Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-21952

Unable to remove Personal Information from Lucene index

    XMLWordPrintable

Details

    • We collect Confluence feedback from various sources, and we evaluate what we've collected when planning our product roadmap. To understand how this piece of feedback will be reviewed, see our Implementation of New Features Policy.

    Description

      NOTE: This suggestion is for Confluence Server. Using Confluence Cloud? See the corresponding suggestion.

      This feature request is related to support ticket https://support.atlassian.com/browse/CSP-58347 After contacting Atlassian Support on behalf of our client for this issue, I was told Lucene index customization is currently not supported and was asked to submit a feature request instead.

      Business Use Case

      Our client wishes to use Confluence as a partner portal, to delegate spaces to their partners. This seems like a very common user case, and leveraging one of the core space distribution benefits of Confluence spaces.

      However, since their partners are OEM manufacturers, and possibly competitors, it is logical that they do not want the names of the users to be visible to each other. We have suppressed the Profile Directory without problem, but since user names are indexed by Lucene, removing them from the index is not working (despite our code efforts to do so, as described below).

      Abstract

      We are in the process of implementing a user requirement to exclude all user information from Confluence search. The approach we took is removing Personal Information from the Lucene index using an extractor module.

      We've found from logging that the document appears to be removed from the index, but the search results persist. Worse, having updated a user's profile, and reindexed, we have index locking errors amongst other things.

      Detail

      The Index Limiter plugin has been written for the single purpose of removing personal information from the Lucene index.

      The purpose of this is to remove:

      1. username links from the rich text editor (RTE)
      2. username results from "quicksearch"
      3. user details/profile from the search results page e.g. /dosearchsite.action?queryString=admin

      I've attempted the unindexing in 2 parts

      1. Invalidating the fields: Using an extractor module (source in svn) to invalidate the values in these fields: "type","email", "fullName", "title", "username" within the Lucene documents of type PersonalInformation.CONTENT_TYPEThe addFields() method in the extractor
      2. Remove all personal information from the Lucene index: Remove all Lucene Documents with handle startswith com.atlassian.confluence.user.PersonalInformation – The unIndex() method in the extractor

      Results

      After installing the Index Limiter plugin, run a complete reindex in Confluence Admin to trigger removal of the personalInformation data from Lucene

      1. Invalidating the fields

      Takes a Lucene Document like this:

      Document<
      	stored/uncompressed,indexed<handle:com.atlassian.confluence.user.PersonalInformation-393217>
      	stored/uncompressed,indexed,tokenized<content-name-unstemmed:admin>
      	stored/uncompressed,indexed,tokenized<email:admin@example.com>
      	stored/uncompressed,indexed,tokenized<fullName:admin>
      	stored/uncompressed,indexed,tokenized<labelText:>
      	stored/uncompressed,indexed,tokenized<title:admin>
      	stored/uncompressed,indexed,tokenized<username:admin>
      	stored/uncompressed,indexed<created:0fl6inapf>
      	stored/uncompressed,indexed<fullNameUntokenized:admin>
      	stored/uncompressed,indexed<hasPersonalSpace:false>
      	stored/uncompressed,indexed<modified:000000000>
      	stored/uncompressed,indexed<urlPath:/~admin>
      	stored/uncompressed<content-version:1>
      	stored/uncompressed<excerpt:>
      	stored/uncompressed<version:1>
      	>

      Changes it to this:

      Document<
      	stored/uncompressed,indexed<handle:com.atlassian.confluence.user.PersonalInformation-393217>
      	stored/uncompressed,indexed,tokenized<content-name-unstemmed:admin>
      	stored/uncompressed,indexed,tokenized<email:admin@example.com>
      	stored/uncompressed,indexed,tokenized<email:appfusions.invalidate>
      	stored/uncompressed,indexed,tokenized<fullName:admin>
      	stored/uncompressed,indexed,tokenized<fullName:appfusions.invalidate>
      	stored/uncompressed,indexed,tokenized<labelText:>
      	stored/uncompressed,indexed,tokenized<title:admin>
      	stored/uncompressed,indexed,tokenized<title:appfusions.invalidate>
      	stored/uncompressed,indexed,tokenized<username:admin>
      	stored/uncompressed,indexed,tokenized<username:appfusions.invalidate>
      	stored/uncompressed,indexed<created:0fl6inapf>
      	stored/uncompressed,indexed<fullNameUntokenized:admin>
      	stored/uncompressed,indexed<hasPersonalSpace:false>
      	stored/uncompressed,indexed<modified:000000000>
      	stored/uncompressed,indexed<type:appfusions.invalidate>
      	stored/uncompressed,indexed<urlPath:/>
      	stored/uncompressed<content-version:1>
      	stored/uncompressed<excerpt:>
      	stored/uncompressed<version:1>
      	>

      Uses the following code:

      document.removeField(field);
      // Set an invalid/meaningless value
      document.add(new Field(field, "appfusions.invalidate", Field.Store.YES, Field.Index.TOKENIZED));
      

      It should change the value of each field to appfusions.invalidate, but actually adds a duplicate field with this value.

      In any case, it has the desired effect on the index by removing user details from the RTE & quicksearch...

      • Removes user details from the RTE & quicksearch

      • Only partially removes information from the search results page

        || Original || Updated ||

      2. Remove all personal information from the Lucene index

      Inject com.atlassian.bonnie.ILuceneConnection into the extractor module with property injection & call the unIndex() method (at the bottom of this page) from the addFields() method

      Having attempted to do this, logging suggests that documents have been removed, but search results suggest otherwise.

      Updated User Profiles

      Having updated a user profile & reindexed, further problems occur with search index locking...

      2011-03-03 11:06:01,019 ERROR [DefaultQuartzScheduler_Worker-9] [atlassian.bonnie.search.BaseDocumentBuilder] getDocument Error extracting search fields from userinfo: admin v.2 (393217) using BackwardsCompatibleExtractor wrapping com.appfusions.confluence.plugins.indexlimiter.extractor.PersonalInformationExtractor@5342836a (com.appfusions.confluence.plugins.indexlimiter:PersonalInformationExtractor): org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: SimpleFSLock@/Users/david/projects/appfusions/confluence/plugins/indexlimiter/trunk/target/confluence/home/index/write.lock
      com.atlassian.bonnie.LuceneException: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: SimpleFSLock@/Users/david/projects/appfusions/confluence/plugins/indexlimiter/trunk/target/confluence/home/index/write.lock
      at com.atlassian.bonnie.LuceneConnection.withReaderAndDeletes(LuceneConnection.java:302)
      at com.appfusions.confluence.plugins.indexlimiter.extractor.PersonalInformationExtractor.unIndex(PersonalInformationExtractor.java:95)
      at com.appfusions.confluence.plugins.indexlimiter.extractor.PersonalInformationExtractor.addFields(PersonalInformationExtractor.java:85)
      at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:45)
      at com.atlassian.bonnie.search.BaseDocumentBuilder.getDocument(BaseDocumentBuilder.java:104)
      at com.atlassian.confluence.search.lucene.ConfluenceDocumentBuilder.getDocument(ConfluenceDocumentBuilder.java:102)
      at com.atlassian.confluence.search.lucene.tasks.AddDocumentIndexTask.perform(AddDocumentIndexTask.java:43)
      at com.atlassian.confluence.search.lucene.tasks.UpdateDocumentIndexTask.perform(UpdateDocumentIndexTask.java:40)
      at com.atlassian.confluence.search.lucene.tasks.BulkWriteIndexTask.perform(BulkWriteIndexTask.java:44)
      at com.atlassian.bonnie.LuceneConnection.withWriter(LuceneConnection.java:331)
      at com.atlassian.confluence.search.lucene.tasks.LuceneConnectionBackedIndexTaskPerformer.perform(LuceneConnectionBackedIndexTaskPerformer.java:20)
      at com.atlassian.confluence.search.lucene.DefaultConfluenceIndexManager$BatchUpdateAction.perform(DefaultConfluenceIndexManager.java:361)
      at com.atlassian.bonnie.LuceneConnection.withBatchUpdate(LuceneConnection.java:405)
      at com.atlassian.confluence.search.lucene.DefaultConfluenceIndexManager.processTasks(DefaultConfluenceIndexManager.java:161)
      at com.atlassian.confluence.search.lucene.DefaultConfluenceIndexManager.flushQueue(DefaultConfluenceIndexManager.java:128)
      at sun.reflect.GeneratedMethodAccessor337.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:304)
      at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:182)
      at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
      at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:106)
      at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
      at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
      at $Proxy35.flushQueue(Unknown Source)
      at com.atlassian.confluence.search.lucene.IndexQueueFlusher.executeJob(IndexQueueFlusher.java:29)
      at com.atlassian.confluence.setup.quartz.AbstractClusterAwareQuartzJobBean.surroundJobExecutionWithLogging(AbstractClusterAwareQuartzJobBean.java:63)
      at com.atlassian.confluence.setup.quartz.AbstractClusterAwareQuartzJobBean.executeInternal(AbstractClusterAwareQuartzJobBean.java:46)
      at org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)
      at org.quartz.core.JobRunShell.run(JobRunShell.java:199)
      at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)
      Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: SimpleFSLock@/Users/david/projects/appfusions/confluence/plugins/indexlimiter/trunk/target/confluence/home/index/write.lock
      at org.apache.lucene.store.Lock.obtain(Lock.java:70)
      at org.apache.lucene.index.IndexReader.acquireWriteLock(IndexReader.java:638)
      at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:672)
      at com.appfusions.confluence.plugins.indexlimiter.extractor.PersonalInformationExtractor$1.perform(PersonalInformationExtractor.java:109)
      at com.atlassian.bonnie.LuceneConnection.withReaderAndDeletes(LuceneConnection.java:298)
      ... 30 more

      Supporting code

      atlassian-plugin.xml:

      <atlassian-plugin key="${project.groupId}.${project.artifactId}" name="${project.name}">
          <plugin-info>
              <description>${project.description}</description>
              <version>${project.version}</version>
              <vendor name="${project.organization.name}" url="${project.organization.url}" />
          </plugin-info>
      	<extractor name="Personal Information Extractor"
                 key="PersonalInformationExtractor"
                 class="com.appfusions.confluence.plugins.indexlimiter.extractor.PersonalInformationExtractor"
                 priority="900">
          <description>Removes some personal information from the search index.</description>
      </extractor>
      </atlassian-plugin>

      com.appfusions.confluence.plugins.indexlimiter.extractor.PersonalInformationExtractor:

      package com.appfusions.confluence.plugins.indexlimiter.extractor;
      
      import org.apache.log4j.Logger;
      import org.apache.lucene.document.Document;
      import org.apache.lucene.document.Field;
      import org.apache.lucene.index.IndexReader;
      import org.apache.lucene.index.Term;
      import org.slf4j.MDC;
      
      import com.atlassian.bonnie.Searchable;
      import com.atlassian.bonnie.ILuceneConnection;
      import com.atlassian.bonnie.search.Extractor;
      import com.atlassian.bonnie.search.BaseDocumentBuilder;
      import com.atlassian.bonnie.search.DocumentBuilder;
      import com.atlassian.confluence.core.ContentEntityObject;
      import com.atlassian.confluence.user.PersonalInformation;
      import com.atlassian.confluence.user.UserAccessor;
      
      import java.io.IOException;
      
      /**
       * User: david
       * Date: Feb 25, 2011
       * Time: 7:51:57 PM
       */
      public class PersonalInformationExtractor implements Extractor
      {
          private UserAccessor userAccessor;
          private ILuceneConnection luceneConnection;
          private DocumentBuilder documentBuilder;
      
          public void setUserAccessor(UserAccessor userAccessor) {
              this.userAccessor = userAccessor;
          }
      
          /**
           * @param luceneConnection set by dependency injection, required
           */
          public void setLuceneConnection(ILuceneConnection luceneConnection) {
              this.luceneConnection = luceneConnection;
          }
      
          public void setDocumentBuilder(DocumentBuilder documentBuilder) {
              this.documentBuilder = documentBuilder;
          }
      
      	/**
      	 * Initially replace the contents of the fields in the index
      	 * This approach will remove PersonalInformation from quicksearch and the rich text editor
      	 */
          public void addFields(Document document, StringBuffer defaultSearchableText, Searchable searchable)
          {
              if (searchable instanceof PersonalInformation)
              {
                  PersonalInformation personalInformation = (PersonalInformation) searchable;
      
                  if(userAccessor.getUser(personalInformation.getUsername()) != null)
                  {
                      // Most important is to change the type field to an unknown value (to Confluence)
                      String[] fieldsTokenized = {"email", "fullName", "title", "username"}; // tokenized fields
      
                      for (String field : fieldsTokenized)
                      {
      					document.removeField(field);
                          // Set an invalid/meaningless value
                          document.add(new Field(field, "appfusions.invalidate", Field.Store.YES, Field.Index.TOKENIZED));
                      }
      
                      String[] fieldsUntokenized = {"type"}; // untokenized fields
      
                      for (String field : fieldsUntokenized)
                      {
                          document.removeField(field);
                          // Set an invalid/meaningless value
                          document.add(new Field(field, "appfusions.invalidate", Field.Store.YES, Field.Index.UN_TOKENIZED));
                      }
      
                      // Redirect/rewrite the urlPath to the context root
                      // -- if we can't remove this item from search results, at least redirect.
                      document.removeField("urlPath");
                      document.add(new Field("urlPath", "/", Field.Store.YES, Field.Index.UN_TOKENIZED));
      
                      // Finally, attempt to remove all documents related to PersonalInformation
      				unIndex(); // unIndex(personalInformation);
                  }
              }
          }
      
      	/**
           * Find *all* Lucene Documents where "handle" starts with "com.atlassian.confluence.user.PersonalInformation"
           * - likely to be rather heavy handed, so perhaps later target just the single document in the index
      	 */
          public void unIndex()
          {
              luceneConnection.withReaderAndDeletes(new ILuceneConnection.ReaderAction()
              {
                  public Object perform(IndexReader indexReader) throws IOException
                  {
                      int max = indexReader.maxDoc();
                      for (int i = 0; i < max; i++)
                      {
                          Field handle = indexReader.document(i).getField("handle");
      
                          if (handle != null)
                          {
                              if (handle.stringValue().startsWith("com.atlassian.confluence.user.PersonalInformation"))
                              {
                                  System.out.println(" unindexing "+indexReader.document(i).toString());
                                  indexReader.deleteDocument(i);
                              }
                          }
                      }
                      return null;
                  }
              });
          }
      }
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              36d7313293c0 Danielle Zhu
              Votes:
              7 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: