Loading...

Type: Suggestion
Resolution: Answered
Fix Version/s: None
Component/s: Search - Core
Labels:
Environment:
Although we are testing in Confluence 3.4.8, this potentially affect other versions too.

NOTE: This suggestion is for Confluence Server. Using Confluence Cloud? See the corresponding suggestion.

This feature request is related to support ticket https://support.atlassian.com/browse/CSP-58347 After contacting Atlassian Support on behalf of our client for this issue, I was told Lucene index customization is currently not supported and was asked to submit a feature request instead.

Business Use Case

Our client wishes to use Confluence as a partner portal, to delegate spaces to their partners. This seems like a very common user case, and leveraging one of the core space distribution benefits of Confluence spaces.

However, since their partners are OEM manufacturers, and possibly competitors, it is logical that they do not want the names of the users to be visible to each other. We have suppressed the Profile Directory without problem, but since user names are indexed by Lucene, removing them from the index is not working (despite our code efforts to do so, as described below).

Abstract

We are in the process of implementing a user requirement to exclude all user information from Confluence search. The approach we took is removing Personal Information from the Lucene index using an extractor module.

We've found from logging that the document appears to be removed from the index, but the search results persist. Worse, having updated a user's profile, and reindexed, we have index locking errors amongst other things.

Detail

The Index Limiter plugin has been written for the single purpose of removing personal information from the Lucene index.

The purpose of this is to remove:

username links from the rich text editor (RTE)
username results from "quicksearch"
user details/profile from the search results page e.g. /dosearchsite.action?queryString=admin

I've attempted the unindexing in 2 parts

Invalidating the fields: Using an extractor module (source in svn) to invalidate the values in these fields: "type","email", "fullName", "title", "username" within the Lucene documents of type PersonalInformation.CONTENT_TYPE – The addFields() method in the extractor
Remove all personal information from the Lucene index: Remove all Lucene Documents with handle startswith com.atlassian.confluence.user.PersonalInformation – The unIndex() method in the extractor

Results

After installing the Index Limiter plugin, run a complete reindex in Confluence Admin to trigger removal of the personalInformation data from Lucene

1. Invalidating the fields

Takes a Lucene Document like this:

Document<
	stored/uncompressed,indexed<handle:com.atlassian.confluence.user.PersonalInformation-393217>
	stored/uncompressed,indexed,tokenized<content-name-unstemmed:admin>
	stored/uncompressed,indexed,tokenized<email:admin@example.com>
	stored/uncompressed,indexed,tokenized<fullName:admin>
	stored/uncompressed,indexed,tokenized<labelText:>
	stored/uncompressed,indexed,tokenized<title:admin>
	stored/uncompressed,indexed,tokenized<username:admin>
	stored/uncompressed,indexed<created:0fl6inapf>
	stored/uncompressed,indexed<fullNameUntokenized:admin>
	stored/uncompressed,indexed<hasPersonalSpace:false>
	stored/uncompressed,indexed<modified:000000000>
	stored/uncompressed,indexed<urlPath:/~admin>
	stored/uncompressed<content-version:1>
	stored/uncompressed<excerpt:>
	stored/uncompressed<version:1>
	>

Changes it to this:

Document<
	stored/uncompressed,indexed<handle:com.atlassian.confluence.user.PersonalInformation-393217>
	stored/uncompressed,indexed,tokenized<content-name-unstemmed:admin>
	stored/uncompressed,indexed,tokenized<email:admin@example.com>
	stored/uncompressed,indexed,tokenized<email:appfusions.invalidate>
	stored/uncompressed,indexed,tokenized<fullName:admin>
	stored/uncompressed,indexed,tokenized<fullName:appfusions.invalidate>
	stored/uncompressed,indexed,tokenized<labelText:>
	stored/uncompressed,indexed,tokenized<title:admin>
	stored/uncompressed,indexed,tokenized<title:appfusions.invalidate>
	stored/uncompressed,indexed,tokenized<username:admin>
	stored/uncompressed,indexed,tokenized<username:appfusions.invalidate>
	stored/uncompressed,indexed<created:0fl6inapf>
	stored/uncompressed,indexed<fullNameUntokenized:admin>
	stored/uncompressed,indexed<hasPersonalSpace:false>
	stored/uncompressed,indexed<modified:000000000>
	stored/uncompressed,indexed<type:appfusions.invalidate>
	stored/uncompressed,indexed<urlPath:/>
	stored/uncompressed<content-version:1>
	stored/uncompressed<excerpt:>
	stored/uncompressed<version:1>
	>

Uses the following code:

document.removeField(field);
// Set an invalid/meaningless value
document.add(new Field(field, "appfusions.invalidate", Field.Store.YES, Field.Index.TOKENIZED));

It should change the value of each field to appfusions.invalidate, but actually adds a duplicate field with this value.

In any case, it has the desired effect on the index by removing user details from the RTE & quicksearch...

Removes user details from the RTE & quicksearch
Only partially removes information from the search results page

|| Original || Updated ||

2. Remove all personal information from the Lucene index

Inject com.atlassian.bonnie.ILuceneConnection into the extractor module with property injection & call the unIndex() method (at the bottom of this page) from the addFields() method

Having attempted to do this, logging suggests that documents have been removed, but search results suggest otherwise.

Updated User Profiles

Having updated a user profile & reindexed, further problems occur with search index locking...

2011-03-03 11:06:01,019 ERROR [DefaultQuartzScheduler_Worker-9] [atlassian.bonnie.search.BaseDocumentBuilder] getDocument Error extracting search fields from userinfo: admin v.2 (393217) using BackwardsCompatibleExtractor wrapping com.appfusions.confluence.plugins.indexlimiter.extractor.PersonalInformationExtractor@5342836a (com.appfusions.confluence.plugins.indexlimiter:PersonalInformationExtractor): org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: SimpleFSLock@/Users/david/projects/appfusions/confluence/plugins/indexlimiter/trunk/target/confluence/home/index/write.lock
com.atlassian.bonnie.LuceneException: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: SimpleFSLock@/Users/david/projects/appfusions/confluence/plugins/indexlimiter/trunk/target/confluence/home/index/write.lock
at com.atlassian.bonnie.LuceneConnection.withReaderAndDeletes(LuceneConnection.java:302)
at com.appfusions.confluence.plugins.indexlimiter.extractor.PersonalInformationExtractor.unIndex(PersonalInformationExtractor.java:95)
at com.appfusions.confluence.plugins.indexlimiter.extractor.PersonalInformationExtractor.addFields(PersonalInformationExtractor.java:85)
at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:45)
at com.atlassian.bonnie.search.BaseDocumentBuilder.getDocument(BaseDocumentBuilder.java:104)
at com.atlassian.confluence.search.lucene.ConfluenceDocumentBuilder.getDocument(ConfluenceDocumentBuilder.java:102)
at com.atlassian.confluence.search.lucene.tasks.AddDocumentIndexTask.perform(AddDocumentIndexTask.java:43)
at com.atlassian.confluence.search.lucene.tasks.UpdateDocumentIndexTask.perform(UpdateDocumentIndexTask.java:40)
at com.atlassian.confluence.search.lucene.tasks.BulkWriteIndexTask.perform(BulkWriteIndexTask.java:44)
at com.atlassian.bonnie.LuceneConnection.withWriter(LuceneConnection.java:331)
at com.atlassian.confluence.search.lucene.tasks.LuceneConnectionBackedIndexTaskPerformer.perform(LuceneConnectionBackedIndexTaskPerformer.java:20)
at com.atlassian.confluence.search.lucene.DefaultConfluenceIndexManager$BatchUpdateAction.perform(DefaultConfluenceIndexManager.java:361)
at com.atlassian.bonnie.LuceneConnection.withBatchUpdate(LuceneConnection.java:405)
at com.atlassian.confluence.search.lucene.DefaultConfluenceIndexManager.processTasks(DefaultConfluenceIndexManager.java:161)
at com.atlassian.confluence.search.lucene.DefaultConfluenceIndexManager.flushQueue(DefaultConfluenceIndexManager.java:128)
at sun.reflect.GeneratedMethodAccessor337.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:304)
at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:182)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:106)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
at $Proxy35.flushQueue(Unknown Source)
at com.atlassian.confluence.search.lucene.IndexQueueFlusher.executeJob(IndexQueueFlusher.java:29)
at com.atlassian.confluence.setup.quartz.AbstractClusterAwareQuartzJobBean.surroundJobExecutionWithLogging(AbstractClusterAwareQuartzJobBean.java:63)
at com.atlassian.confluence.setup.quartz.AbstractClusterAwareQuartzJobBean.executeInternal(AbstractClusterAwareQuartzJobBean.java:46)
at org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)
at org.quartz.core.JobRunShell.run(JobRunShell.java:199)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: SimpleFSLock@/Users/david/projects/appfusions/confluence/plugins/indexlimiter/trunk/target/confluence/home/index/write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:70)
at org.apache.lucene.index.IndexReader.acquireWriteLock(IndexReader.java:638)
at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:672)
at com.appfusions.confluence.plugins.indexlimiter.extractor.PersonalInformationExtractor$1.perform(PersonalInformationExtractor.java:109)
at com.atlassian.bonnie.LuceneConnection.withReaderAndDeletes(LuceneConnection.java:298)
... 30 more

Supporting code

atlassian-plugin.xml:

<atlassian-plugin key="${project.groupId}.${project.artifactId}" name="${project.name}">
    <plugin-info>
        <description>${project.description}</description>
        <version>${project.version}</version>
        <vendor name="${project.organization.name}" url="${project.organization.url}" />
    </plugin-info>
	<extractor name="Personal Information Extractor"
           key="PersonalInformationExtractor"
           class="com.appfusions.confluence.plugins.indexlimiter.extractor.PersonalInformationExtractor"
           priority="900">
    <description>Removes some personal information from the search index.</description>
</extractor>
</atlassian-plugin>

com.appfusions.confluence.plugins.indexlimiter.extractor.PersonalInformationExtractor:

package com.appfusions.confluence.plugins.indexlimiter.extractor;

import org.apache.log4j.Logger;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.slf4j.MDC;

import com.atlassian.bonnie.Searchable;
import com.atlassian.bonnie.ILuceneConnection;
import com.atlassian.bonnie.search.Extractor;
import com.atlassian.bonnie.search.BaseDocumentBuilder;
import com.atlassian.bonnie.search.DocumentBuilder;
import com.atlassian.confluence.core.ContentEntityObject;
import com.atlassian.confluence.user.PersonalInformation;
import com.atlassian.confluence.user.UserAccessor;

import java.io.IOException;

/**
 * User: david
 * Date: Feb 25, 2011
 * Time: 7:51:57 PM
 */
public class PersonalInformationExtractor implements Extractor
{
    private UserAccessor userAccessor;
    private ILuceneConnection luceneConnection;
    private DocumentBuilder documentBuilder;

    public void setUserAccessor(UserAccessor userAccessor) {
        this.userAccessor = userAccessor;
    }

    /**
     * @param luceneConnection set by dependency injection, required
     */
    public void setLuceneConnection(ILuceneConnection luceneConnection) {
        this.luceneConnection = luceneConnection;
    }

    public void setDocumentBuilder(DocumentBuilder documentBuilder) {
        this.documentBuilder = documentBuilder;
    }

	/**
	 * Initially replace the contents of the fields in the index
	 * This approach will remove PersonalInformation from quicksearch and the rich text editor
	 */
    public void addFields(Document document, StringBuffer defaultSearchableText, Searchable searchable)
    {
        if (searchable instanceof PersonalInformation)
        {
            PersonalInformation personalInformation = (PersonalInformation) searchable;

            if(userAccessor.getUser(personalInformation.getUsername()) != null)
            {
                // Most important is to change the type field to an unknown value (to Confluence)
                String[] fieldsTokenized = {"email", "fullName", "title", "username"}; // tokenized fields

                for (String field : fieldsTokenized)
                {
					document.removeField(field);
                    // Set an invalid/meaningless value
                    document.add(new Field(field, "appfusions.invalidate", Field.Store.YES, Field.Index.TOKENIZED));
                }

                String[] fieldsUntokenized = {"type"}; // untokenized fields

                for (String field : fieldsUntokenized)
                {
                    document.removeField(field);
                    // Set an invalid/meaningless value
                    document.add(new Field(field, "appfusions.invalidate", Field.Store.YES, Field.Index.UN_TOKENIZED));
                }

                // Redirect/rewrite the urlPath to the context root
                // -- if we can't remove this item from search results, at least redirect.
                document.removeField("urlPath");
                document.add(new Field("urlPath", "/", Field.Store.YES, Field.Index.UN_TOKENIZED));

                // Finally, attempt to remove all documents related to PersonalInformation
				unIndex(); // unIndex(personalInformation);
            }
        }
    }

	/**
     * Find *all* Lucene Documents where "handle" starts with "com.atlassian.confluence.user.PersonalInformation"
     * - likely to be rather heavy handed, so perhaps later target just the single document in the index
	 */
    public void unIndex()
    {
        luceneConnection.withReaderAndDeletes(new ILuceneConnection.ReaderAction()
        {
            public Object perform(IndexReader indexReader) throws IOException
            {
                int max = indexReader.maxDoc();
                for (int i = 0; i < max; i++)
                {
                    Field handle = indexReader.document(i).getField("handle");

                    if (handle != null)
                    {
                        if (handle.stringValue().startsWith("com.atlassian.confluence.user.PersonalInformation"))
                        {
                            System.out.println(" unindexing "+indexReader.document(i).toString());
                            indexReader.deleteDocument(i);
                        }
                    }
                }
                return null;
            }
        });
    }
}

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List

rich-text-editor-limited.png
09/Mar/2011 4:21 PM
43 kB
Danielle Zhu
search-result-original.png
09/Mar/2011 4:21 PM
35 kB
Danielle Zhu
search-result-updated.png
09/Mar/2011 4:21 PM
33 kB
Danielle Zhu

relates to

AI-817 Unable to remove Personal Information from Lucene index

Closed

Details

Description

Business Use Case

Abstract

Detail

Results

1. Invalidating the fields

2. Remove all personal information from the Lucene index

Updated User Profiles

Supporting code

Attachments

Attachments

Issue Links

Activity

People

Dates