Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-32566

Indexing job becomes stuck due to the UAX29URLEmailTokenizerImpl analyzer

    XMLWordPrintable

Details

    Description

      Problem summary

      The indexing job in Lucene does not always complete, whether triggered automatically by adding content to Confluence, or manually via Confluence Admin > Content Indexing.

      Details

      The version of the Lucene index was upgraded in Confluence 5.2. Along with that, a few new analyzers have been added, which parses new content in files that were not previously being parsed.

      UAX29URLEmailTokenizerImpl is one such analyzer, which suffers from the following Lucene bug that can cause CPU to max out at 100% and proceed extremely slowly: https://issues.apache.org/jira/browse/LUCENE-5400

      This Lucene bug in turn manifests itself in Confluence "hanging" on the index rebuilt task.

      Diagnosis and Workaround

      A workaround is to locate the attachment file that is preventing the indexing job from finishing and remove it from Confluence.

      1. First, enable DEBUG logging on the following via Confluence Admin > Logging & Profiling:
        com.atlassian.confluence.search.lucene.ReindexWorkBatch
        com.atlassian.confluence.search.lucene.tasks
        


      2. Trigger an indexing job via Confluence Admin > Content Indexing
      3. Once the job reaches the point where it does not seem to proceed anymore, take a few thread dumps. Look for an indexing thread in RUNNABLE state that looks like this:
        "Indexer: 1" daemon prio=10 tid=0x00007f38a8048800 nid=0x5a03 runnable [0x00007f3873bf9000]
           java.lang.Thread.State: RUNNABLE
        	at org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4348)
        	at org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:145)
        	at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
        	at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
        	at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:50)
        	at org.apache.lucene.analysis.en.KStemFilter.incrementToken(KStemFilter.java:64)
        	at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
        	at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:254)
        	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256)
        	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
        	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1485)
        	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1160)
        	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1141)
        	at com.atlassian.confluence.search.lucene.tasks.AddDocumentIndexTask.perform(AddDocumentIndexTask.java:43)
        	at com.atlassian.confluence.search.lucene.ReindexWorkBatch.indexCollection(ReindexWorkBatch.java:146)
        	at com.atlassian.confluence.search.lucene.ReindexWorkBatch$1.doInTransaction(ReindexWorkBatch.java:113)
        	at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:128)
        	at com.atlassian.confluence.search.lucene.ReindexWorkBatch.executeTransaction(ReindexWorkBatch.java:84)
        	at com.atlassian.confluence.search.lucene.ReindexWorkBatch.run(ReindexWorkBatch.java:72)
        	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        	at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
        	at java.util.concurrent.FutureTask.run(Unknown Source)
        	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        	at java.lang.Thread.run(Unknown Source)
        


      4. Note the indexing thread (in the above example, "Indexer: 1") and look in the atlassian-confluence.log for the last line referencing this thread. This will only show up if you have DEBUG logging enabled on the classes described above. For example:
        2013-12-17 13:41:39,837 DEBUG [Indexer: 1] [confluence.search.lucene.ReindexWorkBatch] indexCollection Index Attachment: Example Attachment Name.xml (32376037) example_username [com.atlassian.confluence.pages.Attachment] 
        


      5. In this example, the "Example Attachment Name.xml" is preventing the indexing job from finishing. Remove this file from the Confluence instance, and try reindexing again.

      Workaround 2

      It is possible to prevent the contents of text attachments from being indexed entirely, thereby bypassing this Lucene bug. This is done by modifying an XML file inside one of the core Confluence plugins. Please see the warnings/caveats section below before proceeding.

      Please replace the version number (denoted by "x.x.x" in the file names below) with the actual version of Confluence you're using. The below procedure was originally carried out against Confluence 5.3.4.

      1. Shutdown Confluence
      2. Make a backup copy of <confluence_install>/confluence/WEB-INF/lib/confluence-x.x.x.jar outside of the Confluence installation.
      3. Extract the jar's contents to a blank directory: unzip -d /path/to/blank/directory confluence-x.x.x.jar
      4. Navigate to the extracted contents and open the following in a text editor: /plugins/core-extractors.xml
      5. Comment out or remove the following section:
        <extractor name="Text Attachment Content Extractor" key="textAttachmentContentExtractor" class="com.atlassian.bonnie.search.extractor.DefaultTextContentExtractor" priority="1000">
            <description>Indexes text attachments if nothing else has indexed them already</description>
        </extractor>
        
      6. Jar the content back up (jar -cf confluence-x.x.x-modified.jar *), and replace the original confluence-x.x.x.jar with the modified one back in <confluence_install>/confluence/WEB-INF/lib/
      7. Start Confluence

      Warnings/caveats with this workaround:

      • This is not a configuration that we regularly test against in Confluence releases. You will be applying this workaround at your own risk
      • There may be potential issues around some plugins/macros breaking if they rely on text attachments being indexed correctly
      • This is something that will need to be re-applied with each upgrade as the customization will not carry over automatically between versions

      Attachments

        Issue Links

          Activity

            People

              mtran@atlassian.com Minh Tran
              rchang Robert Chang
              Votes:
              11 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: