Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Medium
Fix Version/s: 5.7.1, 5.7-OD-46-015
Affects Version/s: 5.2, 5.3.4, 5.5.4
Component/s: Search - Core
Labels:

Bug Fix Policy:
View Atlassian Server bug fix policy

Description

Problem summary

The indexing job in Lucene does not always complete, whether triggered automatically by adding content to Confluence, or manually via Confluence Admin > Content Indexing.

Details

The version of the Lucene index was upgraded in Confluence 5.2. Along with that, a few new analyzers have been added, which parses new content in files that were not previously being parsed.

UAX29URLEmailTokenizerImpl is one such analyzer, which suffers from the following Lucene bug that can cause CPU to max out at 100% and proceed extremely slowly: https://issues.apache.org/jira/browse/LUCENE-5400

This Lucene bug in turn manifests itself in Confluence "hanging" on the index rebuilt task.

Diagnosis and Workaround

A workaround is to locate the attachment file that is preventing the indexing job from finishing and remove it from Confluence.

First, enable DEBUG logging on the following via Confluence Admin > Logging & Profiling:

com.atlassian.confluence.search.lucene.ReindexWorkBatch
com.atlassian.confluence.search.lucene.tasks

Trigger an indexing job via Confluence Admin > Content Indexing

Once the job reaches the point where it does not seem to proceed anymore, take a few thread dumps. Look for an indexing thread in RUNNABLE state that looks like this:

"Indexer: 1" daemon prio=10 tid=0x00007f38a8048800 nid=0x5a03 runnable [0x00007f3873bf9000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4348)
	at org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:145)
	at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
	at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
	at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:50)
	at org.apache.lucene.analysis.en.KStemFilter.incrementToken(KStemFilter.java:64)
	at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
	at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:254)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1485)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1160)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1141)
	at com.atlassian.confluence.search.lucene.tasks.AddDocumentIndexTask.perform(AddDocumentIndexTask.java:43)
	at com.atlassian.confluence.search.lucene.ReindexWorkBatch.indexCollection(ReindexWorkBatch.java:146)
	at com.atlassian.confluence.search.lucene.ReindexWorkBatch$1.doInTransaction(ReindexWorkBatch.java:113)
	at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:128)
	at com.atlassian.confluence.search.lucene.ReindexWorkBatch.executeTransaction(ReindexWorkBatch.java:84)
	at com.atlassian.confluence.search.lucene.ReindexWorkBatch.run(ReindexWorkBatch.java:72)
	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
	at java.util.concurrent.FutureTask.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)

Note the indexing thread (in the above example, "Indexer: 1") and look in the atlassian-confluence.log for the last line referencing this thread. This will only show up if you have DEBUG logging enabled on the classes described above. For example:
```
2013-12-17 13:41:39,837 DEBUG [Indexer: 1] [confluence.search.lucene.ReindexWorkBatch] indexCollection Index Attachment: Example Attachment Name.xml (32376037) example_username [com.atlassian.confluence.pages.Attachment] 
```
In this example, the "Example Attachment Name.xml" is preventing the indexing job from finishing. Remove this file from the Confluence instance, and try reindexing again.

Workaround 2

It is possible to prevent the contents of text attachments from being indexed entirely, thereby bypassing this Lucene bug. This is done by modifying an XML file inside one of the core Confluence plugins. Please see the warnings/caveats section below before proceeding.

Please replace the version number (denoted by "x.x.x" in the file names below) with the actual version of Confluence you're using. The below procedure was originally carried out against Confluence 5.3.4.

Shutdown Confluence
Make a backup copy of <confluence_install>/confluence/WEB-INF/lib/confluence-x.x.x.jar outside of the Confluence installation.
Extract the jar's contents to a blank directory: unzip -d /path/to/blank/directory confluence-x.x.x.jar
Navigate to the extracted contents and open the following in a text editor: /plugins/core-extractors.xml

Comment out or remove the following section:

<extractor name="Text Attachment Content Extractor" key="textAttachmentContentExtractor" class="com.atlassian.bonnie.search.extractor.DefaultTextContentExtractor" priority="1000">
    <description>Indexes text attachments if nothing else has indexed them already</description>
</extractor>

Jar the content back up (jar -cf confluence-x.x.x-modified.jar *), and replace the original confluence-x.x.x.jar with the modified one back in <confluence_install>/confluence/WEB-INF/lib/
Start Confluence

Warnings/caveats with this workaround:

This is not a configuration that we regularly test against in Confluence releases. You will be applying this workaround at your own risk
There may be potential issues around some plugins/macros breaking if they rely on text attachments being indexed correctly
This is something that will need to be re-applied with each upgrade as the customization will not carry over automatically between versions

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List

testLongEMAILatomText.txt
797 kB
02/Mar/2015 3:32 AM

Issue Links

relates to

CONFSERVER-32752 Option in the UI to disable all attachment content indexing in Confluence

Closed

links to

Atlassian lucene library

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(5 mentioned in)

Indexing job becomes stuck due to the UAX29URLEmailTokenizerImpl analyzer

Details

Description

Problem summary

Details

Diagnosis and Workaround

Workaround 2

Attachments

Attachments

Issue Links

Activity

People

Dates