Loading...

Type: Bug
Resolution: Fixed
Priority: High
Fix Version/s: 11.0.0, 9.2.15, 10.2.6
Affects Version/s: 9.2.10
Component/s: Content - Attachments, Search - Indexing
Labels:

Support reference count:
1
Symptom Severity:
Severity 2 - Major
UIS:
1

Issue Summary

Uploading DOCX file with Legacy OLE(Object Linking and Embedding) can cause Word Extractor to stuck, impacting Index Update or Rebuild.

Confluence should have a timeout value for Text Extraction, but is currently missing

When a DOCX file is uploaded, Confluence by default will be extracting it's content and proceed to index them to be searchable by the users.
However, under certain condition, OLE within the file can cause the Aspose library (Confluence uses for extraction) to be stuck indefinitely, preventing the index update to proceed.
The exact condition/requirement for OLE to cause this issue is not known, and the replicable file contains customer's information hence will not be attached.

Testing Environment

This issue is tested in Confluence 9.2.10, but since Confluence don't include any *timeout* for Text Extraction, this issue can impact other releases as well.

Steps to Reproduce

Go to General Configuration > Logging and Profiling
- Add com.atlassian.confluence.internal.index.attachment as DEBUG level
- this is to enable the logging for identifying the issue
Upload a problematic DOCX file onto Confluence
Confluence should start extracting the file shortly, for indexing purpose

Expected Results

The file extraction by Aspose should not take indefinitely, and should have timeout in the case if the extraction took too long.
Under normal circumstance, the logging below can be found in atlassian-confluence.log to indicate the start and end of file extraction:

2026-01-14 13:52:42,440 DEBUG [attachment-text-extraction-worker-1] [internal.index.attachment.AttachmentTextExtractionFunction] apply Text extraction for 4227073 starting
...
2026-01-14 13:52:42,588 DEBUG [attachment-text-extraction-worker-1] [internal.index.attachment.AttachmentTextExtractionFunction] apply Text extraction for 4227073 took 148 ms

Actual Results

The file extraction can stuck indefinitely in the rare occasion when a problematic file is uploaded.
In atlassian-confluence.log there will only be the log for "starting", but the ending log that says "Text extraction for xx took xx ms" will never be logged:

2026-01-14 13:52:42,440 DEBUG [attachment-text-extraction-worker-1] [internal.index.attachment.AttachmentTextExtractionFunction] apply Text extraction for 4227074 starting
...
...
...

When a thread dump is captured at this point, there should be two threads below that indicates this issue:

A Caesium thread for flushing the Index Queue in WAITING state

"Caesium-1-1" daemon prio=1 tid=0x00000000000002bd nid=0 waiting on condition 
   java.lang.Thread.State: WAITING (parking)
	at java.base@17.0.5/jdk.internal.misc.Unsafe.park(Native Method)
	- parking to wait for <0x0000000036863690> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.base@17.0.5/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
...
	at com.atlassian.confluence.internal.search.LuceneIncrementalIndexManager.flushQueue(LuceneIncrementalIndexManager.java:125)

An attachment-text-extraction-worker thread for extracting the file via Aspose in RUNNABLE state

"attachment-text-extraction-worker-1" #363 [159491] daemon prio=1 os_prio=31 cpu=7762.76ms elapsed=9.07s tid=0x000000013779e800 nid=159491 runnable  [0x000000032f794000]
   java.lang.Thread.State: RUNNABLE
	at com.aspose.words.internal.zzZxK.zzYIY(Unknown Source)
	at com.aspose.words.internal.zzZsL.zzO3(Unknown Source)
	at com.aspose.words.zzVTn.zzXnd(Unknown Source)
	at com.aspose.words.zzVTn.zzia(Unknown Source)
...
	at com.aspose.words.Document.zzO3(Unknown Source)
	at com.aspose.words.Document.zzYIY(Unknown Source)
	at com.aspose.words.Document.<init>(Unknown Source)
	at com.aspose.words.Document.<init>(Unknown Source)
	at com.aspose.words.Document.<init>(Unknown Source)
	at com.atlassian.plugins.conversion.extract.xml.WordXMLExtractor.extractText(WordXMLExtractor.java:19)
	at com.atlassian.confluence.extra.officeconnector.index.word.WordXMLTextExtractor.extractText(WordXMLTextExtractor.java:45)
	at com.atlassian.confluence.extra.officeconnector.index.AbstractAttachmentExtractor.extract(AbstractAttachmentExtractor.java:27)

Rebuilding index from scratch will involve text extraction for the files as well, which is also impacted in the same way above.

Workaround

By setting com.atlassian.confluence.internal.index.attachment as DEBUG level and monitor the atlassian-confluence.log, it is possible to locate the Attachment ID that started the text extraction but never ends.
Confluence admin can use the Attachment ID to locate the actual file on the Disk to physically delete the problematic file to avoid this issue.

In the case that there are multiple files that is causing this issue, Text Extraction for Word file type can be disabled altogether to further prevent this issue:

Set officeconnector.textextract.word.docxmaxsize to 1 byte so that all Word Files will skip text extracting

mentioned in: Page Loading...

Details

Description

Issue Summary

Testing Environment

Steps to Reproduce

Expected Results

Actual Results

Workaround

Attachments

Issue Links

Forms

Activity

People

Dates