-
Type:
Bug
-
Resolution: Fixed
-
Priority:
High
-
Affects Version/s: 9.2.10
-
Component/s: Content - Attachments, Search - Indexing
-
1
-
Severity 2 - Major
-
1
Issue Summary
Uploading DOCX file with Legacy OLE(Object Linking and Embedding) can cause Word Extractor to stuck, impacting Index Update or Rebuild.
Confluence should have a timeout value for Text Extraction, but is currently missing
When a DOCX file is uploaded, Confluence by default will be extracting it's content and proceed to index them to be searchable by the users.
However, under certain condition, OLE within the file can cause the Aspose library (Confluence uses for extraction) to be stuck indefinitely, preventing the index update to proceed.
The exact condition/requirement for OLE to cause this issue is not known, and the replicable file contains customer's information hence will not be attached.
Testing Environment
This issue is tested in Confluence 9.2.10, but since Confluence don't include any *timeout* for Text Extraction, this issue can impact other releases as well.
Steps to Reproduce
- Go to General Configuration > Logging and Profiling
- Add com.atlassian.confluence.internal.index.attachment as DEBUG level
this is to enable the logging for identifying the issue
- Upload a problematic DOCX file onto Confluence
- Confluence should start extracting the file shortly, for indexing purpose
Expected Results
The file extraction by Aspose should not take indefinitely, and should have timeout in the case if the extraction took too long.
Under normal circumstance, the logging below can be found in atlassian-confluence.log to indicate the start and end of file extraction:
2026-01-14 13:52:42,440 DEBUG [attachment-text-extraction-worker-1] [internal.index.attachment.AttachmentTextExtractionFunction] apply Text extraction for 4227073 starting ... 2026-01-14 13:52:42,588 DEBUG [attachment-text-extraction-worker-1] [internal.index.attachment.AttachmentTextExtractionFunction] apply Text extraction for 4227073 took 148 ms
Actual Results
The file extraction can stuck indefinitely in the rare occasion when a problematic file is uploaded.
In atlassian-confluence.log there will only be the log for "starting", but the ending log that says "Text extraction for xx took xx ms" will never be logged:
2026-01-14 13:52:42,440 DEBUG [attachment-text-extraction-worker-1] [internal.index.attachment.AttachmentTextExtractionFunction] apply Text extraction for 4227074 starting ... ... ...
When a thread dump is captured at this point, there should be two threads below that indicates this issue:
- A Caesium thread for flushing the Index Queue in WAITING state
"Caesium-1-1" daemon prio=1 tid=0x00000000000002bd nid=0 waiting on condition java.lang.Thread.State: WAITING (parking) at java.base@17.0.5/jdk.internal.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000036863690> (a java.util.concurrent.CompletableFuture$Signaller) at java.base@17.0.5/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211) ... at com.atlassian.confluence.internal.search.LuceneIncrementalIndexManager.flushQueue(LuceneIncrementalIndexManager.java:125)
- An attachment-text-extraction-worker thread for extracting the file via Aspose in RUNNABLE state
"attachment-text-extraction-worker-1" #363 [159491] daemon prio=1 os_prio=31 cpu=7762.76ms elapsed=9.07s tid=0x000000013779e800 nid=159491 runnable [0x000000032f794000] java.lang.Thread.State: RUNNABLE at com.aspose.words.internal.zzZxK.zzYIY(Unknown Source) at com.aspose.words.internal.zzZsL.zzO3(Unknown Source) at com.aspose.words.zzVTn.zzXnd(Unknown Source) at com.aspose.words.zzVTn.zzia(Unknown Source) ... at com.aspose.words.Document.zzO3(Unknown Source) at com.aspose.words.Document.zzYIY(Unknown Source) at com.aspose.words.Document.<init>(Unknown Source) at com.aspose.words.Document.<init>(Unknown Source) at com.aspose.words.Document.<init>(Unknown Source) at com.atlassian.plugins.conversion.extract.xml.WordXMLExtractor.extractText(WordXMLExtractor.java:19) at com.atlassian.confluence.extra.officeconnector.index.word.WordXMLTextExtractor.extractText(WordXMLTextExtractor.java:45) at com.atlassian.confluence.extra.officeconnector.index.AbstractAttachmentExtractor.extract(AbstractAttachmentExtractor.java:27)
Rebuilding index from scratch will involve text extraction for the files as well, which is also impacted in the same way above.
Workaround
By setting com.atlassian.confluence.internal.index.attachment as DEBUG level and monitor the atlassian-confluence.log, it is possible to locate the Attachment ID that started the text extraction but never ends.
Confluence admin can use the Attachment ID to locate the actual file on the Disk to physically delete the problematic file to avoid this issue.
In the case that there are multiple files that is causing this issue, Text Extraction for Word file type can be disabled altogether to further prevent this issue:
- Set officeconnector.textextract.word.docxmaxsize to 1 byte so that all Word Files will skip text extracting
- mentioned in
-
Page Loading...