Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-58824

Failed to extract index from Word doc file smaller than 16MB.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Low Low
    • None
    • 6.15.2
    • Search - Indexing
    • None

      Issue Summary

      The following error is seen when attaching a 12MB Word file:

      2019-08-14 17:02:52,191 WARN [attachment-text-extraction-worker-2] [bonnie.search.extractor.BaseAttachmentContentExtractor] addFields Error indexing attachment (Attachment: test v1.0.docx v.1 (31099938) user1)2019-08-14 17:02:52,191 WARN [attachment-text-extraction-worker-2] [bonnie.search.extractor.BaseAttachmentContentExtractor] addFields Error indexing attachment (Attachment: test v1.0.docx v.1 (31099938) user1)com.atlassian.bonnie.search.extractor.ExtractorException: java.lang.Exception: Error reading content of Word XML document: Document too big for text extraction, bailing out at com.atlassian.confluence.extra.officeconnector.index.word.WordXMLTextExtractor.extractText(WordXMLTextExtractor.java:53) 

      Environment

      The issue can be reproduced with

      • Confluence 6.15.2, 6.15.9
      • MS SQL Server
      • Postgres9.6

      Steps to Reproduce

      The sample doc files(attached) can be used to reproduce issues.

      • sample_file2.docx: filesize:14181362 Byte, and no error.
      • sample_file3.docx: filesize: 14259624 Byte, and error.
      1. Create a page and attach sample_file3.docx
      2. Check confluence log, and will see the exception, Document too big for text extraction

      Expected Results

      No error for less than 16MB word doc file.

      Actual Results

      The below exception is thrown in the atlassian-confluence.log file after enable debug:

      • com.atlassian.confluence.internal.index
      • com.atlassian.confluence.search.lucene
      • com.atlassian.bonnie.search.extractor
      2019-09-06 11:01:33,903 DEBUG [Caesium-1-2] [internal.index.attachment.DefaultAttachmentExtractedTextManager] getContent Can't read extracted text of attachment 983046
      2019-09-06 11:01:33,904 DEBUG [Caesium-1-2] [search.lucene.extractor.AttachmentExtractedTextExtractor] addFields Extracted text of 983046 is not available, request an extraction
      2019-09-06 11:01:33,905 DEBUG [attachment-text-extraction-worker-2] [internal.index.attachment.AttachmentTextExtractionFunction] apply Text extraction for 983046 starting
      2019-09-06 11:01:33,906 DEBUG [attachment-text-extraction-worker-2] [internal.index.attachment.DefaultAttachmentExtractedTextManager] getContent Can't read extracted text of attachment 983046
      2019-09-06 11:01:33,907 DEBUG [attachment-text-extraction-worker-2] [bonnie.search.extractor.BaseAttachmentContentExtractor] addFields Starting to index attachment: sample_file3.docx
      2019-09-06 11:01:33,978 WARN [attachment-text-extraction-worker-2] [bonnie.search.extractor.BaseAttachmentContentExtractor] addFields Error indexing attachment (Attachment: sample_file3.docx v.1 (983046) admin)
      com.atlassian.bonnie.search.extractor.ExtractorException: java.lang.Exception: Error reading content of Word XML document: Document too big for text extraction, bailing out
      

      Workaround

      No workaround

      Atlassian status as of July 2020

      The size check is based on the size of the file when it's uncompressed. As  .docx files are compressed, text extraction may fail even though the file size on disk appears to be under the 16mb limit.

      You can increase the limit using the officeconnector.textextract.word.docxmaxsize system property.

      We plan to improve the error message for this check in a future release.

        1. content.png
          content.png
          467 kB
        2. sample_file2.docx
          13.52 MB
        3. sample_file3.docx
          13.60 MB

              Unassigned Unassigned
              jlee5@atlassian.com Jaeha (Inactive)
              Votes:
              4 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: