Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-40176

Confluence ignores the system property officeconnector.textextract.word.docxmaxsize

      NOTE: This bug report is for Confluence Server. Using Confluence Cloud? See the corresponding bug report.

      Summary

      In the current System Properties documentation there is a setting officeconnector.textextract.word.docxmaxsize, and this is ignored by Confluence when set in setenv.sh.

      Environment

      • Confluence 5.8.x or Confluence 5.9.x

      Steps to Reproduce

      1. Add to setenv.sh
        setenv.sh
        CATALINA_OPTS="-Dofficeconnector.textextract.word.docxmaxsize=1000 ${CATALINA_OPTS}"
        
      2. Insert the attached file lorum.docx to a page.
      3. Check the logs, the error
        atlassian-confluence.log
        com.atlassian.bonnie.search.extractor.ExtractorException: Error reading content of Word XML document: Document too big for text extraction, bailing out
        

        does not appear.

      Expected Results

      The error

      atlassian-confluence.log
      com.atlassian.bonnie.search.extractor.ExtractorException: Error reading content of Word XML document: Document too big for text extraction, bailing out
      

      should appear as the file size is now 1K.

      Actual Results

      The file is indexed correctly.

      Notes

      For the attached file you will need to increase your Java Heap Space in setenv.sh to something like -Xmx8192m.

      The value officeconnector.textextract.word.docxmaxsize is referenced in com.atlassian.confluence.extra.officeconnector.index.word#WordXMLTextExtractor.java as

      WordXMLTextExtractor.java
      private static final long MAX_XML_SIZE = Long.getLong("officeconnector.textextract.word.docxmaxsize", 1024 * 1024 * 16); // Maximum, 16Mb of XML text, more than enough for most files
      ...
      if (finalSize > MAX_XML_SIZE) {
                              throw new ExtractorException("Document too big for text extraction, bailing out");
                          }
      

      Workaround

      There is no workaround.

            [CONFSERVER-40176] Confluence ignores the system property officeconnector.textextract.word.docxmaxsize

            George Varghese made changes -
            QA Demo Status New: Not Needed [ 14332 ]
            QA Kickoff Status New: Not Needed [ 14236 ]
            Resolution New: Low Engagement [ 10300 ]
            Status Original: Gathering Impact [ 12072 ] New: Closed [ 6 ]
            George Varghese made changes -
            Labels Original: affects-cloud affects-server office-connector p20 New: affects-cloud affects-server cleanup-seos-fy25 office-connector p20
            SET Analytics Bot made changes -
            UIS Original: 3 New: 2
            Mohit Sharma made changes -
            Remote Link New: This issue links to "Page (Confluence)" [ 682354 ]
            SET Analytics Bot made changes -
            UIS Original: 2 New: 3
            SET Analytics Bot made changes -
            UIS Original: 3 New: 2
            SET Analytics Bot made changes -
            UIS Original: 2 New: 3
            SET Analytics Bot made changes -
            UIS Original: 3 New: 2
            SET Analytics Bot made changes -
            UIS Original: 2 New: 3
            SET Analytics Bot made changes -
            UIS Original: 3 New: 2

              Unassigned Unassigned
              jrichards@atlassian.com James Richards
              Affected customers:
              13 This affects my team
              Watchers:
              26 Start watching this issue

                Created:
                Updated:
                Resolved: