Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-40176

Confluence ignores the system property officeconnector.textextract.word.docxmaxsize

    XMLWordPrintable

Details

    Description

      NOTE: This bug report is for Confluence Server. Using Confluence Cloud? See the corresponding bug report.

      Summary

      In the current System Properties documentation there is a setting officeconnector.textextract.word.docxmaxsize, and this is ignored by Confluence when set in setenv.sh.

      Environment

      • Confluence 5.8.x or Confluence 5.9.x

      Steps to Reproduce

      1. Add to setenv.sh
        setenv.sh
        CATALINA_OPTS="-Dofficeconnector.textextract.word.docxmaxsize=1000 ${CATALINA_OPTS}"
        
      2. Insert the attached file lorum.docx to a page.
      3. Check the logs, the error
        atlassian-confluence.log
        com.atlassian.bonnie.search.extractor.ExtractorException: Error reading content of Word XML document: Document too big for text extraction, bailing out
        

        does not appear.

      Expected Results

      The error

      atlassian-confluence.log
      com.atlassian.bonnie.search.extractor.ExtractorException: Error reading content of Word XML document: Document too big for text extraction, bailing out
      

      should appear as the file size is now 1K.

      Actual Results

      The file is indexed correctly.

      Notes

      For the attached file you will need to increase your Java Heap Space in setenv.sh to something like -Xmx8192m.

      The value officeconnector.textextract.word.docxmaxsize is referenced in com.atlassian.confluence.extra.officeconnector.index.word#WordXMLTextExtractor.java as

      WordXMLTextExtractor.java
      private static final long MAX_XML_SIZE = Long.getLong("officeconnector.textextract.word.docxmaxsize", 1024 * 1024 * 16); // Maximum, 16Mb of XML text, more than enough for most files
      ...
      if (finalSize > MAX_XML_SIZE) {
                              throw new ExtractorException("Document too big for text extraction, bailing out");
                          }
      

      Workaround

      There is no workaround.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jrichards@atlassian.com James Richards
              Votes:
              13 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

                Created:
                Updated: