Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-40176

Confluence ignores the system property officeconnector.textextract.word.docxmaxsize

      NOTE: This bug report is for Confluence Server. Using Confluence Cloud? See the corresponding bug report.

      Summary

      In the current System Properties documentation there is a setting officeconnector.textextract.word.docxmaxsize, and this is ignored by Confluence when set in setenv.sh.

      Environment

      • Confluence 5.8.x or Confluence 5.9.x

      Steps to Reproduce

      1. Add to setenv.sh
        setenv.sh
        CATALINA_OPTS="-Dofficeconnector.textextract.word.docxmaxsize=1000 ${CATALINA_OPTS}"
        
      2. Insert the attached file lorum.docx to a page.
      3. Check the logs, the error
        atlassian-confluence.log
        com.atlassian.bonnie.search.extractor.ExtractorException: Error reading content of Word XML document: Document too big for text extraction, bailing out
        

        does not appear.

      Expected Results

      The error

      atlassian-confluence.log
      com.atlassian.bonnie.search.extractor.ExtractorException: Error reading content of Word XML document: Document too big for text extraction, bailing out
      

      should appear as the file size is now 1K.

      Actual Results

      The file is indexed correctly.

      Notes

      For the attached file you will need to increase your Java Heap Space in setenv.sh to something like -Xmx8192m.

      The value officeconnector.textextract.word.docxmaxsize is referenced in com.atlassian.confluence.extra.officeconnector.index.word#WordXMLTextExtractor.java as

      WordXMLTextExtractor.java
      private static final long MAX_XML_SIZE = Long.getLong("officeconnector.textextract.word.docxmaxsize", 1024 * 1024 * 16); // Maximum, 16Mb of XML text, more than enough for most files
      ...
      if (finalSize > MAX_XML_SIZE) {
                              throw new ExtractorException("Document too big for text extraction, bailing out");
                          }
      

      Workaround

      There is no workaround.

        1. lorum.docx
          402 kB
          James Richards

            [CONFSERVER-40176] Confluence ignores the system property officeconnector.textextract.word.docxmaxsize

            Same issue here on Confluence 5.9.10

            Jens Kasperek (Bosch GmbH) added a comment - Same issue here on Confluence 5.9.10

            April added a comment -

            Thanks for doing that testing, Guilherme! The memory usage from this bug is a little nerve-wracking, so I appreciate you getting to this so quickly.

            April added a comment - Thanks for doing that testing, Guilherme! The memory usage from this bug is a little nerve-wracking, so I appreciate you getting to this so quickly.

            April added a comment -

            Thanks James, I will do a bit more testing to see if I can narrow this down any further.

            The documents are 8.53 MB, 17.44 MB, and 6.08 MB, with no video content, so it just doesn't seem like they should be running into this.

            April added a comment - Thanks James, I will do a bit more testing to see if I can narrow this down any further. The documents are 8.53 MB, 17.44 MB, and 6.08 MB, with no video content, so it just doesn't seem like they should be running into this.

            James Richards added a comment - - edited

            Hi adaly,

            It's 16MB of uncompressed text. The code strips out a few multimedia formats (see CONF-40432), so if you have embedded video it may then jump to the higher values, but it really depends on the document size and contents.

            If it's an issue, please log a ticket with support and we can investigate further.

            James.

            James Richards added a comment - - edited Hi adaly , It's 16MB of uncompressed text. The code strips out a few multimedia formats (see CONF-40432 ), so if you have embedded video it may then jump to the higher values, but it really depends on the document size and contents. If it's an issue, please log a ticket with support and we can investigate further. James.

            April added a comment -

            What is the default value for this system property?

            I ask because with Confluence 5.9.7, I have three docx files that throw this error.

            When this occurs, system memory usage immediately jumps 2 gig (yes, gig) and does not return that memory until I restart the service.

            April added a comment - What is the default value for this system property? I ask because with Confluence 5.9.7, I have three docx files that throw this error. When this occurs, system memory usage immediately jumps 2 gig (yes, gig) and does not return that memory until I restart the service.

              Unassigned Unassigned
              jrichards@atlassian.com James Richards
              Affected customers:
              13 This affects my team
              Watchers:
              25 Start watching this issue

                Created:
                Updated: