[CONFSERVER-40176] Confluence ignores the system property officeconnector.textextract.word.docxmaxsize

Type: Bug
Resolution: Unresolved
Priority: Low
Fix Version/s: None
Affects Version/s: 5.9.1, 5.8.10, 5.9.5, 5.9.7, 5.10.8
Component/s: Search - Indexing
Labels:

Support reference count:
23
Symptom Severity:
Severity 3 - Minor
UIS:
2
Bug Fix Policy:
View Atlassian Server bug fix policy

NOTE: This bug report is for Confluence Server. Using Confluence Cloud? See the corresponding bug report.

Summary

In the current System Properties documentation there is a setting officeconnector.textextract.word.docxmaxsize, and this is ignored by Confluence when set in setenv.sh.

Environment

Confluence 5.8.x or Confluence 5.9.x

Steps to Reproduce

Add to setenv.sh

setenv.sh

CATALINA_OPTS="-Dofficeconnector.textextract.word.docxmaxsize=1000 ${CATALINA_OPTS}"

Insert the attached file lorum.docx to a page.

Check the logs, the error

atlassian-confluence.log

com.atlassian.bonnie.search.extractor.ExtractorException: Error reading content of Word XML document: Document too big for text extraction, bailing out

does not appear.

Expected Results

The error

atlassian-confluence.log

com.atlassian.bonnie.search.extractor.ExtractorException: Error reading content of Word XML document: Document too big for text extraction, bailing out

should appear as the file size is now 1K.

Actual Results

The file is indexed correctly.

Notes

For the attached file you will need to increase your Java Heap Space in setenv.sh to something like -Xmx8192m.

The value officeconnector.textextract.word.docxmaxsize is referenced in com.atlassian.confluence.extra.officeconnector.index.word#WordXMLTextExtractor.java as

WordXMLTextExtractor.java

private static final long MAX_XML_SIZE = Long.getLong("officeconnector.textextract.word.docxmaxsize", 1024 * 1024 * 16); // Maximum, 16Mb of XML text, more than enough for most files
...
if (finalSize > MAX_XML_SIZE) {
                        throw new ExtractorException("Document too big for text extraction, bailing out");
                    }

Workaround

There is no workaround.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List

lorum.docx
09/Dec/2015 12:50 AM
402 kB
James Richards

is related to

CONFSERVER-40914 Check size of attachment before content indexing

Closed

relates to

AI-206 Confluence ignores the system property officeconnector.textextract.word.docxmaxsize

Closed

Testing discovered

CONFSERVER-40432 Filter Out All Media Files from Microsoft Word Documents to Improve Indexing in Confluence

Closed

mentioned in: Page No Confluence page found with the given URL.; Attachment indexing - stop the burn; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(14 mentioned in)

Jens Kasperek (Bosch GmbH) added a comment - 29/Nov/2016 8:52 AM

Same issue here on Confluence 5.9.10

Jens Kasperek (Bosch GmbH) added a comment - 29/Nov/2016 8:52 AM Same issue here on Confluence 5.9.10

April added a comment - 31/Mar/2016 2:55 PM

Thanks for doing that testing, Guilherme! The memory usage from this bug is a little nerve-wracking, so I appreciate you getting to this so quickly.

April added a comment - 31/Mar/2016 2:55 PM Thanks for doing that testing, Guilherme! The memory usage from this bug is a little nerve-wracking, so I appreciate you getting to this so quickly.

April added a comment - 28/Mar/2016 1:20 PM

Thanks James, I will do a bit more testing to see if I can narrow this down any further.

The documents are 8.53 MB, 17.44 MB, and 6.08 MB, with no video content, so it just doesn't seem like they should be running into this.

April added a comment - 28/Mar/2016 1:20 PM Thanks James, I will do a bit more testing to see if I can narrow this down any further. The documents are 8.53 MB, 17.44 MB, and 6.08 MB, with no video content, so it just doesn't seem like they should be running into this.

James Richards added a comment - 26/Mar/2016 5:19 AM - edited

Hi adaly,

It's 16MB of uncompressed text. The code strips out a few multimedia formats (see CONF-40432), so if you have embedded video it may then jump to the higher values, but it really depends on the document size and contents.

If it's an issue, please log a ticket with support and we can investigate further.

James.

James Richards added a comment - 26/Mar/2016 5:19 AM - edited Hi adaly , It's 16MB of uncompressed text. The code strips out a few multimedia formats (see CONF-40432 ), so if you have embedded video it may then jump to the higher values, but it really depends on the document size and contents. If it's an issue, please log a ticket with support and we can investigate further. James.

April added a comment - 26/Mar/2016 5:13 AM

What is the default value for this system property?

I ask because with Confluence 5.9.7, I have three docx files that throw this error.

When this occurs, system memory usage immediately jumps 2 gig (yes, gig) and does not return that memory until I restart the service.

April added a comment - 26/Mar/2016 5:13 AM What is the default value for this system property? I ask because with Confluence 5.9.7, I have three docx files that throw this error. When this occurs, system memory usage immediately jumps 2 gig (yes, gig) and does not return that memory until I restart the service.

Confluence Data Center

Details

Description

Summary

Environment

Steps to Reproduce

Expected Results

Actual Results

Notes

Workaround

Attachments

Attachments

Issue Links

Forms

Activity

Collapse comment: Jens Kasperek (Bosch GmbH) added a comment - 29/Nov/2016 8:52 AM

Expand comment: Jens Kasperek (Bosch GmbH) added a comment - 29/Nov/2016 8:52 AM

Collapse comment: April added a comment - 31/Mar/2016 2:55 PM

Expand comment: April added a comment - 31/Mar/2016 2:55 PM

Collapse comment: April added a comment - 28/Mar/2016 1:20 PM

Expand comment: April added a comment - 28/Mar/2016 1:20 PM

Collapse comment: James Richards added a comment - 26/Mar/2016 5:19 AM, Edited by James Richards - 28/Mar/2016 10:24 PM

Expand comment: James Richards added a comment - 26/Mar/2016 5:19 AM, Edited by James Richards - 28/Mar/2016 10:24 PM

Collapse comment: April added a comment - 26/Mar/2016 5:13 AM

Expand comment: April added a comment - 26/Mar/2016 5:13 AM

People

Dates