[CONFSERVER-40432] Filter Out All Media Files from Microsoft Word Documents to Improve Indexing in Confluence

Type: Suggestion
Resolution: Won't Do
Fix Version/s: None
Component/s: Search - Core
Labels:
- search

UIS:
0
Support reference count:
2
Feedback Policy:

We collect Confluence feedback from various sources, and we evaluate what we've collected when planning our product roadmap. To understand how this piece of feedback will be reviewed, see our Implementation of New Features Policy.

NOTE: This suggestion is for Confluence Server. Using Confluence Cloud? See the corresponding suggestion.

Problem Definition

Currently Confluence only strips out a limited number of media files embedded in a Microsoft Word document before indexing it:

.png
.emf
.wmf
.jpg
.jpeg
.gif

If a document has an embedded file not listed here, it may not get indexed if it is too large.

Background

Currently Confluence does not index files if the content with the removed media listed above is greater than 16Mb. There's a system property that can be set to make this larger, but this isn't used. See CONF-40176 for more details.

Suggested Solution

In com.atlassian.confluence.extra.officeconnector.index.word#WordXMLTextExtractor it has

WordXMLTextExtractor.java

                    if (!(name.contains("/media/") ||
                            processedName.endsWith(".png") || processedName.endsWith(".emf") || processedName.endsWith(".wmf") ||
                            processedName.endsWith(".jpg") || processedName.endsWith(".jpeg") ||
                            processedName.endsWith(".gif")
                       ))

Either

Strip out all content in the /media folder
or
Add all media types that are possible to add to a Word document. See Types of media files you can add.

Notes

Similar issues occur with other Microsoft Office documents (e.g. PowerPoint).

Discovered while testing

CONFSERVER-40176 Confluence ignores the system property officeconnector.textextract.word.docxmaxsize

Closed

relates to

AI-781 Filter Out All Media Files from Microsoft Word Documents to Improve Indexing in Confluence

Closed

mentioned in: Attachment indexing - stop the burn; Page Failed to load; Page Failed to load

Assignee:: Unassigned

Reporter:: James Richards

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 11/Jan/2016 1:20 AM

Updated:: 19/Sep/2019 5:24 AM

Resolved:: 20/Sep/2018 12:46 PM

Confluence Data Center

Details

Description

Problem Definition

Background

Suggested Solution

Notes

Attachments

Issue Links

Forms

Activity

People

Dates