-
Suggestion
-
Resolution: Won't Fix
NOTE: This suggestion is for Confluence Cloud. Using Confluence Server? See the corresponding suggestion.
Problem Definition
Currently Confluence only strips out a limited number of media files embedded in a Microsoft Word document before indexing it:
- .png
- .emf
- .wmf
- .jpg
- .jpeg
- .gif
If a document has an embedded file not listed here, it may not get indexed if it is too large.
Background
Currently Confluence does not index files if the content with the removed media listed above is greater than 16Mb. There's a system property that can be set to make this larger, but this isn't used. See CONF-40176 for more details.
Suggested Solution
In com.atlassian.confluence.extra.officeconnector.index.word#WordXMLTextExtractor it has
if (!(name.contains("/media/") || processedName.endsWith(".png") || processedName.endsWith(".emf") || processedName.endsWith(".wmf") || processedName.endsWith(".jpg") || processedName.endsWith(".jpeg") || processedName.endsWith(".gif") ))
Either
- Strip out all content in the /media folder
or - Add all media types that are possible to add to a Word document. See Types of media files you can add.
Notes
Similar issues occur with other Microsoft Office documents (e.g. PowerPoint).
- Discovered while testing
-
AI-206 Confluence ignores the system property officeconnector.textextract.word.docxmaxsize
-
- Closed
-
- is related to
-
CONFSERVER-40432 Filter Out All Media Files from Microsoft Word Documents to Improve Indexing in Confluence
- Closed
[AI-781] Filter Out All Media Files from Microsoft Word Documents to Improve Indexing in Confluence
Component/s | Original: Search - Core [ 46383 ] | |
Component/s | New: Search - Core [ 75296 ] | |
Key |
Original:
|
New:
|
Support reference count | Original: 2 | |
Project | Original: Confluence Cloud [ 18513 ] | New: Atlassian Intelligence [ 23110 ] |
Resolution | New: Won't Fix [ 2 ] | |
Status | Original: Gathering Interest [ 11772 ] | New: Closed [ 6 ] |
Labels | Original: search | New: search timeout-suggestion-bulk-close202104 |
Workflow | Original: JAC Suggestion Workflow [ 3428134 ] | New: JAC Suggestion Workflow 3 [ 3611076 ] |
Workflow | Original: Confluence Workflow - Public Facing v3 [ 2248060 ] | New: JAC Suggestion Workflow [ 3428134 ] |
Status | Original: Needs Verification [ 10004 ] | New: Gathering Interest [ 11772 ] |
Support reference count | New: 2 |
Workflow | Original: Confluence Workflow - Public Facing v3 - TEMP [ 2143890 ] | New: Confluence Workflow - Public Facing v3 [ 2248060 ] |
Workflow | Original: Confluence Workflow - Public Facing v3 [ 1896297 ] | New: Confluence Workflow - Public Facing v3 - TEMP [ 2143890 ] |
Workflow | Original: Confluence Workflow - Public Facing v2 [ 1800515 ] | New: Confluence Workflow - Public Facing v3 [ 1896297 ] |
Description |
Original:
h3. Problem Definition Currently Confluence only strips out a limited number of media files embedded in a Microsoft Word document before indexing it: * {{.png}} * {{.emf}} * {{.wmf}} * {{.jpg}} * {{.jpeg}} * {{.gif}} If a document has an embedded file not listed here, it may not get indexed if it is too large. h3. Background Currently Confluence does not index files if the content with the removed media listed above is greater than 16Mb. There's a system property that can be set to make this larger, but this isn't used. See [ h3. Suggested Solution In {{com.atlassian.confluence.extra.officeconnector.index.word#WordXMLTextExtractor}} it has {code:title=WordXMLTextExtractor.java} if (!(name.contains("/media/") || processedName.endsWith(".png") || processedName.endsWith(".emf") || processedName.endsWith(".wmf") || processedName.endsWith(".jpg") || processedName.endsWith(".jpeg") || processedName.endsWith(".gif") )) {code} Either # Strip out all content in the {{/media}} folder or # Add all media types that are possible to add to a Word document. See [Types of media files you can add|https://support.office.com/en-au/article/Types-of-media-files-you-can-add-067fac9c-ec90-4208-94e7-7459c695cfcc#]. h3. Notes Similar issues occur with other Microsoft Office documents (e.g. PowerPoint). |
New:
{panel:bgColor=#e7f4fa} *NOTE:* This suggestion is for *Confluence Cloud*. Using *Confluence Server*? [See the corresponding suggestion|http://jira.atlassian.com/browse/CONFSERVER-40432]. {panel} h3. Problem Definition Currently Confluence only strips out a limited number of media files embedded in a Microsoft Word document before indexing it: * {{.png}} * {{.emf}} * {{.wmf}} * {{.jpg}} * {{.jpeg}} * {{.gif}} If a document has an embedded file not listed here, it may not get indexed if it is too large. h3. Background Currently Confluence does not index files if the content with the removed media listed above is greater than 16Mb. There's a system property that can be set to make this larger, but this isn't used. See [ h3. Suggested Solution In {{com.atlassian.confluence.extra.officeconnector.index.word#WordXMLTextExtractor}} it has {code:title=WordXMLTextExtractor.java} if (!(name.contains("/media/") || processedName.endsWith(".png") || processedName.endsWith(".emf") || processedName.endsWith(".wmf") || processedName.endsWith(".jpg") || processedName.endsWith(".jpeg") || processedName.endsWith(".gif") )) {code} Either # Strip out all content in the {{/media}} folder or # Add all media types that are possible to add to a Word document. See [Types of media files you can add|https://support.office.com/en-au/article/Types-of-media-files-you-can-add-067fac9c-ec90-4208-94e7-7459c695cfcc#]. h3. Notes Similar issues occur with other Microsoft Office documents (e.g. PowerPoint). |