Uploaded image for project: 'Atlassian Intelligence'
  1. Atlassian Intelligence
  2. AI-781

Filter Out All Media Files from Microsoft Word Documents to Improve Indexing in Confluence

    • Our product teams collect and evaluate feedback from a number of different sources. To learn more about how we use customer feedback in the planning process, check out our new feature policy.

      NOTE: This suggestion is for Confluence Cloud. Using Confluence Server? See the corresponding suggestion.

      Problem Definition

      Currently Confluence only strips out a limited number of media files embedded in a Microsoft Word document before indexing it:

      • .png
      • .emf
      • .wmf
      • .jpg
      • .jpeg
      • .gif

      If a document has an embedded file not listed here, it may not get indexed if it is too large.

      Background

      Currently Confluence does not index files if the content with the removed media listed above is greater than 16Mb. There's a system property that can be set to make this larger, but this isn't used. See CONF-40176 for more details.

      Suggested Solution

      In com.atlassian.confluence.extra.officeconnector.index.word#WordXMLTextExtractor it has

      WordXMLTextExtractor.java
                          if (!(name.contains("/media/") ||
                                  processedName.endsWith(".png") || processedName.endsWith(".emf") || processedName.endsWith(".wmf") ||
                                  processedName.endsWith(".jpg") || processedName.endsWith(".jpeg") ||
                                  processedName.endsWith(".gif")
                             ))
      

      Either

      1. Strip out all content in the /media folder
        or
      2. Add all media types that are possible to add to a Word document. See Types of media files you can add.

      Notes

      Similar issues occur with other Microsoft Office documents (e.g. PowerPoint).

            [AI-781] Filter Out All Media Files from Microsoft Word Documents to Improve Indexing in Confluence

            pqz made changes -
            Component/s Original: Search - Core [ 46383 ]
            Component/s New: Search - Core [ 75296 ]
            Key Original: CONFCLOUD-40432 New: AI-781
            Support reference count Original: 2
            Project Original: Confluence Cloud [ 18513 ] New: Atlassian Intelligence [ 23110 ]
            Matthew Hunter made changes -
            Resolution New: Won't Fix [ 2 ]
            Status Original: Gathering Interest [ 11772 ] New: Closed [ 6 ]
            Matthew Hunter made changes -
            Labels Original: search New: search timeout-suggestion-bulk-close202104
            Katherine Yabut made changes -
            Workflow Original: JAC Suggestion Workflow [ 3428134 ] New: JAC Suggestion Workflow 3 [ 3611076 ]
            Monique Khairuliana (Inactive) made changes -
            Workflow Original: Confluence Workflow - Public Facing v3 [ 2248060 ] New: JAC Suggestion Workflow [ 3428134 ]
            Status Original: Needs Verification [ 10004 ] New: Gathering Interest [ 11772 ]
            SET Analytics Bot made changes -
            Support reference count New: 2
            Katherine Yabut made changes -
            Workflow Original: Confluence Workflow - Public Facing v3 - TEMP [ 2143890 ] New: Confluence Workflow - Public Facing v3 [ 2248060 ]
            Katherine Yabut made changes -
            Workflow Original: Confluence Workflow - Public Facing v3 [ 1896297 ] New: Confluence Workflow - Public Facing v3 - TEMP [ 2143890 ]
            Katherine Yabut made changes -
            Workflow Original: Confluence Workflow - Public Facing v2 [ 1800515 ] New: Confluence Workflow - Public Facing v3 [ 1896297 ]
            jonah (Inactive) made changes -
            Description Original: h3. Problem Definition
            Currently Confluence only strips out a limited number of media files embedded in a Microsoft Word document before indexing it:
            * {{.png}}
            * {{.emf}}
            * {{.wmf}}
            * {{.jpg}}
            * {{.jpeg}}
            * {{.gif}}

            If a document has an embedded file not listed here, it may not get indexed if it is too large.

            h3. Background
            Currently Confluence does not index files if the content with the removed media listed above is greater than 16Mb. There's a system property that can be set to make this larger, but this isn't used. See [CONF-40176|https://jira.atlassian.com/browse/CONF-40176] for more details.

            h3. Suggested Solution
            In {{com.atlassian.confluence.extra.officeconnector.index.word#WordXMLTextExtractor}} it has
            {code:title=WordXMLTextExtractor.java}
                                if (!(name.contains("/media/") ||
                                        processedName.endsWith(".png") || processedName.endsWith(".emf") || processedName.endsWith(".wmf") ||
                                        processedName.endsWith(".jpg") || processedName.endsWith(".jpeg") ||
                                        processedName.endsWith(".gif")
                                   ))
            {code}

            Either
            # Strip out all content in the {{/media}} folder
            or
            # Add all media types that are possible to add to a Word document. See [Types of media files you can add|https://support.office.com/en-au/article/Types-of-media-files-you-can-add-067fac9c-ec90-4208-94e7-7459c695cfcc#].

            h3. Notes
            Similar issues occur with other Microsoft Office documents (e.g. PowerPoint).
            New: {panel:bgColor=#e7f4fa}
              *NOTE:* This suggestion is for *Confluence Cloud*. Using *Confluence Server*? [See the corresponding suggestion|http://jira.atlassian.com/browse/CONFSERVER-40432].
              {panel}

            h3. Problem Definition
            Currently Confluence only strips out a limited number of media files embedded in a Microsoft Word document before indexing it:
            * {{.png}}
            * {{.emf}}
            * {{.wmf}}
            * {{.jpg}}
            * {{.jpeg}}
            * {{.gif}}

            If a document has an embedded file not listed here, it may not get indexed if it is too large.

            h3. Background
            Currently Confluence does not index files if the content with the removed media listed above is greater than 16Mb. There's a system property that can be set to make this larger, but this isn't used. See [CONF-40176|https://jira.atlassian.com/browse/CONF-40176] for more details.

            h3. Suggested Solution
            In {{com.atlassian.confluence.extra.officeconnector.index.word#WordXMLTextExtractor}} it has
            {code:title=WordXMLTextExtractor.java}
                                if (!(name.contains("/media/") ||
                                        processedName.endsWith(".png") || processedName.endsWith(".emf") || processedName.endsWith(".wmf") ||
                                        processedName.endsWith(".jpg") || processedName.endsWith(".jpeg") ||
                                        processedName.endsWith(".gif")
                                   ))
            {code}

            Either
            # Strip out all content in the {{/media}} folder
            or
            # Add all media types that are possible to add to a Word document. See [Types of media files you can add|https://support.office.com/en-au/article/Types-of-media-files-you-can-add-067fac9c-ec90-4208-94e7-7459c695cfcc#].

            h3. Notes
            Similar issues occur with other Microsoft Office documents (e.g. PowerPoint).

              Unassigned Unassigned
              jrichards@atlassian.com James Richards
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: