Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-40432

Filter Out All Media Files from Microsoft Word Documents to Improve Indexing in Confluence

    • Icon: Suggestion Suggestion
    • Resolution: Won't Do
    • None
    • Search - Core
    • 0
    • 2
    • We collect Confluence feedback from various sources, and we evaluate what we've collected when planning our product roadmap. To understand how this piece of feedback will be reviewed, see our Implementation of New Features Policy.

      NOTE: This suggestion is for Confluence Server. Using Confluence Cloud? See the corresponding suggestion.

      Problem Definition

      Currently Confluence only strips out a limited number of media files embedded in a Microsoft Word document before indexing it:

      • .png
      • .emf
      • .wmf
      • .jpg
      • .jpeg
      • .gif

      If a document has an embedded file not listed here, it may not get indexed if it is too large.

      Background

      Currently Confluence does not index files if the content with the removed media listed above is greater than 16Mb. There's a system property that can be set to make this larger, but this isn't used. See CONF-40176 for more details.

      Suggested Solution

      In com.atlassian.confluence.extra.officeconnector.index.word#WordXMLTextExtractor it has

      WordXMLTextExtractor.java
                          if (!(name.contains("/media/") ||
                                  processedName.endsWith(".png") || processedName.endsWith(".emf") || processedName.endsWith(".wmf") ||
                                  processedName.endsWith(".jpg") || processedName.endsWith(".jpeg") ||
                                  processedName.endsWith(".gif")
                             ))
      

      Either

      1. Strip out all content in the /media folder
        or
      2. Add all media types that are possible to add to a Word document. See Types of media files you can add.

      Notes

      Similar issues occur with other Microsoft Office documents (e.g. PowerPoint).

            [CONFSERVER-40432] Filter Out All Media Files from Microsoft Word Documents to Improve Indexing in Confluence

            No work has yet been logged on this issue.

              Unassigned Unassigned
              jrichards@atlassian.com James Richards
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: