Uploaded image for project: 'Atlassian Intelligence'
  1. Atlassian Intelligence
  2. AI-772

Use Apache POI returned information to attempt to index Office 2007 where incorrect extension was used

    • Our product teams collect and evaluate feedback from a number of different sources. To learn more about how we use customer feedback in the planning process, check out our new feature policy.

      NOTE: This suggestion is for Confluence Cloud. Using Confluence Server? See the corresponding suggestion.

      I carried out a test upgrade from Confluence 3.0.2 to 3.3 over the weekend, and noticed that the re-index threw over 2000 errors relating to attachments. Some of them were problematic PDFs, and I've voted on CONF-18962 to get those resolved.
      However, the vast majority of issues were relating to .xls and .csv files not being properly indexed.
      In many of the cases the following appeared:

      org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
      

      ,and POI is correct. We use a tool called FlorenceSoft DiffEngineX to carry out diffs between Excel documents, and this inexplicably creates Office 2007 format output, but mistakenly used a .xls (instead of .xlsx) extension.
      I'm not aware of any other tools that make this mistake, but I'm sure we're not the only ones who have content saved with the wrong extension. Considering POI is able to guess that it might be Office 2007 content, perhaps Confluence could capture the error, and try to re-index the documents as Excel 2007? It would be fantastic, and I'd really appreciate it.

            [AI-772] Use Apache POI returned information to attempt to index Office 2007 where incorrect extension was used

            pqz made changes -
            Component/s Original: Search - Core [ 46383 ]
            Component/s New: Search - Core [ 75296 ]
            Key Original: CONFCLOUD-20594 New: AI-772
            Affects Version/s Original: 3.3 [ 67569 ]
            Project Original: Confluence Cloud [ 18513 ] New: Atlassian Intelligence [ 23110 ]
            Katherine Yabut made changes -
            Workflow Original: JAC Suggestion Workflow [ 3405829 ] New: JAC Suggestion Workflow 3 [ 3624586 ]
            Status Original: RESOLVED [ 5 ] New: Closed [ 6 ]
            Monique Khairuliana (Inactive) made changes -
            Workflow Original: Confluence Workflow - Public Facing v3 [ 2240189 ] New: JAC Suggestion Workflow [ 3405829 ]
            Katherine Yabut made changes -
            Workflow Original: Confluence Workflow - Public Facing v3 - TEMP [ 2152241 ] New: Confluence Workflow - Public Facing v3 [ 2240189 ]
            Katherine Yabut made changes -
            Workflow Original: Confluence Workflow - Public Facing v3 [ 1890498 ] New: Confluence Workflow - Public Facing v3 - TEMP [ 2152241 ]
            Katherine Yabut made changes -
            Workflow Original: Confluence Workflow - Public Facing v2 [ 1806856 ] New: Confluence Workflow - Public Facing v3 [ 1890498 ]
            jonah (Inactive) made changes -
            Description Original: I carried out a test upgrade from Confluence 3.0.2 to 3.3 over the weekend, and noticed that the re-index threw over 2000 errors relating to attachments. Some of them were problematic PDFs, and I've voted on CONF-18962 to get those resolved.
            However, the vast majority of issues were relating to .xls and .csv files not being properly indexed.
            In many of the cases the following appeared:
            {noformat}
            org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
            {noformat}
            ,and POI is correct. We use a tool called FlorenceSoft DiffEngineX to carry out diffs between Excel documents, and this inexplicably creates Office 2007 format output, but mistakenly used a .xls (instead of .xlsx) extension.
            I'm not aware of any other tools that make this mistake, but I'm sure we're not the only ones who have content saved with the wrong extension. Considering POI is able to guess that it might be Office 2007 content, perhaps Confluence could capture the error, and try to re-index the documents as Excel 2007? It would be fantastic, and I'd really appreciate it.
            New: {panel:bgColor=#e7f4fa}
              *NOTE:* This suggestion is for *Confluence Cloud*. Using *Confluence Server*? [See the corresponding suggestion|http://jira.atlassian.com/browse/CONFSERVER-20594].
              {panel}

            I carried out a test upgrade from Confluence 3.0.2 to 3.3 over the weekend, and noticed that the re-index threw over 2000 errors relating to attachments. Some of them were problematic PDFs, and I've voted on CONF-18962 to get those resolved.
            However, the vast majority of issues were relating to .xls and .csv files not being properly indexed.
            In many of the cases the following appeared:
            {noformat}
            org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
            {noformat}
            ,and POI is correct. We use a tool called FlorenceSoft DiffEngineX to carry out diffs between Excel documents, and this inexplicably creates Office 2007 format output, but mistakenly used a .xls (instead of .xlsx) extension.
            I'm not aware of any other tools that make this mistake, but I'm sure we're not the only ones who have content saved with the wrong extension. Considering POI is able to guess that it might be Office 2007 content, perhaps Confluence could capture the error, and try to re-index the documents as Excel 2007? It would be fantastic, and I'd really appreciate it.
            jonah (Inactive) made changes -
            Link New: This issue is related to CONFSERVER-20594 [ CONFSERVER-20594 ]
            vkharisma made changes -
            Project Import New: Sat Apr 01 14:06:06 UTC 2017 [ 1491055566265 ]
            Katherine Yabut made changes -

              Unassigned Unassigned
              b0d88db9bee7 David Corley
              Votes:
              3 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: