Uploaded image for project: 'Atlassian Intelligence'
  1. Atlassian Intelligence
  2. AI-396

PDF extractor throws data format exception error in logs

    • Severity 3 - Minor
    • 56

      NOTE: This bug report is for Confluence Cloud. Using Confluence Server? See the corresponding bug report.

      The pdf indexer throws a lot of error messages when indexing pdf files.

      ERROR [Indexer: 3] [apache.pdfbox.filter.FlateFilter] decode FlateFilter: stop reading corrupt stream due to a DataFormatException
      

      This is probably caused by a bug in the pdfbox.
      https://issues.apache.org/jira/browse/PDFBOX-2497

      The bug above is fixed in 1.8.8 although we are using 1.8.10 and still seeing the error message. it can possibly be a regression.

      Workaround :

      Do note that this workaround is only tested in small instances and if you're facing any issues after applying this, restore back the PDFBOX version to the default bundled version and clear the plugin cache with a restart.
      This is only applicable if your PDFBOX version is 1.8.x.

      1. Download this PDFBOX version 1.8.12 here
      2. Shutdown Confluence
      3. Go to <Confluence Installation Directory>\confluence\WEB-INF\lib and search for PDFBOX 1.8.xx jar file. Remove the jar file and keep it somewhere in a non-Confluence folder.
        It is important not to leave two versions of the same plugin jar file in the installation directory as all of them will be deployed upon start up.
      4. Insert the PDFBOX 1.8.12 version here.
      5. Clear the plugin cache
      6. Start Confluence

      The errors will not appear again after a content index.

            [AI-396] PDF extractor throws data format exception error in logs

            Mark Symons added a comment - - edited

            Affects Confluence v5.10.0 (which still uses pdfbox-1.8.10.jar)

            I have investigated this problem further:

            Identify Problem Attachments

            Add the following package to "Logging and Profiling" and set it as DEBUG level. Per docs, the change is not persisted and will be lost when you restart Confluence.

            com.atlassian.bonnie.search.extractor.BaseAttachmentContentExtractor

            Then, when running re-index, the log output identifies every file being indexed:

            2016-07-01 13:23:53,117 DEBUG [Indexer: 3] [bonnie.search.extractor.BaseAttachmentContentExtractor] addFields Starting to index attachment: ITSO_TS_1000-7_V2_1_4_2010-02.pdf
            2016-07-01 13:23:54,195 ERROR [Indexer: 3] [apache.pdfbox.filter.FlateFilter] decode FlateFilter: stop reading corrupt stream due to a DataFormatException
            

            Edited above log snippet 5th July... the original had a copy and paste error whereby the two log lines were from different indexer threads

            And the problem file named in the log? Here it is:

            http://www.kti.hu/uploads/KMK/2011/ITSO%20Tud%C3%A1st%C3%A1r/Szabv%C3%A1ny%20UK/ITSO_TS_1000-7_V2_1_4_2010-02.pdf

            Try it out... upload it to Confluence v5.10.0 and you should get the PDFBox error straight away on upload.

            What else can we tell?

            1) Thanks to Debug logging I can see that a couple of xls and xlsx files also give the DataFormatException. This was incorrect... examining the Indexer thread IDs show that the Dataformat exception that appeared to come from xls* files actually "belonged" to a previously-read PDF.

            2) Some files give 2 (or even 3) DataFormatException. So, on my test server, my 12 errors equate to only 8 unique files.

            3) Files that give errors do still seem to be searchable in general... although that is no guarantee that "bits are not missing".

            Testing PDFBox from command line

            The current version of pdfbox is 2.0.2 (or 1.8.12 if nervous about 2.0 release notes listing 1249 issues).

            I downloaded snapshots of 1.8.13 and 2.0.3 from here:

            https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/

            These snapshots contain all required dependencies and can be run from the command line:

            java -jar pdfbox-app-1.8.13-20160630.170856-17.jar ExtractText ITSO_TS_1000-7_V2_1_4_2010-02.pdf
            

            No error!

            Mark Symons added a comment - - edited Affects Confluence v5.10.0 (which still uses pdfbox-1.8.10.jar) I have investigated this problem further: Identify Problem Attachments Add the following package to "Logging and Profiling" and set it as DEBUG level. Per docs, the change is not persisted and will be lost when you restart Confluence. com.atlassian.bonnie.search.extractor.BaseAttachmentContentExtractor Then, when running re-index, the log output identifies every file being indexed: 2016-07-01 13:23:53,117 DEBUG [Indexer: 3] [bonnie.search.extractor.BaseAttachmentContentExtractor] addFields Starting to index attachment: ITSO_TS_1000-7_V2_1_4_2010-02.pdf 2016-07-01 13:23:54,195 ERROR [Indexer: 3] [apache.pdfbox.filter.FlateFilter] decode FlateFilter: stop reading corrupt stream due to a DataFormatException Edited above log snippet 5th July... the original had a copy and paste error whereby the two log lines were from different indexer threads And the problem file named in the log? Here it is: http://www.kti.hu/uploads/KMK/2011/ITSO%20Tud%C3%A1st%C3%A1r/Szabv%C3%A1ny%20UK/ITSO_TS_1000-7_V2_1_4_2010-02.pdf Try it out... upload it to Confluence v5.10.0 and you should get the PDFBox error straight away on upload. What else can we tell? 1) Thanks to Debug logging I can see that a couple of xls and xlsx files also give the DataFormatException. This was incorrect... examining the Indexer thread IDs show that the Dataformat exception that appeared to come from xls* files actually "belonged" to a previously-read PDF. 2) Some files give 2 (or even 3) DataFormatException. So, on my test server, my 12 errors equate to only 8 unique files. 3) Files that give errors do still seem to be searchable in general... although that is no guarantee that "bits are not missing". Testing PDFBox from command line The current version of pdfbox is 2.0.2 (or 1.8.12 if nervous about 2.0 release notes listing 1249 issues). I downloaded snapshots of 1.8.13 and 2.0.3 from here: https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/ These snapshots contain all required dependencies and can be run from the command line: java -jar pdfbox-app-1.8.13-20160630.170856-17.jar ExtractText ITSO_TS_1000-7_V2_1_4_2010-02.pdf No error!

              Unassigned Unassigned
              rgadami Rodrigo Girardi Adami
              Affected customers:
              11 This affects my team
              Watchers:
              20 Start watching this issue

                Created:
                Updated:
                Resolved: