-
Bug
-
Resolution: Fixed
-
Medium
-
Severity 3 - Minor
-
56
NOTE: This bug report is for Confluence Cloud. Using Confluence Server? See the corresponding bug report.
The pdf indexer throws a lot of error messages when indexing pdf files.
ERROR [Indexer: 3] [apache.pdfbox.filter.FlateFilter] decode FlateFilter: stop reading corrupt stream due to a DataFormatException
This is probably caused by a bug in the pdfbox.
https://issues.apache.org/jira/browse/PDFBOX-2497
The bug above is fixed in 1.8.8 although we are using 1.8.10 and still seeing the error message. it can possibly be a regression.
Workaround :
Do note that this workaround is only tested in small instances and if you're facing any issues after applying this, restore back the PDFBOX version to the default bundled version and clear the plugin cache with a restart.
This is only applicable if your PDFBOX version is 1.8.x.
- Download this PDFBOX version 1.8.12 here
- Shutdown Confluence
- Go to <Confluence Installation Directory>\confluence\WEB-INF\lib and search for PDFBOX 1.8.xx jar file. Remove the jar file and keep it somewhere in a non-Confluence folder.
It is important not to leave two versions of the same plugin jar file in the installation directory as all of them will be deployed upon start up.
- Insert the PDFBOX 1.8.12 version here.
- Clear the plugin cache
- Start Confluence
The errors will not appear again after a content index.
- is related to
-
CONFSERVER-39892 PDF extractor throws data format exception error in logs
-
- Closed
-
Affects Confluence v5.10.0 (which still uses pdfbox-1.8.10.jar)
I have investigated this problem further:
Identify Problem Attachments
Add the following package to "Logging and Profiling" and set it as DEBUG level. Per docs, the change is not persisted and will be lost when you restart Confluence.
com.atlassian.bonnie.search.extractor.BaseAttachmentContentExtractor
Then, when running re-index, the log output identifies every file being indexed:
Edited above log snippet 5th July... the original had a copy and paste error whereby the two log lines were from different indexer threads
And the problem file named in the log? Here it is:
http://www.kti.hu/uploads/KMK/2011/ITSO%20Tud%C3%A1st%C3%A1r/Szabv%C3%A1ny%20UK/ITSO_TS_1000-7_V2_1_4_2010-02.pdf
Try it out... upload it to Confluence v5.10.0 and you should get the PDFBox error straight away on upload.
What else can we tell?
1) Thanks to Debug logging I can see that a couple of xls and xlsx files also give the DataFormatException.This was incorrect... examining the Indexer thread IDs show that the Dataformat exception that appeared to come from xls* files actually "belonged" to a previously-read PDF.2) Some files give 2 (or even 3) DataFormatException. So, on my test server, my 12 errors equate to only 8 unique files.
3) Files that give errors do still seem to be searchable in general... although that is no guarantee that "bits are not missing".
Testing PDFBox from command line
The current version of pdfbox is 2.0.2 (or 1.8.12 if nervous about 2.0 release notes listing 1249 issues).
I downloaded snapshots of 1.8.13 and 2.0.3 from here:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/
These snapshots contain all required dependencies and can be run from the command line:
No error!