[AI-396] PDF extractor throws data format exception error in logs

Type: Bug
Resolution: Fixed
Priority: Medium
Component/s: Admin - General, Search - Core (DO NOT USE)
Labels:

Symptom Severity:
Severity 3 - Minor
UIS:
56

NOTE: This bug report is for Confluence Cloud. Using Confluence Server? See the corresponding bug report.

The pdf indexer throws a lot of error messages when indexing pdf files.

ERROR [Indexer: 3] [apache.pdfbox.filter.FlateFilter] decode FlateFilter: stop reading corrupt stream due to a DataFormatException

This is probably caused by a bug in the pdfbox.
https://issues.apache.org/jira/browse/PDFBOX-2497

The bug above is fixed in 1.8.8 although we are using 1.8.10 and still seeing the error message. it can possibly be a regression.

Workaround :

Do note that this workaround is only tested in small instances and if you're facing any issues after applying this, restore back the PDFBOX version to the default bundled version and clear the plugin cache with a restart.
This is only applicable if your PDFBOX version is 1.8.x.

Download this PDFBOX version 1.8.12 here
Shutdown Confluence
Go to <Confluence Installation Directory>\confluence\WEB-INF\lib and search for PDFBOX 1.8.xx jar file. Remove the jar file and keep it somewhere in a non-Confluence folder.
It is important not to leave two versions of the same plugin jar file in the installation directory as all of them will be deployed upon start up.
Insert the PDFBOX 1.8.12 version here.
Clear the plugin cache
- https://confluence.atlassian.com/display/CONFKB/How+to+clear+Confluence+plugins+cache
Start Confluence

The errors will not appear again after a content index.

is related to

CONFSERVER-39892 PDF extractor throws data format exception error in logs

Closed

pqz made changes - 10/Apr/2024 3:36 AM

Component/s	Original: Search - Core [ 46383 ]
Component/s	Original: Integrations - Office Macros [ 46351 ]
Component/s		New: Search - Core [ 75296 ]
Component/s		New: Admin Experience [ 74216 ]
Fix Version/s	Original: 5.10.4 [ 68162 ]
Key	Original: ~~CONFCLOUD-39892~~	New: ~~AI-396~~
Support reference count	Original: 19
Symptom Severity	Original: Severity 2 - Major [ 14431 ]	New: Severity 3 - Minor [ 14432 ]
Affects Version/s	Original: 5.10.0 [ 68013 ]
Affects Version/s	Original: 5.9.5 [ 67959 ]
Affects Version/s	Original: 5.9.2 [ 67894 ]
Affects Version/s	Original: 5.8.15 [ 67883 ]
Project	Original: Confluence Cloud [ 18513 ]	New: Atlassian Intelligence [ 23110 ]

Monique Khairuliana (Inactive) made changes - 22/Aug/2019 3:51 AM

Workflow	Original: Confluence Workflow - Public Facing - Restricted v5 - TEMP [ 2365018 ]	New: JAC Bug Workflow v3 [ 3405481 ]
Status	Original: Resolved [ 5 ]	New: Closed [ 6 ]

Katherine Yabut made changes - 22/Jun/2017 6:38 AM

Workflow

Original: Confluence Workflow - Public Facing - Restricted v5 [ 2236634 ]

New: Confluence Workflow - Public Facing - Restricted v5 - TEMP [ 2365018 ]

Katherine Yabut made changes - 31/May/2017 5:19 AM

Workflow

Original: Confluence Workflow - Public Facing - Restricted v5.1 - TEMP [ 2200776 ]

New: Confluence Workflow - Public Facing - Restricted v5 [ 2236634 ]

Katherine Yabut made changes - 31/May/2017 4:46 AM

Workflow

Original: Confluence Workflow - Public Facing - Restricted v5 - TEMP [ 2147592 ]

New: Confluence Workflow - Public Facing - Restricted v5.1 - TEMP [ 2200776 ]

Katherine Yabut made changes - 31/May/2017 1:49 AM

Workflow

Original: Confluence Workflow - Public Facing - Restricted v5 [ 1895865 ]

New: Confluence Workflow - Public Facing - Restricted v5 - TEMP [ 2147592 ]

Katherine Yabut made changes - 03/Apr/2017 3:47 AM

Workflow

Original: Confluence Workflow - Public Facing - Restricted v3 [ 1793179 ]

New: Confluence Workflow - Public Facing - Restricted v5 [ 1895865 ]

jonah (Inactive) made changes - 02/Apr/2017 8:50 AM

Description

Original: The pdf indexer throws a lot of error messages when indexing pdf files.
{code}
ERROR [Indexer: 3] [apache.pdfbox.filter.FlateFilter] decode FlateFilter: stop reading corrupt stream due to a DataFormatException
{code}
This is probably caused by a bug in the pdfbox.
https://issues.apache.org/jira/browse/PDFBOX-2497

The bug above is fixed in 1.8.8 although we are using 1.8.10 and still seeing the error message. it can possibly be a regression.

h3. Workaround :
(!) Do note that this workaround is only tested in small instances and if you're facing any issues after applying this, restore back the PDFBOX version to the default bundled version and clear the plugin cache with a restart.
(!) This is only applicable if your PDFBOX version is 1.8.x.

# Download this [PDFBOX version 1.8.12 here|http://search.maven.org/remotecontent?filepath=org/apache/pdfbox/pdfbox/1.8.12/pdfbox-1.8.12.jar]
# Shutdown Confluence
# Go to {{<Confluence Installation Directory>\confluence\WEB-INF\lib}} and search for PDFBOX 1.8.xx jar file. Remove the jar file and keep it somewhere in a non-Confluence folder.
(!) It is important not to leave two versions of the same plugin jar file in the installation directory as all of them will be deployed upon start up.
# Insert the PDFBOX 1.8.12 version here.
# Clear the plugin cache
#- https://confluence.atlassian.com/display/CONFKB/How+to+clear+Confluence+plugins+cache
# Start Confluence

The errors will not appear again after a content index.

New: {panel:bgColor=#e7f4fa}
*NOTE:* This bug report is for *Confluence Cloud*. Using *Confluence Server*? [See the corresponding bug report|http://jira.atlassian.com/browse/CONFSERVER-39892].
{panel}

The pdf indexer throws a lot of error messages when indexing pdf files.
{code}
ERROR [Indexer: 3] [apache.pdfbox.filter.FlateFilter] decode FlateFilter: stop reading corrupt stream due to a DataFormatException
{code}
This is probably caused by a bug in the pdfbox.
https://issues.apache.org/jira/browse/PDFBOX-2497

The bug above is fixed in 1.8.8 although we are using 1.8.10 and still seeing the error message. it can possibly be a regression.

h3. Workaround :
(!) Do note that this workaround is only tested in small instances and if you're facing any issues after applying this, restore back the PDFBOX version to the default bundled version and clear the plugin cache with a restart.
(!) This is only applicable if your PDFBOX version is 1.8.x.

# Download this [PDFBOX version 1.8.12 here|http://search.maven.org/remotecontent?filepath=org/apache/pdfbox/pdfbox/1.8.12/pdfbox-1.8.12.jar]
# Shutdown Confluence
# Go to {{<Confluence Installation Directory>\confluence\WEB-INF\lib}} and search for PDFBOX 1.8.xx jar file. Remove the jar file and keep it somewhere in a non-Confluence folder.
(!) It is important not to leave two versions of the same plugin jar file in the installation directory as all of them will be deployed upon start up.
# Insert the PDFBOX 1.8.12 version here.
# Clear the plugin cache
#- https://confluence.atlassian.com/display/CONFKB/How+to+clear+Confluence+plugins+cache
# Start Confluence

The errors will not appear again after a content index.

jonah (Inactive) made changes - 02/Apr/2017 8:50 AM

Link

New: This issue is related to ~~CONFSERVER-39892~~ [ ~~CONFSERVER-39892~~ ]

vkharisma made changes - 01/Apr/2017 2:08 PM

Project Import

New: Sat Apr 01 14:06:06 UTC 2017 [ 1491055566265 ]

Assignee:: Unassigned

Reporter:: Rodrigo Girardi Adami

Affected customers:: 11 This affects my team

Watchers:: 20 Start watching this issue

Created:: 12/Nov/2015 5:03 PM

Updated:: 10/Apr/2024 3:36 AM

Resolved:: 02/Sep/2016 12:40 AM

Details

Description

Workaround :

Attachments

Issue Links

Forms

Activity

People

Dates