-
Bug
-
Resolution: Fixed
-
Medium
-
3.5.2, 4.0
When indexing, we are seeing this warning:
2011-04-19 22:29:48,063 WARN [Indexer: 4] [apache.pdfbox.util.PDFStreamEngine] processOperator java.io.IOException: Error: expected hex character and not :32 - url: /admin/reindex.action | userName: admin | referer: https://confluenceurl/admin/search-indexes.action | action: reindex java.io.IOException: Error: expected hex character and not :32
Which is a bug in PDFBox 1.2.1 and has been fixed in 1.3.1: https://issues.apache.org/jira/browse/PDFBOX-790
[CONFSERVER-22358] Upgrade PDFBox to 1.3.1
For anyone else seeing errors with PDF content indexing the file names themselves should be still be indexed. This only affects indexing of the content within PDF files.
Hello Lachlan,
From what I can see, .pdf files aren't indexed which is bad!
This should be fixed sooner rather than later.
Cheers,
Leon
Hi Leon, we've had some issues appear upon updating the version, so we've rolled back the change for now and are going to see if the update is still viable.
Here is a log snapshot:
2011-12-14 16:22:17,313 WARN [Indexer: 1] [apache.pdfbox.util.PDFStreamEngine] processOperator java.io.IOException: Error: expected hex character and not :32 java.io.IOException: Error: expected hex character and not :32 at org.apache.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:336) at org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:139) at org.apache.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:556) at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:390) at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:386) at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:567) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:250) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:208) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:378) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:302) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:258) at com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:50) at com.atlassian.bonnie.search.extractor.BaseAttachmentContentExtractor.addFields(BaseAttachmentContentExtractor.java:40) at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:36) at com.atlassian.bonnie.search.BaseDocumentBuilder.getDocument(BaseDocumentBuilder.java:104) at com.atlassian.confluence.search.lucene.ConfluenceDocumentBuilder.getDocument(ConfluenceDocumentBuilder.java:97) at com.atlassian.confluence.search.lucene.tasks.AddDocumentIndexTask.perform(AddDocumentIndexTask.java:43) at com.atlassian.bonnie.index.TempIndexWriter.perform(TempIndexWriter.java:73) at com.atlassian.confluence.search.lucene.TempIndexWriterStrategy.perform(TempIndexWriterStrategy.java:43) at com.atlassian.confluence.search.lucene.tasks.TempIndexBackedIndexTaskPerformer.perform(TempIndexBackedIndexTaskPerformer.java:21) at com.atlassian.confluence.search.lucene.reindex.ReindexWorkBatch.indexCollection(ReindexWorkBatch.java:128) at com.atlassian.confluence.search.lucene.reindex.ReindexWorkBatch$1.doInTransaction(ReindexWorkBatch.java:88) at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:128) at com.atlassian.confluence.search.lucene.reindex.ReindexWorkBatch.run(ReindexWorkBatch.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) 2011-12-14 16:22:17,314 WARN [Indexer: 1] [apache.pdfbox.util.PDFStreamEngine] processOperator java.io.IOException: Error: expected hex character and not :32 java.io.IOException: Error: expected hex character and not :32 at org.apache.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:336) at org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:139) at org.apache.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:556) at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:390) at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:386) at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:567) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:250) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:208) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:378) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:302) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:258) at com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:50) at com.atlassian.bonnie.search.extractor.BaseAttachmentContentExtractor.addFields(BaseAttachmentContentExtractor.java:40) at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:36) at com.atlassian.bonnie.search.BaseDocumentBuilder.getDocument(BaseDocumentBuilder.java:104) at com.atlassian.confluence.search.lucene.ConfluenceDocumentBuilder.getDocument(ConfluenceDocumentBuilder.java:97) at com.atlassian.confluence.search.lucene.tasks.AddDocumentIndexTask.perform(AddDocumentIndexTask.java:43) at com.atlassian.bonnie.index.TempIndexWriter.perform(TempIndexWriter.java:73) at com.atlassian.confluence.search.lucene.TempIndexWriterStrategy.perform(TempIndexWriterStrategy.java:43) at com.atlassian.confluence.search.lucene.tasks.TempIndexBackedIndexTaskPerformer.perform(TempIndexBackedIndexTaskPerformer.java:21) at com.atlassian.confluence.search.lucene.reindex.ReindexWorkBatch.indexCollection(ReindexWorkBatch.java:128) at com.atlassian.confluence.search.lucene.reindex.ReindexWorkBatch$1.doInTransaction(ReindexWorkBatch.java:88) at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:128) at com.atlassian.confluence.search.lucene.reindex.ReindexWorkBatch.run(ReindexWorkBatch.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) 2011-12-14 16:22:17,314 WARN [Indexer: 1] [apache.pdfbox.util.PDFStreamEngine] processOperator java.io.IOException: Error: expected hex character and not :32 java.io.IOException: Error: expected hex character and not :32 at org.apache.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:336) at org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:139) at org.apache.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:556) at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:390) at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:386) at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:567) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:250) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:208) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:378) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:302) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:258) at com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:50) at com.atlassian.bonnie.search.extractor.BaseAttachmentContentExtractor.addFields(BaseAttachmentContentExtractor.java:40) at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:36) at com.atlassian.bonnie.search.BaseDocumentBuilder.getDocument(BaseDocumentBuilder.java:104) at com.atlassian.confluence.search.lucene.ConfluenceDocumentBuilder.getDocument(ConfluenceDocumentBuilder.java:97) at com.atlassian.confluence.search.lucene.tasks.AddDocumentIndexTask.perform(AddDocumentIndexTask.java:43) at com.atlassian.bonnie.index.TempIndexWriter.perform(TempIndexWriter.java:73) at com.atlassian.confluence.search.lucene.TempIndexWriterStrategy.perform(TempIndexWriterStrategy.java:43) at com.atlassian.confluence.search.lucene.tasks.TempIndexBackedIndexTaskPerformer.perform(TempIndexBackedIndexTaskPerformer.java:21) at com.atlassian.confluence.search.lucene.reindex.ReindexWorkBatch.indexCollection(ReindexWorkBatch.java:128) at com.atlassian.confluence.search.lucene.reindex.ReindexWorkBatch$1.doInTransaction(ReindexWorkBatch.java:88) at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:128) at com.atlassian.confluence.search.lucene.reindex.ReindexWorkBatch.run(ReindexWorkBatch.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) 2011-12-14 16:22:17,315 WARN [Indexer: 1] [apache.pdfbox.util.PDFStreamEngine] processOperator java.io.IOException: Error: expected hex character and not :32 java.io.IOException: Error: expected hex character and not :32 at org.apache.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:336) at org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:139) at org.apache.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:556) at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:390) at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:386) at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:567) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:250) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:208) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:378) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:302) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:258) at com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:50) at com.atlassian.bonnie.search.extractor.BaseAttachmentContentExtractor.addFields(BaseAttachmentContentExtractor.java:40) at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:36) at com.atlassian.bonnie.search.BaseDocumentBuilder.getDocument(BaseDocumentBuilder.java:104) at com.atlassian.confluence.search.lucene.ConfluenceDocumentBuilder.getDocument(ConfluenceDocumentBuilder.java:97) at com.atlassian.confluence.search.lucene.tasks.AddDocumentIndexTask.perform(AddDocumentIndexTask.java:43) at com.atlassian.bonnie.index.TempIndexWriter.perform(TempIndexWriter.java:73) at com.atlassian.confluence.search.lucene.TempIndexWriterStrategy.perform(TempIndexWriterStrategy.java:43) at com.atlassian.confluence.search.lucene.tasks.TempIndexBackedIndexTaskPerformer.perform(TempIndexBackedIndexTaskPerformer.java:21) at com.atlassian.confluence.search.lucene.reindex.ReindexWorkBatch.indexCollection(ReindexWorkBatch.java:128) at com.atlassian.confluence.search.lucene.reindex.ReindexWorkBatch$1.doInTransaction(ReindexWorkBatch.java:88) at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:128) at com.atlassian.confluence.search.lucene.reindex.ReindexWorkBatch.run(ReindexWorkBatch.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) 2011-12-14 16:22:17,315 WARN [Indexer: 1] [apache.pdfbox.util.PDFStreamEngine] processOperator java.io.IOException: Error: expected hex character and not :32 java.io.IOException: Error: expected hex character and not :32 at org.apache.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:336) at org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:139) at org.apache.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:556) at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:390) at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:386) at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:567) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:250) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:208) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:378) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:302) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:258) at com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:50) at com.atlassian.bonnie.search.extractor.BaseAttachmentContentExtractor.addFields(BaseAttachmentContentExtractor.java:40) at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:36) at com.atlassian.bonnie.search.BaseDocumentBuilder.getDocument(BaseDocumentBuilder.java:104) at com.atlassian.confluence.search.lucene.ConfluenceDocumentBuilder.getDocument(ConfluenceDocumentBuilder.java:97) at com.atlassian.confluence.search.lucene.tasks.AddDocumentIndexTask.perform(AddDocumentIndexTask.java:43) at com.atlassian.bonnie.index.TempIndexWriter.perform(TempIndexWriter.java:73) at com.atlassian.confluence.search.lucene.TempIndexWriterStrategy.perform(TempIndexWriterStrategy.java:43) at com.atlassian.confluence.search.lucene.tasks.TempIndexBackedIndexTaskPerformer.perform(TempIndexBackedIndexTaskPerformer.java:21) at com.atlassian.confluence.search.lucene.reindex.ReindexWorkBatch.indexCollection(ReindexWorkBatch.java:128) at com.atlassian.confluence.search.lucene.reindex.ReindexWorkBatch$1.doInTransaction(ReindexWorkBatch.java:88) at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:128) at com.atlassian.confluence.search.lucene.reindex.ReindexWorkBatch.run(ReindexWorkBatch.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)
Fixed where?
I've installed the latest 4.1 version and still getting a bunch of those errors for every PDF file Confluence indexes.
And PDFBox version is still 1.2.1 ;(
QA'd this locally using a PDF that was previously giving this exception when indexing, all good now.
Thanks Lachlan,
I can see that this issue is "In Progress".
Is anyone going to be assign to this in a near future?
Cheers,
Leon
Hi,
I've simply replaced pdfbox-1.2.1.jar with pdfbox-1.6.0.jar (under confluence-3.5.13-std/confluence/WEB-INF/lib/) and run reindex again.
Though I don't know if it will break anything else.....
Thanks Steve,
Can you please tell me how to fix it?
What version incorporates this fix?
Cheers,
Leon