Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-22699

Attachment extractor can't extract RTF format which generated by ГАРАНТ (Russian government legal documents base)

    XMLWordPrintable

Details

    Description

      Summary of the Bug

      Indexer is not able to index/extract RTF documents which is generated by "ГАРАНТ" (Russian government legal documents base).

      The following stack trace is recorded on logs

      2011-05-20 22:29:28,850 WARN [Indexer: 2] [bonnie.search.extractor.BaseAttachmentContentExtractor] addFields Error indexing attachment (Attachment: 110-п_от_15_05_2009_Постановление_Правительства_Ханты-Мансийского_АО_-_Югры.rtf v.1 (1179649) adminconf)
       -- referer: http://localhost:8354/admin/search-indexes.action | url: /admin/reindex.action | userName: adminconf | action: reindex
      com.atlassian.bonnie.search.extractor.ExtractorException: Error reading content of Word document: The document appears to be corrupted and cannot be loaded.
      	at com.atlassian.confluence.extra.officeconnector.index.word.WordTextExtractor.extractText(WordTextExtractor.java:41)
      	at com.atlassian.bonnie.search.extractor.BaseAttachmentContentExtractor.addFields(BaseAttachmentContentExtractor.java:40)
      	at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:45)
      	at com.atlassian.bonnie.search.BaseDocumentBuilder.getDocument(BaseDocumentBuilder.java:104)
      	at com.atlassian.confluence.search.lucene.ConfluenceDocumentBuilder.getDocument(ConfluenceDocumentBuilder.java:102)
      	at com.atlassian.confluence.search.lucene.tasks.AddDocumentIndexTask.perform(AddDocumentIndexTask.java:43)
      	at com.atlassian.bonnie.index.TempIndexWriter.perform(TempIndexWriter.java:73)
      	at com.atlassian.confluence.search.lucene.TempIndexWriterStrategy.perform(TempIndexWriterStrategy.java:43)
      	at com.atlassian.confluence.search.lucene.tasks.TempIndexBackedIndexTaskPerformer.perform(TempIndexBackedIndexTaskPerformer.java:21)
      	at com.atlassian.confluence.search.lucene.DefaultObjectQueueWorker.indexCollection(DefaultObjectQueueWorker.java:78)
      	at com.atlassian.confluence.search.lucene.DefaultObjectQueueWorker$1.doInTransactionWithoutResult(DefaultObjectQueueWorker.java:62)
      	at org.springframework.transaction.support.TransactionCallbackWithoutResult.doInTransaction(TransactionCallbackWithoutResult.java:33)
      	at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:127)
      	at com.atlassian.confluence.search.lucene.DefaultObjectQueueWorker.run(DefaultObjectQueueWorker.java:51)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      	at java.lang.Thread.run(Thread.java:662)
      Caused by: com.aspose.words.FileCorruptedException: The document appears to be corrupted and cannot be loaded.
      	at com.aspose.words.Document.a(Unknown Source)
      	at com.aspose.words.Document.b(Unknown Source)
      	at com.aspose.words.Document.a(Unknown Source)
      	at com.aspose.words.Document.<init>(Unknown Source)
      	at com.aspose.words.Document.<init>(Unknown Source)
      	at com.aspose.words.Document.<init>(Unknown Source)
      	at com.atlassian.confluence.extra.officeconnector.index.word.WordTextExtractor.extractText(WordTextExtractor.java:37)
      	... 16 more
      Caused by: java.lang.NullPointerException: style
      	at asposewobfuscated.am.c(Unknown Source)
      	at com.aspose.words.aav.a(Unknown Source)
      	at com.aspose.words.wp.a(Unknown Source)
      	at com.aspose.words.wp.d(Unknown Source)
      	at com.aspose.words.fq.gg(Unknown Source)
      	at com.aspose.words.fq.d(Unknown Source)
      	at com.aspose.words.fq.read(Unknown Source)
      	... 22 more
      

      Steps to Reproduce

      1. Download the attached file
      2. Attach into Confluence
      3. Wait for a minute (indexer run every minute)
      4. Check atlassian-confluence.log

      Steps to create the bad RTF document

      1. Go to http://english.garant.ru/
      2. Open demo version
      3. Open any full text available document.
      4. Press "Export to word button"

      Workaround

      1. Open the problematic document on Microsoft Office
      2. Re-save the problematic document on Microsoft Office
      3. Re-attached

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              scahyadiputra Septa Cahyadiputra (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: