PDF Extractor unable to index chinese and Japanese characters

XMLWordPrintable

      Confluence's Lucene cannot search for Chinese characters (both traditional and simplified) in PDF file.
      The same characters can be indexed fine in Word DOC file.

      It appears that Confluence PDF Extractor fails to extract the chinese characters (See picture). Alphabets can be searched without any problem.

        1. characters_encoding_test.PNG
          characters_encoding_test.PNG
          46 kB
        2. chinesechars_pdf_fails.PNG
          chinesechars_pdf_fails.PNG
          46 kB
        3. search_chinese.PNG
          search_chinese.PNG
          40 kB
        4. test.doc
          62 kB
        5. test.pdf
          33 kB

            Assignee:
            Katrina Walser (Inactive)
            Reporter:
            Roy Hartono [Atlassian]
            Votes:
            4 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: