Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-10911

PDF Extractor unable to index chinese and Japanese characters

    XMLWordPrintable

Details

    Description

      Confluence's Lucene cannot search for Chinese characters (both traditional and simplified) in PDF file.
      The same characters can be indexed fine in Word DOC file.

      It appears that Confluence PDF Extractor fails to extract the chinese characters (See picture). Alphabets can be searched without any problem.

      Attachments

        1. characters_encoding_test.PNG
          46 kB
          Roy Hartono [Atlassian]
        2. chinesechars_pdf_fails.PNG
          46 kB
          Roy Hartono [Atlassian]
        3. search_chinese.PNG
          40 kB
          Roy Hartono [Atlassian]
        4. test.doc
          62 kB
          Roy Hartono [Atlassian]
        5. test.pdf
          33 kB
          Roy Hartono [Atlassian]

        Issue Links

          Activity

            People

              akdominguez Katrina Walser (Inactive)
              rhartono Roy Hartono [Atlassian]
              Votes:
              4 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: