Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-78782

Importing a site export can fail with Invalid byte 2 of 4-byte UTF-8 sequence

      We don't plan to backport the fix for this bug to earlier Long Term Support versions

      The fix for this bug isn't suitable for backporting to a bug fix release for any previous LTS versions. This is often because the fix is considered too high risk to implement in an older version.

      The fix for this issue will be included in future Long Term Support versions.

      Issue Summary

      Imports of a site containing UTF-8 characters can fail with "Import failed. Check your server logs for more information. com.atlassian.confluence.importexport.ImportExportException: Unable to complete import: Invalid byte 2 of 4-byte UTF-8 sequence" shown on the web UI. This appears to be due to a bug (XERCESJ-1668) in the Apache Xerces library, and the underlying logs show the attempt failing with a SAXParseException.

      Steps to Reproduce

      1. Create a space with multiple pages containing many special characters.
      2. Export the site in xml format.
      3. Import the xml file.

      This can be difficult to reproduce as the UTF-8 character needs to be read in as the reader buffer is exhausted so it is only partially read and causes the rest to be added to the next buffer, causing the calculation to be off by one. 

      Alternatively you can import the following site export to see the issue:
      xmlexport-20220516-093414-6.zip

      Expected Results

      Import should complete without error.

      Actual Results

      The import fails with the following error to screen:

       

      The below exception is thrown in the confluence.log file:

      2022-05-16 09:36:20,874 ERROR [Long running task: Importing data] [confluence.importexport.xmlimport.BackupImporter] importEntities Cannot import the entities:
       -- url: /longrunningtaskxml.action | referer: http://10.108.15.254:8090/admin/restore-local-file.action | traceId: fe357468ab26f515 | userName: admin | action: longrunningtaskxml
      com.atlassian.confluence.importexport.ImportExportException: Unable to complete import: Invalid byte 2 of 4-byte UTF-8 sequence.
              at com.atlassian.confluence.importexport.xmlimport.DefaultXmlImporter.doImportInternal(DefaultXmlImporter.java:64)
              at com.atlassian.confluence.importexport.xmlimport.DefaultXmlImporter.doImport(DefaultXmlImporter.java:42)
              at com.atlassian.confluence.importexport.xmlimport.BackupImporter.importEntities(BackupImporter.java:402)
              at com.atlassian.confluence.importexport.xmlimport.BackupImporter.importEverything(BackupImporter.java:371)
              at com.atlassian.confluence.importexport.xmlimport.FileBackupImporter.importEverything(FileBackupImporter.java:170)
              at com.atlassian.confluence.importexport.xmlimport.BackupImporter$1.doInTransactionWithoutResult(BackupImporter.java:262)
              at org.springframework.transaction.support.TransactionCallbackWithoutResult.doInTransaction(TransactionCallbackWithoutResult.java:36)
              at com.atlassian.confluence.importexport.xmlimport.RestorePluginStateStoreTransactionCallbackDecorator.doInTransaction(RestorePluginStateStoreTransactionCallbackDecorator.java:49)
              at com.atlassian.confluence.importexport.xmlimport.RestoreBandanaValuesTransactionCallbackDecorator.doInTransaction(RestoreBandanaValuesTransactionCallbackDecorator.java:56)
              at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
              at com.atlassian.confluence.importexport.xmlimport.BackupImporter.doImportInternal(BackupImporter.java:224)
              at com.atlassian.confluence.importexport.Importer.doImport(Importer.java:73)
              at com.atlassian.confluence.importexport.DefaultImportExportManager.performImportInternal(DefaultImportExportManager.java:118)
              at com.atlassian.confluence.importexport.DefaultImportExportManager.doPerformImport(DefaultImportExportManager.java:106)
              at com.atlassian.confluence.importexport.DefaultImportExportManager.performImport(DefaultImportExportManager.java:101)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:498)
              at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:343)
              at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
              at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
              at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:295)
              at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:98)
              at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
              at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:212)
              at com.sun.proxy.$Proxy172.performImport(Unknown Source)
              at com.atlassian.confluence.importexport.actions.ImportLongRunningTask.runInternal(ImportLongRunningTask.java:78)
              at com.atlassian.confluence.util.longrunning.ConfluenceAbstractLongRunningTask.run(ConfluenceAbstractLongRunningTask.java:26)
              at com.atlassian.confluence.util.longrunning.ManagedTask.run(ManagedTask.java:39)
              at com.atlassian.confluence.impl.util.concurrent.ConfluenceExecutors$ThreadLocalContextTaskWrapper.lambda$wrap$1(ConfluenceExecutors.java:90)
              at com.atlassian.confluence.vcache.VCacheRequestContextOperations.lambda$doInRequestContext$0(VCacheRequestContextOperations.java:50)
              at com.atlassian.confluence.impl.vcache.VCacheRequestContextManager.doInRequestContextInternal(VCacheRequestContextManager.java:84)
              at com.atlassian.confluence.impl.vcache.VCacheRequestContextManager.doInRequestContext(VCacheRequestContextManager.java:68)
              at com.atlassian.confluence.vcache.VCacheRequestContextOperations.doInRequestContext(VCacheRequestContextOperations.java:49)
              at com.atlassian.confluence.vcache.VCacheRequestContextOperations.lambda$withRequestContext$2(VCacheRequestContextOperations.java:66)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      Caused by: org.xml.sax.SAXParseException; lineNumber: 115232; columnNumber: 1140; Invalid byte 2 of 4-byte UTF-8 sequence.
              at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
              at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
              at com.atlassian.security.xml.RestrictedXMLReader.parse(RestrictedXMLReader.java:103)
              at com.atlassian.confluence.importexport.xmlimport.DefaultXmlImporter.parseBackup(DefaultXmlImporter.java:86)
              at com.atlassian.confluence.importexport.xmlimport.DefaultXmlImporter.initProgressMeter(DefaultXmlImporter.java:75)
              at com.atlassian.confluence.importexport.xmlimport.DefaultXmlImporter.doImportInternal(DefaultXmlImporter.java:47)
              ... 40 more
      Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
              at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
              at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
              at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
              at org.apache.xerces.impl.XMLEntityScanner.scanData(Unknown Source)
              at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanCDATASection(Unknown Source)
              at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
              at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
              at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
              at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
              at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
              ... 46 more 

      Workaround

      Use the atlassian-xml-cleaner-0.1.jar documented in 'Incorrect string value' error thrown when restoring XML backup in Confluence.

      This will clean the entities.xml file, however the imported site will now be missing emojis (replaced with question marks), or worse, the emojis are changed from their original form to another emoji. 

       

            [CONFSERVER-78782] Importing a site export can fail with Invalid byte 2 of 4-byte UTF-8 sequence

            It seems this problem may also cause another error, or at least a similar problem which also has to do with character/string length when reading the backup.

            In my case importing one days 7.4.5 xml backup to another 7.4.5 instance crashed with error "Index 2048 out of bounds for length 2048", but an xml from the next day instead crashed with the error "Invalid byte 2 of 4-byte UTF-8 sequence" mentioned in this ticket. Both errors rendered Confluence unusable and had to restore a db backup to get it running again. The index out of bounds error had no reference to where the problem was in the xml file but the invalid byte error pointed to a specific row where I did find a multibyte emoji.

            When trying to import the same backups into a 8.5.0 version it worked fine. So conclusion is, if an xml import in pre-8.3 versions crashes with something that sounds related to character/string length, upgrade to at least 8.3.

            Adding this comment here in case others search for the index out of bounds error because it took me a long time finding what that one was about.

            Peter Heubeck added a comment - It seems this problem may also cause another error, or at least a similar problem which also has to do with character/string length when reading the backup. In my case importing one days 7.4.5 xml backup to another 7.4.5 instance crashed with error "Index 2048 out of bounds for length 2048", but an xml from the next day instead crashed with the error "Invalid byte 2 of 4-byte UTF-8 sequence" mentioned in this ticket. Both errors rendered Confluence unusable and had to restore a db backup to get it running again. The index out of bounds error had no reference to where the problem was in the xml file but the invalid byte error pointed to a specific row where I did find a multibyte emoji. When trying to import the same backups into a 8.5.0 version it worked fine. So conclusion is, if an xml import in pre-8.3 versions crashes with something that sounds related to character/string length, upgrade to at least 8.3. Adding this comment here in case others search for the index out of bounds error because it took me a long time finding what that one was about.

            Hey 50e16252c2f5,

            Your timing was almost perfect.

            It's fixed in Confluence 8.3.0 with our new export/import implementation

            Hope this helps.

            Thanks,
            James Ponting
            Engineering Manager - Confluence Data Center

            James Ponting added a comment - Hey 50e16252c2f5 , Your timing was almost perfect. It's fixed in Confluence 8.3.0 with our new export/import implementation Hope this helps. Thanks, James Ponting Engineering Manager - Confluence Data Center

            A fix for this issue is available in Confluence Server and Data Center 8.3.0.
            Upgrade now or check out the Release Notes to see what other issues are resolved.

            James Whitehead added a comment - A fix for this issue is available in Confluence Server and Data Center 8.3.0. Upgrade now or check out the Release Notes to see what other issues are resolved.

            William W added a comment -

            Can anyone yet confirm if this has been addressed/patched in any of the 8.x releases? i see the above listing as they are not going to port it back but curious if its still happening in the later DC versions

            William W added a comment - Can anyone yet confirm if this has been addressed/patched in any of the 8.x releases? i see the above listing as they are not going to port it back but curious if its still happening in the later DC versions

            Kelly Pumphrey added a comment - - edited

            I opened the entities.xml file in Sublime editor and did a regex search 

            [^\x00-\x7F]

            This highlighted a few emoji characters. I removed them and re-added the entitites.xml to the zip file and the import then worked.

            Kelly Pumphrey added a comment - - edited I opened the entities.xml file in Sublime editor and did a regex search  [^\x00-\x7F] This highlighted a few emoji characters. I removed them and re-added the entitites.xml to the zip file and the import then worked.

            Alexander added a comment - - edited

            Also getting this Error in 7.13.12. ...

             

            It seems like there is an workaround for the XERCES-J Bug --> [LUCENE-3937] Workaround the XERCES-J bug in Benchmark - ASF JIRA (apache.org)

            Alexander added a comment - - edited Also getting this Error in 7.13.12. ...   It seems like there is an workaround for the XERCES-J Bug --> [LUCENE-3937] Workaround the XERCES-J bug in Benchmark - ASF JIRA (apache.org)

            Also seeing this in 7.13.7.

            Maria Murphy added a comment - Also seeing this in 7.13.7.

            Affecting also 7.17.5

            Alessandro Di Prima added a comment - Affecting also 7.17.5

              glipatov George Lipatov
              7829eff5df87 Dean Norman
              Affected customers:
              26 This affects my team
              Watchers:
              37 Start watching this issue

                Created:
                Updated:
                Resolved: