Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-22342

Synchronising LDAP/Crowd can completely fail because transactions are not properly rolled back in Hibernate2BatchProcessor

      Hibernate2BatchProcessor#commitTransaction() clears the transaction from ThreadLocal before trying to flush the Hibernate session. If the flushing fails, rollbackTransaction() will not clear the session because the transaction cannot be found. After this the whole batch operation will fail, as the offending operation fails on every subsequent flush.

      Directory synchronisation algorithm is not atomic, so it can sometimes try to add existing users to the database, which cause the session flushing to fail. AbstractBatchProcessor has logic to handle these failures gracefully, but it does not work in Confluence because the session is not cleared properly.

      In a large Confluence instance the following behaviour could trigger this issue:

      1. New directory is added with large amount of users
      2. Sync is started, synchronisation algorithm finds that all users in the new directory need to be added and starts adding them
      3. A user from the large user set logs in before the user sync is completed (This triggers user creation in the local instance)
      4. Sync operation tries to add the user who was created in the previous step thus causing the flush to fail

      At this stage no new users can be added as all flushes will fail. Membership synchronisation will proceed very slowly as some users have not been added, so batch operations will fall back to individual processing.

      Patch

      Attached is an updated version of atlassian-embedded-crowd-hibernate2 jar, to patch this issue. When patched, transactions will now correctly rollback, allowing the synchronisation to complete. All records in the rolled back transaction will be ignored until the next synchronisation attempt (or until the affected users log in).

      It is known to work in Confluence 3.5.4, and might work in earlier versions, but these have not been tested. It is not needed in Confluence 3.5.6, as that version already contains this fix.

      This patch also addresses CONF-22631, so that any records that fail to synchronise are logged correctly. Users with Confluence 3.5.5 should install this patch to avoid that issue.

      Installation

      To install the patch:

      1. Stop Confluence
      2. Move the old atlassian-embedded-crowd-hibernate2 jar out of <confluence install dir>/confluence/WEB-INF/lib
      3. Copy the new jar into the same directory
      4. Start Confluence

        1. CONF-22342__Testcase.patch
          4 kB
          Olli Nevalainen
        2. atlassian-embedded-crowd-hibernate2-1.2.9-m3.jar
          66 kB
          Richard Atkins

            [CONFSERVER-22342] Synchronising LDAP/Crowd can completely fail because transactions are not properly rolled back in Hibernate2BatchProcessor

            Michael S added a comment -

            It is known to work in Confluence 3.5.4, and might work in earlier versions, but these have not been tested.

            A customer has reported that 3.5.4 was working for a while and then stopped. Sounds like the effects may not be seen straight away.

            Michael S added a comment - It is known to work in Confluence 3.5.4, and might work in earlier versions, but these have not been tested. A customer has reported that 3.5.4 was working for a while and then stopped. Sounds like the effects may not be seen straight away.

            The previous patch (atlassian-embedded-crowd-hibernate2-1.2.9-m1.jar), had an issue that would prevent XML backups from restoring group memberships correctly. I've updated the patch to resolve this issue.

            Richard Atkins added a comment - The previous patch (atlassian-embedded-crowd-hibernate2-1.2.9-m1.jar), had an issue that would prevent XML backups from restoring group memberships correctly. I've updated the patch to resolve this issue.

            We've opted to release 3.5.5 with this fix as is for now, but I'll also update the patch attached to this issue with the fix for CONF-22631.

            Richard Atkins added a comment - We've opted to release 3.5.5 with this fix as is for now, but I'll also update the patch attached to this issue with the fix for CONF-22631 .

            Thanks to Colin Goudie, we've found an issue with the patch that will cause a null pointer exception if a duplicate membership is detected while synchronising memberships. This exception will cause the synchronisation attempt to abort, preventing all memberships after the affected batch from being synchronised until the next synchronisation attempt. I'll attach an updated patch to fix this additional issue shortly.

            Richard Atkins added a comment - Thanks to Colin Goudie, we've found an issue with the patch that will cause a null pointer exception if a duplicate membership is detected while synchronising memberships. This exception will cause the synchronisation attempt to abort, preventing all memberships after the affected batch from being synchronised until the next synchronisation attempt. I'll attach an updated patch to fix this additional issue shortly.

            Richard Atkins added a comment - - edited

            (Comment deleted, obsolete)

            Richard Atkins added a comment - - edited (Comment deleted, obsolete)

            Matt Ryall added a comment -

            An IM comment from Olli:

            I wrote a test, and I think I understand it now. In Hibernate2BatchProsessor line 107 transaction is forgotten, so when flush on line 108 fails, rollbackTransaction() in line 121 becomes a no-op and so fails to clear the session

            Sounds like it's pretty straightforward to fix.

            Matt Ryall added a comment - An IM comment from Olli: I wrote a test, and I think I understand it now. In Hibernate2BatchProsessor line 107 transaction is forgotten, so when flush on line 108 fails, rollbackTransaction() in line 121 becomes a no-op and so fails to clear the session Sounds like it's pretty straightforward to fix.

            Attached a test case that triggers this problem.

            Olli Nevalainen added a comment - Attached a test case that triggers this problem.

              akdominguez Katrina Walser (Inactive)
              onevalainen Olli Nevalainen
              Affected customers:
              0 This affects my team
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved:

                  Estimated:
                  Original Estimate - 4h
                  4h
                  Remaining:
                  Remaining Estimate - 4h
                  4h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified