Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-57375

Confluence modifies tables with active objects on every restart

      Background

      When Confluence starts up, it verifies schemas for all tables with active objects (AO), and modifies / adds fields if required. For example, if a new version of a plugin is installed, and this plugin introduces a new field in AO class, the corresponding table will be updated and new column will be created.

      The problem

      For MySQL, migration procedure has a bug and updates several tables on every boot like this:

       [[SQLAction{statement='ALTER TABLE AO_BAF3AA_AOINLINE_TASK CHANGE COLUMN BODY BODY LONGTEXT DEFAULT NULL', undoAction=null}]]
      
      
      [[SQLAction{statement='ALTER TABLE AO_88BB94_BATCH_NOTIFICATION CHANGE COLUMN PAYLOAD PAYLOAD LONGTEXT NOT NULL', undoAction=null}]]
      
      
      [[SQLAction{statement='ALTER TABLE AO_21D670_WHITELIST_RULES CHANGE COLUMN EXPRESSION EXPRESSION LONGTEXT NOT NULL', undoAction=null}]]
      
      ...etc...

      It is not a serious issue for Confluence with one node (it only increases boot time), but in case of 2+ nodes it could lead Confluence to deadlock. Example:

      1. The first node is executing slow query (which takes a lot of time) on AO_BAF3AA_AOINLINE_TASK table
      2. The second node is being restarted and is trying to modify AO_BAF3AA_AOINLINE_TASK. But this operation is blocked while the first node does not finish all SQL queries on AO_BAF3AA_AOINLINE_TASK
      3. The first node tries to run more SQL queries on AO_BAF3AA_AOINLINE_TASK, but all of them are blocked by DDL operation on the second node.

      As a result, all threads could be blocked.

      Expected behaviour

      Confluence should not run any DDL operations on boot if tables do not require changes.

      Technical details

      It happens in SchemaGenerator.migrate

            [CONFSERVER-57375] Confluence modifies tables with active objects on every restart

            Quan Pham added a comment -

            A fix for this issue is available to Server and Data Center customers in Confluence 6.15.6
            Upgrade now or check out the Release Notes to see what other issues are resolved.

            Quan Pham added a comment - A fix for this issue is available to Server and Data Center customers in Confluence 6.15.6 Upgrade now or check out the Release Notes to see what other issues are resolved.

            Michelle added a comment -

            We've had to withdraw 6.15.5, and will release 6.15.6 soon. We'll update this issue once the new version is available. You can find out more about the related issue here: https://jira.atlassian.com/browse/CONFSERVER-58490

            Michelle added a comment - We've had to withdraw 6.15.5, and will release 6.15.6 soon. We'll update this issue once the new version is available. You can find out more about the related issue here:  https://jira.atlassian.com/browse/CONFSERVER-58490

            Quan Pham added a comment -

            A fix for this issue is available to Server and Data Center customers in Confluence 6.15.5
            Upgrade now or check out the Release Notes to see what other issues are resolved.

            Quan Pham added a comment - A fix for this issue is available to Server and Data Center customers in Confluence 6.15.5 Upgrade now or check out the Release Notes to see what other issues are resolved.

            As of today, we faced another outage due to this bug. A simple restart of one node of our cluster triggered the bug and the entire outage.

            Florian Maupas added a comment - As of today, we faced another outage due to this bug. A simple restart of one node of our cluster triggered the bug and the entire outage.

            Florian Maupas added a comment - - edited

            Hi,

             

            Today we had an outage and this bug led us to lose 30 minutes of investigation since our second node lock the DB. While I understand this could be limited to some specific circumstance, the impact can be huge for an organization. Could we consider this in the LTS bug fix ? This is typically the type of bug that enterprise pay to avoid ( I mean come on, it leads to all threads being blocked !!! ... )

            Florian Maupas added a comment - - edited Hi,   Today we had an outage and this bug led us to lose 30 minutes of investigation since our second node lock the DB. While I understand this could be limited to some specific circumstance, the impact can be huge for an organization. Could we consider this in the LTS bug fix ? This is typically the type of bug that enterprise pay to avoid ( I mean come on, it leads to all threads being blocked !!! ... )

              nhdang Nhan Dang
              glipatov George Lipatov
              Affected customers:
              3 This affects my team
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: