Details
-
Bug
-
Resolution: Fixed
-
Low
-
None
-
7.3.0, 7.3.1, 7.3.2, 7.3.3
-
Severity 1 - Critical
Description
Issue Summary
This issue appears to be isolated to PostgreSQL databases due to how they determine which merge strategies to use on table joins.
A4J 7.3 changed the data type for AUDIT_ITEM_ID and ID in the tables AO_589059_AUDIT_ITEM_CGE_ITEM and AO_589059_AUDIT_ITEM_COMP_CGE from INT to BIGINT.
This is causing some queries to use a hash join strategy instead of a merge join.
If the hash tables of the tables are too large to be loaded into memory, eg they are larger than 4MB but work_mem is set to the default of 4MB, Postgres begins writing them out to temporary disk files.
This causes DB Write IOPS to spike, and causes other DB queries to back up and return more slowly.
This will cause Automation Rules to be actioned more slowly.
With heavy load, it can also cause DB Connection Pool exhaustion, which in turn will cause HTTP Thread exhaustion, which will result in an outage.
This is a good read on join strategies for more information: https://www.cybertec-postgresql.com/en/join-strategies-and-performance-in-postgresql/
Steps to Reproduce
- Have a large number of entries in the tables listed here
- Upgrade A4J to 7.3.x
- Run this query and observe which merge strategy is in use:
explain analyse select "AO_589059_AUDIT_ITEM_COMP_CGE"."AUDIT_ITEM_ID", "AO_589059_AUDIT_ITEM_COMP_CGE"."COMPONENT_NAME_KEY", "AO_589059_AUDIT_ITEM_COMP_CGE"."COMPONENT", "AO_589059_AUDIT_ITEM_COMP_CGE"."COMPONENT_ID", "AO_589059_AUDIT_ITEM_COMP_CGE"."DURATION", "AO_589059_AUDIT_ITEM_COMP_CGE"."ID", "AO_589059_AUDIT_ITEM_COMP_CGE"."OPTIMISED_IDS", "AO_589059_AUDIT_ITEM_COMP_CGE"."START_TIME", "AO_589059_AUDIT_ITEM_CGE_ITEM"."AUDIT_ITEM_COMPONENT_CHANGE_ID", "AO_589059_AUDIT_ITEM_CGE_ITEM"."AUDIT_ITEM_ID", "AO_589059_AUDIT_ITEM_CGE_ITEM"."CHANGE_FROM", "AO_589059_AUDIT_ITEM_CGE_ITEM"."CHANGE_TO", "AO_589059_AUDIT_ITEM_CGE_ITEM"."FIELD_NAME", "AO_589059_AUDIT_ITEM_CGE_ITEM"."ID", "AO_589059_AUDIT_ITEM_CGE_ITEM"."MESSAGE" from "public"."AO_589059_AUDIT_ITEM_COMP_CGE" "AO_589059_AUDIT_ITEM_COMP_CGE" left join "public"."AO_589059_AUDIT_ITEM_CGE_ITEM" "AO_589059_AUDIT_ITEM_CGE_ITEM" on "AO_589059_AUDIT_ITEM_COMP_CGE"."ID" = "AO_589059_AUDIT_ITEM_CGE_ITEM"."AUDIT_ITEM_COMPONENT_CHANGE_ID" where "AO_589059_AUDIT_ITEM_COMP_CGE"."AUDIT_ITEM_ID" = 10;
Expected Results
- A4J is as performant as prior to the upgrade
- Running the query above shows a merge join strategy is in use
Actual Results
- A4J is extremely non performant and causes Jira instability
- DB Write IOPS is very high
- Running the query above shows a hash join strategy is in use
Workaround
NB: Currently, downgrading to A4J 7.2.x is NOT guaranteed to resolve the performance issue
RECOMMENDED FIX
- Run ANALYZE VERBOSE on the tables to collect statistics about the tables, so the query planner can use these statistics to determine the most efficient execution plans for the queries.
- Run the query above and confirm that the join strategy is hash join
- Run ANALYZE VERBOSE "AO_589059_AUDIT_ITEM_COMP_CGE";
- Run ANALYZE VERBOSE "AO_589059_AUDIT_ITEM_CGE_ITEM";
- Run the query above and confirm that the join strategy is now: merge left join
- Confirm DB Write IOPS has dropped and performance is restored
Other workarounds
- Increase work_mem until you see Write IOPS drop and performance improve: SET work_mem = '16MB';
- This may need to be increased further and monitored
- Be careful not to increase this beyond the capacity of the memory on your database server. Bear in mind that multiple queries will run concurrently, so there may be multiple queries using memory at once, and each query can have multiple operations which consume 16mb of memory each, so the memory in use can grow exponentially.
- Alternatively, follow the instructions in the 7.3 release notes to truncate the data from your AUDITLOG tables
- Ideally this would be done *prior* to the upgrade to 7.3 to avoid any performance problems, but this can be done post upgrade to immediately resolve performance issues. This can be done while the system is running, and you should see immediate recovery.
- This will result in the loss of all the audit log data
- Over time the audit log will fill back up and you may see a resurgence of this problem
- Consider expiring audit log items more frequently, so the tables are smaller
Attachments
Issue Links
- causes
-
JIRAAUTOSERVER-266 After upgrading A4J to 7.3.x the majority of my rules stopped working
- Closed
-
HOT-95319 Loading...
-
A4J-2712 Loading...
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
- PIR - Medium Priority Action
-
PIR-9061 Loading...
- relates to
-
A4J-2714 Loading...