-
Bug
-
Resolution: Fixed
-
High
-
6.3.9
-
6.03
-
I'm asking that this issue be considered a Critical Bug, because I believe it is the primary cause of many JIRA outages that we have experienced over the last year and probably much longer. We've been trying to figure this one out for a long time!
The quick summary of the problem is Gadgets such as "Issue Statistics" and "Pie Chart" handle "Irrelevant" issues or "None" issue results (where the group by value resolves to "Irrelevant" or "None") extremely poorly which can easily result in "Out of Memory" and JIRA application failure requiring restart. By poorly, I mean that it is relatively easy for users to add "useful" gadgets to their Dashboard which result in very high Java heap space utilization and extended duration server-side compute resource utilization. With hundreds or thousands of users all customizing their dashboards, and trying to get the most use out of JIRA, we see this happening on a regular basis.
Up until now, we've been asking the users to please stop using JIRA Dashboards to report on large issue sets, and we have been providing them similar reports via our R&D Data Warehouse. However, since the point of JIRA Dashboards, is to make this information available to the users from one tool in one place, the option of asking them to get their data from a different tool a) doesn't work very well, because they will still use JIRA and cause the system to crash, and b) isn't the direction we want to pursue.
Now to the explanation of my research including my proposal for one way to address this problem:
I instrumented a few of the JIRA gadget classes to narrow down where the problem is. I wrote off all "first queries", so the results below are always reproducible "second or later query". In this particular case, the "Issue Statistics" gadget is being used against 250,060 issues, grouped by the "Product Owner" field, with 241,606 "Irrelevant" issues. This is on a stand-alone development box that I am the only one who is using it so the timings represent a "idle system" scenario in a controlled enviornment where I choose when to refresh each gadget one at a time. With a "heavily loaded" system, the times are much worse, and the memory usage is out of control.
Most intervals that I instrumented completed very fast and had minimum impact on memory. I was somewhat impressed that even with all the complexity, JIRA is mostly working extremely efficiently. However, one interval that I will break down below is VERY SLOW, and allocates a significant amount of heap with each invocation. This is easily reproducible for me as it happens every single time with almost no variance:
2014-11-23 04:31:08,750 http-bio-8080-exec-17 ERROR 0501056 271x9x1 ku7mpt 10.128.110.207 /rest/gadget/1.0/stats/generate [jira.gadgets.system.StatisticAccessorBean] search:B = 488395 KBytes allocated 2014-11-23 04:31:10,923 http-bio-8080-exec-17 ERROR 0501056 271x9x1 ku7mpt 10.128.110.207 /rest/gadget/1.0/stats/generate [jira.gadgets.system.StatisticAccessorBean] search:C2 = 689221 KBytes allocated
There is nothing else running on this JIRA instance. I have also run the test dozens of times and it is always approximately the same.
Between "search:B" and "search:C2" it took 2,173 milliseconds and it allocated at least 196 MBytes of Java heap. If you can imagine dozens or hundreds of people refreshing their dashboards at the same time, this appetite for Java heap easily leads to "Out of Memory" during peak periods where all the queries happen to line up and run concurrently. This is particularly more likely due to the long runtime... it is a lot easier for queries to become concurrent and compete for shared resources if they take 2,173 milliseconds to complete, than if they take 200 milliseconds to complete.
The above interval happens in this code:
atlassian-jira-6.3.9-source/jira-project/jira-components/jira-api/src/main/java/com/atlassian/jira/web/bean/StatisticAccessorBean.java
These are the original lines plus my extra logging and try/finally: (I only added lines)
protected SearchStatisticsResult searchCountMap(final String groupField) throws SearchException { LOGGER.error("search:A = " + ((Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) / 1024) + " KBytes allocated"); try { OneDimensionalTermHitCollector hitCollector = new OneDimensionalTermHitCollector(groupField, fieldVisibilityManager, readerCache, fieldManager, projectManager); LOGGER.error("search:B = " + ((Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) / 1024) + " KBytes allocated"); if (overrideSecurity) { searchProvider.searchOverrideSecurity(filter.getQuery(), remoteUser, hitCollector); LOGGER.error("search:C1 = " + ((Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) / 1024) + " KBytes allocated"); } else { searchProvider.search(filter.getQuery(), remoteUser, hitCollector); LOGGER.error("search:C2 = " + ((Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) / 1024) + " KBytes allocated"); } return new SearchStatisticsResult(hitCollector.getHitCount(), hitCollector.getIrrelevantCount(), hitCollector.getResult()); } finally { LOGGER.error("search:D = " + ((Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) / 1024) + " KBytes allocated"); } }
You can see that between "search:B" and "search:C2" is this:
searchProvider.search(filter.getQuery(), remoteUser, hitCollector);
This is the Lucene search. However, it is more than this, as it includes other aspects like the the JIRA issue permission checking and the hit collector. I pursued both paths and came to the realization that if I use the "Issue Statistics" gadget with an issue filter and grouping field that results in many "Irrelevant" issues that it completes slowly, but if I use it with an issue filter and grouping field with few or no "Irrelevant" issues that it completes much more quickly.
This lead me to this class:
jira-project/jira-components/jira-api/src/main/java/com/atlassian/jira/issue/statistics/util/AbstractOneDimensionalHitCollector.java
And this method:
public void collect(int i) { ++hitCount; Collection<String> terms = docToTerms[i]; if (terms == null) { // We know there will always be a project and issue type for each issue final Long projectId = new Long(getSingleValue(docToProject[i])); final String issueTypeId = getSingleValue(docToIssueType[i]); // Determine the I AM NOT SHOWN STATE and the NONE state if (fieldVisibilityManager.isFieldHidden(projectId, fieldId, issueTypeId) || isFieldOutOfScope(projectId, fieldId, issueTypeId)) { irrelevantCount++; collectIrrelevant(docBase + i); } else { collectWithTerms(docBase + i, terms); } } else { collectWithTerms(docBase + i, terms); } }
In the case that the Lucene document "term" is null, it calls these:
if (fieldVisibilityManager.isFieldHidden(projectId, fieldId, issueTypeId) ||
isFieldOutOfScope(projectId, fieldId, issueTypeId))
One or probably both of them are slow. Slow times 250,000 is very slow. They appear to do a lot of work including allocating a lot of memory on the heap. At first it was a suspicion, but then I changed the code to cache results between calls to collect() like this:
private Map<Long, Map<String, Boolean>> projectIdIssueTypeIdFieldVisibility = new HashMap<Long, Map<String, Boolean>>(); ... public void collect(int i) { ++hitCount; Collection<String> terms = docToTerms[i]; if (terms == null) { // We know there will always be a project and issue type for each issue final Long projectId = new Long(getSingleValue(docToProject[i])); final String issueTypeId = getSingleValue(docToIssueType[i]); Map<String, Boolean> issueTypeIdFieldVisibility = projectIdIssueTypeIdFieldVisibility.get(projectId); if (issueTypeIdFieldVisibility == null) { issueTypeIdFieldVisibility = new HashMap<String, Boolean>(); projectIdIssueTypeIdFieldVisibility.put(projectId, issueTypeIdFieldVisibility); } Boolean fieldVisibility = issueTypeIdFieldVisibility.get(issueTypeId); if (fieldVisibility != null) { if (fieldVisibility) { collectWithTerms(docBase + i, terms); } else { irrelevantCount++; collectIrrelevant(docBase + i); } } else { // Determine the I AM NOT SHOWN STATE and the NONE state if (fieldVisibilityManager.isFieldHidden(projectId, fieldId, issueTypeId) || isFieldOutOfScope(projectId, fieldId, issueTypeId)) { irrelevantCount++; collectIrrelevant(docBase + i); fieldVisibility = false; } else { collectWithTerms(docBase + i, terms); fieldVisibility = true; } issueTypeIdFieldVisibility.put(issueTypeId, fieldVisibility); } } else { collectWithTerms(docBase + i, terms); } }
My thinking is that the "field" is always the same. Only the project and issue type changes, and since the field visibility only considers the project and issue type, and these make great "cache key" grouping fields due to the relatively few number of (projectId, issueTypeId) pairs compared to all the issues in the system, that a cache based upon only (projectId, issueTypeId) that is local to this particular hit collector instance would be fine. I confirmed that the resulting "Irrelevant" numbers were the same before and after deploying this change.
And now I am getting this result for the exact same "Issue Statistics" gadget refresh:
2014-11-23 04:41:53,762 http-bio-8080-exec-17 ERROR 0501056 281x49x1 44osdj 10.128.110.207 /rest/gadget/1.0/stats/generate [jira.gadgets.system.StatisticAccessorBean] search:B = 511048 KBytes allocated 2014-11-23 04:41:53,852 http-bio-8080-exec-17 ERROR 0501056 281x49x1 44osdj 10.128.110.207 /rest/gadget/1.0/stats/generate [jira.gadgets.system.StatisticAccessorBean] search:C2 = 520425 KBytes allocated
From 2,173 milliseconds down to 90 milliseconds!
I realized that the code path also applies to "None" issues and went back to specifically confirm this. For a query against "Milestone" which is highly relevant (8,716 of 250,060 issues are "Irrelevant"), but not widely used (230,648 are "None"), this is the interval timing with the original code:
2014-11-23 15:33:50,264 http-bio-8080-exec-8 ERROR 0501056 933x115x1 1u4xb6y 10.128.110.207 /rest/gadget/1.0/stats/generate [jira.gadgets.system.StatisticAccessorBean] search:B = 667549 KBytes allocated 2014-11-23 15:33:52,255 http-bio-8080-exec-8 ERROR 0501056 933x115x1 1u4xb6y 10.128.110.207 /rest/gadget/1.0/stats/generate [jira.gadgets.system.StatisticAccessorBean] search:C2 = 467289 KBytes allocated
You can see it triggered at least one garbage collection and also took 1,991 milliseconds to complete. After the fix, the same query results in:
2014-11-23 15:40:09,440 http-bio-8080-exec-7 ERROR 0501056 940x4x1 1n24425 10.128.110.207 /rest/gadget/1.0/stats/generate [jira.gadgets.system.StatisticAccessorBean] search:B = 619192 KBytes allocated 2014-11-23 15:40:09,540 http-bio-8080-exec-7 ERROR 0501056 940x4x1 1n24425 10.128.110.207 /rest/gadget/1.0/stats/generate [jira.gadgets.system.StatisticAccessorBean] search:C2 = 628381 KBytes allocated
From 1,991 milliseconds down to 100 milliseconds. So the same fix applies to both "Irrelevant" and "None" issue results.
I think this fix, or one like it, is important. It is definitely important for us. But, because it requires tweaking jira-api, which is something we don't want to do, I would much rather get the change from you in an official patch release.
Thank you.
- is related to
-
JRASERVER-37475 Implementation of certain gadgets may cause performance degradation
- Gathering Interest