Status: Closed (View Workflow)
Affects Version/s: Archived Jira Cloud
Component/s: Dashboard & Gadgets
Support reference count:137
Symptom Severity:Severity 3 - Minor
Bug Fix Policy:
JIRA can get into a state where most gadgets in the "Add a gadget" dialog on dashboards never load and JIRA displays a message saying "Some gadgets failed to load". Clicking the "Try again" link may or may not work.
JIRA Cloud customers should raise a support request at support.atlassian.com and reference this bug, and the support team will restart JIRA and increase the maximum amount of memory available to it as a temporary workaround until the bug itself is fixed.
The dashboard fetches gadgets in a two-step operation:
- Upon clicking "add gadget" the dashboard directory is loaded and filled with the contents of the call to /rest/config/1.0/directoryitems/local.json. If this call times out (60secs) a retry is available to attempt it again. This is current design and behaviour, if the timeout is reached the "Broken Pipe" exception will be logged, this is again expected behaviour.
- Also, the dashboard will attempt to fetch external gadgets at this time via a background request to /rest/config/1.0/directoryitems/external.json and will notify this to the end user via the "Not all gadgets have loaded" information box on the dashboard. This is not an error, it's normal behaviour. Having said that, getting this timeout frequently in a specific instance might be indicative of some problem in it that would need to be troubleshooted for that customer.
If the aforementioned request to fetch external gadgets times out (60 secs), then you will get the "Some gadgets failed to load" message and a prompt to try again. This is also expected behaviour from JIRA. Having said that, getting this timeout frequently in a specific instance might be indicative of some problem in it that would need to be troubleshooted for that customer. For instance, they might have too many applinks, connectivity to applinked
instances of JIRA and Confluence might be too slow / etc...
What can be done now?
- We can consider bumping the Nginx timeout for these requests.
- We can / probably should consider improving the messaging in the UI, so that what is actually happening is more clear to the end-user and support. I can even spot a spelling mistake in one of the messages
What can be done in the longer term?
The current fetching / retry mechanism is quite naive as it tries to fetch everything in two one-shot requests and completely fails over to retrieve anything if the request(s) fail over. This is specially hurtful when it comes to remote specs. The number of remote specs is proportional to the amount of applinked instances and it will eventually not scale and timeout more and more frequently (this is exacerbated by the overhead of fetching the data from a remote host).
We can redesign the fetching mechanism such that specs are retrieved via a "paging" sort of mechanism (say 5 specs at a time for example) to achieve scalability and reduce the incidence of time-outs.
We can profile the fetching algorithm for a given spec to understand what performance improvements can be done for this operation such that we reduce the calculation time if possible.
Furthermore, as part of other work being undertaken, we are replacing/removing the global cache that holds these gadget specs. I believe that re-architecting these fetching mechanism and the profiling described is a pre-requisite to be able to kill this cache so we need these three things to be done together at the same time.