Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-66265

Null Pointer Exception thrown due to JIRA not handling correctly an inconsistency between the database and the lucene index.

    XMLWordPrintable

Details

    Description

      Summary

      Issuing a rest API call or JQL that have sub-tasks as part of their results will fail with a Null Pointer Exception if the parent of the sub-task had been deleted on a different node.

      Based on the stack trace written to the JIRA log files we understand that:

      1) A rest API call is issued on one node (lets suppose NODE_A) of the JIRA DC cluster.
      2) As part of processing the results for the rest API call or JQL, a sub-task is converted into an issue.
      3) To convert the sub-task, its respective parent needs to be found.
      4) JIRA tries to get the parent of the subtask from cache (on NODE_A as it is the node where the rest call or JQL were issued) and does not find it.
      5) JIRA then goes to get the parent of the subtask from the database and does not find it either. Instead, a NULL value is returned as the result from the SELECT statement.
      It is likely that the parent of the sub-task had been deleted on any other node apart from NODE_A within the JIRA DC cluster (and hence from the database) however the lucene index on that NODE_A was not yet updated.
      At this point is where we find the inconsistency as the value found in the Lucene index cannot be found in the database.
      6) The rest API call (or JQL) fails and an exception is written to the log file.

      It must be stressed that the Lucene index is eventually consistent, which means that there will be some delay in updating the index across all nodes.

      Environment

      • JIRA Data Center with more than one node.
      • JIRA single node (it is less likely to happen but still can happen)

      Steps to Reproduce

      There no specific steps to reproduce this problem. The only way to reproduce it would be in a controlled environment and setting breakpoints in the JIRA code.

      Expected Results

      JIRA should handle this in a different way, detecting that a NULL value was returned from the SELECT statement and deducting (as the database is our source) that due to eventual consistency the lucene index had not been updated yet. Then, handling the inconsistency rather than throwing a NULL pointer exception.

      Actual Results

      The rest API call fails and the below exception is thrown in the JIRA log files file.

      Please note that the below exception shows a rest API causing it as this was how it was reported however this can be caused by a JQL as well and hence the stack trace might be different. The key for this issue is the stack trace reported below "Caused by:".

       

      2017-08-03 01:57:46,291 ajp-nio-127.0.0.1-8009-exec-3549 ERROR srvCatalyst 117x12153229x1 1y091dw 10.147.129.135 /rest/api/2/search [c.a.j.rest.exception.ExceptionInterceptor] Returning internal server error in response
      java.lang.reflect.InvocationTargetException
      	at sun.reflect.GeneratedMethodAccessor1098.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at com.atlassian.plugins.rest.common.interceptor.impl.DispatchProviderHelper$TypeOutInvoker$1.invoke(DispatchProviderHelper.java:169)
      	at com.atlassian.plugins.rest.common.interceptor.impl.DispatchProviderHelper$1.intercept(DispatchProviderHelper.java:83)
      	... 3 filtered
      	at com.atlassian.jira.rest.exception.ExceptionInterceptor.intercept(ExceptionInterceptor.java:55)
      	... 1 filtered
      	at com.atlassian.jira.rest.v2.issue.scope.RequestScopeInterceptor.intercept(RequestScopeInterceptor.java:39)
      	... 14 filtered
      	at com.atlassian.plugins.rest.module.RestDelegatingServletFilter$JerseyOsgiServletContainer.doFilter(RestDelegatingServletFilter.java:154)
      	... 1 filtered
      	at com.atlassian.plugins.rest.module.RestDelegatingServletFilter.doFilter(RestDelegatingServletFilter.java:68)
      	... 36 filtered
      	at com.atlassian.web.servlet.plugin.request.RedirectInterceptingFilter.doFilter(RedirectInterceptingFilter.java:21)
      	... 53 filtered
      	at com.atlassian.jira.security.JiraSecurityFilter.lambda$doFilter$0(JiraSecurityFilter.java:76)
      	at com.atlassian.jira.security.JiraSecurityFilter.dt_access$218(JiraSecurityFilter.java)
      	... 1 filtered
      	at com.atlassian.jira.security.JiraSecurityFilter.doFilter(JiraSecurityFilter.java:74)
      	... 16 filtered
      	at com.atlassian.plugins.rest.module.servlet.RestSeraphFilter.doFilter(RestSeraphFilter.java:37)
      	... 20 filtered
      	at com.atlassian.jira.servermetrics.CorrelationIdPopulatorFilter.doFilter(CorrelationIdPopulatorFilter.java:30)
      	... 10 filtered
      	at com.atlassian.web.servlet.plugin.request.RedirectInterceptingFilter.doFilter(RedirectInterceptingFilter.java:21)
      	... 4 filtered
      	at com.atlassian.web.servlet.plugin.LocationCleanerFilter.doFilter(LocationCleanerFilter.java:36)
      	... 29 filtered
      	at com.atlassian.jira.servermetrics.MetricsCollectorFilter.doFilter(MetricsCollectorFilter.java:25)
      	... 27 filtered
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: com.google.common.cache.CacheLoader$InvalidCacheLoadException: CacheLoader returned null for key 2276745.
      	at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2350)
      	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2320)
      	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
      	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
      	at com.google.common.cache.LocalCache.get(LocalCache.java:3937)
      	at com.atlassian.jira.rest.v2.issue.IssueBeanBuilder2.createParentLink(IssueBeanBuilder2.java:317)
      	at com.atlassian.jira.rest.v2.issue.IssueBeanBuilder2.build(IssueBeanBuilder2.java:155)
      	at com.atlassian.jira.rest.v2.search.SearchResource$IssueToIssueBean.apply(SearchResource.java:291)
      	at com.atlassian.jira.rest.v2.search.SearchResource$IssueToIssueBean.apply(SearchResource.java:275)
      	at com.google.common.collect.Lists$TransformingRandomAccessList$1.transform(Lists.java:617)
      	at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
      	at java.util.AbstractCollection.toArray(AbstractCollection.java:141)
      	at java.util.ArrayList.<init>(ArrayList.java:177)
      	at com.google.common.collect.Lists.newArrayList(Lists.java:144)
      	at com.atlassian.jira.rest.v2.search.SearchResource.asResultsBean(SearchResource.java:217)
      	at com.atlassian.jira.rest.v2.search.SearchResource.search(SearchResource.java:179)
      	at com.atlassian.jira.rest.v2.search.SearchResource.searchUsingSearchRequest(SearchResource.java:203)
      	... 237 more
      

       

      Workaround

      The workaround is to re-issue the rest API call or JQL as eventually will not fail, once the lucene index is updated on the node running the rest API or JQL.

      • Note that the index may not always update. In this case, please run a project re-index.

       Observed a case where issue the REST API call and project reindex had NOT solved the issue, only a full reindex.

      • Project reindex does not always fix the issue, but a full reindex, recreating the indices 

      Note on Fix

      In Jira 8.12 we have change how Index replication works and added Jira Document Based Replication (DBR), this means documents are only created on the source node (where the action triggering the change happened) and the document is then distributed to all other nodes.  Check Document-based replication in Jira Data Center page for more details.

      As part of the DBR change, there are other changes, which we belive prevent or self-heal the problem mentioned in this ticket:

      • we have added versioning of all entities - now every change which triggers an issue, comment or worklog reindex bumps this entity version (bumping the version is only triggered by user actions, not full/background re-index) 
      • all index updates (except full-foreground re-index) are now doing a conditional update: we only update the index if the entity in the index is older/equal (in therms of versioning - not time) to the new entity: this gives us a guarantee that all update conflicts will be resolved on all nodes with the same result; when doing a full re-index we have a clean index and there are no concurrent index updates so the index updates are non-conditional;
      • on both server and DC there is a task running every 5sec which checks the consistency of the recent (last 15 sec) changes in the index and re-applies deletes if there was a late update which has overridden the delete (conditional updates do not solve this problem); this task checks for the consistency of updates as well to handle corner cases like thread was gone after updating the DB (for DC this is the index replication table, for server the entity version table) but before updating the local index

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ecasuscelli Esteban Casuscelli
              Votes:
              10 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: