[JRASERVER-67613] Race Condition with Jira Startup Consistency Checker Can Result in File Lock Errors - Create and track feature requests for Atlassian products.

Type: Bug
Resolution: Obsolete
Priority: Low (View bug fix roadmap)
Fix Version/s: 7.2.14, 7.10.0, 7.7.4, 7.8.4, 7.11.0, 7.6.7
Affects Version/s: 6.4.14, 7.2.13, 7.6.6
Component/s: Data Center - Other
Labels:
- pse-request

Fixed in Long Term Support Release/s:

Download 7.6
Introduced in Version:
6.04
Symptom Severity:
Severity 2 - Major
Bug Fix Policy:
View Atlassian Server bug fix policy

Problem

During startup, Jira performs a consistency check for various items, including a check to see if the index locations already have an existing lock. There are no explicit waits on this, so at times the check may run after existing services/job have already started. This may cause the consistency check to fail and the result is Jira startup will fail.

This has been found to occur in a Data Center environment where the NodeReindexServiceThread starts indexing before the consistency checker. This results in index write.lock check errors.

Steps to Reproduce

One situation found to delay the consistency checker is JMX monitoring, see JRASERVER-67614 for details. So we'll use this as an example.

Startup Jira Data Center Node with JMX enabled

Notice that there is a long delay during startup, in this case ~20s:

2018-07-13 17:55:09,844 localhost-startStop-1 INFO [c.a.j.instrumentation.external.DatabaseExternalGauges] Installing DBCP monitoring instruments: DatabaseExternalGauges.JiraDbcpInstruments[instruments=[DBCP_MAX, DBCP_ACTIVE, DBCP_IDLE],objectName=com.atlassian.jira:name=BasicDataSource]

2018-07-13 17:55:38,300 localhost-startStop-1 WARN      [c.a.jira.health.HealthChecks] Your database is using an unsupported collation

After waiting for a longer period, we see:

    ********************************************************************************************************************************************************************************************************
    Index lock file(s) found. This occurs either because JIRA was not cleanly shutdown
    or because there is another instance of this JIRA installation currently running.
    Please ensure that no other instance of this JIRA installation is running
    and then remove the following lock file(s) and restart JIRA:
    
    UtilConcurrentLock{lock=true, fullLockName=/opt/atlassian/home/caches/indexes/comments/write.lock} UtilConcurrentLock{lock=true, fullLockName=/opt/atlassian/home/caches/indexes/issues/write.lock} UtilConcurrentLock{lock=true, fullLockName=/opt/atlassian/home/caches/indexes/changes/write.lock} UtilConcurrentLock{lock=true, fullLockName=/opt/atlassian/home/caches/indexes/worklogs/write.lock}
    
    Once restarted you will need to reindex your data to ensure that indexes are up to date.
    
    Do NOT delete the lock file(s) if there is another JIRA running with the same index directory
    instead cleanly shutdown the other instance.
    ********************************************************************************************************************************************************************************************************

Diagnosing

If receiving the index write.lock errors, we can diagnose by:

Enable DEBUG logging on the following packages via log4j.properties and restart Jira:

com.atlassian.jira.index.LuceneCorruptionChecker
com.atlassian.jira.upgrade.ConsistencyCheckerImpl
com.atlassian.jira.startup.JiraStartupLogger
com.atlassian.jira.util.LuceneDirectoryUtilsImpl.UtilConcurrentLock

This will detect if index operations are occurring before the the consistency checker

Capture thread dumps in short 2 second intervals to examine what's happening during the startup delay.

Workaround

In the case where JMX monitoring is causing startup delays in Jira Data Center, disable JMX monitoring before starting up the new node. Once startup is complete, JMX monitoring can be re-enabled.

is caused by

JRASERVER-67614 JMX Monitoring may take a long time to load during Jira Startup

Gathering Impact

JRASERVER-67619 Start of Jira datacenter replication thread NodeReindexServiceThread is not synchronised with Jira start localhost-startStop thread

Gathering Impact

relates to: DELTA-374 Loading...

Assignee:: Unassigned

Reporter:: David Chan

Affected customers:: 1 This affects my team

Watchers:: 4 Start watching this issue

Created:: 18/Jul/2018 12:55 PM

Updated:: 25/Aug/2023 12:08 PM

Resolved:: 19/Jul/2018 7:01 AM

Details

Description

Problem

Steps to Reproduce

Diagnosing

Workaround

Attachments

Issue Links

Forms

Activity

People

Dates