Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-68653

Asynchronous cache replication queue - leaking file descriptor when queue file corrupted

      Problem

      If a cache replication queue is corrupted when a node is shutting down, then on the next node start Jira will try to open this queue file every time it is required (a cache replication message is being send to another node on this particular channel = file). If the existing file is corrupted it fails with the following error:

      ERROR      [c.a.j.c.distribution.localq.LocalQCacheManager] Error when creating cache replication queue for node: [node_name]. This node will be inconsistent. Error: File is corrupt; length stored in header is 0.

      This results in:

      • cache replication message not being delivered
      • leak of file descriptor
        • Jira hits into "Too many open files" error. Reviewing lsof output points to many localq entries.

      Desired Jira behaviour

      If the file is corrupted backup this file (copy with corrupted_ prefix) and create a new file.

      Workaround

      Delete the corrupted queue file. Step to identify the corrupted file can be found following comment-1917799
      It should not be necessary to shut down this node. It should recreate this queue file automatically. 
       

            [JRASERVER-68653] Asynchronous cache replication queue - leaking file descriptor when queue file corrupted

            dnorton@atlassian.com 

            2019-01-10 00:05:49,141 localq-reader-18 WARN      [c.a.j.c.distribution.localq.LocalQCacheOpReader] Runtime exception: UnrecoverableFailure occurred when processing: LocalQCacheOp{cacheName='com.almworks.structure.issues-to-rows', action=REMOVE, key=<2323>, value=null, creationTimeInMillis=1547078406338} from cache replication queue: [queueId=queue_<node-1>_8_8db34e23442a564ae4d4d9b06c049f8a, queuePath=/scmdata/jira/home/localq/queue_<node-1>_8_8db34e23442a564ae4d4d9b06c049f8a], failuresCount: 150/200, error: java.rmi.NotBoundException: com.almworks.structure.issues-to-rows 

            This has nothing to do with the original problem. This warning means that the other node: node-1 is alive but the cache "com.almworks.structure.issues-to-rows" does not exist there. This is not a problem when <node-1> starts - RMI port is first open but plugins are still starting and in this case the plugin registering cache: com.almworks.structure.issues-to-rows  has not started yet. 

            If node-1 is fully up and other nodes cant deliver this message - this means there is something wrong with the plugin (not up?) registering this cache. Anyway the problem is not really related to the cache replication messages itself.

            In addition, clearing the localq folder works for that node, but other nodes eventually start showing the same problem - much like playing whack-a-mole.

             

            which problem? the corrupted queue problem? - this should be really rare; if you are talking about the "java.rmi.NotBoundException" - this is fine on start as explained or in this case would mean this particular cache is not present on the other node.

             

            Maciej Swinarski (Inactive) added a comment - - edited dnorton@atlassian.com   2019-01-10 00:05:49,141 localq-reader-18 WARN [c.a.j.c.distribution.localq.LocalQCacheOpReader] Runtime exception: UnrecoverableFailure occurred when processing: LocalQCacheOp{cacheName='com.almworks.structure.issues-to-rows', action=REMOVE, key=<2323>, value=null, creationTimeInMillis=1547078406338} from cache replication queue: [queueId=queue_<node-1>_8_8db34e23442a564ae4d4d9b06c049f8a, queuePath=/scmdata/jira/home/localq/queue_<node-1>_8_8db34e23442a564ae4d4d9b06c049f8a], failuresCount: 150/200, error: java.rmi.NotBoundException: com.almworks.structure.issues-to-rows This has nothing to do with the original problem. This warning means that the other node: node-1 is alive but the cache "com.almworks.structure.issues-to-rows" does not exist there. This is not a problem when <node-1> starts - RMI port is first open but plugins are still starting and in this case the plugin registering cache: com.almworks.structure.issues-to-rows  has not started yet.  If node-1 is fully up and other nodes cant deliver this message - this means there is something wrong with the plugin (not up?) registering this cache. Anyway the problem is not really related to the cache replication messages itself. In addition, clearing the  localq  folder works for that node, but other nodes eventually start showing the same problem - much like playing whack-a-mole.   which problem? the corrupted queue problem? - this should be really rare; if you are talking about the "java.rmi.NotBoundException" - this is fine on start as explained or in this case would mean this particular cache is not present on the other node.  

            Fix description

            When Jira starts and one of the cache replication queues is corrupted a backup of this file is created (named "corrupted_[timestamp]_[original_queue_name]") and a new empty queue file is created. There are no file descriptor leaks when this occurs. 

            The error message:

            Error when creating cache replication queue for node: {node_id}. This node will be inconsistent. Error: {error_message} 

            changed to contain the queue id, so the queue file can be identified:

            Error when creating cache replication queue: {queue_id} for node: {node_id}. This node will be inconsistent. Error: {error_message} 

             

            Maciej Swinarski (Inactive) added a comment - - edited Fix description When Jira starts and one of the cache replication queues is corrupted a backup of this file is created (named "corrupted_ [timestamp] _ [original_queue_name] ") and a new empty queue file is created. There are no file descriptor leaks when this occurs.  The error message: Error when creating cache replication queue for node: {node_id}. This node will be inconsistent. Error: {error_message} changed to contain the queue id, so the queue file can be identified: Error when creating cache replication queue: {queue_id} for node: {node_id}. This node will be inconsistent. Error: {error_message}  

            Hi bvaradha1,

            Unfortunately the log message does not tell you which file exactly it is. Thank you for letting me know about this problem. I will add this information to the fix as well.

            You can delete all files on this node with node-name_[0-9], but then you will loose some cache replication messages from valid cache replication queues. So its better to just delete the corrupted file. Since in this bug we are also leaking file descriptors for the corrupted queue file, you can use this to find exactly which queue file is corrupted.

            For the sake of this example lets assume:

            • jira process id = 19797
            • node-name = node2

            Run the following command:

            ls -l /proc/19797/fd | grep localq | grep node2
            

            Since there is a leak for the corrupted file (lets say its queue #9) you should see something like this:

            1316 -> /opt/atlassian/jira/jira_home/localq/queue_node2_7_78882aaeb08e9a4c81687b5de2add74f
            1320 -> /opt/atlassian/jira/jira_home/localq/queue_node2_1_78882aaeb08e9a4c81687b5de2add74f
            1321 -> /opt/atlassian/jira/jira_home/localq/queue_node2_2_78882aaeb08e9a4c81687b5de2add74f
            1322 -> /opt/atlassian/jira/jira_home/localq/queue_node2_3_78882aaeb08e9a4c81687b5de2add74f
            1323 -> /opt/atlassian/jira/jira_home/localq/queue_node2_4_78882aaeb08e9a4c81687b5de2add74f
            1325 -> /opt/atlassian/jira/jira_home/localq/queue_node2_5_78882aaeb08e9a4c81687b5de2add74f
            1326 -> /opt/atlassian/jira/jira_home/localq/queue_node2_6_78882aaeb08e9a4c81687b5de2add74f
            1327 -> /opt/atlassian/jira/jira_home/localq/queue_node2_8_78882aaeb08e9a4c81687b5de2add74f
            1328 -> /opt/atlassian/jira/jira_home/localq/queue_node2_9_78882aaeb08e9a4c81687b5de2add74f
            1329 -> /opt/atlassian/jira/jira_home/localq/queue_node2_9_78882aaeb08e9a4c81687b5de2add74f
            1330 -> /opt/atlassian/jira/jira_home/localq/queue_node2_9_78882aaeb08e9a4c81687b5de2add74f
            1331 -> /opt/atlassian/jira/jira_home/localq/queue_node2_9_78882aaeb08e9a4c81687b5de2add74f
            1332 -> /opt/atlassian/jira/jira_home/localq/queue_node2_9_78882aaeb08e9a4c81687b5de2add74f
            241 -> /opt/atlassian/jira/jira_home/localq/queue_node2_0_78882aaeb08e9a4c81687b5de2add74f
            

            As in the example above there will be multiple file descriptors for a single queue file, so based on this information I would only have to delete file:

            /opt/atlassian/jira/jira_home/localq/queue_node2_9_78882aaeb08e9a4c81687b5de2add74f
            

            Thank you for letting me know about this problem.

            Maciej Swinarski (Inactive) added a comment - Hi bvaradha1 , Unfortunately the log message does not tell you which file exactly it is. Thank you for letting me know about this problem. I will add this information to the fix as well. You can delete all files on this node with node-name _ [0-9] , but then you will loose some cache replication messages from valid cache replication queues. So its better to just delete the corrupted file. Since in this bug we are also leaking file descriptors for the corrupted queue file, you can use this to find exactly which queue file is corrupted. For the sake of this example lets assume: jira process id = 19797 node-name = node2 Run the following command: ls -l /proc/19797/fd | grep localq | grep node2 Since there is a leak for the corrupted file (lets say its queue #9) you should see something like this: 1316 -> /opt/atlassian/jira/jira_home/localq/queue_node2_7_78882aaeb08e9a4c81687b5de2add74f 1320 -> /opt/atlassian/jira/jira_home/localq/queue_node2_1_78882aaeb08e9a4c81687b5de2add74f 1321 -> /opt/atlassian/jira/jira_home/localq/queue_node2_2_78882aaeb08e9a4c81687b5de2add74f 1322 -> /opt/atlassian/jira/jira_home/localq/queue_node2_3_78882aaeb08e9a4c81687b5de2add74f 1323 -> /opt/atlassian/jira/jira_home/localq/queue_node2_4_78882aaeb08e9a4c81687b5de2add74f 1325 -> /opt/atlassian/jira/jira_home/localq/queue_node2_5_78882aaeb08e9a4c81687b5de2add74f 1326 -> /opt/atlassian/jira/jira_home/localq/queue_node2_6_78882aaeb08e9a4c81687b5de2add74f 1327 -> /opt/atlassian/jira/jira_home/localq/queue_node2_8_78882aaeb08e9a4c81687b5de2add74f 1328 -> /opt/atlassian/jira/jira_home/localq/queue_node2_9_78882aaeb08e9a4c81687b5de2add74f 1329 -> /opt/atlassian/jira/jira_home/localq/queue_node2_9_78882aaeb08e9a4c81687b5de2add74f 1330 -> /opt/atlassian/jira/jira_home/localq/queue_node2_9_78882aaeb08e9a4c81687b5de2add74f 1331 -> /opt/atlassian/jira/jira_home/localq/queue_node2_9_78882aaeb08e9a4c81687b5de2add74f 1332 -> /opt/atlassian/jira/jira_home/localq/queue_node2_9_78882aaeb08e9a4c81687b5de2add74f 241 -> /opt/atlassian/jira/jira_home/localq/queue_node2_0_78882aaeb08e9a4c81687b5de2add74f As in the example above there will be multiple file descriptors for a single queue file, so based on this information I would only have to delete file: /opt/atlassian/jira/jira_home/localq/queue_node2_9_78882aaeb08e9a4c81687b5de2add74f Thank you for letting me know about this problem.

            @Maciej : We faced the same issue few weeks back on one of our Jira data-center node. 

            [c.a.j.c.distribution.localq.LocalQCacheManager] Error when creating cache replication queue for node: <node-name>. This node will be inconsistent. Error: File is corrupt; length stored in header is 0. java.io.IOException: File is corrupt; length stored in header is 0.

            Just wanted to know if we face the same issue again, do we need to remove the <node-name> file from jira-home/localq folder ?

            We do see lot of files under the <node-name> in jira-home/localq folder :

            queue_<node-name>_0_558f4e07e022f3d0952240a1abc249c3  
            queue_<node-name>_1_558f4e07e022f3d0952240a1abc249c3 
            queue_<node-name>_2_558f4e07e022f3d0952240a1abc249c3
            queue_<node-name>_3_558f4e07e022f3d0952240a1abc249c3
            queue_<node-name>_4_558f4e07e022f3d0952240a1abc249c3

            Do we need to remove all of these ?

            Badrinath Varadharajan added a comment - @Maciej : We faced the same issue few weeks back on one of our Jira data-center node.  [c.a.j.c.distribution.localq.LocalQCacheManager] Error when creating cache replication queue for node: <node-name>. This node will be inconsistent. Error: File is corrupt; length stored in header is 0. java.io.IOException: File is corrupt; length stored in header is 0. Just wanted to know if we face the same issue again, do we need to remove the <node-name> file from jira-home/localq folder ? We do see lot of files under the <node-name> in jira-home/localq folder : queue_<node-name>_0_558f4e07e022f3d0952240a1abc249c3   queue_<node-name>_1_558f4e07e022f3d0952240a1abc249c3  queue_<node-name>_2_558f4e07e022f3d0952240a1abc249c3 queue_<node-name>_3_558f4e07e022f3d0952240a1abc249c3 queue_<node-name>_4_558f4e07e022f3d0952240a1abc249c3 Do we need to remove all of these ?

            7.6.10 is also affected by this issue.

            Dibyandu Roy added a comment - 7.6.10 is also affected by this issue.

              mswinarski Maciej Swinarski (Inactive)
              mswinarski Maciej Swinarski (Inactive)
              Affected customers:
              2 This affects my team
              Watchers:
              13 Start watching this issue

                Created:
                Updated:
                Resolved: