Uploaded image for project: 'HipChat'
  1. HipChat
  2. HCPUB-1407

On new and upgraded deployments of v1.4.3 or later, runsv curler is stuck in a limbo state

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

XMLWordPrintable

    • Severity 2 - Major

      Summary

      On new and upgraded deployments of v1.4.3, runsv curler is stuck in a limbo state. This causes a few intermittent issues, most notably any message notifications that rely on API v1 (PagerDuty, Jenkins, etc) will not fire.

      This may also affect email notifications and push notifications from firing.

      Environment

      HipChat Server v1.4.x -> v1.4.3 (upgraded instances)
      HipChat Server v1.4.3 (new deployments)

      Steps to Reproduce

      Upgrade to HipChat Server v1.4.3 from a v1.4.x instance.
      Spin up new v1.4.3 instance.

      Actual Results

      Log into the HipChat Server command line and check to see if the curler service is running. The quickest way to do this is to grep for curler.pid:

      ps aux | grep curler.pid
      

      There should be at least one result (sometimes two) that look similar to this:

      hipchat  21321  0.0  0.2  63776 17516 ?        S    Aug22   0:00 /hipchat-scm/curler/vendor/virtualenv/bin/python /hipchat/curler/current/vendor/virtualenv/bin/twistd --pidfile=/var/run/hipchat/curler.pid --syslog --facility=168 --prefix=curler --nodaemon curler --base-urls=http://localhost:8080/_jobs --job-queue=*curler* --gearmand-server=localhost:4730 --num-workers=5
      

      If there isn't, then curler isn't fully running.

      Notes

      • There is also a part of curler called curler-export.
      • The actual issue may lie with the runsv curler service as just restarting curler does not work by itself. You will see this error:
        runsv curler: fatal: unable to lock supervise/lock: temporary failure runsv curler-export: fatal: unable to lock supervise/lock: temporary failure
        

        If so, please run through the workaround below.

      Workaround

      Please be aware that once the curler service is restarted that all queued jobs (push notifications, email notifications) will all get queued and fired off, which may result in a flood of notifications. These safely can be ignored.

      1. Log into the HipChat Server command line.
      2. Gain root access:
        sudo dont-blame-hipchat
        
      3. Next, stop the curler service:
        /etc/init.d/curler stop
        
      4. Check to see if any existing (zombie) curler processes exist:
        ps aux | grep curler
        
      5. If so, then they will need to be killed:
        kill -9 curler_PID
        

        Where "curler_PID" is any remaining curler PID's.

      6. Next, kill the runsv curler and runsv curler-export services:
        kill -9 runsv_curler_PID
        

        Where "runsv_curler_PID" is the PID of the runsv curler process found in step 4

        kill -9 runsv_curler-export_PID
        

        Where "runsv_curler-export_PID" is the PID of the runsv curler-export process found in step 4

      7. Start curler
        /etc/init.d/curler start
        
      8. Verify curler is up:
        ps aux | grep curler
        

      If the service is shown as up, then send a test notification from your integration. If the service is not up, please reach out to HipChat Server support at support.atlassian.com and attach log output using hipchat diagnostics -b to the support ticket.

              Unassigned Unassigned
              dmaye@atlassian.com David Maye
              Archiver:
              mandreacchio Michael Andreacchio

                Created:
                Updated:
                Resolved:
                Archived: