Uploaded image for project: 'Server Deployments and Scale'
  1. Server Deployments and Scale
  2. SCALE-141

Jira deployed on kubernetes using helm is throwing error on stop when jmx monitoring is enabled

    • Severity 2 - Major
    • S

      Issue Summary

      In case Jira is deployed on kubernetes and jmx monitoring is enabled when a stop command is executed, jmx agent throws a port already in use error and the stop command fails. JMV is terminated by an OS signal as a result.

      Steps to Reproduce

      1. Deploy Jira on kubernetes using helm chart with jmx monitoring enabled
      2. Execute the following command to stop Jira
      kubectl exec -it jira-0 -c jira -- /shutdown-wait.sh 

      Expected Results

      Jira stops without any error

      Actual Results

      The below exception is thrown

      $ kubectl exec -it jira-0 -c jira --  /shutdown-wait.sh
      Shutting down Jira...
      executing as current user
      
                        `sMMMMMMMMMMMMMM+
                           MMMMMMMMMMMMMM
                           :sdMMMMMMMMMMM
                                   MMMMMM
                `sMMMMMMMMMMMMMM+  MMMMMM
                   MMMMMMMMMMMMMM  +MMMMM
                    :sMMMMMMMMMMM   MMMMM
                           MMMMMM    `UOJ
         `sMMMMMMMMMMMMM+  MMMMMM
           MMMMMMMMMMMMMM  +MMMMM
            :sdMMMMMMMMMM   MMMMM
                   MMMMMM    `UOJ
                   MMMMMM
                   +MMMMM
                    MMMMM
                     `UOJ
      
            Atlassian Jira
            Version : 9.12.13
                        
      
      If you encounter issues starting or stopping Jira, please see the Troubleshooting guide at https://docs.atlassian.com/jira/jadm-docs-0912/Troubleshooting+installation
      
      Using JIRA_HOME:       /var/atlassian/application-data/jira
      
      Server startup logs are located in /opt/atlassian/jira/logs/catalina.out
      Using CATALINA_BASE:   /opt/atlassian/jira
      Using CATALINA_HOME:   /opt/atlassian/jira
      Using CATALINA_TMPDIR: /opt/atlassian/jira/temp
      Using JRE_HOME:        /opt/java/openjdk
      Using CLASSPATH:       /opt/atlassian/jira/bin/bootstrap.jar:/opt/atlassian/jira/bin/tomcat-juli.jar
      Using CATALINA_OPTS:   
      Using CATALINA_PID:    /opt/atlassian/jira/work/catalina.pid
      NOTE: Picked up JDK_JAVA_OPTIONS:  --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.rmi/sun.rmi.transport=ALL-UNNAMED
       specified to --add-exportsL-UNNAMED
      Exception in thread "main" java.lang.reflect.InvocationTargetException
              at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
              at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.base/java.lang.reflect.Method.invoke(Method.java:569)
              at java.instrument/sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:491)
              at java.instrument/sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:503)
      Caused by: java.net.BindException: Address already in use
              at java.base/sun.nio.ch.Net.bind0(Native Method)
              at java.base/sun.nio.ch.Net.bind(Net.java:555)
              at java.base/sun.nio.ch.ServerSocketChannelImpl.netBind(ServerSocketChannelImpl.java:337)
              at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:294)
              at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:89)
              at jdk.httpserver/sun.net.httpserver.ServerImpl.<init>(ServerImpl.java:142)
              at jdk.httpserver/sun.net.httpserver.HttpServerImpl.<init>(HttpServerImpl.java:50)
              at jdk.httpserver/sun.net.httpserver.DefaultHttpServerProvider.createHttpServer(DefaultHttpServerProvider.java:35)
              at jdk.httpserver/com.sun.net.httpserver.HttpServer.create(HttpServer.java:150)
              at io.prometheus.jmx.shaded.io.prometheus.client.exporter.HTTPServer.<init>(HTTPServer.java:396)
              at io.prometheus.jmx.JavaAgent.premain(JavaAgent.java:31)
              ... 6 more
      *** java.lang.instrument ASSERTION FAILED ***: "!errorOutstanding" with message Outstanding error when calling method in invokeJavaAgentMainMethod at src/java.instrument/share/native/libinstrument/JPLISAgent.c line: 619
      *** java.lang.instrument ASSERTION FAILED ***: "success" with message invokeJavaAgentMainMethod failed at src/java.instrument/share/native/libinstrument/JPLISAgent.c line: 459
      *** java.lang.instrument ASSERTION FAILED ***: "result" with message agent load/premain call failed at src/java.instrument/share/native/libinstrument/JPLISAgent.c line: 422
      FATAL ERROR in native method: processing of -javaagent failed, processJavaStart failed
      Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
      V  [libjvm.so+0x8b385e]  jni_FatalError+0xbe
      V  [libjvm.so+0xa327f2]  JvmtiExport::post_vm_initialized()+0x132
      V  [libjvm.so+0xe6df56]  Threads::create_vm(JavaVMInitArgs*, bool*)+0x946
      V  [libjvm.so+0x8c1e22]  JNI_CreateJavaVM+0x52
      C  [libjli.so+0x9c7f]  JavaMain+0x8f
      C  [libjli.so+0xdf79]  ThreadJavaMain+0x9
      C  [libc.so.6+0x9ca94]
      
      /opt/atlassian/jira/bin/catalina.sh: line 575:   743 Aborted                 (core dumped) "/opt/java/openjdk/bin/java" -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.nio.file=ALL-UNNAMED --add-opens=java.base/javax.crypto=ALL-UNNAMED --add-opens=java.management/javax.management=ALL-UNNAMED --add-opens=java.desktop/sun.font=ALL-UNNAMED --add-opens=java.base/sun.reflect.generics.parser=ALL-UNNAMED --add-opens=java.base/java.time=ALL-UNNAMED --add-opens=jdk -Xms2g -Xmx2g -XX:InitialCodeCacheSize=32m -XX:ReservedCodeCacheSize=${JVM_RESERVED_CODE_CACHE_SIZE:=512m} -Dlog4j2.contextSelector=org.apache.logging.log4j.core.selector.BasicContextSelector -Dlog4j2.disableJmx=true -Dlog4j2.garbagefree.threadContextMap=true -Dlog4j2.isWebapp=false -Djava.awt.headless=true -Datlassian.standalone=JIRA -Dorg.apache.jasper.runtime.BodyContentImpl.LIMIT_BUFFER=true -Dmail.mime.decodeparameters=true -Dorg.dom4j.factory=com.atlassian.core.xml.InterningDocumentFactory -Dorg.apache.catalina.connector.RECYCLE_FACADES=false -Datlassian.logging.cloud.enabled=false -XX:ActiveProcessorCount=2 -javaagent:/var/atlassian/application-data/shared-home/jmx_prometheus_javaagent.jar=30000:/opt/atlassian/jmx/jmx-config.yaml -XX:-OmitStackTraceInFastThrow -Djava.locale.providers=COMPAT -Djira.home=/var/atlassian/application-data/jira -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Dorg.apache.catalina.security.SecurityListener.UMASK=0027 -Dignore.endorsed.dirs="" -classpath "/opt/atlassian/jira/bin/bootstrap.jar:/opt/atlassian/jira/bin/tomcat-juli.jar" -Dcatalina.base="/opt/atlassian/jira" -Dcatalina.home="/opt/atlassian/jira" -Djava.io.tmpdir="/opt/atlassian/jira/temp" org.apache.catalina.startup.Bootstrap stop
      The stop command failed. Attempting to signal the process to stop through OS signal.
      command terminated with exit code 137

      Workaround

      Currently there is no known workaround for this behavior. A workaround will be added here when available

            [SCALE-141] Jira deployed on kubernetes using helm is throwing error on stop when jmx monitoring is enabled

            Yevhen made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: In Review [ 10051 ] New: Closed [ 6 ]

            Yevhen added a comment -

            Available in 1.22.0

            Yevhen added a comment - Available in 1.22.0
            Yevhen made changes -
            Status Original: In Progress [ 3 ] New: In Review [ 10051 ]
            Yevhen made changes -
            Status Original: Short Term Backlog [ 12074 ] New: In Progress [ 3 ]
            Yevhen made changes -
            Labels Original: dcxp-triaged product New: dcxp-triaged
            Yevhen made changes -
            Labels Original: dcxp-triaged New: dcxp-triaged product
            Yevhen made changes -
            Labels New: dcxp-triaged

            To expand on this issue:

            When monitoring.exposeJmxMetrics is set to true, then jira will start up an additional java agent which will by default listen on port 9999 (see env var JVM_SUPPORT_RECOMMENDED_ARGS for details).

            Then when the cluster decides to terminate pod, it will run the command specified in helm shutdown.command: "/shutdown-wait.sh"

            This script will through a complex chain of events trigger catalina.sh which before finally starting a java process picks up the values in JVM_SUPPORT_RECOMMENDED_ARGS by running setenv.sh.

            This new java process will try to also spin up a javaagent listening/binding to the same port as the jira javaagent, this will fail and a "Exception caused by: java.net.BindException: Address already in use" will be thrown.

            JIRA now appears to terminate immediately, potentially corrupting any ongoing SQL queries, processes etc. On top of potential corruption problems, we also see "core.xxx" files created in JIRA local home, they seem to be the size of JVM_MAXIMUM_MEMORY which in our prod environment is ~35GB and with 4 pods each having several of these core files that is filling up disk space quickly.

            The workaround we implemented was to update the shutdown command:

            --set jira.shutdown.command: 'sed -i "JVM_SUPPORT_RECOMMENDED_ARGS=$(echo "$JVM_SUPPORT_RECOMMENDED_ARGS" | sed "s|-javaagent:/var/atlassian/application-data/shared-home/jmx_prometheus_javaagent.jar=[0-9]*:/opt/atlassian/jmx/jmx-config.yaml||") && /shutdown-wait.sh'

            If you are still having similar shutdown problems, have a look at: https://jira.atlassian.com/browse/JRASERVER-70785

            anders.lantz added a comment - To expand on this issue: When monitoring.exposeJmxMetrics is set to true, then jira will start up an additional java agent which will by default listen on port 9999 (see env var JVM_SUPPORT_RECOMMENDED_ARGS for details). Then when the cluster decides to terminate pod, it will run the command specified in helm shutdown.command: "/shutdown-wait.sh" This script will through a complex chain of events trigger catalina.sh which before finally starting a java process picks up the values in JVM_SUPPORT_RECOMMENDED_ARGS by running setenv.sh. This new java process will try to also spin up a javaagent listening/binding to the same port as the jira javaagent, this will fail and a "Exception caused by: java.net.BindException: Address already in use" will be thrown. JIRA now appears to terminate immediately, potentially corrupting any ongoing SQL queries, processes etc. On top of potential corruption problems, we also see "core.xxx" files created in JIRA local home, they seem to be the size of JVM_MAXIMUM_MEMORY which in our prod environment is ~35GB and with 4 pods each having several of these core files that is filling up disk space quickly. The workaround we implemented was to update the shutdown command: --set jira.shutdown.command: 'sed -i "JVM_SUPPORT_RECOMMENDED_ARGS=$(echo " $JVM_SUPPORT_RECOMMENDED_ARGS " | sed " s|-javaagent:/ var /atlassian/application-data/shared-home/jmx_prometheus_javaagent.jar=[0-9]*:/opt/atlassian/jmx/jmx-config.yaml||") && /shutdown-wait.sh' If you are still having similar shutdown problems, have a look at: https://jira.atlassian.com/browse/JRASERVER-70785
            Yevhen made changes -
            Assignee New: Yevhen [ c64f33b2bce3 ]
            Yevhen made changes -
            Development Effort New: S [ 13031 ]
            Status Original: Needs Triage [ 10030 ] New: Short Term Backlog [ 12074 ]

              c64f33b2bce3 Yevhen
              b39100fe004b Kutay Peksen
              Affected customers:
              2 This affects my team
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: