Investigate on usage of StackShadowPages attribute in order to prevent SIGSEGV JVM crashes

XMLWordPrintable

    • Type: Suggestion
    • Resolution: Won't Do
    • None
    • Component/s: None

      NOTE: This suggestion is for JIRA Server. Using JIRA Cloud? See the corresponding suggestion.

      Hi, this is probably a bit unusual, but I'm filing a ticket to provide you guys with a solution. Your article on this was a huge help in correlating the fact that these crashes weren't unique to our product, so thought you'd be interested in the Java bug we've raised with Oracle, as these have caused significant problems for our clients:

      http://confluence.atlassian.com/display/GHKB/JIRA+with+GreenHopper+Crashes+Java+with+a+SIGSEGV+Fault+on+Linux+64bit+JVMs

      The short summary is: stack overflows in Java native calls should never crash the JVM, unless you have your own native library that's deeply recursing. Crashes have been confirmed to be prevented with a -XX:StackShadowPages=20 on Linux x86_64 and may also address these on UltraSPAC (sparcv9).

      Cheers!
      Danny

      Java bug: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7059899

      As our clients have migrated to 64-bit JVMs we've seen a significant number of JVM crashes due to SIGSEGV on Unix platforms. These are spontaneous and do not trigger a HotSpot crash report. Each crash involved an application bug that caused deep recursion that should have resulted in a java.lang.StackOverflowError, for instance an infinite struts forward or a search continuance/referral loop during LDAP authentication. This affects both Solaris and Linux platforms, but a 64-bit JVM is always the common factor.

      We referred to the Java SE Troubleshooting Guide section '4.1.3 Crash due to Stack Overflow' and found that likely, the StackShadowPages value is too small for these platforms. The guide discusses custom JNI libraries, however we're seeing these conditions in "normal" native calls, usually socket operations, reads or writes, which lead us to investigate further, as this should not be the case. According to the OpenJDK source the default on x84 platforms is 3, and is doubled to 6 on AMD64. There is a x86 Solaris value, seemingly to accomodate C++ compiler bugs on that platform, however our experience has shown that perhaps this is

      http://hg.openjdk.java.net/jdk6/jdk6/hotspot/file/9b013e207574/src/cpu/x86/vm/globals_x86.hpp

             60 #ifdef AMD64
             61 // Very large C++ stack frames using solaris-amd64 optimized builds
             62 // due to lack of optimization caused by C++ compiler bugs
             63 define_pd_global(intx, StackShadowPages, SOLARIS_ONLY(20) NOT_SOLARIS(6) DEBUG_ONLY(+2));
             64 #else
             65 define_pd_global(intx, StackShadowPages, 3 DEBUG_ONLY(+5));
             66 #endif // AMD64
      

      We can only conclude that either 64-bit stack frames on AMD64 are generally far larger than their 32-bit equivalents or there's a problem with the way this value is calculated (I believe it's OS pagesize * StackShadowPages), allowing previously benign stack overflows in Java code to crash the JVM. Lab testing indicates that 17 was is the smallest StackShadowPages size that prevented the JVM from crashing with a segmentation fault. We have not confirmed the value on Solaris (UltraSPARC, we don't support our product on Solaris x86), however we have certainly seen these conditions affect both platforms.

      Others have also encountered 64-bit specific SIGSEGVs, bug 6346701 seems to report exactly this kind of condition, however I could not see an outright discussion anywhere that indicated that there could be a problem with the default shipping value or calculation of the number of pages to look ahead before invoking native methods:

      http://confluence.atlassian.com/display/GHKB/JIRA+with+GreenHopper+Crashes+Java+with+a+SIGSEGV+Fault+on+Linux+64bit+JVMs
      http://fusesource.com/forums/thread.jspa?messageID=7830

      We identified the offending threads using gdb, and jstack thread dumps identified the following two crash use cases:

      Crash 1 - Two application methods that call each other recursively, executing database statements, causing overflow during Oracle thin driver socket read:

      Thread 18073: (state = IN_NATIVE)
      - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise)
      - java.net.SocketInputStream.read(byte[], int, int) @bci=84, line=129 (Compiled frame)
      - oracle.net.ns.Packet.receive() @bci=31, line=240 (Compiled frame)
      - oracle.net.ns.DataPacket.receive() @bci=1, line=92 (Compiled frame)
      - oracle.net.ns.NetInputStream.getNextPacket() @bci=48, line=172 (Compiled frame)
      - oracle.net.ns.NetInputStream.read(byte[], int, int) @bci=33, line=117 (Compiled frame)
      - oracle.net.ns.NetInputStream.read(byte[]) @bci=5, line=92 (Compiled frame)
      - oracle.jdbc.driver.T4CMAREngine.buffer2Value(byte) @bci=325, line=2320 (Compiled frame)
      - oracle.jdbc.driver.T4CMAREngine.unmarshalUB4() @bci=2, line=1200 (Compiled frame)
      - oracle.jdbc.driver.T4CTTIoer.unmarshal() @bci=200, line=270 (Compiled frame)
      - oracle.jdbc.driver.T4C8Oall.receive() @bci=1507, line=1015 (Compiled frame)
      - oracle.jdbc.driver.T4CPreparedStatement.doOall8(boolean, boolean, boolean, boolean) @bci=655, line=194 (Compiled frame)
      - oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe() @bci=39, line=791 (Compiled frame)
      - oracle.jdbc.driver.T4CPreparedStatement.executeMaybeDescribe() @bci=104, line=866 (Compiled frame)
      - oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout() @bci=139, line=1186 (Compiled frame)
      - oracle.jdbc.driver.OraclePreparedStatement.executeInternal() @bci=98, line=3387 (Compiled frame)
      - oracle.jdbc.driver.OraclePreparedStatement.executeQuery() @bci=13, line=3431 (Compiled frame)
      - oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery() @bci=4, line=1491 (Compiled frame)
      - org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery() @bci=9, line=93 (Compiled frame)
      - org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery() @bci=9, line=93 (Compiled frame)
      ...
      

      Crash 2 - Infinite LDAP search referral/continuance due to incorrectly configured Active Directory server, overflowing during socket write:

      Thread 11962: (state = IN_NATIVE)
       - java.net.SocketOutputStream.socketWrite0(java.io.FileDescriptor, byte[], int, int) @bci=0 (Interpreted frame)
       - java.net.SocketOutputStream.socketWrite(byte[], int, int) @bci=44, line=92 (Interpreted frame)
       - java.net.SocketOutputStream.write(byte[], int, int) @bci=4, line=136 (Interpreted frame)
       - java.io.BufferedOutputStream.flushBuffer() @bci=20, line=65 (Interpreted frame)
       - java.io.BufferedOutputStream.flush() @bci=1, line=123 (Interpreted frame)
       - com.sun.jndi.ldap.Connection.writeRequest(com.sun.jndi.ldap.BerEncoder, int, boolean) @bci=73, line=396 (Interpreted frame)
       - com.sun.jndi.ldap.LdapClient.ldapBind(java.lang.String, byte[], javax.naming.ldap.Control[], java.lang.String, boolean) @bci=196, line=334 (Interpreted frame)
       - com.sun.jndi.ldap.LdapClient.authenticate(boolean, java.lang.String, java.lang.Object, int, java.lang.String, javax.naming.ldap.Control[], java.util.Hashtable) @bci=315, line=192 (Interpreted frame)
       - com.sun.jndi.ldap.LdapCtx.connect(boolean) @bci=316, line=2694 (Interpreted frame)
       - com.sun.jndi.ldap.LdapCtx.<init>(java.lang.String, java.lang.String, int, java.util.Hashtable, boolean) @bci=390, line=293 (Interpreted frame)
       - com.sun.jndi.ldap.LdapCtxFactory.getUsingURL(java.lang.String, java.util.Hashtable) @bci=227, line=175 (Interpreted frame)
       - com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxInstance(java.lang.Object, java.util.Hashtable) @bci=12, line=134 (Interpreted frame)
       - com.sun.jndi.url.ldap.ldapURLContextFactory.getObjectInstance(java.lang.Object, javax.naming.Name, javax.naming.Context, java.util.Hashtable) @bci=17, line=35 (Interpreted frame)
       - javax.naming.spi.NamingManager.getURLObject(java.lang.String, java.lang.Object, javax.naming.Name, javax.naming.Context, java.util.Hashtable) @bci=62, line=584 (Interpreted frame)
       - javax.naming.spi.NamingManager.processURL(java.lang.Object, javax.naming.Name, javax.naming.Context, java.util.Hashtable) @bci=31, line=364 (Interpreted frame)
       - javax.naming.spi.NamingManager.processURLAddrs(javax.naming.Reference, javax.naming.Name, javax.naming.Context, java.util.Hashtable) @bci=56, line=344 (Interpreted frame)
       - javax.naming.spi.NamingManager.getObjectInstance(java.lang.Object, javax.naming.Name, javax.naming.Context, java.util.Hashtable) @bci=124, line=316 (Interpreted frame)
       - com.sun.jndi.ldap.LdapReferralContext.<init>(com.sun.jndi.ldap.LdapReferralException, java.util.Hashtable, javax.naming.ldap.Control[], javax.naming.ldap.Control[], java.lang.String, boolean, int) @bci=212, line=93(Interpreted frame)
       - com.sun.jndi.ldap.LdapReferralException.getReferralContext(java.util.Hashtable, javax.naming.ldap.Control[]) @bci=38, line=132 (Interpreted frame)
       - com.sun.jndi.ldap.LdapCtx.searchAux(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, boolean, boolean, com.sun.jndi.toolkit.ctx.Continuation) @bci=269, line=1838 (Interpreted frame)
       - com.sun.jndi.ldap.LdapCtx.c_search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, com.sun.jndi.toolkit.ctx.Continuation) @bci=14, line=1749 (Interpreted frame)
       - com.sun.jndi.toolkit.ctx.ComponentDirContext.p_search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, com.sun.jndi.toolkit.ctx.Continuation) @bci=72, line=368 (Interpreted frame)
       - com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls) @bci=32, line=338 (Interpreted frame)
       - com.sun.jndi.ldap.LdapReferralContext.search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls) @bci=44, line=639 (Interpreted frame)
       - com.sun.jndi.ldap.LdapCtx.searchAux(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, boolean, boolean, com.sun.jndi.toolkit.ctx.Continuation) @bci=282, line=1844 (Interpreted frame)
       - com.sun.jndi.ldap.LdapCtx.c_search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, com.sun.jndi.toolkit.ctx.Continuation) @bci=14, line=1749 (Interpreted frame)
       - com.sun.jndi.toolkit.ctx.ComponentDirContext.p_search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, com.sun.jndi.toolkit.ctx.Continuation) @bci=72, line=368 (Interpreted frame)
       - com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls) @bci=32, line
      ...
      

            Assignee:
            Unassigned
            Reporter:
            Bogdan Dziedzic [Atlassian]
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: