In the last several weeks we've been seeing a lot of confluence instabilities at wikis.sun.com - all of them were related to running out of heap space. Several iterations of increasing Xmx didn't help (we started at 3GB and now we are at 5GB and 64bit JVM).

      I took several memory dumps during outages and analyzed them with Eclipse Memory Analyzer, which repeatedly found two issues:

      • Something is storing Xerces SaxParser objects as ThreadLocal variables, this results in up to 90MB being retained per thread and I see several instances of this size being held in memory causing total of 800-1200MB of the memory to be retained
      • Hundreds of instances of net.sf.hibernate.impl.SessionImpl retain additional ~780MB of memory - I'll document this as a separate issue

      Just before taking the heap dump, I also took a thread dump. By comparing the two I found that threads that were holding on the the huge thread local variables were currently in the containers thread pool and were not processing any requests - thus should have minimal memory requirements.

      I'm attaching some annotated screenshots from Eclipse Memory Analyzer and a thread dump that proves that the misbehaving threads were idle.

        1. SAXParserInstanceListing.jpg
          SAXParserInstanceListing.jpg
          151 kB
        2. ThreadInstanceDrilldown.png
          ThreadInstanceDrilldown.png
          266 kB
        3. ThreadInstanceListing.jpg
          ThreadInstanceListing.jpg
          166 kB
        4. ThreadSuspectSummary.png
          ThreadSuspectSummary.png
          121 kB
        5. wikis-threaddump-090320_1106.txt
          274 kB
        6. XMLReaderManager.class
          3 kB

            [CONFSERVER-14988] SAXParser memory leaks

            Louise added a comment -

            Hi I have this problem as well, where can I download the xalan jar file with this fix?

            Louise added a comment - Hi I have this problem as well, where can I download the xalan jar file with this fix?

            Igor Minar added a comment -

            73 hours later and our heap is still happy - fluctuating between 1GB and 2.5GB. Before the fix 5GB wasn't enough. We'll see what happens tomorrow during the peak hours.

            Igor Minar added a comment - 73 hours later and our heap is still happy - fluctuating between 1GB and 2.5GB. Before the fix 5GB wasn't enough. We'll see what happens tomorrow during the peak hours.

            In the first ~21hours after applying the patch our memory usage dropped by ~1GB on average. I suggest that you include this patch with 2.10.3 so that others can benefit from it as well.

            Igor Minar added a comment - In the first ~21hours after applying the patch our memory usage dropped by ~1GB on average. I suggest that you include this patch with 2.10.3 so that others can benefit from it as well.

            Igor Minar added a comment -

            The patch is now deployed on the prod server, I'll let you know tomorrow how the mem consumption looks. At the moment it looks pretty good, but it's too early to tell.

            Igor Minar added a comment - The patch is now deployed on the prod server, I'll let you know tomorrow how the mem consumption looks. At the moment it looks pretty good, but it's too early to tell.

            Hi Igor,

            That's correct. Let me know how if goes with your production server.
            I'll consider creating a forked Xalan for 2.10.3 if necessary.

            Regards,
            Andrew Lynch

            Andrew Lynch (Inactive) added a comment - Hi Igor, That's correct. Let me know how if goes with your production server. I'll consider creating a forked Xalan for 2.10.3 if necessary. Regards, Andrew Lynch

            Can I assume that you took the patch from XALANJ-2195, applied it & compiled it to get the class?

            I tested it and it seems to work. I'll apply it to our prod server tonight.

            thanks

            Igor Minar added a comment - Can I assume that you took the patch from XALANJ-2195, applied it & compiled it to get the class? I tested it and it seems to work. I'll apply it to our prod server tonight. thanks

            Hi Igor,

            While this sounds like a good solution, I think there is substantial risk involved with this approach which I'm not sure is appropriate for a point release.
            I think this is even riskier than patching Xalan itself.

            I've provided a patched binary of the suspected class responsible : placing this in WEB-INF/classes//org/apache/xml/utils should prevent the memory leak.

            Regards,
            Andrew Lynch

            Andrew Lynch (Inactive) added a comment - Hi Igor, While this sounds like a good solution, I think there is substantial risk involved with this approach which I'm not sure is appropriate for a point release. I think this is even riskier than patching Xalan itself. I've provided a patched binary of the suspected class responsible : placing this in WEB-INF/classes//org/apache/xml/utils should prevent the memory leak. Regards, Andrew Lynch

            Igor Minar added a comment -

            Hi Andrew,

            Closing this issue as obsolete is just not acceptable. This bug forces us to do daily restarts of Confluence, so it must be fixed asap.

            What if you developed a plugin similar to my servlet filter plugin, that would clean up the leaked thread-local variables at the end of the request (basically the diff between the thread-local variables at the entrance to the filter and exit would be forcefully nulled out (via reflection)). Delivering the fix as a plugin wouldn't interrupt your 2.10.3 dev cycle and would fix the issue for all the other unknown (and probably less significant) thread-local memory leaks that I spotted.

            Igor Minar added a comment - Hi Andrew, Closing this issue as obsolete is just not acceptable . This bug forces us to do daily restarts of Confluence, so it must be fixed asap. What if you developed a plugin similar to my servlet filter plugin, that would clean up the leaked thread-local variables at the end of the request (basically the diff between the thread-local variables at the entrance to the filter and exit would be forcefully nulled out (via reflection)). Delivering the fix as a plugin wouldn't interrupt your 2.10.3 dev cycle and would fix the issue for all the other unknown (and probably less significant) thread-local memory leaks that I spotted.

            Hi Igor,

            I've found the reason I wasn't able to reproduce this : my system is using a different DTMManagerDefault (from my JDK, instead of from the version of Xalan we ship with Confluence) for XSLT, which does not use a thread local to store state.

            Given that we will be removing Xalan as part of 3.0 and are wrapping up 2.10.3, I'm inclined to close this as obsolete at this point.
            Unfortunately this issue has not yet been fixed in Xalan (and probably never will be), so the alternative would be to patch Xalan and provide this as a dependency, but I'm unsure if this is too risk prone and whether there is any benefit over providing a patched jar on this issue that affected customers could use.

            Regards,
            Andrew Lynch

            Andrew Lynch (Inactive) added a comment - Hi Igor, I've found the reason I wasn't able to reproduce this : my system is using a different DTMManagerDefault (from my JDK, instead of from the version of Xalan we ship with Confluence) for XSLT, which does not use a thread local to store state. Given that we will be removing Xalan as part of 3.0 and are wrapping up 2.10.3, I'm inclined to close this as obsolete at this point. Unfortunately this issue has not yet been fixed in Xalan (and probably never will be), so the alternative would be to patch Xalan and provide this as a dependency, but I'm unsure if this is too risk prone and whether there is any benefit over providing a patched jar on this issue that affected customers could use. Regards, Andrew Lynch

            Igor Minar added a comment -

            Igor Minar added a comment - http://issues.apache.org/jira/browse/XALANJ-2195

              alynch Andrew Lynch (Inactive)
              15d9a6950818 Igor Minar
              Affected customers:
              0 This affects my team
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: