-
Bug
-
Resolution: Fixed
-
Highest
-
2.10.2
-
None
-
jdk6u12 (32bit and 64bit), mysql5, war, cluster - but there is only one node in the cluster
In the last several weeks we've been seeing a lot of confluence instabilities at wikis.sun.com - all of them were related to running out of heap space. Several iterations of increasing Xmx didn't help (we started at 3GB and now we are at 5GB and 64bit JVM).
I took several memory dumps during outages and analyzed them with Eclipse Memory Analyzer, which repeatedly found two issues:
- Something is storing Xerces SaxParser objects as ThreadLocal variables, this results in up to 90MB being retained per thread and I see several instances of this size being held in memory causing total of 800-1200MB of the memory to be retained
- Hundreds of instances of net.sf.hibernate.impl.SessionImpl retain additional ~780MB of memory - I'll document this as a separate issue
Just before taking the heap dump, I also took a thread dump. By comparing the two I found that threads that were holding on the the huge thread local variables were currently in the containers thread pool and were not processing any requests - thus should have minimal memory requirements.
I'm attaching some annotated screenshots from Eclipse Memory Analyzer and a thread dump that proves that the misbehaving threads were idle.
[CONFSERVER-14988] SAXParser memory leaks
73 hours later and our heap is still happy - fluctuating between 1GB and 2.5GB. Before the fix 5GB wasn't enough. We'll see what happens tomorrow during the peak hours.
In the first ~21hours after applying the patch our memory usage dropped by ~1GB on average. I suggest that you include this patch with 2.10.3 so that others can benefit from it as well.
The patch is now deployed on the prod server, I'll let you know tomorrow how the mem consumption looks. At the moment it looks pretty good, but it's too early to tell.
Hi Igor,
That's correct. Let me know how if goes with your production server.
I'll consider creating a forked Xalan for 2.10.3 if necessary.
Regards,
Andrew Lynch
Can I assume that you took the patch from XALANJ-2195, applied it & compiled it to get the class?
I tested it and it seems to work. I'll apply it to our prod server tonight.
thanks
Hi Igor,
While this sounds like a good solution, I think there is substantial risk involved with this approach which I'm not sure is appropriate for a point release.
I think this is even riskier than patching Xalan itself.
I've provided a patched binary of the suspected class responsible : placing this in WEB-INF/classes//org/apache/xml/utils should prevent the memory leak.
Regards,
Andrew Lynch
Hi Andrew,
Closing this issue as obsolete is just not acceptable. This bug forces us to do daily restarts of Confluence, so it must be fixed asap.
What if you developed a plugin similar to my servlet filter plugin, that would clean up the leaked thread-local variables at the end of the request (basically the diff between the thread-local variables at the entrance to the filter and exit would be forcefully nulled out (via reflection)). Delivering the fix as a plugin wouldn't interrupt your 2.10.3 dev cycle and would fix the issue for all the other unknown (and probably less significant) thread-local memory leaks that I spotted.
Hi Igor,
I've found the reason I wasn't able to reproduce this : my system is using a different DTMManagerDefault (from my JDK, instead of from the version of Xalan we ship with Confluence) for XSLT, which does not use a thread local to store state.
Given that we will be removing Xalan as part of 3.0 and are wrapping up 2.10.3, I'm inclined to close this as obsolete at this point.
Unfortunately this issue has not yet been fixed in Xalan (and probably never will be), so the alternative would be to patch Xalan and provide this as a dependency, but I'm unsure if this is too risk prone and whether there is any benefit over providing a patched jar on this issue that affected customers could use.
Regards,
Andrew Lynch
For some reason BTrace was not able to instrument the ThreadLocal classes, but DTrace was able to find this:
libjvm.so`__1cNSharedRuntimeTdtrace_method_entry6FpnKJavaThread_pnNmethodOopDesc__i_+0xb5 java/lang/ThreadLocal.<init>()V org/apache/xml/utils/XMLReaderManager.getXMLReader()Lorg/xml/sax/XMLReader; org/apache/xml/dtm/ref/DTMManagerDefault.getXMLReader(Ljavax/xml/transform/Source;)Lorg/xml/sax/XMLReader; org/apache/xml/dtm/ref/DTMManagerDefault.getDTM(Ljavax/xml/transform/Source;ZLorg/apache/xml/dtm/DTMWSFilter;ZZ)Lorg/apache/xml/dtm/DTM; org/apache/xalan/transformer/TransformerImpl.transform(Ljavax/xml/transform/Source;Z)V org/apache/xalan/transformer/TransformerImpl.transform(Ljavax/xml/transform/Source;Ljavax/xml/transform/Result;Z)V org/apache/xalan/transformer/TransformerImpl.transform(Ljavax/xml/transform/Source;Ljavax/xml/transform/Result;)V com/atlassian/confluence/importexport/impl/WikiExporter.convertXHtmlToXslFo(Ljava/lang/String;)Ljava/lang/String; com/atlassian/confluence/importexport/impl/WikiExporter.exportWikiToXslFo(Ljava/lang/String;Lcom/atlassian/confluence/renderer/PageContext;)Ljava/lang/String; com/atlassian/confluence/importexport/impl/WikiExporter.exportWikiToXslFo(Lcom/atlassian/confluence/core/ContentEntityObject;)Ljava/lang/String; com/atlassian/confluence/importexport/impl/WikiExporter.exportWikiToXslFo(Lcom/atlassian/confluence/core/ContentEntityObject;)Ljava/lang/String; libjvm.so`__1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_+0x22a libjvm.so`__1cCosUos_exception_wrapper6FpFpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v2468_v_+0x18 libjvm.so`__1cJJavaCallsEcall6FpnJJavaValue_nMmethodHandle_pnRJavaCallArguments_pnGThread__v_+0x2e libjvm.so`__1cKReflectionGinvoke6FnTinstanceKlassHandle_nMmethodHandle_nGHandle_bnOobjArrayHandle_nJBasicType_4bpnGThread__pnHoopDesc__+0x1284 libjvm.so`__1cKReflectionNinvoke_method6FpnHoopDesc_nGHandle_nOobjArrayHandle_pnGThread__2_+0x1d6 libjvm.so`JVM_InvokeMethod+0x293 libjava.so`Java_sun_reflect_NativeMethodAccessorImpl_invoke0+0x12 sun/reflect/NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; sun/reflect/NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; sun/reflect/DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;* 0xfffffd7fd802a3e0
So it seems that xalan is responsible for the leak.
I do get Tomcat's classloader placed in the ThreadLocal when exporting a space, but that does not seem be the problem your screenshots exhibit.
Hi Igor,
I performed by setting a conditional breakpoint inside the TheadLocal, so this would work whether it was placed in there by Confluence or any of our libraries.
None of these objects are ever put into a ThreadLocal on my instance, so I really don't have any clues as to how / why they are residing in your thread locals.
I'm using Java 1.5.0_13 on OS X 10.4, with a 2.10.2 instance without any customizations.
Have you also verified that non of the libraries you use stores stuff in thread-locale storage?
Hi Igor,
I've recompiled my JDK to not strip debugging symbols so that I could debug it effectively, and I've verified (on my instance) we never put any of the following objects in a ThreadLocal while exporting a pdf page:
1) Anything with xml in the class or package name
2) Anything with SAXParser in the class or package name
3) Anything with apache in the class or package name
I think xwork might not correctly clear its thread local after servicing an action, but I'm not sure how much of a problem this is.
Regards,
Andrew Lynch
I kind of have a feeling that the issue is not caused by Confluence, but by one of its dependencies.... just a hunch.
I wrote a Confluence Servlet Filter plugin that detects possible threadlocal memory leaks before and after request is dispatched to Confluence. The filter found many dangling thread local references - some of which might be legitimate, e.g. those set by container.
The filter uses reflection to inspect thread local variables set for the current thread and prints classes of all the references. If a threadlocal instance object exists, but currently points to null, "null" is printed instead of a class name.
The filter inspects the threadlocal store twice for each request - once before the request is dispatched to Confluence and once after. This allows me to see any change in the store when hitting different URIs.
WARN 2009-03-31 08:03:03,664 [service-j2ee-4] ThreadLocalLeakFilter:detectLeak - possible ThreadLocal leaks: 111 of 256 = [null, null, null, java.lang.Long, java.lang.Long, null, null, null, java.util.HashMap, null, null, null, null, null, java.util.HashMap, java.lang.Long, null, null, null, null, java.lang.Long, null, null, null, java.lang.Long, null, null, null, java.lang.Long, null, null, java.lang.ref.SoftReference, java.util.concurrent.locks.ReentrantReadWriteLock$Sync$HoldCounter, null, java.lang.Long, null, java.lang.Long, java.util.HashMap, null, null, java.lang.Long, java.lang.Long, null, null, null, java.lang.Long, null, null, java.lang.Long, java.lang.Boolean, null, null, java.lang.Long, null, java.lang.Long, [Ljava.text.SimpleDateFormat;, null, java.lang.Long, null, java.lang.Long, java.lang.Long, null, java.lang.ref.SoftReference, null, null, null, java.lang.String, null, java.util.concurrent.locks.ReentrantReadWriteLock$Sync$HoldCounter, null, null, null, null, java.lang.Long, null, null, null, java.lang.Long, org.apache.catalina.core.ApplicationContext$DispatchData, null, null, null, null, null, java.util.LinkedList, java.lang.Long, java.util.HashMap, java.lang.Long, null, null, null, null, null, java.lang.Long, null, null, null, java.lang.Long, java.lang.Long, null, java.util.LinkedList, java.lang.Long, null, null, null, null, com.opensymphony.xwork.ActionContext, null, null, java.lang.Long, null,] beforeFilter http://wikis-tst/pages/doexportpage.action
...
WARN 2009-03-31 08:03:08,039 [service-j2ee-4] ThreadLocalLeakFilter:detectLeak - possible ThreadLocal leaks: 127 of 256 = [null, null, null, null, java.lang.Long, java.lang.Long, null, java.lang.Long, null, null, java.util.HashMap, null, java.lang.Long, null, null, null, null, org.apache.xerces.parsers.SAXParser, [C, java.util.HashMap, java.lang.Long, null, null, null, null, null, java.lang.Long, null, null, null, java.lang.Long, null, null, null, java.lang.Long, null, java.lang.Long, null, java.lang.ref.SoftReference, java.util.concurrent.locks.ReentrantReadWriteLock$Sync$HoldCounter, null, java.lang.Long, null, null, java.lang.Long, java.util.HashMap, null, null, null, org.dom4j.DocumentFactory, java.lang.Long, java.lang.Long, null, null, null, java.lang.Long, null, null, java.lang.Long, java.lang.Boolean, null, null, java.lang.Long, null, java.lang.Long, [Ljava.text.SimpleDateFormat;, org.apache.lucene.index.SegmentTermEnum, null, java.lang.Long, null, java.lang.Long, null, null, java.lang.Long, null, java.lang.ref.SoftReference, null, null, null, java.lang.String, null, java.util.concurrent.locks.ReentrantReadWriteLock$Sync$HoldCounter, null, java.util.Stack, null, null, null, java.lang.Long, null, null, null, java.lang.Long, org.apache.catalina.core.ApplicationContext$DispatchData, null, null, null, java.lang.Long, null, null, java.util.LinkedList, java.lang.Long, java.util.HashMap, java.lang.Long, null, null, null, null, null, java.lang.Long, null, null, null, java.lang.Long, java.lang.Long, java.lang.Long, null, java.util.LinkedList, java.lang.Long, null, null, null, null, com.opensymphony.xwork.ActionContext, null, null, java.lang.Long, null,] afterFilter http://wikis-tst/pages/doexportpage.action
The output above shows that when hitting /pages/doexportpage.action, the number of objects in the threadlocal store increases by 16, and one of these 16 references is a reference to org.apache.xerces.parsers.SAXParser.
I don't think that there is a way to use reflection to find out where the instance of SAXParser was instantiated. Other tools like BTrace should reveal this with a little effort.
The output above reveals other questionable references, e.g. com.opensymphony.xwork.ActionContext, this is something that should be looked at as well.
I posted the logs at CSP-30409. I'll try to dig in the heap dump a bit more to see if I can find any other useful info.
Hi Igor,
I can't provide too much in the way of suggestions unfortunately. From the screenshots you have provided, it appears that the SAXParser objects which are being leaked contain FOP objects, and we only use FOP for PDF exports (and space exports run in seperate threads). Do all of the 63 SAXParser objects contain FOP objects in them?
I was hoping you would be able to reproduce this problem by replaying some of the PDF page export requests. I'm assuming only a small number of requests would actually be for PDF exports.
Regards,
Andrew Lynch
Hi Andrew,
What do you want me to look for in the access logs (we do have these logs)? Just pdf export activity in general or something more specific?
We had 25 million requests captured in the access logs for February, so it would be good if you could help me narrow down what we are looking for.
Hi Igor,
I haven't been able to find anything which strores SaxParser objects as ThreadLocals.
As far as I know, the only place we use FOP is for PDF exports (of Pages / Spaces). I've done both of these operations and I don't appear to have any lingering instances after completion of the export.
We may unfortunately require more investigation on your part.
Given that the threads were all http serving threads, it's likely that PDF export of pages may have lead to the creation of these objects, but I haven't been able to reproduce this.
It might be necessary to analyse your access logs (if you have them).
Regards,
Andrew Lynch
Thanks a lot Igor for these detailed descriptions. We will have a look very soon.
Hi I have this problem as well, where can I download the xalan jar file with this fix?