Details
-
Suggestion
-
Resolution: Fixed
Description
NOTE: This suggestion is for Confluence Server. Using Confluence Cloud? See the corresponding suggestion.
This has caused outages for our own internal instances and some customers as well. To mitigate this risk I would like to add this param to the setenv scripts for all Server and DC customers:
-XX:G1ReservePercent=20
To-space allocation failures are very very slow, due to some bugs in the JDK which have been fixed in Java 9. We need to avoid these collections where possible.
To briefly explain what this param does - G1GC splits the heap into regions and allocates those regions to Eden, Survivor and Tenured. When an object needs to be promoted from Eden to Survivor (for example), there needs to be enough room in a region which has been allocated to Survivor, or the heap needs to allocate a new region. If all regions are allocated, you get these failures, and you will then see 30s-1m long pauses (or worse).
The heap maintains a certain percentage of regions free in order to try to mitigate this scenario - 10% by default. This parameter will bump that up to 20%. On our internal instances this was enough to eliminate to-space allocation failures entirely, and stopped our outages.
FYI, this is the bug I raised with Oracle, which they have fixed for Java 9: https://bugs.openjdk.java.net/browse/JDK-8155256
Diagnosis
This parameter is helpful if you are experiencing long to-space allocation pauses in your garbage collection. To determine if this is the case:
- Enable GC Logging
- After a few days, or if you experience any outages, review the GC logs:
- Download GC Viewer
- Open your GC logs in GC Viewer
- Click on the Event Details tab
- Check the pause types you are experiencing. If you have long 'to-space exhausted' pauses, this parameter can help.