[CWD-6276] Running Crowd with Oracle Database Native Network Encryption degrades performance or causes outages - Create and track feature requests for Atlassian products.

Type: Bug
Resolution: Unresolved
Priority: Low
Fix Version/s: None
Affects Version/s: 5.1.2
Component/s: Database
Labels:

Support reference count:
2
Symptom Severity:
Severity 3 - Minor
UIS:
2
Bug Fix Policy:
View Atlassian Server bug fix policy

Issue Summary

Oracle provides a feature called Native Network Encryption: ORACLE-BASE - Native Network Encryption for Database Connections. This feature was previously part of the Advanced Security Option license, and provides connection encryption without requiring client side configuration.

When this feature is enabled, it adds 350ms+ to the time required to establish a database connection. This alone will cause noteworthy performance degradation, but when combined with the default database connection pool manager in Crowd, c3p0, it can cause intermittent outages and extreme performance degradation.

Oracle have stated that this latency is working as intended: Slow Connection Using 12c Client When Network Encryption Is Enabled

This is reproducible on Data Center: yes

Steps to Reproduce

Install any version of Crowd
Install Oracle DB 11g or later with Native Network Encryption enabled
Introduce load to the system. The problem is usually exacerbated with a larger value of hibernate.c3p0.max_size in crowd.cfg.xml, due to the nature of the c3p0 pool scaling bug.
Monitor Crowd for delayed or timeout responses

Expected Results

Crowd should continue operating normally, scaling the size of the c3p0 pool appropriately.

Actual Results

There is a prolonged delay in establishing database connections that causes c3p0 to get stuck in a loop of attempting to obtain additional database connections. As obtaining these database connections is slow, this will take longer than normal.

Crowd will remain unresponsive until it reaches the c3p0 maximum pool size for the node.

This issue will not be visible in the logs by default, but the following KB provides additional details on how to diagnose this issue: Confluence Unresponsive Due to High Database Connection Latency (taken from Confluence, which used c3p0 up until Confluence 7.14 (not inclusive). A KB article will be drafted for Crowd shortly)

Thread dumps will show unresponsive threads waiting on com.mchange.v2.resourcepool.BasicResourcePool.awaitAvailable for long periods, and this is indicative of the problem.

Additionally, if the StuckThreadDetectionValve is enabled in the <Host> block within server.xml with an appropriate threshold value, Tomcat (catalina) logs will then show show the same stuck thread stack trace. For example:

<Valve className="org.apache.catalina.valves.StuckThreadDetectionValve" threshold="60"/>

The Tomcat (catalina) log will then show:

15-Jun-2024 11:16:50.109 WARNING [ContainerBackgroundProcessor[StandardEngine[Catalina]]] org.apache.catalina.valves.StuckThreadDetectionValve.notifyStuckThreadDetected Thread [http-nio-8095-exec-103 url: /crowd/rest/usermanagement/1/authentication] (id=[306]) has been active for [67,206] milliseconds (since [6/15/24, 11:15 AM]) to serve the same request for [https://testsite.atlassian.com/crowd/rest/usermanagement/1/authentication?username=user55] and may be stuck (configured threshold for this StuckThreadDetectionValve is [60] seconds). There is/are [138] thread(s) in total that are monitored by this Valve and may be stuck.
        java.lang.Throwable
                at java.base@11.0.21/java.lang.Object.wait(Native Method)
                at com.mchange.v2.resourcepool.BasicResourcePool.awaitAvailable(BasicResourcePool.java:1503)
                at com.mchange.v2.resourcepool.BasicResourcePool.prelimCheckoutResource(BasicResourcePool.java:644)
                at com.mchange.v2.resourcepool.BasicResourcePool.checkoutResource(BasicResourcePool.java:554)
                at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool.checkoutAndMarkConnectionInUse(C3P0PooledConnectionPool.java:758)
                at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool.checkoutPooledConnection(C3P0PooledConnectionPool.java:685)
                ....

Workaround

A workaround is detailed on this KB: Confluence Unresponsive Due to High Database Connection Latency. In essence, the quickest workaround is workaround #2, which can be implemented by:

Editing the crowd.cfg.xml
Changing the hibernate.c3p0.timeout value from the default of 30 to a larger value (eg. 900 or 1800). For example, change:
```
<property name="hibernate.c3p0.timeout">30</property>
```
to
```
<property name="hibernate.c3p0.timeout">900</property>
```
If in doubt, set 1800 and validate that the problem is solved, then try a lower value but choose one that is sufficiently high to solve the problem.
Restart Crowd for the setting to take effect. This must be completed on all nodes if Crowd is clustered.

However, it may be preferably to implement SSL to the database with proper certificate exchange, or disable Native Network Encryption entirely.

causes: KRAK-6559 Loading...; KRAK-6560 Loading...

Details

Description

Issue Summary

Steps to Reproduce

Expected Results

Actual Results

Workaround

Attachments

Issue Links

Forms

Activity

People

Dates