-
Bug
-
Resolution: Fixed
-
Medium
-
2.7.1
-
None
-
None
Symptoms
Crowd becomes unresponsive. A thread dump shows that all threads are in WAITING state, except one which is RUNNABLE and reading from the JDBC socket (SocketInputStream.read) while at the same time holding the WRITE lock in SwitchableTokenManagerImpl.
Postgres logs contain "LOG: could not send data to client: Broken pipe".
Steps to reproduce
This issue is affecting some customers. I haven't been able to reproduce it locally. The key to reproduce this issue seems to be able to kill a connection between Crowd and Postgres in such way that Postgres believes it's closed ("broken pipe"), while Crowd keeps waiting to read from the socket.
This issue seems to happen only when using database token storage.
[CWD-3768] A failure in a single DB connection causes deadlock in Crowd
2.7.2 is as unstable as 2.7.1 for me. Even with tokens in memory.
And it seems it's affecting JIRA and others atlassian apps heavily.
Thank you for your patience,
As described at https://jira.atlassian.com/browse/CWD-3769?focusedCommentId=591918&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-591918, we have simplified the locking around the token storage, making it impossible for an unresponsive database connection to hold essential resources and cause the whole server to freeze. We have also changed the transaction model to eliminate the deadlocks. We believe these changes will fix the problem described in this issue. They will be part of the upcoming Crowd 2.7.2 release.
Nevertheless, as a best practice to improve resilience against unexpected failures, we still recommend setting socket timeouts in your JDBC driver, and transaction timeouts in your database server. Please check the documentation of your database to configure timeouts.
If you still experience deadlocks and stability problems after the upgrade to the upcoming Crowd 2.7.2 release, please open a support ticket. Thank you.
FWIW, we removed the foreign key from table dbo.cwd_user and all of the deadlocks have ceased, and performance has been very good. This resolved the issue for us.
Using SQL Server 2008 R2 64-bit, when using the sqljdbc4.jar driver, Crowd didn't crash whether or not we used database storage of authentication tokens.
When we switched to the jtds-1.2.7.jar driver, since all of our other Atlassian tools are using it, Crowd crashed. Crowd and all of our Atlassian applications became unresponsive too. I had to use a SQL script to turn off database storage of authentication tokens, restore the sqljdbc4.jar driver in the crowd config, and restart the server to get everything back online.
Would appreciate an Update comment from Atlassian on this issue, as the last comment was nearly one month ago. How close is a resolution to this issue?
This issue seems Critical to me, rather than Major unless Major is the highest priority. Crowd is the nucleus of all Atlassian Tools, this needs to be resolved ASAP.
We just upgraded Crowd from 2.4.2 to 2.7.1 on Windows Server 2008 R2 using SQL Server 2008 R2 both 64-bit and can confirm the comments made previous to this one: Moving Authentication Token Storage to "Memory Cache" did not help.
The only reason I upgraded was to get all of the Atlassian tools using Java 7, otherwise version 2.4.2 wasn't having any issues. I sort of regret the upgrade, due to this bug. I will check Atlassian's Jira issues prior to upgrading in the future.
We are getting deadlocks, and it is filling up our SQL logs, however it is not crashing Crowd. Sessions are timing out frequently though, among the Atlassian applications( Jira, Confluence, Fisheye, etc. ) requiring our users to re-login frequently, and also saving documents in Confluence are erroring out, due to losing Session, when they just logged in.
Thanks.
We get the following on the database side (Postgres 9.3)
16387 | crowd | 62073 | 16396 | crowd | | 172.17.2.178 | | 40219 | 2014-02-10 12:00:00.057242+02 | 2014-02-10 12:01:00.607062+02 | 2014-02-10 12:01:00.699423+02 | 2014-02-10 12:01:00.699961+02 | f | idle in transaction | select property0_.property_key as property1_13_0_, property0_.property_name as property2_13_0_, property0_.property_value as property3_13_0_ from cwd_property property0_ where property0_.property_key=$1 and property0_.property_name=$2 16387 | crowd | 34633 | 16396 | crowd | | 172.17.2.178 | | 38868 | 2014-02-10 11:15:47.146698+02 | 2014-02-10 11:16:01.402503+02 | 2014-02-10 11:16:01.405518+02 | 2014-02-10 11:16:01.405521+02 | t | active | insert into cwd_token (directory_id, entity_name, random_number, identifier_hash, random_hash, created_date, last_accessed_date, last_accessed_time, duration, id) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10) 16387 | crowd | 25933 | 16396 | crowd | | 172.17.2.178 | | 41748 | 2014-02-10 12:51:59.465915+02 | | 2014-02-10 12:52:34.667549+02 | 2014-02-10 12:52:34.667726+02 | f | idle | DISCARD ALL 16387 | crowd | 34635 | 16396 | crowd | | 172.17.2.178 | | 38870 | 2014-02-10 11:15:48.357651+02 | 2014-02-10 11:16:01.02871+02 | 2014-02-10 11:16:01.356168+02 | 2014-02-10 11:16:01.35656+02 | f | idle in transaction | delete from cwd_token where id=$1 16387 | crowd | 25934 | 16396 | crowd | | 172.17.2.178 | | 41749 | 2014-02-10 12:51:59.47671+02 | | 2014-02-10 12:52:34.668734+02 | 2014-02-10 12:52:34.668846+02 | f | idle | DISCARD ALL 16387 | crowd | 25935 | 16396 | crowd | | 172.17.2.178 | | 41750 | 2014-02-10 12:51:59.477418+02 | 2014-02-10 12:52:11.390548+02 | 2014-02-10 12:52:11.466483+02 | 2014-02-10 12:52:11.466827+02 | f | idle in transaction | select property0_.property_key as property1_13_0_, property0_.property_name as property2_13_0_, property0_.property_value as property3_13_0_ from cwd_property property0_ where property0_.property_key=$1 and property0_.property_name=$2 16387 | crowd | 62074 | 16396 | crowd | | 172.17.2.178 | | 40220 | 2014-02-10 12:00:00.058206+02 | | 2014-02-10 12:00:00.063045+02 | 2014-02-10 12:00:00.063565+02 | f | idle | SHOW TRANSACTION ISOLATION LEVEL 16387 | crowd | 35507 | 16396 | crowd | | 172.17.2.178 | | 38975 | 2014-02-10 11:19:49.546673+02 | 2014-02-10 11:20:20.105722+02 | 2014-02-10 11:20:20.195675+02 | 2014-02-10 11:20:20.196117+02 | f | idle in transaction | select property0_.property_key as property1_13_0_, property0_.property_name as property2_13_0_, property0_.property_value as property3_13_0_ from cwd_property property0_ where property0_.property_key=$1 and property0_.property_name=$2 16387 | crowd | 35508 | 16396 | crowd | | 172.17.2.178 | | 38976 | 2014-02-10 11:19:49.58113+02 | 2014-02-10 11:19:55.930264+02 | 2014-02-10 11:19:56.006547+02 | 2014-02-10 11:19:56.006703+02 | f | idle in transaction | select property0_.property_key as property1_13_0_, property0_.property_name as property2_13_0_, property0_.property_value as property3_13_0_ from cwd_property property0_ where property0_.property_key=$1 and property0_.property_name=$2 16387 | crowd | 26809 | 16396 | crowd | | 172.17.2.178 | | 41855 | 2014-02-10 12:56:00.286072+02 | 2014-02-10 12:56:00.313195+02 | 2014-02-10 12:56:00.395825+02 | 2014-02-10 12:56:00.396254+02 | f | idle in transaction | select property0_.property_key as property1_13_0_, property0_.property_name as property2_13_0_, property0_.property_value as property3_13_0_ from cwd_property property0_ where property0_.property_key=$1 and property0_.property_name=$2 16387 | crowd | 45089 | 16396 | crowd | | 172.17.2.178 | | 39215 | 2014-02-10 11:24:56.00944+02 | 2014-02-10 11:24:56.032012+02 | 2014-02-10 11:24:56.126233+02 | 2014-02-10 11:24:56.126891+02 | f | idle in transaction | select property0_.property_key as property1_13_0_, property0_.property_name as property2_13_0_, property0_.property_value as property3_13_0_ from cwd_property property0_ where property0_.property_key=$1 and property0_.property_name=$2 16387 | crowd | 26983 | 16396 | crowd | | 172.17.2.178 | | 41887 | 2014-02-10 12:57:11.439214+02 | | 2014-02-10 12:59:59.807766+02 | 2014-02-10 12:59:59.80853+02 | f | idle | SHOW TRANSACTION ISOLATION LEVEL 16387 | crowd | 26985 | 16396 | crowd | | 172.17.2.178 | | 41888 | 2014-02-10 12:57:11.45502+02 | 2014-02-10 12:57:11.465899+02 | 2014-02-10 12:57:11.553381+02 | 2014-02-10 12:57:11.553848+02 | f | idle in transaction | select property0_.property_key as property1_13_0_, property0_.property_name as property2_13_0_, property0_.property_value as property3_13_0_ from cwd_property property0_ where property0_.property_key=$1 and property0_.property_name=$2 16387 | crowd | 61210 | 16396 | crowd | | 172.17.2.178 | | 40118 | 2014-02-10 11:56:00.500355+02 | 2014-02-10 11:56:00.525963+02 | 2014-02-10 11:56:00.608745+02 | 2014-02-10 11:56:00.60894+02 | f | idle in transaction | select property0_.property_key as property1_13_0_, property0_.property_name as property2_13_0_, property0_.property_value as property3_13_0_ from cwd_property property0_ where property0_.property_key=$1 and property0_.property_name=$2 16387 | crowd | 26986 | 16396 | crowd | | 172.17.2.178 | | 41889 | 2014-02-10 12:57:11.456641+02 | 2014-02-10 12:57:21.703485+02 | 2014-02-10 12:57:21.777868+02 | 2014-02-10 12:57:21.77803+02 | f | idle in transaction | select property0_.property_key as property1_13_0_, property0_.property_name as property2_13_0_, property0_.property_value as property3_13_0_ from cwd_property property0_ where property0_.property_key=$1 and property0_.property_name=$2 16387 | crowd | 27361 | 16396 | crowd | | 172.17.2.178 | | 41940 | 2014-02-10 12:59:08.022304+02 | 2014-02-10 13:01:00.357251+02 | 2014-02-10 13:01:00.43968+02 | 2014-02-10 13:01:00.44007+02 | f | idle in transaction | select property0_.property_key as property1_13_0_, property0_.property_name as property2_13_0_, property0_.property_value as property3_13_0_ from cwd_property property0_ where property0_.property_key=$1 and property0_.property_name=$2
The deadlock is caused by pid 34633.
---Jaco
I tried adding ?socketTimeout=30 to my jdbc url, but still doesn't work. This is a fresh/brand new install of crowd. I went through the installation wizard, logged in after that, and was able to use it. After I restarted (tried to add a plugin jar), I have not been able to login since - this is a server with zero activity on it, other than me trying to login as an admin. I see postgres doing "INSERT waiting" and another "idle in transaction".
status "idle in transaction" is: delete from cwd_token where id=$1
active query is: insert into cwd_token (directory_id, entity_name, random_number, identifier_hash, random_hash, created_d
ate, last_accessed_date, last_accessed_time, duration, id) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)
Just to add to the comments already here. Out fresh evaluation setup consists of:
- Centos 6 + OpenJDK 1.7.0
- PostgreSQL 9.2
- Crowd 2.7.1
After clean new installation & setup with completely unloaded server and no data, we are unable to log back in. Crowd hangs while trying to log in with PostgreSQL showing: postgres: crowd crowd_db_01 127.0.0.1(33930) INSERT waiting
And the following in logs:
ERROR: duplicate key value violates unique constraint "uk_token_id_hash"
DETAIL: Key (identifier_hash)=(ggFyZinu0tfC85Ccyz4fRA00) already exists.
STATEMENT: insert into cwd_token (directory_id, entity_name, random_number, identifier_hash, random_hash, created_date, last_accessed_date, last_accessed_time, duration, id) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)
LOG: could not send data to client: Broken pipe
FATAL: connection to client lost
Adding socket timeout to the connection URL did not help.
Further, Crowd v2.7.0. While we've seen the PostgreSQL log error referenced in the first message for a couple of instances of Crowd logging the "Directory 'xxxx' is not functional during authentication of 'uuuuu'. Skipped." messages. However, we have many more instances of Crowd logging the "Directory...not functional" errors.
We're seeing this on a lightly loaded server.
Config: Ubuntu 12.04.4 LTS (GNU/Linux 3.2.0-56-generic x86_64), 8GB RAM, 4 CPU (Intel Xeon X5675 @ 3.07GHz), PostgreSQL 9.1.11-0ubuntu, VMware vSphere 5.5, hosting JIRA, Crowd & Confluence on same machine. Thus we do have three applications all accessing the same database engine on the same machine on which they are running. Considering our loading, should not be a problem.
I will try to use the socketTimeout during off-peak hours.
Maybe it would be good to ship the CROWD with the bundled JRE and the INSTALLER as the other Atlassian product /jira, confluence/.
Hi,
we experience this issue after migrating to the new server.
The old one was Ubuntu Hardy /8.04/, aka postgresql 8.4 and java6. New OS is Debian Wheezy /amd64/, latest java 1.7 form Oracle a 9.1 postgresql. We got the deadlock right after the first attempt to login.
We're investigating this issue. If anyone is experiencing server crashes and is using Postgres, we suggest you modify the JDBC connection URL in crowd.cfg.xml to add the parameter ?socketTimeout=30. For instance, in my case it looks like:
<property name="hibernate.connection.url">jdbc:postgresql://localhost:5432/crowd?socketTimeout=30</property>
Please let us know if that improves the stability of the server. Thank you.
This issue has similar effects to CWD-3692 (Crowd freezes), but different causes. In particular, it is not essential for this issue to have a high load in the Crowd server. Once the situation described in the "Symptoms"/"Steps to reproduce" section above happens, Crowd will eventually crash after some time.
This issue also has some resemblances to CWD-3568. In particular, we have observed the line "ERROR: duplicate key value violates unique constraint "cwd_token_identifier_hash_key" STATEMENT: insert into cwd_token (directory_id, entity_name, random_number, identifier_hash, random_hash, created_date, last_accessed_date, last_accessed_time, duration, id) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)" in the Postgres logs just before the "LOG: could not send data to client: Broken pipe". The effects are quite different: CWD-3568 never caused a server crash, just some request to fail.
The "LOG: could not send data to client: Broken pipe" line was also seen in CWD-3495.
Hi danilo.tuler, we're sorry to hear that you're having problems with 2.7.2. I've noticed that you've opened
CWD-3915. The cause of your problems with 2.7.2 seems to be unrelated with this issue (CWD-3768).