[CONFSERVER-12319] Adding users to a large group is slow with default (Hibernate) user management Created: 03/Jul/2008  Updated: 17/Feb/2017  Resolved: 21/Jan/2010

Status: Resolved
Project: Confluence Server
Component/s: None
Affects Version/s: 2.8
Fix Version/s: 3.1.1

Type: Bug Priority: High
Reporter: Igor Minar Assignee: Andrew Lynch [Atlassian]
Resolution: Fixed Votes: 11
Labels: affects-server, bugfix_support_backlog, performance, permissions, users&groups
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

MySQL5


Attachments: Zip Archive search-3.37.zip    
Issue Links:
Cloners
was cloned as CONFSERVER-18347 Adding LDAP User to a large internal ... Resolved
Reference
relates to CONFSERVER-10030 db2: queries that use 'lower' do not ... Resolved
is related to CONFSERVER-14989 Possible net.sf.hibernate.impl.Sessio... Resolved
is related to CONFSERVER-13754 HibernateGroupManager.hasExternalMemb... Resolved
is related to CONFSERVER-8675 Support for thousands of groups needs... Resolved
Support reference count: 4
Participants:
Last Touched By: Katherine Yabut
Last commented: 8 years, 12 weeks, 3 days ago
Internal Complexity: 4
Internal Value: 6
Reviewers:
Chris Kiehl

 Description   

While debugging an outage at wikis.sun.com I noticed that the code in HibernateUserManager#addMembership generates some extremely inefficient queries that were giving our db and network hard time:

membership = dGroup.getLocalMembers();

if (membership == null)
{
       membership = new HashSet();
       dGroup.setLocalMembers(membership);
}

membership.add(user);

The last line of the code translates to:

DEBUG 2008-07-02 11:35:43,801 [service-j2ee-3] BatcherImpl:log - select localmembe0_.groupid as groupid__, localmembe0_.userid as userid__, defaulthib1_.id as id0_, defaulthib1_.name as name0_, defaulthib1_.password as password0_, defaulthib1_.email as email0_, defaulthib1_.created as created0_, defaulthib1_.fullname as fullname0_ from local_members localmembe0_ inner join users defaulthib1_ on localmembe0_.userid=defaulthib1_.id where localmembe0_.groupid=?

DEBUG 2008-07-02 11:35:43,806 [service-j2ee-3] BatcherImpl:log - insert into local_members (groupid, userid) values (?, ?)

Which means retrieve*all* members of a given group and then insert the user to the db.

If you run this query on our db with 25k users in a group, you run into some really big problems. By that I mean that the query can easily run several minutes and affect the overall db performance.

The code should be rewritten so that the uniqueness constraint is checked by a SELECT and if no dupe is found the INSERT can follow, otherwise this code will never scale.



 Comments   
Comment by Don Willis [ 03/Jul/2008 ]

Hi Igor,

Thanks for raising this issue.
While obviously I agree that's a very inefficient way to perform the operation, I'm also surprised that such a query would take several minutes, even with 25000 users in the group.
Are you missing an index on either local_members.groupid or users.id?
What does MySQL say if you ask it to EXPLAIN the actual query that runs?

Cheers,
Don

Comment by Igor Minar [ 03/Jul/2008 ]

Hi Don,

Here is the query plan:

mysql> explain select localmembe0_.groupid as groupid__, localmembe0_.userid as userid__, defaulthib1_.id as id0_, defaulthib1_.name as name0_, defaulthib1_.password as password0_, defaulthib1_.email as email0_, defaulthib1_.created as created0_, defaulthib1_.fullname as fullname0_ from local_members localmembe0_ inner join users defaulthib1_ on localmembe0_.userid=defaulthib1_.id where localmembe0_.groupid=13205505;
+----+-------------+--------------+--------+----------------------------+---------+---------+---------------------------+-------+-------------+
| id | select_type | table        | type   | possible_keys              | key     | key_len | ref                       | rows  | Extra       |
+----+-------------+--------------+--------+----------------------------+---------+---------+---------------------------+-------+-------------+
|  1 | SIMPLE      | localmembe0_ | ref    | PRIMARY,FK6B8FB445CE2B3226 | PRIMARY | 8       | const                     | 22686 | Using index | 
|  1 | SIMPLE      | defaulthib1_ | eq_ref | PRIMARY                    | PRIMARY | 8       | wikis.localmembe0_.userid |     1 |             | 
+----+-------------+--------------+--------+----------------------------+---------+---------+---------------------------+-------+-------------+

I think that we are facing several issues here, combination of which causes our db to perform terribly.

Issues we know about are:

cheers,
Igor

Comment by Don Willis [ 10/Jul/2008 ]

CONF-10030 is definitely a problem, but that query plan shows that this query should run very fast. I suspect the mysql people will be able to help you with your mysql configuration to avoid your thread thrashing problems.
Now that you've addressed CONF-10030 in your instance are you still having problems with the groups query?

Comment by Igor Minar [ 11/Jul/2008 ]

wikis.sun.com has been stable since the patch for CONF-10030 was applied, but I still consider addMembership to be an issue waiting to bite us at any time. Especially as the number of registered users grow rapidly.

Running the query above currently takes quite some time:

25437 rows in set (13.57 sec)

Other than this query, the db is screaming fast at the moment.

Comment by Andrew Lynch [Atlassian] [ 11/Nov/2008 ]

Hi Igor,

We have an experimental fix for this issue which will be included with 2.10. It makes changes to the relationship between Users and Groups and is disabled by default due to the fact we have not had time to test it exhaustively.
It can be enabled by providing the system property -Dcom.atlassian.user.experimentalMapping=true.
We hope to make this the default behaviour by 2.10.

Regards,
Andrew Lynch

Comment by Igor Minar [ 11/Nov/2008 ]

Awesome! I'll test it. Thanks.

What do you expect could go wrong, if anything?

Comment by Andrew Lynch [Atlassian] [ 14/Nov/2008 ]

Users might not be added to groups correctly

I haven't actually seen any evidence that it doesn't work properly, you will probably be fine but I think you should exercise some caution and ensure your users / groups are getting updated correctly.

Regards,
Andrew Lynch

Comment by Igor Minar [ 03/Apr/2009 ]

the fix is looking good. We are now using it in production and it makes a huge difference for the cache & db.

thnx

Comment by Igor Minar [ 17/Apr/2009 ]

We found a problem. When the experimental mapping is turned on, we can't remove users from groups via the admin UI. Is there an alternative or enhanced solution?

Comment by Andrew Lynch [Atlassian] [ 29/Apr/2009 ]

Hi Igor,

Looks like a bug; You won't be able to remove users via the UI while the experimental mapping mode is in effect until this is fixed.
I'll try to get this fixed for 3.0 and I'll provide a patched class for this.

Regards,
Andrew Lynch

Comment by Igor Minar [ 29/Apr/2009 ]

we would appreciate that.

thanks

Comment by Igor Minar [ 07/May/2009 ]

the mapping also breaks adding and removing users to a group via the Custom Space User Management plugin

Comment by Brian M. Thomas [ 20/Aug/2009 ]

We are having a problem that sounds suspiciously similar under similar circumstances, and the workaround is easy to apply.

However, I'd like some clarification about the problems it causes – do they clear up when the flag is turned off, or does it cause persistent database problems? We have concluded that to avoid logjams when masses of new users sign up at once, we will just pre-register them in bulk. We could do this if adding a new user didn't take at least 2.5s and bulk-loading them via the SOAP CLI gobbles memory until Confluence crashes after adding a few hundred.

If the problem persists only while the flag is on, we could do the bulk load with it on and then turn it off to resume normal operations.

Comment by Igor Minar [ 20/Aug/2009 ]

Yes, the problem goes away when the flag us removed.

However I wonder if you'll be able to add users while the flag is on.

The bulk pre-registration sounds like a good solution, but it doesn't work for us, because our main IDM repo has millions of users :-/

Also keep in mind that if you get several new users after the bulk pre-registration then you are back to the original problem.

Per emailed me in the past that a proper fix will be considered for Confluence 3.1, I wonder if it is still the case.

Comment by Anatoli Kazatchkov [ 26/Aug/2009 ]

Hi

We found a problem. When the experimental mapping is turned on, we can't remove users from groups via the admin UI. Is there an alternative or enhanced solution?

This problem will be fixed in 3.0.2 although you will still need to have -Dcom.atlassian.user.experimentalMapping=true property set. The corresponding problem in the atlassian-user project is [USER-258].

We are still considering implementing a proper fix for 3.1.

Anatoli.

Comment by Igor Minar [ 26/Aug/2009 ]

Thanks for the update. I'll be eagerly waiting for 3.0.2. Do you have some approximate release date?

Comment by Anatoli Kazatchkov [ 27/Aug/2009 ]

Igor, we are targeting mid/end September for 3.0.2 release.

Comment by Igor Minar [ 10/Nov/2009 ]

Anatoli,

This bug was not fixed in 3.0.2 and I don't see it as fixed in 3.1 beta either.

This issue has been open for 16 months and prevents us from using confluence cluster for which we've been paying all this time.

Can you please get it fixed in 3.1 already?

thanks,
Igor

Comment by Andreas Richter [ 19/Nov/2009 ]

Hi,

We have the same problem. At first let me say a few words about my instance, so you know what I'm talking about.
Confluence 3.0.1, clustered, 200.000 db user, all users are a members of confluence-users group, orcale db.
I have no connection to a ldap, because our ldap has nested groups, Confluence can't handle it and I'm not interested to use unsupported code at this point.
And the user would see all unusable ldap groups.

At the moment I get the db users synchronized via webservice triggered by our identity management. We pre-register all our users. Sometimes we have bulk jobs, when we have bought an other company or when we do house-keeping.
As long as we had a single installation it work. A bulk job took long, but ehcache was able to handle it. Tangsol is more sensitive. My application monitoring tool tells me that I loose the most time in the tangosol cache.
At the moment I try to get a proper byte code instrumentation. Hopefully I'll see which method causes the problem or where oracle made a programming mistake.

I got from the Atlassian support the suggestion to test this experimental fix. From the previous posts I have seen that there are some problems

  • It's not possible to remove users via UI (wouldn't effect me)
  • the mapping also breaks adding and removing users to a group via the Custom Space User Management plugin. Is this still an issue?
  • Has somebody tried to add an user to a group via UI (administrators section). Is this working?

Regards,
Andreas

Comment by Igor Minar [ 19/Nov/2009 ]

Andreas,

This is not a Tangosol issue, but rather a problem caused by incorrect use of hibernate.

Right now when you add a single user to your confluence-users group,
hibernate is instructed to pull all 200k of your users from the db,
create a java collection for these users, add the new user to the
collection and then persist the collection into the db
(which probably results in a single insert sql which could have been
used without creating the collection with users in the first place).

So even ehcache is affected by this issue, but it's not causing major
issues with hundreds of thousands of users. Things are just slow.
When you set up a cluster, the problem is that tangosol is instructed
to replicate the huge collection of users on all of your nodes, and
this takes forever + makes the app too busy which results in cluster
panics.

  • the mapping also breaks adding and removing users to a group via the Custom Space User Management plugin. Is this still an issue?

yes, still a problem

  • Has somebody tried to add an user to a group via UI (administrators section). Is this working?

haven't tried this one

/i

Comment by Andreas Richter [ 19/Nov/2009 ]

Hi Igor,

Thanks for your detailed answer. It saves time. I have to think about what I'll do with this information.

Best regards,
Andreas

Comment by Partha Kamal [ 20/Nov/2009 ]

Dear Watchers,

Currently, the fix which requires you to use the flag -Dcom.atlassian.user.experimentalMapping=true is included in both 3.0.2 and the upcoming 3.1 release.

The naming has not been changed (i.e. it still says experimentalMapping), however it is the proper fix for situations where you have a large number of local users AND these are mapped to local groups.

This issue has not been closed however, since the fix does not cover the situation where you have external users (from LDAP for example) that are mapped to local groups. Development would like to include this fix as well, before closing this issue.
Also they need to polish this off (i.e. change the naming of the parameter or remove it altogether).

Kind Regards,
Partha Kamal

Comment by Igor Minar [ 20/Nov/2009 ]

Partha, how can a fix that breaks other confluence functionality be called "proper fix"? Am I missing something?

Comment by Anatoli Kazatchkov [ 20/Nov/2009 ]

Hi Igor,

how can a fix that breaks other confluence functionality be called "proper fix"?

What functionality are you referring to? If you are talking about the problem you have mentioned before:

We found a problem. When the experimental mapping is turned on, we can't remove users from groups via the admin UI. Is there an alternative or enhanced solution?

then it has been fixed in 3.0.2

If you meant something else, please let us know so that we can investigate it further.

Anatoli.

Comment by Igor Minar [ 30/Nov/2009 ]

Hi Anatoli,

sorry, due to how this issue is linked with the other issue, I didn't notice that it was fixed in 3.0.2.

I'm going to test it now and will let you know if the fix really works.

Have you by chance ran your test suite against a build with the flag on? That would be a good indicator if there are any remaining issues.

thanks,
Igor

Comment by Igor Minar [ 03/Dec/2009 ]

sadly, even 3.0.2 still can't remove users via the admin interface. It fails with:

net.sf.hibernate.LazyInitializationException: Failed to lazily initialize a collection - no session or session was closed
at net.sf.hibernate.collection.PersistentCollection.initialize(PersistentCollection.java:209)
at net.sf.hibernate.collection.PersistentCollection.write(PersistentCollection.java:84)
at net.sf.hibernate.collection.Set.remove(Set.java:162)
at com.atlassian.user.impl.hibernate.HibernateGroupManager.removeMembership(HibernateGroupManager.java:405)
at com.atlassian.user.impl.cache.CachingGroupManager.removeMembership(CachingGroupManager.java:178)
at com.atlassian.user.impl.delegation.DelegatingGroupManager.removeMembership(DelegatingGroupManager.java:234)
at bucket.user.DefaultUserAccessor.removeMembership(DefaultUserAccessor.java:527)
at com.atlassian.confluence.user.DefaultUserAccessor.removeMembership(DefaultUserAccessor.java:97)
at com.atlassian.confluence.user.DefaultUserAccessor.removeUser(DefaultUserAccessor.java:226)

Comment by Anatoli Kazatchkov [ 03/Dec/2009 ]

Hi Igor,

We run all our test against the build with the flag and found some problems that are [logged here|USER-267]. A patch jar is attached to that case.

Anatoli.

Comment by Igor Minar [ 04/Dec/2009 ]

Thanks Anatoli, I did some testing and the patched atlassian-user resolved all the known issues.

As far as I can tell at the moment, everything seems to work as expected.

Will the patch make it to 3.1?

cheers,
Igor

Comment by Anatoli Kazatchkov [ 04/Dec/2009 ]

Hi Igor,

the patch is almost certainly will NOT make it into 3.1. I would say it is more likely to make it in 3.1.1 but I don't want to disappoint you in case it does not.

Anatoli.

Generated at Mon Feb 19 05:55:25 UTC 2018 using JIRA 7.8.0-m0003#78000-sha1:e5ec29087cffe574ad41394afe143eb1de3ecdfb.