Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-12319

Adding users to a large group is slow with default (Hibernate) user management

      While debugging an outage at wikis.sun.com I noticed that the code in HibernateUserManager#addMembership generates some extremely inefficient queries that were giving our db and network hard time:

      membership = dGroup.getLocalMembers();
      
      if (membership == null)
      {
             membership = new HashSet();
             dGroup.setLocalMembers(membership);
      }
      
      membership.add(user);
      

      The last line of the code translates to:

      DEBUG 2008-07-02 11:35:43,801 [service-j2ee-3] BatcherImpl:log - select localmembe0_.groupid as groupid__, localmembe0_.userid as userid__, defaulthib1_.id as id0_, defaulthib1_.name as name0_, defaulthib1_.password as password0_, defaulthib1_.email as email0_, defaulthib1_.created as created0_, defaulthib1_.fullname as fullname0_ from local_members localmembe0_ inner join users defaulthib1_ on localmembe0_.userid=defaulthib1_.id where localmembe0_.groupid=?
      
      DEBUG 2008-07-02 11:35:43,806 [service-j2ee-3] BatcherImpl:log - insert into local_members (groupid, userid) values (?, ?)
      

      Which means retrieve*all* members of a given group and then insert the user to the db.

      If you run this query on our db with 25k users in a group, you run into some really big problems. By that I mean that the query can easily run several minutes and affect the overall db performance.

      The code should be rewritten so that the uniqueness constraint is checked by a SELECT and if no dupe is found the INSERT can follow, otherwise this code will never scale.

            [CONFSERVER-12319] Adding users to a large group is slow with default (Hibernate) user management

            Anatoli added a comment -

            Hi Igor,

            the patch is almost certainly will NOT make it into 3.1. I would say it is more likely to make it in 3.1.1 but I don't want to disappoint you in case it does not.

            Anatoli.

            Anatoli added a comment - Hi Igor, the patch is almost certainly will NOT make it into 3.1. I would say it is more likely to make it in 3.1.1 but I don't want to disappoint you in case it does not. Anatoli.

            Igor Minar added a comment -

            Thanks Anatoli, I did some testing and the patched atlassian-user resolved all the known issues.

            As far as I can tell at the moment, everything seems to work as expected.

            Will the patch make it to 3.1?

            cheers,
            Igor

            Igor Minar added a comment - Thanks Anatoli, I did some testing and the patched atlassian-user resolved all the known issues. As far as I can tell at the moment, everything seems to work as expected. Will the patch make it to 3.1? cheers, Igor

            Anatoli added a comment -

            Hi Igor,

            We run all our test against the build with the flag and found some problems that are [logged here|USER-267]. A patch jar is attached to that case.

            Anatoli.

            Anatoli added a comment - Hi Igor, We run all our test against the build with the flag and found some problems that are [logged here|USER-267] . A patch jar is attached to that case. Anatoli.

            Igor Minar added a comment -

            sadly, even 3.0.2 still can't remove users via the admin interface. It fails with:

            net.sf.hibernate.LazyInitializationException: Failed to lazily initialize a collection - no session or session was closed
            at net.sf.hibernate.collection.PersistentCollection.initialize(PersistentCollection.java:209)
            at net.sf.hibernate.collection.PersistentCollection.write(PersistentCollection.java:84)
            at net.sf.hibernate.collection.Set.remove(Set.java:162)
            at com.atlassian.user.impl.hibernate.HibernateGroupManager.removeMembership(HibernateGroupManager.java:405)
            at com.atlassian.user.impl.cache.CachingGroupManager.removeMembership(CachingGroupManager.java:178)
            at com.atlassian.user.impl.delegation.DelegatingGroupManager.removeMembership(DelegatingGroupManager.java:234)
            at bucket.user.DefaultUserAccessor.removeMembership(DefaultUserAccessor.java:527)
            at com.atlassian.confluence.user.DefaultUserAccessor.removeMembership(DefaultUserAccessor.java:97)
            at com.atlassian.confluence.user.DefaultUserAccessor.removeUser(DefaultUserAccessor.java:226)

            Igor Minar added a comment - sadly, even 3.0.2 still can't remove users via the admin interface. It fails with: net.sf.hibernate.LazyInitializationException: Failed to lazily initialize a collection - no session or session was closed at net.sf.hibernate.collection.PersistentCollection.initialize(PersistentCollection.java:209) at net.sf.hibernate.collection.PersistentCollection.write(PersistentCollection.java:84) at net.sf.hibernate.collection.Set.remove(Set.java:162) at com.atlassian.user.impl.hibernate.HibernateGroupManager.removeMembership(HibernateGroupManager.java:405) at com.atlassian.user.impl.cache.CachingGroupManager.removeMembership(CachingGroupManager.java:178) at com.atlassian.user.impl.delegation.DelegatingGroupManager.removeMembership(DelegatingGroupManager.java:234) at bucket.user.DefaultUserAccessor.removeMembership(DefaultUserAccessor.java:527) at com.atlassian.confluence.user.DefaultUserAccessor.removeMembership(DefaultUserAccessor.java:97) at com.atlassian.confluence.user.DefaultUserAccessor.removeUser(DefaultUserAccessor.java:226)

            Hi Anatoli,

            sorry, due to how this issue is linked with the other issue, I didn't notice that it was fixed in 3.0.2.

            I'm going to test it now and will let you know if the fix really works.

            Have you by chance ran your test suite against a build with the flag on? That would be a good indicator if there are any remaining issues.

            thanks,
            Igor

            Igor Minar added a comment - Hi Anatoli, sorry, due to how this issue is linked with the other issue, I didn't notice that it was fixed in 3.0.2. I'm going to test it now and will let you know if the fix really works. Have you by chance ran your test suite against a build with the flag on? That would be a good indicator if there are any remaining issues. thanks, Igor

            Anatoli added a comment -

            Hi Igor,

            how can a fix that breaks other confluence functionality be called "proper fix"?

            What functionality are you referring to? If you are talking about the problem you have mentioned before:

            We found a problem. When the experimental mapping is turned on, we can't remove users from groups via the admin UI. Is there an alternative or enhanced solution?

            then it has been fixed in 3.0.2

            If you meant something else, please let us know so that we can investigate it further.

            Anatoli.

            Anatoli added a comment - Hi Igor, how can a fix that breaks other confluence functionality be called "proper fix"? What functionality are you referring to? If you are talking about the problem you have mentioned before: We found a problem. When the experimental mapping is turned on, we can't remove users from groups via the admin UI. Is there an alternative or enhanced solution? then it has been fixed in 3.0.2 If you meant something else, please let us know so that we can investigate it further. Anatoli.

            Partha, how can a fix that breaks other confluence functionality be called "proper fix"? Am I missing something?

            Igor Minar added a comment - Partha, how can a fix that breaks other confluence functionality be called "proper fix"? Am I missing something?

            Partha added a comment -

            Dear Watchers,

            Currently, the fix which requires you to use the flag -Dcom.atlassian.user.experimentalMapping=true is included in both 3.0.2 and the upcoming 3.1 release.

            The naming has not been changed (i.e. it still says experimentalMapping), however it is the proper fix for situations where you have a large number of local users AND these are mapped to local groups.

            This issue has not been closed however, since the fix does not cover the situation where you have external users (from LDAP for example) that are mapped to local groups. Development would like to include this fix as well, before closing this issue.
            Also they need to polish this off (i.e. change the naming of the parameter or remove it altogether).

            Kind Regards,
            Partha Kamal

            Partha added a comment - Dear Watchers, Currently, the fix which requires you to use the flag -Dcom.atlassian.user.experimentalMapping=true is included in both 3.0.2 and the upcoming 3.1 release. The naming has not been changed (i.e. it still says experimentalMapping), however it is the proper fix for situations where you have a large number of local users AND these are mapped to local groups. This issue has not been closed however, since the fix does not cover the situation where you have external users (from LDAP for example) that are mapped to local groups. Development would like to include this fix as well, before closing this issue. Also they need to polish this off (i.e. change the naming of the parameter or remove it altogether). Kind Regards, Partha Kamal

            Hi Igor,

            Thanks for your detailed answer. It saves time. I have to think about what I'll do with this information.

            Best regards,
            Andreas

            Andreas Richter added a comment - Hi Igor, Thanks for your detailed answer. It saves time. I have to think about what I'll do with this information. Best regards, Andreas

            Andreas,

            This is not a Tangosol issue, but rather a problem caused by incorrect use of hibernate.

            Right now when you add a single user to your confluence-users group,
            hibernate is instructed to pull all 200k of your users from the db,
            create a java collection for these users, add the new user to the
            collection and then persist the collection into the db
            (which probably results in a single insert sql which could have been
            used without creating the collection with users in the first place).

            So even ehcache is affected by this issue, but it's not causing major
            issues with hundreds of thousands of users. Things are just slow.
            When you set up a cluster, the problem is that tangosol is instructed
            to replicate the huge collection of users on all of your nodes, and
            this takes forever + makes the app too busy which results in cluster
            panics.

            • the mapping also breaks adding and removing users to a group via the Custom Space User Management plugin. Is this still an issue?

            yes, still a problem

            • Has somebody tried to add an user to a group via UI (administrators section). Is this working?

            haven't tried this one

            /i

            Igor Minar added a comment - Andreas, This is not a Tangosol issue, but rather a problem caused by incorrect use of hibernate. Right now when you add a single user to your confluence-users group, hibernate is instructed to pull all 200k of your users from the db, create a java collection for these users, add the new user to the collection and then persist the collection into the db (which probably results in a single insert sql which could have been used without creating the collection with users in the first place). So even ehcache is affected by this issue, but it's not causing major issues with hundreds of thousands of users. Things are just slow. When you set up a cluster, the problem is that tangosol is instructed to replicate the huge collection of users on all of your nodes, and this takes forever + makes the app too busy which results in cluster panics. the mapping also breaks adding and removing users to a group via the Custom Space User Management plugin. Is this still an issue? yes, still a problem Has somebody tried to add an user to a group via UI (administrators section). Is this working? haven't tried this one /i

            Hi,

            We have the same problem. At first let me say a few words about my instance, so you know what I'm talking about.
            Confluence 3.0.1, clustered, 200.000 db user, all users are a members of confluence-users group, orcale db.
            I have no connection to a ldap, because our ldap has nested groups, Confluence can't handle it and I'm not interested to use unsupported code at this point.
            And the user would see all unusable ldap groups.

            At the moment I get the db users synchronized via webservice triggered by our identity management. We pre-register all our users. Sometimes we have bulk jobs, when we have bought an other company or when we do house-keeping.
            As long as we had a single installation it work. A bulk job took long, but ehcache was able to handle it. Tangsol is more sensitive. My application monitoring tool tells me that I loose the most time in the tangosol cache.
            At the moment I try to get a proper byte code instrumentation. Hopefully I'll see which method causes the problem or where oracle made a programming mistake.

            I got from the Atlassian support the suggestion to test this experimental fix. From the previous posts I have seen that there are some problems

            • It's not possible to remove users via UI (wouldn't effect me)
            • the mapping also breaks adding and removing users to a group via the Custom Space User Management plugin. Is this still an issue?
            • Has somebody tried to add an user to a group via UI (administrators section). Is this working?

            Regards,
            Andreas

            Andreas Richter added a comment - Hi, We have the same problem. At first let me say a few words about my instance, so you know what I'm talking about. Confluence 3.0.1, clustered, 200.000 db user, all users are a members of confluence-users group, orcale db. I have no connection to a ldap, because our ldap has nested groups, Confluence can't handle it and I'm not interested to use unsupported code at this point. And the user would see all unusable ldap groups. At the moment I get the db users synchronized via webservice triggered by our identity management. We pre-register all our users. Sometimes we have bulk jobs, when we have bought an other company or when we do house-keeping. As long as we had a single installation it work. A bulk job took long, but ehcache was able to handle it. Tangsol is more sensitive. My application monitoring tool tells me that I loose the most time in the tangosol cache. At the moment I try to get a proper byte code instrumentation. Hopefully I'll see which method causes the problem or where oracle made a programming mistake. I got from the Atlassian support the suggestion to test this experimental fix. From the previous posts I have seen that there are some problems It's not possible to remove users via UI (wouldn't effect me) the mapping also breaks adding and removing users to a group via the Custom Space User Management plugin. Is this still an issue? Has somebody tried to add an user to a group via UI (administrators section). Is this working? Regards, Andreas

            Igor Minar added a comment -

            Anatoli,

            This bug was not fixed in 3.0.2 and I don't see it as fixed in 3.1 beta either.

            This issue has been open for 16 months and prevents us from using confluence cluster for which we've been paying all this time.

            Can you please get it fixed in 3.1 already?

            thanks,
            Igor

            Igor Minar added a comment - Anatoli, This bug was not fixed in 3.0.2 and I don't see it as fixed in 3.1 beta either. This issue has been open for 16 months and prevents us from using confluence cluster for which we've been paying all this time. Can you please get it fixed in 3.1 already? thanks, Igor

            Anatoli added a comment -

            Igor, we are targeting mid/end September for 3.0.2 release.

            Anatoli added a comment - Igor, we are targeting mid/end September for 3.0.2 release.

            Igor Minar added a comment -

            Thanks for the update. I'll be eagerly waiting for 3.0.2. Do you have some approximate release date?

            Igor Minar added a comment - Thanks for the update. I'll be eagerly waiting for 3.0.2. Do you have some approximate release date?

            Anatoli added a comment -

            Hi

            We found a problem. When the experimental mapping is turned on, we can't remove users from groups via the admin UI. Is there an alternative or enhanced solution?

            This problem will be fixed in 3.0.2 although you will still need to have -Dcom.atlassian.user.experimentalMapping=true property set. The corresponding problem in the atlassian-user project is [USER-258].

            We are still considering implementing a proper fix for 3.1.

            Anatoli.

            Anatoli added a comment - Hi We found a problem. When the experimental mapping is turned on, we can't remove users from groups via the admin UI. Is there an alternative or enhanced solution? This problem will be fixed in 3.0.2 although you will still need to have -Dcom.atlassian.user.experimentalMapping=true property set. The corresponding problem in the atlassian-user project is [USER-258] . We are still considering implementing a proper fix for 3.1. Anatoli.

            Igor Minar added a comment -

            Yes, the problem goes away when the flag us removed.

            However I wonder if you'll be able to add users while the flag is on.

            The bulk pre-registration sounds like a good solution, but it doesn't work for us, because our main IDM repo has millions of users :-/

            Also keep in mind that if you get several new users after the bulk pre-registration then you are back to the original problem.

            Per emailed me in the past that a proper fix will be considered for Confluence 3.1, I wonder if it is still the case.

            Igor Minar added a comment - Yes, the problem goes away when the flag us removed. However I wonder if you'll be able to add users while the flag is on. The bulk pre-registration sounds like a good solution, but it doesn't work for us, because our main IDM repo has millions of users :-/ Also keep in mind that if you get several new users after the bulk pre-registration then you are back to the original problem. Per emailed me in the past that a proper fix will be considered for Confluence 3.1, I wonder if it is still the case.

            We are having a problem that sounds suspiciously similar under similar circumstances, and the workaround is easy to apply.

            However, I'd like some clarification about the problems it causes – do they clear up when the flag is turned off, or does it cause persistent database problems? We have concluded that to avoid logjams when masses of new users sign up at once, we will just pre-register them in bulk. We could do this if adding a new user didn't take at least 2.5s and bulk-loading them via the SOAP CLI gobbles memory until Confluence crashes after adding a few hundred.

            If the problem persists only while the flag is on, we could do the bulk load with it on and then turn it off to resume normal operations.

            Brian M. Thomas added a comment - We are having a problem that sounds suspiciously similar under similar circumstances, and the workaround is easy to apply. However, I'd like some clarification about the problems it causes – do they clear up when the flag is turned off, or does it cause persistent database problems? We have concluded that to avoid logjams when masses of new users sign up at once, we will just pre-register them in bulk. We could do this if adding a new user didn't take at least 2.5s and bulk-loading them via the SOAP CLI gobbles memory until Confluence crashes after adding a few hundred. If the problem persists only while the flag is on, we could do the bulk load with it on and then turn it off to resume normal operations.

            the mapping also breaks adding and removing users to a group via the Custom Space User Management plugin

            Igor Minar added a comment - the mapping also breaks adding and removing users to a group via the Custom Space User Management plugin

            Igor Minar added a comment -

            we would appreciate that.

            thanks

            Igor Minar added a comment - we would appreciate that. thanks

            Andrew Lynch (Inactive) added a comment - - edited

            Hi Igor,

            Looks like a bug; You won't be able to remove users via the UI while the experimental mapping mode is in effect until this is fixed.
            I'll try to get this fixed for 3.0 and I'll provide a patched class for this.

            Regards,
            Andrew Lynch

            Andrew Lynch (Inactive) added a comment - - edited Hi Igor, Looks like a bug; You won't be able to remove users via the UI while the experimental mapping mode is in effect until this is fixed. I'll try to get this fixed for 3.0 and I'll provide a patched class for this. Regards, Andrew Lynch

            We found a problem. When the experimental mapping is turned on, we can't remove users from groups via the admin UI. Is there an alternative or enhanced solution?

            Igor Minar added a comment - We found a problem. When the experimental mapping is turned on, we can't remove users from groups via the admin UI. Is there an alternative or enhanced solution?

            Igor Minar added a comment -

            the fix is looking good. We are now using it in production and it makes a huge difference for the cache & db.

            thnx

            Igor Minar added a comment - the fix is looking good. We are now using it in production and it makes a huge difference for the cache & db. thnx

            Users might not be added to groups correctly

            I haven't actually seen any evidence that it doesn't work properly, you will probably be fine but I think you should exercise some caution and ensure your users / groups are getting updated correctly.

            Regards,
            Andrew Lynch

            Andrew Lynch (Inactive) added a comment - Users might not be added to groups correctly I haven't actually seen any evidence that it doesn't work properly, you will probably be fine but I think you should exercise some caution and ensure your users / groups are getting updated correctly. Regards, Andrew Lynch

            Igor Minar added a comment - - edited

            Awesome! I'll test it. Thanks.

            What do you expect could go wrong, if anything?

            Igor Minar added a comment - - edited Awesome! I'll test it. Thanks. What do you expect could go wrong, if anything?

            Hi Igor,

            We have an experimental fix for this issue which will be included with 2.10. It makes changes to the relationship between Users and Groups and is disabled by default due to the fact we have not had time to test it exhaustively.
            It can be enabled by providing the system property -Dcom.atlassian.user.experimentalMapping=true.
            We hope to make this the default behaviour by 2.10.

            Regards,
            Andrew Lynch

            Andrew Lynch (Inactive) added a comment - Hi Igor, We have an experimental fix for this issue which will be included with 2.10. It makes changes to the relationship between Users and Groups and is disabled by default due to the fact we have not had time to test it exhaustively. It can be enabled by providing the system property -Dcom.atlassian.user.experimentalMapping=true . We hope to make this the default behaviour by 2.10. Regards, Andrew Lynch

            Igor Minar added a comment -

            wikis.sun.com has been stable since the patch for CONF-10030 was applied, but I still consider addMembership to be an issue waiting to bite us at any time. Especially as the number of registered users grow rapidly.

            Running the query above currently takes quite some time:

            25437 rows in set (13.57 sec)
            

            Other than this query, the db is screaming fast at the moment.

            Igor Minar added a comment - wikis.sun.com has been stable since the patch for CONF-10030 was applied, but I still consider addMembership to be an issue waiting to bite us at any time. Especially as the number of registered users grow rapidly. Running the query above currently takes quite some time: 25437 rows in set (13.57 sec) Other than this query, the db is screaming fast at the moment.

            Don Willis added a comment -

            CONF-10030 is definitely a problem, but that query plan shows that this query should run very fast. I suspect the mysql people will be able to help you with your mysql configuration to avoid your thread thrashing problems.
            Now that you've addressed CONF-10030 in your instance are you still having problems with the groups query?

            Don Willis added a comment - CONF-10030 is definitely a problem, but that query plan shows that this query should run very fast. I suspect the mysql people will be able to help you with your mysql configuration to avoid your thread thrashing problems. Now that you've addressed CONF-10030 in your instance are you still having problems with the groups query?

            Igor Minar added a comment -

            Hi Don,

            Here is the query plan:

            mysql> explain select localmembe0_.groupid as groupid__, localmembe0_.userid as userid__, defaulthib1_.id as id0_, defaulthib1_.name as name0_, defaulthib1_.password as password0_, defaulthib1_.email as email0_, defaulthib1_.created as created0_, defaulthib1_.fullname as fullname0_ from local_members localmembe0_ inner join users defaulthib1_ on localmembe0_.userid=defaulthib1_.id where localmembe0_.groupid=13205505;
            +----+-------------+--------------+--------+----------------------------+---------+---------+---------------------------+-------+-------------+
            | id | select_type | table        | type   | possible_keys              | key     | key_len | ref                       | rows  | Extra       |
            +----+-------------+--------------+--------+----------------------------+---------+---------+---------------------------+-------+-------------+
            |  1 | SIMPLE      | localmembe0_ | ref    | PRIMARY,FK6B8FB445CE2B3226 | PRIMARY | 8       | const                     | 22686 | Using index | 
            |  1 | SIMPLE      | defaulthib1_ | eq_ref | PRIMARY                    | PRIMARY | 8       | wikis.localmembe0_.userid |     1 |             | 
            +----+-------------+--------------+--------+----------------------------+---------+---------+---------------------------+-------+-------------+
            

            I think that we are facing several issues here, combination of which causes our db to perform terribly.

            Issues we know about are:

            cheers,
            Igor

            Igor Minar added a comment - Hi Don, Here is the query plan: mysql> explain select localmembe0_.groupid as groupid__, localmembe0_.userid as userid__, defaulthib1_.id as id0_, defaulthib1_.name as name0_, defaulthib1_.password as password0_, defaulthib1_.email as email0_, defaulthib1_.created as created0_, defaulthib1_.fullname as fullname0_ from local_members localmembe0_ inner join users defaulthib1_ on localmembe0_.userid=defaulthib1_.id where localmembe0_.groupid=13205505; +----+-------------+--------------+--------+----------------------------+---------+---------+---------------------------+-------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------------+--------+----------------------------+---------+---------+---------------------------+-------+-------------+ | 1 | SIMPLE | localmembe0_ | ref | PRIMARY,FK6B8FB445CE2B3226 | PRIMARY | 8 | const | 22686 | Using index | | 1 | SIMPLE | defaulthib1_ | eq_ref | PRIMARY | PRIMARY | 8 | wikis.localmembe0_.userid | 1 | | +----+-------------+--------------+--------+----------------------------+---------+---------+---------------------------+-------+-------------+ I think that we are facing several issues here, combination of which causes our db to perform terribly. Issues we know about are: this one http://jira.atlassian.com/browse/CONF-10030 http://bugs.mysql.com/bug.php?id=37411 cheers, Igor

            Don Willis added a comment -

            Hi Igor,

            Thanks for raising this issue.
            While obviously I agree that's a very inefficient way to perform the operation, I'm also surprised that such a query would take several minutes, even with 25000 users in the group.
            Are you missing an index on either local_members.groupid or users.id?
            What does MySQL say if you ask it to EXPLAIN the actual query that runs?

            Cheers,
            Don

            Don Willis added a comment - Hi Igor, Thanks for raising this issue. While obviously I agree that's a very inefficient way to perform the operation, I'm also surprised that such a query would take several minutes, even with 25000 users in the group. Are you missing an index on either local_members.groupid or users.id? What does MySQL say if you ask it to EXPLAIN the actual query that runs? Cheers, Don

              alynch Andrew Lynch (Inactive)
              15d9a6950818 Igor Minar
              Affected customers:
              11 This affects my team
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: