Issue Details (XML | Word | Printable)

Key: CONF-10030
Type: Improvement Improvement
Status: Open Open
Priority: Major Major
Assignee: Unassigned
Reporter: Anatoli Kazatchkov [Atlassian]
Votes: 7
Watchers: 3
Operations

Add/Edit UI Mockup to this issue
If you were logged in you would be able to see more operations.
Confluence

db2: queries that use 'lower' do not use index because of case sensitivity

Created: 22/Nov/07 06:58 PM   Updated: 10/Jul/08 10:54 PM
Component/s: Database / Hibernate, Performance
Affects Version/s: 2.5, 2.5.1, 2.5.2, 2.5.3, 2.5.4, 2.5.5, 2.5.6, 2.5.7, 2.5.8, 2.6-dr1, 2.6.0, 2.6.1
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments: 1. Text File lower-func-performance.patch (19 kB)
2. Text File lower-func-performance.patch (3 kB)

Issue Links:
Cloners
 
Reference
 

Participants: Anatoli Kazatchkov [Atlassian], Don Willis [Atlassian], Igor Minar, Ivan Benko [Atlassian] and Patrick Bossman
Since last comment: 20 weeks, 4 days ago
Internal Complexity: 4
Support reference count: 3
Internal Value: 7
Labels:


 Description  « Hide
Most of the queries against CONTENT table use 'lower' db function for a title field. The goes for (space) key field in SPACES table.
We do it to make a case insensitive comparison, however it means that db2 does not use the existing indexes for those fields and performs a full table scan.
This results in a poor performance.

This can be applicable to other database too, but it is not verified. Also other databases may have global 'case sensitivity' flag.



 All   Comments   Work Log   Change History      Sort Order: Ascending order - Click to sort in descending order
Patrick Bossman added a comment - 18/Mar/08 10:50 AM
DB2 for z/OS version 9 supports column expression indexes - so you could create an index on LOWER(COL).

DB2 LUW does support indexes on column expressions through generated columns.
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp?topic=/com.ibm.db2.luw.admin.dbobj.doc/doc/c0020109.html


Patrick Bossman added a comment - 18/Mar/08 11:56 AM
An alternate and DBMS agnostic solution would be for confluence to store a derived column itself, index the field, and ensure all uses of LOWER use the derived fields.

Don Willis [Atlassian] added a comment - 08/Apr/08 05:30 AM
I agree with Patrick's solution. If we need to look up the title in a case-insensitive fashion and Hibernate doesn't support a generic way of creating such an index on all our supported databases, we should do the to-lower-casing ourselves and store it in a separate column. We would have to watch out for Locales though. Whatever Locale we use to store the lowercase version needs to stay the same forever, or we need to update the whole table when it changes.

If this problem applies to DB2 it should also apply to PostgreSQL and Oracle. MySQL can be run with a case sensitive or insensitive collation.

I'm hoping we can just not do a case-insensitive search when looking up page titles. Any search that generic should probably using Lucene index rather than the Database.


Igor Minar added a comment - 30/Jun/08 10:12 PM
This is causing major problems for us at wikis.sun.com where we use MySQL5.

There are other reports of this issue occurring with MySQL:
http://jira.atlassian.com/browse/CONF-9152 (check out comments)
https://support.atlassian.com/browse/CSP-11031

MySQL RFE: http://bugs.mysql.com/bug.php?id=9639

The title of this JIRA should be changed so that it's clear that this is a generic issue.

PostgresSQL can create indexes for functions, but since Confluence doesn't generate the index, users don't benefit from this unless a DBA creates the index manually.

Drupal was facing the very same issues in the past (http://drupal.org/node/83738), their attempt to resolving it is similar to what Patrick mentioned. I'm all for this approach as well.


Igor Minar added a comment - 02/Jul/08 08:09 PM
Hi Ivan,

I'll attach an updated version of the patch. We just experienced major outage today that lasted the entire afternoon. According to the db investigation this was due to slow queries that were using the lower() function. After the patch was deployed, the outage cleared.

I suggest that you consider this to be a blocker RFE/bug as it can take a site down easily.

There is one more major bug that we found in atlassian-user code, I'll file a separate issue for that.

cheers,
Igor


Igor Minar added a comment - 02/Jul/08 08:21 PM
updated version of the patch

Igor Minar added a comment - 02/Jul/08 08:26 PM
And just to give you some numbers. I ran some simple benchmarks with and without the patch on our production db during our outage and this is the difference in time it takes to execute a query that retrieves a wiki page:

before: 10-15 seconds
after: 0.01-0.12 seconds

WHAT A DIFFERENCE!

My only concern is if the patch is cluster safe. We were running two nodes side by side, but since I don't understand all your clustering code (yet), I took one node down just to be safe. Can someone from Atlassian have a look at it and let us know if it's ok to run this in clustered mode?

cheers,
Igor


Ivan Benko [Atlassian] added a comment - 02/Jul/08 11:08 PM
Hi Igor,
I've discussed this issue with one of the developers, familiar with the code, and although he agrees that the performance improvement is enormous it is perhaps approached too specifically to the MySQL 5 database and not generic enough (not sure, perhaps there is even more effective way to implement it). We need to investigate and test the required fix on all supported databases against our functional and productions/test loading tests to be able to ascertain it's applicability and ability to be embedded into the main source code. As you would appreciate it takes a bit of time to investigate, and thus we shall provide an update on this ticket when necessary information is collated. Thanks for your effort mate ! Great work and perhaps I'd suggest to open a support ticket for the database problems so we can review the logs etc... to ensure there is nothing else going on additionally.
s pozdravom ostava
Ivan

Igor Minar added a comment - 02/Jul/08 11:58 PM
Hi Ivan,

I'm afraid that you are wrong.

The patch is as database agnostic as possible and will work with any database out there. I avoided the use of triggers, stored procedures or any other database specific technologies that would by compatible with other relational database systems. I even went to the trouble of using some Atlassian's utility methods to make the patch compatible with DB2.

Any database that supports indexes (that is any serious db out there) will enormously benefit from this patch.

cheers,
Igor


Don Willis [Atlassian] added a comment - 03/Jul/08 01:40 AM
Hi Igor,

As I've expressed earlier in this comments, I think the approach taken in your patch is probably the right one.

What Ivan was trying to express is that this patch is only required for some databases, whereas others will happily use their existing index despite the lower function. The patch is required for MySQL because it ignores indexes as soon as you introduce a function. The introduction of the column is definitely the most database-generic way of doing it.

I have looked quickly over your patch and cannot imagine a reason it would not work in a cluster. How did you migrate the data? By SQL? I am not sure when we will have time to test your patch, but I certainly hope we can do it soon.
However, in case we do not get to working your patch into the product soon, might I suggest a simpler patch for your particular instance?

MySQL will by default completely ignore case on data anyway. Certainly you could do it for this specific table, and in fact some of our customers run their whole mysql database with case insensitivity. We recommend against this because the majority of databases, and therefore of our testing, is done case sensitively. I suggest setting your content table's collation to be case insensitive, and simply taking out the call to the lower function in the queries that use it. cf http://dev.mysql.com/doc/refman/5.0/en/charset-table.html. That change will be smaller, and easier to maintain than the database agnostic one.

Of course, there are pitfalls to doing this, such as the chance of a unique key constraint collision where two keys differ only in case.

Cheers,
Don


Patrick Bossman added a comment - 03/Jul/08 07:58 AM
Hello,
I want to reiterate
1. DB2 9 for z/OS supports index on column expressions.
2. DB2 LUW supports generated columns with index on generated columns - where the database maintains the generated column and will automatically rewrite queries to use the generated column. In DB2 LUW 9.5, generated columns can be created as hidden columns (not returned for SELECT *).

So this problem is solvable in current releases of DB2 - but it requires user action to solve it.

This issue is a major problem for any site that has large tables. Basically, any customer that's going to have a large wiki installation - this issue is a show-stopper. The lower is used all over the place. Not just on content. The biggest issue is on the SPACES table - since it's used to look up content, blog posts, etc. So it's either a unique lookup on SPACES, or a table space scan to find 1 row.

We also saw this issue on LINKS.
select outgoingli0_.LINKID as LINKID, outgoingli0_.DESTPAGETITLE as DESTPAGE2_
, outgoingli0_.DESTSPACEKEY as DESTSPAC3_, outgoingli0_.CONTENTID as CONTENTID
, outgoingli0_.CREATOR as CREATOR, outgoingli0_.CREATIONDATE as CREATION6_
, outgoingli0_.LASTMODIFIER as LASTMODI7_, outgoingli0_.LASTMODDATE as LASTMODD8_
from LINKS outgoingli0_ CARD 1457006 NPAGES 57402
where (lower(outgoingli0_.DESTSPACEKEY)=? ) COLCARD 11320
and(lower(outgoingli0_.DESTPAGETITLE)=? ) COLCARD 356000

BLOGPOST Title also has the issue.

select blogpost0_.CONTENTID as CONTENTID, blogpost0_.SPACEID as SPACEID
, blogpost0_.TITLE as TITLE, blogpost0_.VERSION as VERSION
, blogpost0_.CREATOR as CREATOR, blogpost0_.CREATIONDATE as CREATION6_
, blogpost0_.LASTMODIFIER as LASTMODI7_, blogpost0_.LASTMODDATE as LASTMODD8_
, blogpost0_.VERSIONCOMMENT as VERSIONC9_, blogpost0_.PREVVER as PREVVER
, blogpost0_.CONTENT_STATUS as CONTENT11_
from CONTENT blogpost0_
, SPACES space1_
where blogpost0_.CONTENTTYPE='BLOGPOST'
and ((lower(space1_.SPACEKEY)=?
and blogpost0_.SPACEID=space1_.SPACEID)
and(lower(blogpost0_.TITLE)=? )
and(blogpost0_.CREATIONDATE>=? )
and(blogpost0_.CREATIONDATE<? )
and(blogpost0_.PREVVER is null )
and(blogpost0_.CONTENT_STATUS='current' ))

CONTENT_LABEL table.
select distinct label0_.LABELID as LABELID
, label0_.NAME as NAME, label0_.OWNER as OWNER
, label0_.NAMESPACE as NAMESPACE, label0_.CREATIONDATE as CREATION5_
, label0_.LASTMODDATE as LASTMODD6_
from LABEL label0_ CARD 22253 NPAGES 424
, CONTENT_LABEL labelling1_ CARD 73483 NPAGES 1711
where (lower(labelling1_.SPACEKEY)=? ) COLCARD 7813
AND(labelling1_.LABELID=label0_.LABELID ) COLCARD 22253/22253
AND(label0_.NAMESPACE=? ) COLCARD 4
order by label0_.NAME

ATTACHMENTS table.
SELECT attachmentid, title, contenttype, creator, creationdate
, lastmodifier, lastmoddate, filesize, attachment_comment
, attversion, prevver
FROM WIKI_RW.attachments CARD 403449
where pageid = ? COLCARD 94288
and lower(title) = ? COLCARD 253952
order by attversion desc with UR

Our wiki has a lot of content, a lot of attachements, a lot of spaces. Any big site - this issue is an absolute performance killer . The wiki quickly becomes useless.


Igor Minar added a comment - 03/Jul/08 11:17 AM
Don,

We took the risk and we deployed the patch to production yesterday. So far so good

The data was manually migrated using SQL, the queries should be in the comment section of the patch.

Thanks for the cluster update. I thought that just a simple serialization was used to push the objects over the wire to the second node, but I wasn't sure.

I'd prefer the solution proposed in the patch over changing the hibernate queries and exploiting MySQL insensitivity because the solution is the smallest common denominator for all of your supported databases, which would make it easier for you to test and support it.

Using this solution you could easily and safely provide this enhancements to all of your customers regardless of their database via an upgrade task (I can image that creating an upgrade task that would create custom indexes for all the supported databases would be much trickier).


Don Willis [Atlassian] added a comment - 10/Jul/08 02:51 AM
I completely agree that the use of lower in all these places is going to hurt performance for some databases and that Igor's approach is the most pragmatic one for us to take in fixing this. (This is despite the obvious ickiness of replicated data). I only suggested using the MySQL insensitivity because if we don't manage to prioritise this bug, then you're going to need to maintain whatever workaround you do, and I'd think that a simple workaround would be easier to maintain than one that is database agnostic.

Igor Minar added a comment - 10/Jul/08 10:54 PM
Thanks Don. I'm not happy about the data redundancy, but if it means that you have only one db schema to support that is fast for everyone, then I think it's definitely worth it.

The benefit for us is that you'll do all your QA with the same schema as we use, minimizing the chance of something breaking for us in the future because of this change.

/i