Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-99555

Query for finding duplicate labels in LabelUniqueIndexUpgradeTask takes too much time for the big LABEL tables

XMLWordPrintable

    • Icon: Suggestion Suggestion
    • Resolution: Unresolved
    • None
    • Content - Tasks
    • None
    • 8
    • 7
    • We collect Confluence feedback from various sources, and we evaluate what we've collected when planning our product roadmap. To understand how this piece of feedback will be reviewed, see our Implementation of New Features Policy.

      We are using the below SQL query for finding duplicates while running the 'LabelUniqueIndexUpgradeTask':

          public static final String SQL_SELECT_DUPLICATE_LABELS = "SELECT l1.* " +
                  "FROM LABEL l1, LABEL l2 " +
                  "WHERE l1.LABELID <> l2.LABELID " +
                  "   AND (l1.NAME = l2.NAME OR (l1.NAME IS NULL AND l2.NAME IS NULL)) " +
                  "   AND (l1.OWNER = l2.OWNER OR (l1.OWNER IS NULL AND l2.OWNER IS NULL)) " +
                  "   AND (l1.NAMESPACE = l2.NAMESPACE OR (l1.NAMESPACE IS NULL AND l2.NAMESPACE IS NULL)) " +
                  "ORDER BY l1.NAME, l1.OWNER, l1.NAMESPACE, l1.LABELID ASC";
      

      In a PostgreSQL database with 80k records in LABEL table it took nearly 5minutes to execute this query. In customer's instance it took +10 minutes.

      When we checked the SQL query with EXPLAIN, it turned out that the indexes are not used effectively.

      (Execution Time: 304666.246 ms)

      Nested Loop  (cost=1001.35..73786.65 rows=159723919 width=71) (actual time=20.236..304665.775 rows=324 loops=1)
        ->  Gather Merge  (cost=1001.06..18070.10 rows=88091 width=71) (actual time=20.143..620.023 rows=88168 loops=1)
              Workers Planned: 1
              Workers Launched: 1
              ->  Incremental Sort  (cost=1.05..7159.85 rows=51818 width=71) (actual time=0.874..397.286 rows=44084 loops=2)
      "              Sort Key: l1.name, l1.owner, l1.namespace, l1.labelid"
                    Presorted Key: l1.name
                    Full-sort Groups: 105  Sort Method: quicksort  Average Memory: 27kB  Peak Memory: 27kB
      "              Pre-sorted Groups: 14  Sort Methods: quicksort, external merge  Average Memory: 23kB  Peak Memory: 25kB  Average Disk: 241kB  Peak Disk: 3384kB"
                    Worker 0:  Full-sort Groups: 542  Sort Method: quicksort  Average Memory: 28kB  Peak Memory: 28kB
      "                Pre-sorted Groups: 17  Sort Methods: quicksort, external merge  Average Memory: 23kB  Peak Memory: 25kB  Average Disk: 120kB  Peak Disk: 2048kB"
                    ->  Parallel Index Scan using l_name_idx on label l1  (cost=0.29..5026.71 rows=51818 width=71) (actual time=0.040..7.619 rows=44084 loops=2)
        ->  Index Scan using l_name_idx on label l2  (cost=0.29..0.62 rows=1 width=55) (actual time=3.445..3.448 rows=0 loops=88168)
              Index Cond: ((name)::text = (l1.name)::text)
              Filter: ((l1.labelid <> labelid) AND (((l1.owner)::text = (owner)::text) OR ((l1.owner IS NULL) AND (owner IS NULL))) AND ((l1.namespace)::text = (namespace)::text))
              Rows Removed by Filter: 51714
      Planning Time: 3.087 ms
      Execution Time: 304666.246 ms
      

      Based on the our research, 'OR' condition is possibly confusing the query planner.

      When the query adjusted as below, the execution time drastically drops to milliseconds

      SELECT l1.*
      FROM label l1
      JOIN label l2
        ON l1.name = l2.name
       AND l1.owner = l2.owner
       AND l1.namespace = l2.namespace
       AND l1.labelid <> l2.labelid
      UNION ALL
      SELECT l1.*
      FROM label l1
      JOIN label l2
        ON l1.name = l2.name
       AND l1.owner IS NULL
       AND l2.owner IS NULL
       AND l1.namespace = l2.namespace
       AND l1.labelid <> l2.labelid
      ORDER BY name, owner, namespace, labelid;
      

      EXPLAIN output: (Execution Time: 93.398 ms)

      Gather Merge  (cost=31598083.15..50527758.37 rows=161522001 width=71) (actual time=91.720..93.302 rows=324 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Sort  (cost=31597083.13..31883111.67 rows=114411417 width=71) (actual time=83.133..83.137 rows=108 loops=3)
      "        Sort Key: l1.name, l1.owner, l1.namespace, l1.labelid"
              Sort Method: quicksort  Memory: 25kB
              Worker 0:  Sort Method: quicksort  Memory: 70kB
              Worker 1:  Sort Method: quicksort  Memory: 25kB
              ->  Parallel Append  (cost=0.29..2205342.37 rows=114411417 width=71) (actual time=41.869..82.713 rows=108 loops=3)
                    ->  Parallel Hash Join  (cost=2966.26..6933.39 rows=14827 width=71) (actual time=28.958..50.164 rows=320 loops=1)
                          Hash Cond: (((l1.name)::text = (l2.name)::text) AND ((l1.owner)::text = (l2.owner)::text) AND ((l1.namespace)::text = (l2.namespace)::text))
                          Join Filter: (l1.labelid <> l2.labelid)
                          Rows Removed by Join Filter: 67527
                          ->  Parallel Seq Scan on label l1  (cost=0.00..1551.64 rows=51864 width=71) (actual time=0.003..2.940 rows=88168 loops=1)
                          ->  Parallel Hash  (cost=1551.64..1551.64 rows=51864 width=55) (actual time=28.532..28.532 rows=88168 loops=1)
                                Buckets: 131072  Batches: 1  Memory Usage: 6848kB
                                ->  Parallel Seq Scan on label l2  (cost=0.00..1551.64 rows=51864 width=55) (actual time=0.008..11.059 rows=88168 loops=1)
                    ->  Nested Loop  (cost=0.29..11131.89 rows=94998115 width=71) (actual time=41.911..65.984 rows=1 loops=3)
                          ->  Parallel Seq Scan on label l1_1  (cost=0.00..1551.64 rows=12238 width=71) (actual time=0.017..1.705 rows=6880 loops=3)
                                Filter: (owner IS NULL)
                                Rows Removed by Filter: 22509
                          ->  Index Scan using l_name_idx on label l2_1  (cost=0.29..0.77 rows=1 width=22) (actual time=0.009..0.009 rows=0 loops=20641)
                                Index Cond: ((name)::text = (l1_1.name)::text)
                                Filter: ((owner IS NULL) AND (l1_1.labelid <> labelid) AND ((l1_1.namespace)::text = (namespace)::text))
                                Rows Removed by Filter: 4
      Planning Time: 1.089 ms
      Execution Time: 93.398 ms
      

      Please edit the SQL query as above (or any other form that improves the execution time) to make it more performant for the customers that have large LABEL table.

              Unassigned Unassigned
              9f7de485df51 Basar Beykoz (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: