Extremely slow indexing, thread dumps show HtmlEntityFilterExtractor busy processing payload from large (10+Mb) text files

XMLWordPrintable

    • Type: Bug
    • Resolution: Fixed
    • Priority: Medium
    • 7.4.11, 7.11.6, 7.12.5, 7.13.0
    • Affects Version/s: 7.4.1, 7.4.3, 7.4.4, 7.4.5, 7.4.6, 7.4.7, 7.4.8, 7.12.3
    • Component/s: Search - Indexing
    • None
    • 9
    • Severity 3 - Minor
    • 53

      Issue Summary

      Presence of large text files (10+Mb) having lots of escapable/html data (characters such as <, >, \, /, |, & etc.) will significantly slow down indexing to the extent that it would appear to have stalled.
      However, thread dumps will show indexing thread(s):

      • in a runnable state (java.lang.Thread.State: RUNNABLE), and
      • constantly executing com.atlassian.confluence.util.HtmlEntityEscapeUtil.unEscapeHtmlEntities code

      Note: Only large text files containing significant quantities of special/escapable characters will cause the problem.

      Steps to Reproduce

      (Tested on Confluence 7.4.[1-8] and 7.12)

      1. Attach a 10+Mb *.txt file containing lots of SQL-queries/HTML-source/special-characters to a Confluence page
      2. Wait for the scheduled background indexing job (Caesium thread) to pick up the newly attached data, or rebuild indexes (via Confluence Administration > Content Indexing > Search Index > Rebuild)

      Expected Results

      • When the scheduled background Caesium indexing thread picks up the data, indexing should finish in reasonable time, and 0 pending data should be shown in Confluence Administration > Content Indexing > Queue Contents
      • When a one-time/on-demand reindexing job is manually triggered, Indexer thread(s) should finish in reasonable time, and show 100% re-index task completion

      Actual Results

      • When the scheduled background Caesium indexing thread picks up the data:
        • it spends an inordinate amount of time processing payload from the file and is extremely slow in getting to the remaining indexable items (visible as the steadily increasing Number of items in queue metric under Confluence Administration > Content Indexing > Queue Contents).
          Thread dumps will show the Caesium thread in the following state:
          thread dump excerpt of a "Caesium-1-N" thread
          "Caesium-1-1" #373 daemon prio=5 os_prio=0 tid=0x000001a95b71d000 nid=0x380c runnable [0x000000361d4fd000]
             java.lang.Thread.State: RUNNABLE
          	at java.lang.String.indexOf(String.java:1769)
          	at java.lang.String.indexOf(String.java:1718)
          	at com.atlassian.confluence.util.HtmlEntityEscapeUtil.unEscapeHtmlEntities(HtmlEntityEscapeUtil.java:388)
          	at com.atlassian.confluence.search.lucene.extractor.HtmlEntityFilterExtractor.addFields(HtmlEntityFilterExtractor.java:13)
          	at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:39)
          	at com.atlassian.confluence.internal.index.lucene.LuceneContentExtractor.lambda$extract$0(LuceneContentExtractor.java:68)
          	at com.atlassian.confluence.internal.index.lucene.LuceneContentExtractor$$Lambda$2093/1823784346.accept(Unknown Source)
          	...
          



      • When a one-time/on-demand reindexing job is manually triggered:
        • Indexer: N thread(s) spend(s) inordinate amount of time processing payload from the file, giving the appearance that indexing is stuck at a specific completion percentage (sometimes for hours).
          Thread dumps will show Indexer thread(s) in the following state:
          thread dump excerpt of an "Indexer: N" thread
          "Indexer: 1" #369 daemon prio=5 os_prio=0 cpu=51468.75ms elapsed=51.59s allocated=54757K defined_classes=45 tid=0x000002a704504800 nid=0x1108 runnable  [0x000000bdaf7fd000]
             java.lang.Thread.State: RUNNABLE
          	at com.atlassian.confluence.util.HtmlEntityEscapeUtil.unEscapeHtmlEntities(HtmlEntityEscapeUtil.java:390)
          	at com.atlassian.confluence.search.lucene.extractor.HtmlEntityFilterExtractor.addFields(HtmlEntityFilterExtractor.java:13)
          	at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:39)
          	at com.atlassian.confluence.internal.index.lucene.LuceneContentExtractor.lambda$extract$0(LuceneContentExtractor.java:68)
          	at com.atlassian.confluence.internal.index.lucene.LuceneContentExtractor$$Lambda$2457/0x0000000102b61c40.accept(Unknown Source)
          	...
          



      • The behavior can also be observed by performing the following steps:
        1. set the log level of the package com.atlassian.bonnie.search.extractor and com.atlassian.confluence.internal.index.lucene to DEBUG
        2. track indexing progress by looking at the application logs
        3. note those text files which cause long intervals in between the printing of the following two extractors (indexme10mb.txt in this example):
          atlassian-confluence.log with indexing related DEBUG lines
          # PAYLOAD PASSED TO HtmlEntityFilterExtractor
          2021-07-05 16:00:08,327 DEBUG [Indexer: 1] [internal.index.lucene.LuceneContentExtractor] lambda$extract$0 Adding fields to document for Attachment: indexme10mb.txt v.1 (360471) admin using BackwardsCompatibleExtractor wrapping com.atlassian.confluence.search.lucene.extractor.HtmlEntityFilterExtractor@707d0516 (confluence.extractors.core:htmlEntitiesFilterExtractor)
          # ~600 SECONDS GAP
          # PAYLOAD PASSED TO TitleExtractor
          2021-07-05 16:10:42,266 DEBUG [Indexer: 1] [internal.index.lucene.LuceneContentExtractor] lambda$extract$0 Adding fields to document for Attachment: indexme10mb.txt v.1 (360471) admin using BackwardsCompatibleExtractor wrapping com.atlassian.confluence.search.lucene.extractor.TitleExtractor@1df9ba4a (confluence.extractors.core:titleExtractor)
          

          Occasionally, the TitleExtractor line may take tens of minutes/hours to get written for the indexer thread in question ([Indexer: 1] in the above example).
          So if the logs roll up, for a specific indexer thread processing a specific file, the HtmlEntityFilterExtractor and TitleExtractor lines could exist in different log files.

      Workaround

      Currently, the only workaround is to upload zipped/compressed versions of the problematic text files:

      1. Identify and download copies of the the problematic text files from Confluence.
        As outlined previously, to identify problematic text files:
        • set com.atlassian.bonnie.search.extractor and com.atlassian.confluence.internal.index.lucene to DEBUG before a reindexing attempt, and then note down those text files which have long intervals in between the printing of the HtmlEntityFilterExtractor and TitleExtractor lines in the logs
        • then, collate that list with the results of this modified SQL statement taken from this article about finding large attachments in Confluence:
          SQL to find all ATTACHMENTS, type = *.txt, size > = 10 Mb
          select distinct 
          	c.contentid as contentid,
          	c.spaceid as spaceid,
          	c.title as attachmenttitle,
          	u.username as uploadedby,
          	co.title as pagetitle,
          	(cn.longval/(1024*1024)) as filesizeinmb,
          	s.spacekey as spacekey,
          	s.spacename as spacename
          from 	
          	content as c
          join user_mapping as u on u.user_key = c.creator
          join content as co on c.pageid = co.contentid
          join spaces as s on c.spaceid=s.spaceid
          join contentproperties as cn on cn.contentid = c.contentid
          where 	
          	c.contenttype = 'ATTACHMENT' and c.title like '%.txt'
          	and cn.longval is not null
          	and (cn.longval/(1024*1024)) >= 10
          order by (cn.longval/(1024*1024)) desc;
          
          Query for Oracle
          select distinct 
          	c.contentid as contentid,
          	c.spaceid as spaceid,
          	c.title as attachmenttitle,
          	u.username as uploadedby,
          	co.title as pagetitle,
          	(cn.longval/(1024*1024)) as filesizeinmb,
          	s.spacekey as spacekey,
          	s.spacename as spacename
          from 	
          	content c
          join user_mapping u on u.user_key = c.creator
          join content co on c.pageid = co.contentid
          join spaces s on c.spaceid=s.spaceid
          join contentproperties cn on cn.contentid = c.contentid
          where 	
          	c.contenttype = 'ATTACHMENT' and c.title like '%.txt'
          	and cn.longval is not null
          	and (cn.longval/(1024*1024)) >= 10
          order by (cn.longval/(1024*1024)) desc;
          

          Note: As mentioned before, only text files containing significant quantities of special/escapable characters are to be considered.

      2. After the problematic text files have been identified, compress the downloaded copies (using any compression utility) and re-upload them.
      3. Purge the original *.txt files from the instance.

            Assignee:
            Unassigned
            Reporter:
            Mohit Sharma
            Votes:
            3 Vote for this issue
            Watchers:
            19 Start watching this issue

              Created:
              Updated:
              Resolved: