-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Medium
-
Affects Version/s: 7.4.1, 7.4.3, 7.4.4, 7.4.5, 7.4.6, 7.4.7, 7.4.8, 7.12.3
-
Component/s: Search - Indexing
-
None
-
9
-
Severity 3 - Minor
-
53
Issue Summary
Presence of large text files (10+Mb) having lots of escapable/html data (characters such as <, >, \, /, |, & etc.) will significantly slow down indexing to the extent that it would appear to have stalled.
However, thread dumps will show indexing thread(s):
- in a runnable state (java.lang.Thread.State: RUNNABLE), and
- constantly executing com.atlassian.confluence.util.HtmlEntityEscapeUtil.unEscapeHtmlEntities code
Note: Only large text files containing significant quantities of special/escapable characters will cause the problem.
Steps to Reproduce
(Tested on Confluence 7.4.[1-8] and 7.12)
- Attach a 10+Mb *.txt file containing lots of SQL-queries/HTML-source/special-characters to a Confluence page
- Wait for the scheduled background indexing job (Caesium thread) to pick up the newly attached data, or rebuild indexes (via Confluence Administration > Content Indexing > Search Index > Rebuild)
Expected Results
- When the scheduled background Caesium indexing thread picks up the data, indexing should finish in reasonable time, and 0 pending data should be shown in Confluence Administration > Content Indexing > Queue Contents
- When a one-time/on-demand reindexing job is manually triggered, Indexer thread(s) should finish in reasonable time, and show 100% re-index task completion
Actual Results
- When the scheduled background Caesium indexing thread picks up the data:
- it spends an inordinate amount of time processing payload from the file and is extremely slow in getting to the remaining indexable items (visible as the steadily increasing Number of items in queue metric under Confluence Administration > Content Indexing > Queue Contents).
Thread dumps will show the Caesium thread in the following state:thread dump excerpt of a "Caesium-1-N" thread"Caesium-1-1" #373 daemon prio=5 os_prio=0 tid=0x000001a95b71d000 nid=0x380c runnable [0x000000361d4fd000] java.lang.Thread.State: RUNNABLE at java.lang.String.indexOf(String.java:1769) at java.lang.String.indexOf(String.java:1718) at com.atlassian.confluence.util.HtmlEntityEscapeUtil.unEscapeHtmlEntities(HtmlEntityEscapeUtil.java:388) at com.atlassian.confluence.search.lucene.extractor.HtmlEntityFilterExtractor.addFields(HtmlEntityFilterExtractor.java:13) at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:39) at com.atlassian.confluence.internal.index.lucene.LuceneContentExtractor.lambda$extract$0(LuceneContentExtractor.java:68) at com.atlassian.confluence.internal.index.lucene.LuceneContentExtractor$$Lambda$2093/1823784346.accept(Unknown Source) ...
- it spends an inordinate amount of time processing payload from the file and is extremely slow in getting to the remaining indexable items (visible as the steadily increasing Number of items in queue metric under Confluence Administration > Content Indexing > Queue Contents).
- When a one-time/on-demand reindexing job is manually triggered:
- Indexer: N thread(s) spend(s) inordinate amount of time processing payload from the file, giving the appearance that indexing is stuck at a specific completion percentage (sometimes for hours).
Thread dumps will show Indexer thread(s) in the following state:thread dump excerpt of an "Indexer: N" thread"Indexer: 1" #369 daemon prio=5 os_prio=0 cpu=51468.75ms elapsed=51.59s allocated=54757K defined_classes=45 tid=0x000002a704504800 nid=0x1108 runnable [0x000000bdaf7fd000] java.lang.Thread.State: RUNNABLE at com.atlassian.confluence.util.HtmlEntityEscapeUtil.unEscapeHtmlEntities(HtmlEntityEscapeUtil.java:390) at com.atlassian.confluence.search.lucene.extractor.HtmlEntityFilterExtractor.addFields(HtmlEntityFilterExtractor.java:13) at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:39) at com.atlassian.confluence.internal.index.lucene.LuceneContentExtractor.lambda$extract$0(LuceneContentExtractor.java:68) at com.atlassian.confluence.internal.index.lucene.LuceneContentExtractor$$Lambda$2457/0x0000000102b61c40.accept(Unknown Source) ...
- Indexer: N thread(s) spend(s) inordinate amount of time processing payload from the file, giving the appearance that indexing is stuck at a specific completion percentage (sometimes for hours).
- The behavior can also be observed by performing the following steps:
- set the log level of the package com.atlassian.bonnie.search.extractor and com.atlassian.confluence.internal.index.lucene to DEBUG
- track indexing progress by looking at the application logs
- note those text files which cause long intervals in between the printing of the following two extractors (indexme10mb.txt in this example):
atlassian-confluence.log with indexing related DEBUG lines
# PAYLOAD PASSED TO HtmlEntityFilterExtractor 2021-07-05 16:00:08,327 DEBUG [Indexer: 1] [internal.index.lucene.LuceneContentExtractor] lambda$extract$0 Adding fields to document for Attachment: indexme10mb.txt v.1 (360471) admin using BackwardsCompatibleExtractor wrapping com.atlassian.confluence.search.lucene.extractor.HtmlEntityFilterExtractor@707d0516 (confluence.extractors.core:htmlEntitiesFilterExtractor) # ~600 SECONDS GAP # PAYLOAD PASSED TO TitleExtractor 2021-07-05 16:10:42,266 DEBUG [Indexer: 1] [internal.index.lucene.LuceneContentExtractor] lambda$extract$0 Adding fields to document for Attachment: indexme10mb.txt v.1 (360471) admin using BackwardsCompatibleExtractor wrapping com.atlassian.confluence.search.lucene.extractor.TitleExtractor@1df9ba4a (confluence.extractors.core:titleExtractor)
Occasionally, the TitleExtractor line may take tens of minutes/hours to get written for the indexer thread in question ([Indexer: 1] in the above example).
So if the logs roll up, for a specific indexer thread processing a specific file, the HtmlEntityFilterExtractor and TitleExtractor lines could exist in different log files.
Workaround
Currently, the only workaround is to upload zipped/compressed versions of the problematic text files:
- Identify and download copies of the the problematic text files from Confluence.
As outlined previously, to identify problematic text files:- set com.atlassian.bonnie.search.extractor and com.atlassian.confluence.internal.index.lucene to DEBUG before a reindexing attempt, and then note down those text files which have long intervals in between the printing of the HtmlEntityFilterExtractor and TitleExtractor lines in the logs
- then, collate that list with the results of this modified SQL statement taken from this article about finding large attachments in Confluence:
SQL to find all ATTACHMENTS, type = *.txt, size > = 10 Mb
select distinct c.contentid as contentid, c.spaceid as spaceid, c.title as attachmenttitle, u.username as uploadedby, co.title as pagetitle, (cn.longval/(1024*1024)) as filesizeinmb, s.spacekey as spacekey, s.spacename as spacename from content as c join user_mapping as u on u.user_key = c.creator join content as co on c.pageid = co.contentid join spaces as s on c.spaceid=s.spaceid join contentproperties as cn on cn.contentid = c.contentid where c.contenttype = 'ATTACHMENT' and c.title like '%.txt' and cn.longval is not null and (cn.longval/(1024*1024)) >= 10 order by (cn.longval/(1024*1024)) desc;
Query for Oracleselect distinct c.contentid as contentid, c.spaceid as spaceid, c.title as attachmenttitle, u.username as uploadedby, co.title as pagetitle, (cn.longval/(1024*1024)) as filesizeinmb, s.spacekey as spacekey, s.spacename as spacename from content c join user_mapping u on u.user_key = c.creator join content co on c.pageid = co.contentid join spaces s on c.spaceid=s.spaceid join contentproperties cn on cn.contentid = c.contentid where c.contenttype = 'ATTACHMENT' and c.title like '%.txt' and cn.longval is not null and (cn.longval/(1024*1024)) >= 10 order by (cn.longval/(1024*1024)) desc;
Note: As mentioned before, only text files containing significant quantities of special/escapable characters are to be considered.
- After the problematic text files have been identified, compress the downloaded copies (using any compression utility) and re-upload them.
- Purge the original *.txt files from the instance.