Loading...

Type: Bug
Resolution: Fixed
Priority: Medium
Fix Version/s: 7.4.11, 7.11.6, 7.12.5, 7.13.0
Affects Version/s: 7.4.1, 7.4.3, 7.4.4, 7.4.5, 7.4.6, 7.4.7, 7.4.8, 7.12.3
Component/s: Search - Indexing
Labels:
None

Fixed in Long Term Support Release/s:

Download 7.4
Support reference count:
9
Symptom Severity:
Severity 3 - Minor
UIS:
53

Issue Summary

Presence of large text files (10+Mb) having lots of escapable/html data (characters such as <, >, \, /, |, & etc.) will significantly slow down indexing to the extent that it would appear to have stalled.
However, thread dumps will show indexing thread(s):

in a runnable state (java.lang.Thread.State: RUNNABLE), and
constantly executing com.atlassian.confluence.util.HtmlEntityEscapeUtil.unEscapeHtmlEntities code

Note: Only large text files containing significant quantities of special/escapable characters will cause the problem.

Steps to Reproduce

(Tested on Confluence 7.4.[1-8] and 7.12)

Attach a 10+Mb *.txt file containing lots of SQL-queries/HTML-source/special-characters to a Confluence page
Wait for the scheduled background indexing job (Caesium thread) to pick up the newly attached data, or rebuild indexes (via Confluence Administration > Content Indexing > Search Index > Rebuild)

Expected Results

When the scheduled background Caesium indexing thread picks up the data, indexing should finish in reasonable time, and 0 pending data should be shown in Confluence Administration > Content Indexing > Queue Contents
When a one-time/on-demand reindexing job is manually triggered, Indexer thread(s) should finish in reasonable time, and show 100% re-index task completion

Actual Results

When the scheduled background Caesium indexing thread picks up the data:

it spends an inordinate amount of time processing payload from the file and is extremely slow in getting to the remaining indexable items (visible as the steadily increasing Number of items in queue metric under Confluence Administration > Content Indexing > Queue Contents).
Thread dumps will show the Caesium thread in the following state:

thread dump excerpt of a "Caesium-1-N" thread

"Caesium-1-1" #373 daemon prio=5 os_prio=0 tid=0x000001a95b71d000 nid=0x380c runnable [0x000000361d4fd000]
   java.lang.Thread.State: RUNNABLE
	at java.lang.String.indexOf(String.java:1769)
	at java.lang.String.indexOf(String.java:1718)
	at com.atlassian.confluence.util.HtmlEntityEscapeUtil.unEscapeHtmlEntities(HtmlEntityEscapeUtil.java:388)
	at com.atlassian.confluence.search.lucene.extractor.HtmlEntityFilterExtractor.addFields(HtmlEntityFilterExtractor.java:13)
	at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:39)
	at com.atlassian.confluence.internal.index.lucene.LuceneContentExtractor.lambda$extract$0(LuceneContentExtractor.java:68)
	at com.atlassian.confluence.internal.index.lucene.LuceneContentExtractor$$Lambda$2093/1823784346.accept(Unknown Source)
	...

When a one-time/on-demand reindexing job is manually triggered:

Indexer: N thread(s) spend(s) inordinate amount of time processing payload from the file, giving the appearance that indexing is stuck at a specific completion percentage (sometimes for hours).
Thread dumps will show Indexer thread(s) in the following state:

thread dump excerpt of an "Indexer: N" thread

"Indexer: 1" #369 daemon prio=5 os_prio=0 cpu=51468.75ms elapsed=51.59s allocated=54757K defined_classes=45 tid=0x000002a704504800 nid=0x1108 runnable  [0x000000bdaf7fd000]
   java.lang.Thread.State: RUNNABLE
	at com.atlassian.confluence.util.HtmlEntityEscapeUtil.unEscapeHtmlEntities(HtmlEntityEscapeUtil.java:390)
	at com.atlassian.confluence.search.lucene.extractor.HtmlEntityFilterExtractor.addFields(HtmlEntityFilterExtractor.java:13)
	at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:39)
	at com.atlassian.confluence.internal.index.lucene.LuceneContentExtractor.lambda$extract$0(LuceneContentExtractor.java:68)
	at com.atlassian.confluence.internal.index.lucene.LuceneContentExtractor$$Lambda$2457/0x0000000102b61c40.accept(Unknown Source)
	...

The behavior can also be observed by performing the following steps:

set the log level of the package com.atlassian.bonnie.search.extractor and com.atlassian.confluence.internal.index.lucene to DEBUG
track indexing progress by looking at the application logs

note those text files which cause long intervals in between the printing of the following two extractors (indexme10mb.txt in this example):

atlassian-confluence.log with indexing related DEBUG lines

# PAYLOAD PASSED TO HtmlEntityFilterExtractor
2021-07-05 16:00:08,327 DEBUG [Indexer: 1] [internal.index.lucene.LuceneContentExtractor] lambda$extract$0 Adding fields to document for Attachment: indexme10mb.txt v.1 (360471) admin using BackwardsCompatibleExtractor wrapping com.atlassian.confluence.search.lucene.extractor.HtmlEntityFilterExtractor@707d0516 (confluence.extractors.core:htmlEntitiesFilterExtractor)
# ~600 SECONDS GAP
# PAYLOAD PASSED TO TitleExtractor
2021-07-05 16:10:42,266 DEBUG [Indexer: 1] [internal.index.lucene.LuceneContentExtractor] lambda$extract$0 Adding fields to document for Attachment: indexme10mb.txt v.1 (360471) admin using BackwardsCompatibleExtractor wrapping com.atlassian.confluence.search.lucene.extractor.TitleExtractor@1df9ba4a (confluence.extractors.core:titleExtractor)

Occasionally, the TitleExtractor line may take tens of minutes/hours to get written for the indexer thread in question ([Indexer: 1] in the above example).
So if the logs roll up, for a specific indexer thread processing a specific file, the HtmlEntityFilterExtractor and TitleExtractor lines could exist in different log files.

Workaround

Currently, the only workaround is to upload zipped/compressed versions of the problematic text files:

Identify and download copies of the the problematic text files from Confluence.
As outlined previously, to identify problematic text files:

set com.atlassian.bonnie.search.extractor and com.atlassian.confluence.internal.index.lucene to DEBUG before a reindexing attempt, and then note down those text files which have long intervals in between the printing of the HtmlEntityFilterExtractor and TitleExtractor lines in the logs

then, collate that list with the results of this modified SQL statement taken from this article about finding large attachments in Confluence:

SQL to find all ATTACHMENTS, type = *.txt, size > = 10 Mb

select distinct 
	c.contentid as contentid,
	c.spaceid as spaceid,
	c.title as attachmenttitle,
	u.username as uploadedby,
	co.title as pagetitle,
	(cn.longval/(1024*1024)) as filesizeinmb,
	s.spacekey as spacekey,
	s.spacename as spacename
from 	
	content as c
join user_mapping as u on u.user_key = c.creator
join content as co on c.pageid = co.contentid
join spaces as s on c.spaceid=s.spaceid
join contentproperties as cn on cn.contentid = c.contentid
where 	
	c.contenttype = 'ATTACHMENT' and c.title like '%.txt'
	and cn.longval is not null
	and (cn.longval/(1024*1024)) >= 10
order by (cn.longval/(1024*1024)) desc;

Query for Oracle

select distinct 
	c.contentid as contentid,
	c.spaceid as spaceid,
	c.title as attachmenttitle,
	u.username as uploadedby,
	co.title as pagetitle,
	(cn.longval/(1024*1024)) as filesizeinmb,
	s.spacekey as spacekey,
	s.spacename as spacename
from 	
	content c
join user_mapping u on u.user_key = c.creator
join content co on c.pageid = co.contentid
join spaces s on c.spaceid=s.spaceid
join contentproperties cn on cn.contentid = c.contentid
where 	
	c.contenttype = 'ATTACHMENT' and c.title like '%.txt'
	and cn.longval is not null
	and (cn.longval/(1024*1024)) >= 10
order by (cn.longval/(1024*1024)) desc;

Note: As mentioned before, only text files containing significant quantities of special/escapable characters are to be considered.

After the problematic text files have been identified, compress the downloaded copies (using any compression utility) and re-upload them.
Purge the original *.txt files from the instance.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List

screenshot_caesium_&_indexer_state_runnable_at_HtmlEntityEscapeUtil_unEscapeHtmlEntities_HtmlEntityFilterExtractor.png
352 kB
12/Jul/2021 10:22 AM
sample_problematic_txt_files_with_special_characters_1Mb_to_500Mb_DO_NOT_UPLOAD_UNZIPPED-1.zip
7.13 MB
12/Jul/2021 10:12 AM

mentioned in: Page Loading...; Page Loading...; Page Loading...

Details

Description

Issue Summary

Steps to Reproduce

Expected Results

Actual Results

Workaround

Attachments

Attachments

Issue Links

Activity

People

Dates