Labeled Content page (/label/) should add rel="nofollow" or noindex to Related Labels links to prevent bot crawler infinite loop.

XMLWordPrintable

    • 1
    • CtB - Improve Existing

      When a public-facing Confluence instance has pages with labels, crawler bots 
      can reach the Labeled Content page at /label/<labelname>. 

      From that page, the "Related Labels" section (top-right) displays links to 
      all other labels that co-occur with the current label(s). Crucially, clicking 
      any "Related Label" appends it to the URL (e.g. /label/foo+bar+baz), and the 
      new page again shows its own "Related Labels", creating a near-infinite 
      combination of crawlable URLs.

      This is confirmed in the Atlassian Support KB article:
      https://support.atlassian.com/confluence/kb/web-crawler-bots-and-confluence-how-public-access-can-lead-to-performance-issues/

      Which shows real-world log examples of bots (PanguBot, bingbot, etc.) 
      generating requests like:
        GET /label/aggregate+coverage+database_management+estimation+eu+intra_regional_trade+qa+territory+world

      These bots:

      • Use unique user-agent strings
      • Come from unique IP addresses
      • Actively ignore robots.txt directives

      This creates near-infinite traffic combinations that degrade performance 
      and can cause outages on public Confluence Data Center instances.

      This issue was partially addressed in CONFSERVER-11940 (fixed in 2.8.2) 
      which added rel="nofollow" to label links, but the fix does not appear 
      to cover the "Related Labels" links on the /label/ Labeled Content page 
      in modern versions of Confluence Data Center.

       

      STEPS TO REPRODUCE
      1. Set up a public-facing Confluence Data Center instance with anonymous access
      2. Add labels to several pages (e.g. "kb-how-to-article", "troubleshooting")
      3. Visit /label/kb-how-to-article
      4. Inspect the HTML of the "Related Labels" section in the top-right
      5. Observe that the label links do NOT have rel="nofollow"
      6. A crawler bot will follow each Related Label link, landing on a new 
         /label/ page with its own Related Labels, generating combinatorial 
         URL explosion

       

      EXPECTED BEHAVIOR

      • The "Related Labels" links on the /label/ Labeled Content page should 
          have rel="nofollow" and/or the page should include a 
          <meta name="robots" content="noindex,nofollow"> tag, preventing bots 
          from following the combinatorial label URL chains.

      OR alternatively:

      • Provide an admin-level option to disable the "Related Labels" feature 
          entirely, or restrict it to logged-in users only.

       

      ACTUAL BEHAVIOR

      • Related Labels links are fully followable by crawlers, with no 
          nofollow attribute, creating a near-infinite crawl loop.

       

      WORKAROUND (per Atlassian KB):

      • Add "Disallow: /label" in robots.txt. But this does NOT work against 
          bots that ignore robots.txt.
      • Block IPs at firewall level. Impractical when bots use thousands of 
          unique IP addresses

       

      RELATED TICKETS

              Assignee:
              Unassigned
              Reporter:
              Rigel Carbajal
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: