Uploaded image for project: 'Confluence Cloud'
  1. Confluence Cloud
  2. CONFCLOUD-8749

Make Confluence more configureable in how it handles web crawlers such as the Google Search Appliance

    • 15
    • 18
    • Our product teams collect and evaluate feedback from a number of different sources. To learn more about how we use customer feedback in the planning process, check out our new feature policy.

      NOTE: This suggestion is for Confluence Cloud. Using Confluence Server? See the corresponding suggestion.

      We have discovered that Confluence is severely impacted by the Google Search Appliance, as seen in CSP-8619.

      • Since each page is dynamic there is no cache setting, so the GSA will hit the same page multiple times a day. In some instances we've had pages re-indexed several times an hour!
      • GSA, like Google, is pretty aggressive in following every single link. It hits page source, history, comment focus, everything.
      • Our site has ~8000 "current versions" of pages. GSA has indexed up to 65k worth of pages!
      • We're using an aggressive robots.txt file (below) but have not yet determined how successful it is.
        • The ideal solution would have been to block EVERYTHING but \display\, but we have quite a few pages that use unallowable characters in the page name (and thus URL), such as ?, that doing so would knock out a significant chunk of content.

      Suggestions:

      • Set up something in the admin console for search engine configuration, something like maybe a check list of what pages should be crawled and what should not. The ones that should not should either use noindex and nofollow meta-tags or maybe Confluence should generate a custom robots.txt file on its own and place in the root directory. Either is fine.
      • There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change.
      • All links to edit pages, admin pages, add comment, etc etc should have no-follow meta-tags.
      • At the least, it would be ideal if ALL current version pages had the /display/ in the beginning, including the ones with non-standard characters, and EVERYTHING else under different stem.
        • It's almost like this today, just the notable exception that the non-standard titled pages show up under the /viewpage stem instead of display. My robots.txt file below tries to work around that but, unfortunately, there's no way to pick up those pages without also picking up all of the version history pages. Some might want those crawled as well; I suspect most would not.

      Thank you.

      Peter

      Our "aggressive" robots.txt file, again for which we're waiting to see if GSA picks it up like we'd like:

      # Note: this files uses parameters specific to Google, parameters that are not in robots.txt standard
      # http://www.google.com/support/webmasters/, http://www.robotstxt.org/wc/faq.html and http://en.wikipedia.org/wiki/Robots_Exclusion_Standard were used to research said parameters
      # some links shouldn't show to an anonymous browser such as GAS but are included for completeness
      
      User-agent: * # match all bots. GSA is our primarly crawler but logs indicate there may be others on our Intranet
      Crawl-delay: 5 # per http://en.wikipedia.org/wiki/Robots.txt#Nonstandard_extensions, sets number of seconds to wait between requests to 5 seconds. may not work
      Disallow: /pages/ # this line to purge GSA of all old page entries, will be removed in next iteration so that specific /pages/ lines below take effect
      # DISABLED FOR NOW Visit-time: 0600-0845 # per http://en.wikipedia.org/wiki/Robots.txt#Extended_Standard, only visit between 6:00 AM and 8:45 AM UT (GMT), may not work
      Disallow: /admin/ # administrator links
      Disallow: /adminstrators.action? # remove any administrator links
      Disallow: /createrssfeed.action? # remove internal RSS links
      Disallow: /dashboard.action? # remove the dashboard, heavy resource hit
      Allow: /display # ensure primary display pages are allowed
      Disallow: /display/*&tasklist.complete= # remove tasklist links
      Disallow: /display/*?decorator=printable # remove printable version links
      Disallow: /display/*?focusedCommentId= # remove page comment focus links
      Disallow: /display/*?refresh= # prevent crawler from clicking refresh button
      Disallow: /display/*?replyToComment= # remove reply to comment links
      Disallow: /display/*?rootCommentId= # remove news comment focus links
      Disallow: /display/*?showComments=true&showCommentArea=true#addcomment # remove add comment links
      Disallow: /doexportpage.action? # remove pdf export links
      Disallow: /dopeopledirectorysearch.action # people search
      Disallow: /dosearchsite.action? # remove specific site searches
      Disallow: /exportword? # remove word export links
      Disallow: /login.action? # remove the login page
      # Next line, 26, will be enabled when line after, 27, is removed
      # Allow: /pages/viewpage.action?* # allows indexing of pages with invalid titles for html (such as ?'s). Unfortunately currently allows page history to sneak in
      Disallow: /pages/ # this line to purge GSA of all old page entries, will be removed in next iteration so that specific /pages/ lines below take effect
      Disallow: /pages/copypage.action? # remove copy page links
      Disallow: /pages/createblogpost.action? # remove add news links
      Disallow: /pages/createpage.action? # remove add page links
      Disallow: /pages/diffpages.action? # remove page comparison pages
      Disallow: /pages/diffpagesbyversion.action? # remove page comparison links
      Disallow: /pages/editblogpost.action? # remove edit news links
      Disallow: /pages/editpage.action? # remove edit page links
      Disallow: /pages/removepage.action? # remove the remove page links
      Disallow: /pages/revertpagebacktoversion.action? # remove reversion links
      Disallow: /pages/templates # remove template pages
      Disallow: /pages/templates/ # block template indexes
      Disallow: /pages/viewchangessincelastlogin.action? # remove page comparison pages
      Disallow: /pages/viewpagesrc.action? # remove view page source links
      Disallow: /pages/viewpreviouspageversions.action? # remove the link to previous versions
      Disallow: /plugins/ # blocks plug-in calls
      Disallow: /rpc/ # remove any RPC links
      Disallow: /searchsite.action? # remove the wiki search engine pages
      Disallow: /spaces/ # remove space action pages
      Disallow: /themes/ # theme links
      Disallow: /users/ # remove user action pages
      Disallow: /x/ # remove tiny link urls
      
      # End file
      

          Form Name

            [CONFCLOUD-8749] Make Confluence more configureable in how it handles web crawlers such as the Google Search Appliance

            We (are trying to) use the Confluence wiki to publish support documents, but Confluence blocks all crawlers from indexing our content, so it remains invisible for the internet. Support tells me there is nothing that can be done about it. Really, Atlassian?

            Metin Savignano added a comment - We (are trying to) use the Confluence wiki to publish support documents, but Confluence blocks all crawlers from indexing our content, so it remains invisible for the internet. Support tells me there is nothing that can be done about it. Really, Atlassian?

            BL added a comment -

            I'm trying to make publicly open wikis, one of them for a computer game. At the moment it's impossible for anyone to find anything in the wikis through search engines.
            I was asked to give input here on how this issue affects me.

            BL added a comment - I'm trying to make publicly open wikis, one of them for a computer game. At the moment it's impossible for anyone to find anything in the wikis through search engines. I was asked to give input here on how this issue affects me.

            +1

            Adrien Persuy added a comment - +1

            A customer asks the configuration ability to enable 'nofollow' attribute only for page/comment where anonymous can create or update.

            K. Yamamoto added a comment - A customer asks the configuration ability to enable 'nofollow' attribute only for page/comment where anonymous can create or update.

            Please note that not following HTTP standards, is not a new feature, is a product bug and also a proof that QA didn't make a good job on finding those.

            Sorin Sbarnea (Citrix) added a comment - Please note that not following HTTP standards, is not a new feature, is a product bug and also a proof that QA didn't make a good job on finding those.

            Sorin Sbarnea (Citrix) added a comment - - edited

            It seems that Confluence does not follow even most basic HTTP standard requirements, like returning 204 when content was not changed or including the change date as a meta tag like `<META name="date" content="13-01-08">` and the date inside the HTTP headers does contain the date when the datetime of the response NOT of the last change made to the page (content).

            Due to this, I should not have been so surprised when I observed that 66% of our traffic is made by GSA indexing, which obviously was reindexing each page daily and some of them more often. Over 230.000 requests in 24 hours, ~4/second.

            This being said, where is the enterprise quality?

            Sorin Sbarnea (Citrix) added a comment - - edited It seems that Confluence does not follow even most basic HTTP standard requirements, like returning 204 when content was not changed or including the change date as a meta tag like `<META name="date" content="13-01-08">` and the date inside the HTTP headers does contain the date when the datetime of the response NOT of the last change made to the page (content). Due to this, I should not have been so surprised when I observed that 66% of our traffic is made by GSA indexing, which obviously was reindexing each page daily and some of them more often. Over 230.000 requests in 24 hours, ~4/second. This being said, where is the enterprise quality?

            Clearly with the current setup GSA can put your instance down easily. I just added the host-load: 1 based on http://www.stonetemple.com/articles/interview-matt-cutts.shtml and waiting to see if it makes a change.

            Sorin Sbarnea (Citrix) added a comment - Clearly with the current setup GSA can put your instance down easily. I just added the host-load: 1 based on http://www.stonetemple.com/articles/interview-matt-cutts.shtml and waiting to see if it makes a change.

            The overall solution should probably (also) use the X-Robot-Tags header, which wasn't around when this issue was created.

            https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag

            There are some SEO plugins that might provide control over things like this.

            Yuji Shinozaki added a comment - The overall solution should probably (also) use the X-Robot-Tags header, which wasn't around when this issue was created. https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag There are some SEO plugins that might provide control over things like this.

            Having an Admin tool to create a robots.txt file would be greatly appreciated!

            Mary Washburn added a comment - Having an Admin tool to create a robots.txt file would be greatly appreciated!

            Any news?

            Sergey Svishchev added a comment - Any news?

              Unassigned Unassigned
              7173c64ad886 Peter Raymond
              Votes:
              88 Vote for this issue
              Watchers:
              88 Start watching this issue

                Created:
                Updated: