Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-8749

Make Confluence more configureable in how it handles web crawlers such as the Google Search Appliance

XMLWordPrintable

    • 21
    • 13
    • We collect Confluence feedback from various sources, and we evaluate what we've collected when planning our product roadmap. To understand how this piece of feedback will be reviewed, see our Implementation of New Features Policy.

      NOTE: This suggestion is for Confluence Server. Using Confluence Cloud? See the corresponding suggestion.

      We have discovered that Confluence is severely impacted by the Google Search Appliance, as seen in CSP-8619.

      • Since each page is dynamic there is no cache setting, so the GSA will hit the same page multiple times a day. In some instances we've had pages re-indexed several times an hour!
      • GSA, like Google, is pretty aggressive in following every single link. It hits page source, history, comment focus, everything.
      • Our site has ~8000 "current versions" of pages. GSA has indexed up to 65k worth of pages!
      • We're using an aggressive robots.txt file (below) but have not yet determined how successful it is.
        • The ideal solution would have been to block EVERYTHING but \display\, but we have quite a few pages that use unallowable characters in the page name (and thus URL), such as ?, that doing so would knock out a significant chunk of content.

      Suggestions:

      • Set up something in the admin console for search engine configuration, something like maybe a check list of what pages should be crawled and what should not. The ones that should not should either use noindex and nofollow meta-tags or maybe Confluence should generate a custom robots.txt file on its own and place in the root directory. Either is fine.
      • There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change.
      • All links to edit pages, admin pages, add comment, etc etc should have no-follow meta-tags.
      • At the least, it would be ideal if ALL current version pages had the /display/ in the beginning, including the ones with non-standard characters, and EVERYTHING else under different stem.
        • It's almost like this today, just the notable exception that the non-standard titled pages show up under the /viewpage stem instead of display. My robots.txt file below tries to work around that but, unfortunately, there's no way to pick up those pages without also picking up all of the version history pages. Some might want those crawled as well; I suspect most would not.

      Thank you.

      Peter

      Our "aggressive" robots.txt file, again for which we're waiting to see if GSA picks it up like we'd like:

      # Note: this files uses parameters specific to Google, parameters that are not in robots.txt standard
      # http://www.google.com/support/webmasters/, http://www.robotstxt.org/wc/faq.html and http://en.wikipedia.org/wiki/Robots_Exclusion_Standard were used to research said parameters
      # some links shouldn't show to an anonymous browser such as GAS but are included for completeness
      
      User-agent: * # match all bots. GSA is our primarly crawler but logs indicate there may be others on our Intranet
      Crawl-delay: 5 # per http://en.wikipedia.org/wiki/Robots.txt#Nonstandard_extensions, sets number of seconds to wait between requests to 5 seconds. may not work
      Disallow: /pages/ # this line to purge GSA of all old page entries, will be removed in next iteration so that specific /pages/ lines below take effect
      # DISABLED FOR NOW Visit-time: 0600-0845 # per http://en.wikipedia.org/wiki/Robots.txt#Extended_Standard, only visit between 6:00 AM and 8:45 AM UT (GMT), may not work
      Disallow: /admin/ # administrator links
      Disallow: /adminstrators.action? # remove any administrator links
      Disallow: /createrssfeed.action? # remove internal RSS links
      Disallow: /dashboard.action? # remove the dashboard, heavy resource hit
      Allow: /display # ensure primary display pages are allowed
      Disallow: /display/*&tasklist.complete= # remove tasklist links
      Disallow: /display/*?decorator=printable # remove printable version links
      Disallow: /display/*?focusedCommentId= # remove page comment focus links
      Disallow: /display/*?refresh= # prevent crawler from clicking refresh button
      Disallow: /display/*?replyToComment= # remove reply to comment links
      Disallow: /display/*?rootCommentId= # remove news comment focus links
      Disallow: /display/*?showComments=true&showCommentArea=true#addcomment # remove add comment links
      Disallow: /doexportpage.action? # remove pdf export links
      Disallow: /dopeopledirectorysearch.action # people search
      Disallow: /dosearchsite.action? # remove specific site searches
      Disallow: /exportword? # remove word export links
      Disallow: /login.action? # remove the login page
      # Next line, 26, will be enabled when line after, 27, is removed
      # Allow: /pages/viewpage.action?* # allows indexing of pages with invalid titles for html (such as ?'s). Unfortunately currently allows page history to sneak in
      Disallow: /pages/ # this line to purge GSA of all old page entries, will be removed in next iteration so that specific /pages/ lines below take effect
      Disallow: /pages/copypage.action? # remove copy page links
      Disallow: /pages/createblogpost.action? # remove add news links
      Disallow: /pages/createpage.action? # remove add page links
      Disallow: /pages/diffpages.action? # remove page comparison pages
      Disallow: /pages/diffpagesbyversion.action? # remove page comparison links
      Disallow: /pages/editblogpost.action? # remove edit news links
      Disallow: /pages/editpage.action? # remove edit page links
      Disallow: /pages/removepage.action? # remove the remove page links
      Disallow: /pages/revertpagebacktoversion.action? # remove reversion links
      Disallow: /pages/templates # remove template pages
      Disallow: /pages/templates/ # block template indexes
      Disallow: /pages/viewchangessincelastlogin.action? # remove page comparison pages
      Disallow: /pages/viewpagesrc.action? # remove view page source links
      Disallow: /pages/viewpreviouspageversions.action? # remove the link to previous versions
      Disallow: /plugins/ # blocks plug-in calls
      Disallow: /rpc/ # remove any RPC links
      Disallow: /searchsite.action? # remove the wiki search engine pages
      Disallow: /spaces/ # remove space action pages
      Disallow: /themes/ # theme links
      Disallow: /users/ # remove user action pages
      Disallow: /x/ # remove tiny link urls
      
      # End file
      

              Unassigned Unassigned
              7173c64ad886 Peter Raymond
              Votes:
              66 Vote for this issue
              Watchers:
              59 Start watching this issue

                Created:
                Updated: