-
Suggestion
-
Resolution: Unresolved
-
None
-
None
-
21
-
13
-
NOTE: This suggestion is for Confluence Server. Using Confluence Cloud? See the corresponding suggestion.
We have discovered that Confluence is severely impacted by the Google Search Appliance, as seen in CSP-8619.
- Since each page is dynamic there is no cache setting, so the GSA will hit the same page multiple times a day. In some instances we've had pages re-indexed several times an hour!
- GSA, like Google, is pretty aggressive in following every single link. It hits page source, history, comment focus, everything.
- Our site has ~8000 "current versions" of pages. GSA has indexed up to 65k worth of pages!
- We're using an aggressive robots.txt file (below) but have not yet determined how successful it is.
- The ideal solution would have been to block EVERYTHING but \display\, but we have quite a few pages that use unallowable characters in the page name (and thus URL), such as ?, that doing so would knock out a significant chunk of content.
Suggestions:
- Set up something in the admin console for search engine configuration, something like maybe a check list of what pages should be crawled and what should not. The ones that should not should either use noindex and nofollow meta-tags or maybe Confluence should generate a custom robots.txt file on its own and place in the root directory. Either is fine.
- There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change.
- All links to edit pages, admin pages, add comment, etc etc should have no-follow meta-tags.
- At the least, it would be ideal if ALL current version pages had the /display/ in the beginning, including the ones with non-standard characters, and EVERYTHING else under different stem.
- It's almost like this today, just the notable exception that the non-standard titled pages show up under the /viewpage stem instead of display. My robots.txt file below tries to work around that but, unfortunately, there's no way to pick up those pages without also picking up all of the version history pages. Some might want those crawled as well; I suspect most would not.
Thank you.
Peter
Our "aggressive" robots.txt file, again for which we're waiting to see if GSA picks it up like we'd like:
# Note: this files uses parameters specific to Google, parameters that are not in robots.txt standard # http://www.google.com/support/webmasters/, http://www.robotstxt.org/wc/faq.html and http://en.wikipedia.org/wiki/Robots_Exclusion_Standard were used to research said parameters # some links shouldn't show to an anonymous browser such as GAS but are included for completeness User-agent: * # match all bots. GSA is our primarly crawler but logs indicate there may be others on our Intranet Crawl-delay: 5 # per http://en.wikipedia.org/wiki/Robots.txt#Nonstandard_extensions, sets number of seconds to wait between requests to 5 seconds. may not work Disallow: /pages/ # this line to purge GSA of all old page entries, will be removed in next iteration so that specific /pages/ lines below take effect # DISABLED FOR NOW Visit-time: 0600-0845 # per http://en.wikipedia.org/wiki/Robots.txt#Extended_Standard, only visit between 6:00 AM and 8:45 AM UT (GMT), may not work Disallow: /admin/ # administrator links Disallow: /adminstrators.action? # remove any administrator links Disallow: /createrssfeed.action? # remove internal RSS links Disallow: /dashboard.action? # remove the dashboard, heavy resource hit Allow: /display # ensure primary display pages are allowed Disallow: /display/*&tasklist.complete= # remove tasklist links Disallow: /display/*?decorator=printable # remove printable version links Disallow: /display/*?focusedCommentId= # remove page comment focus links Disallow: /display/*?refresh= # prevent crawler from clicking refresh button Disallow: /display/*?replyToComment= # remove reply to comment links Disallow: /display/*?rootCommentId= # remove news comment focus links Disallow: /display/*?showComments=true&showCommentArea=true#addcomment # remove add comment links Disallow: /doexportpage.action? # remove pdf export links Disallow: /dopeopledirectorysearch.action # people search Disallow: /dosearchsite.action? # remove specific site searches Disallow: /exportword? # remove word export links Disallow: /login.action? # remove the login page # Next line, 26, will be enabled when line after, 27, is removed # Allow: /pages/viewpage.action?* # allows indexing of pages with invalid titles for html (such as ?'s). Unfortunately currently allows page history to sneak in Disallow: /pages/ # this line to purge GSA of all old page entries, will be removed in next iteration so that specific /pages/ lines below take effect Disallow: /pages/copypage.action? # remove copy page links Disallow: /pages/createblogpost.action? # remove add news links Disallow: /pages/createpage.action? # remove add page links Disallow: /pages/diffpages.action? # remove page comparison pages Disallow: /pages/diffpagesbyversion.action? # remove page comparison links Disallow: /pages/editblogpost.action? # remove edit news links Disallow: /pages/editpage.action? # remove edit page links Disallow: /pages/removepage.action? # remove the remove page links Disallow: /pages/revertpagebacktoversion.action? # remove reversion links Disallow: /pages/templates # remove template pages Disallow: /pages/templates/ # block template indexes Disallow: /pages/viewchangessincelastlogin.action? # remove page comparison pages Disallow: /pages/viewpagesrc.action? # remove view page source links Disallow: /pages/viewpreviouspageversions.action? # remove the link to previous versions Disallow: /plugins/ # blocks plug-in calls Disallow: /rpc/ # remove any RPC links Disallow: /searchsite.action? # remove the wiki search engine pages Disallow: /spaces/ # remove space action pages Disallow: /themes/ # theme links Disallow: /users/ # remove user action pages Disallow: /x/ # remove tiny link urls # End file
- is caused by
-
CONFSERVER-9289 Resources served from /display/* are not sent with correct cache headers
- Closed
-
CONFSERVER-9290 Improve browser-caching and back-navigation by removing the "no-store" cache control headers
- Closed
- is duplicated by
-
CONFSERVER-27053 Bundle a default robots.txt with Confluence
- Closed
- is related to
-
CONFSERVER-27053 Bundle a default robots.txt with Confluence
- Closed
- relates to
-
CONFCLOUD-8749 Make Confluence more configureable in how it handles web crawlers such as the Google Search Appliance
- Gathering Interest