Loading...

Type: Suggestion
Resolution: Unresolved
Fix Version/s: None
Component/s: None
Labels:

UIS:
21
Support reference count:
13
Feedback Policy:

We collect Confluence feedback from various sources, and we evaluate what we've collected when planning our product roadmap. To understand how this piece of feedback will be reviewed, see our Implementation of New Features Policy.

NOTE: This suggestion is for Confluence Server. Using Confluence Cloud? See the corresponding suggestion.

We have discovered that Confluence is severely impacted by the Google Search Appliance, as seen in CSP-8619.

Since each page is dynamic there is no cache setting, so the GSA will hit the same page multiple times a day. In some instances we've had pages re-indexed several times an hour!
GSA, like Google, is pretty aggressive in following every single link. It hits page source, history, comment focus, everything.
Our site has ~8000 "current versions" of pages. GSA has indexed up to 65k worth of pages!
We're using an aggressive robots.txt file (below) but have not yet determined how successful it is.
- The ideal solution would have been to block EVERYTHING but \display\, but we have quite a few pages that use unallowable characters in the page name (and thus URL), such as ?, that doing so would knock out a significant chunk of content.

Suggestions:

Set up something in the admin console for search engine configuration, something like maybe a check list of what pages should be crawled and what should not. The ones that should not should either use noindex and nofollow meta-tags or maybe Confluence should generate a custom robots.txt file on its own and place in the root directory. Either is fine.
There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change.
All links to edit pages, admin pages, add comment, etc etc should have no-follow meta-tags.
At the least, it would be ideal if ALL current version pages had the /display/ in the beginning, including the ones with non-standard characters, and EVERYTHING else under different stem.
- It's almost like this today, just the notable exception that the non-standard titled pages show up under the /viewpage stem instead of display. My robots.txt file below tries to work around that but, unfortunately, there's no way to pick up those pages without also picking up all of the version history pages. Some might want those crawled as well; I suspect most would not.

Thank you.

Peter

Our "aggressive" robots.txt file, again for which we're waiting to see if GSA picks it up like we'd like:

# Note: this files uses parameters specific to Google, parameters that are not in robots.txt standard
# http://www.google.com/support/webmasters/, http://www.robotstxt.org/wc/faq.html and http://en.wikipedia.org/wiki/Robots_Exclusion_Standard were used to research said parameters
# some links shouldn't show to an anonymous browser such as GAS but are included for completeness

User-agent: * # match all bots. GSA is our primarly crawler but logs indicate there may be others on our Intranet
Crawl-delay: 5 # per http://en.wikipedia.org/wiki/Robots.txt#Nonstandard_extensions, sets number of seconds to wait between requests to 5 seconds. may not work
Disallow: /pages/ # this line to purge GSA of all old page entries, will be removed in next iteration so that specific /pages/ lines below take effect
# DISABLED FOR NOW Visit-time: 0600-0845 # per http://en.wikipedia.org/wiki/Robots.txt#Extended_Standard, only visit between 6:00 AM and 8:45 AM UT (GMT), may not work
Disallow: /admin/ # administrator links
Disallow: /adminstrators.action? # remove any administrator links
Disallow: /createrssfeed.action? # remove internal RSS links
Disallow: /dashboard.action? # remove the dashboard, heavy resource hit
Allow: /display # ensure primary display pages are allowed
Disallow: /display/*&tasklist.complete= # remove tasklist links
Disallow: /display/*?decorator=printable # remove printable version links
Disallow: /display/*?focusedCommentId= # remove page comment focus links
Disallow: /display/*?refresh= # prevent crawler from clicking refresh button
Disallow: /display/*?replyToComment= # remove reply to comment links
Disallow: /display/*?rootCommentId= # remove news comment focus links
Disallow: /display/*?showComments=true&showCommentArea=true#addcomment # remove add comment links
Disallow: /doexportpage.action? # remove pdf export links
Disallow: /dopeopledirectorysearch.action # people search
Disallow: /dosearchsite.action? # remove specific site searches
Disallow: /exportword? # remove word export links
Disallow: /login.action? # remove the login page
# Next line, 26, will be enabled when line after, 27, is removed
# Allow: /pages/viewpage.action?* # allows indexing of pages with invalid titles for html (such as ?'s). Unfortunately currently allows page history to sneak in
Disallow: /pages/ # this line to purge GSA of all old page entries, will be removed in next iteration so that specific /pages/ lines below take effect
Disallow: /pages/copypage.action? # remove copy page links
Disallow: /pages/createblogpost.action? # remove add news links
Disallow: /pages/createpage.action? # remove add page links
Disallow: /pages/diffpages.action? # remove page comparison pages
Disallow: /pages/diffpagesbyversion.action? # remove page comparison links
Disallow: /pages/editblogpost.action? # remove edit news links
Disallow: /pages/editpage.action? # remove edit page links
Disallow: /pages/removepage.action? # remove the remove page links
Disallow: /pages/revertpagebacktoversion.action? # remove reversion links
Disallow: /pages/templates # remove template pages
Disallow: /pages/templates/ # block template indexes
Disallow: /pages/viewchangessincelastlogin.action? # remove page comparison pages
Disallow: /pages/viewpagesrc.action? # remove view page source links
Disallow: /pages/viewpreviouspageversions.action? # remove the link to previous versions
Disallow: /plugins/ # blocks plug-in calls
Disallow: /rpc/ # remove any RPC links
Disallow: /searchsite.action? # remove the wiki search engine pages
Disallow: /spaces/ # remove space action pages
Disallow: /themes/ # theme links
Disallow: /users/ # remove user action pages
Disallow: /x/ # remove tiny link urls

# End file

is caused by

CONFSERVER-9289 Resources served from /display/* are not sent with correct cache headers

Closed

CONFSERVER-9290 Improve browser-caching and back-navigation by removing the "no-store" cache control headers

Closed

is duplicated by

CONFSERVER-27053 Bundle a default robots.txt with Confluence

Closed

is related to

CONFSERVER-27053 Bundle a default robots.txt with Confluence

Closed

relates to

CONFCLOUD-8749 Make Confluence more configureable in how it handles web crawlers such as the Google Search Appliance

Gathering Interest

mentioned in: Page Loading...; Page Loading...

(2 mentioned in)

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates