[CONFSERVER-8749] Make Confluence more configureable in how it handles web crawlers such as the Google Search Appliance

Type: Suggestion
Resolution: Unresolved
Fix Version/s: None
Component/s: None
Labels:

UIS:
26
Support reference count:
13
Feedback Policy:

We collect Confluence feedback from various sources, and we evaluate what we've collected when planning our product roadmap. To understand how this piece of feedback will be reviewed, see our Implementation of New Features Policy.

NOTE: This suggestion is for Confluence Server. Using Confluence Cloud? See the corresponding suggestion.

We have discovered that Confluence is severely impacted by the Google Search Appliance, as seen in CSP-8619.

Since each page is dynamic there is no cache setting, so the GSA will hit the same page multiple times a day. In some instances we've had pages re-indexed several times an hour!
GSA, like Google, is pretty aggressive in following every single link. It hits page source, history, comment focus, everything.
Our site has ~8000 "current versions" of pages. GSA has indexed up to 65k worth of pages!
We're using an aggressive robots.txt file (below) but have not yet determined how successful it is.
- The ideal solution would have been to block EVERYTHING but \display\, but we have quite a few pages that use unallowable characters in the page name (and thus URL), such as ?, that doing so would knock out a significant chunk of content.

Suggestions:

Set up something in the admin console for search engine configuration, something like maybe a check list of what pages should be crawled and what should not. The ones that should not should either use noindex and nofollow meta-tags or maybe Confluence should generate a custom robots.txt file on its own and place in the root directory. Either is fine.
There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change.
All links to edit pages, admin pages, add comment, etc etc should have no-follow meta-tags.
At the least, it would be ideal if ALL current version pages had the /display/ in the beginning, including the ones with non-standard characters, and EVERYTHING else under different stem.
- It's almost like this today, just the notable exception that the non-standard titled pages show up under the /viewpage stem instead of display. My robots.txt file below tries to work around that but, unfortunately, there's no way to pick up those pages without also picking up all of the version history pages. Some might want those crawled as well; I suspect most would not.

Thank you.

Peter

Our "aggressive" robots.txt file, again for which we're waiting to see if GSA picks it up like we'd like:

# Note: this files uses parameters specific to Google, parameters that are not in robots.txt standard
# http://www.google.com/support/webmasters/, http://www.robotstxt.org/wc/faq.html and http://en.wikipedia.org/wiki/Robots_Exclusion_Standard were used to research said parameters
# some links shouldn't show to an anonymous browser such as GAS but are included for completeness

User-agent: * # match all bots. GSA is our primarly crawler but logs indicate there may be others on our Intranet
Crawl-delay: 5 # per http://en.wikipedia.org/wiki/Robots.txt#Nonstandard_extensions, sets number of seconds to wait between requests to 5 seconds. may not work
Disallow: /pages/ # this line to purge GSA of all old page entries, will be removed in next iteration so that specific /pages/ lines below take effect
# DISABLED FOR NOW Visit-time: 0600-0845 # per http://en.wikipedia.org/wiki/Robots.txt#Extended_Standard, only visit between 6:00 AM and 8:45 AM UT (GMT), may not work
Disallow: /admin/ # administrator links
Disallow: /adminstrators.action? # remove any administrator links
Disallow: /createrssfeed.action? # remove internal RSS links
Disallow: /dashboard.action? # remove the dashboard, heavy resource hit
Allow: /display # ensure primary display pages are allowed
Disallow: /display/*&tasklist.complete= # remove tasklist links
Disallow: /display/*?decorator=printable # remove printable version links
Disallow: /display/*?focusedCommentId= # remove page comment focus links
Disallow: /display/*?refresh= # prevent crawler from clicking refresh button
Disallow: /display/*?replyToComment= # remove reply to comment links
Disallow: /display/*?rootCommentId= # remove news comment focus links
Disallow: /display/*?showComments=true&showCommentArea=true#addcomment # remove add comment links
Disallow: /doexportpage.action? # remove pdf export links
Disallow: /dopeopledirectorysearch.action # people search
Disallow: /dosearchsite.action? # remove specific site searches
Disallow: /exportword? # remove word export links
Disallow: /login.action? # remove the login page
# Next line, 26, will be enabled when line after, 27, is removed
# Allow: /pages/viewpage.action?* # allows indexing of pages with invalid titles for html (such as ?'s). Unfortunately currently allows page history to sneak in
Disallow: /pages/ # this line to purge GSA of all old page entries, will be removed in next iteration so that specific /pages/ lines below take effect
Disallow: /pages/copypage.action? # remove copy page links
Disallow: /pages/createblogpost.action? # remove add news links
Disallow: /pages/createpage.action? # remove add page links
Disallow: /pages/diffpages.action? # remove page comparison pages
Disallow: /pages/diffpagesbyversion.action? # remove page comparison links
Disallow: /pages/editblogpost.action? # remove edit news links
Disallow: /pages/editpage.action? # remove edit page links
Disallow: /pages/removepage.action? # remove the remove page links
Disallow: /pages/revertpagebacktoversion.action? # remove reversion links
Disallow: /pages/templates # remove template pages
Disallow: /pages/templates/ # block template indexes
Disallow: /pages/viewchangessincelastlogin.action? # remove page comparison pages
Disallow: /pages/viewpagesrc.action? # remove view page source links
Disallow: /pages/viewpreviouspageversions.action? # remove the link to previous versions
Disallow: /plugins/ # blocks plug-in calls
Disallow: /rpc/ # remove any RPC links
Disallow: /searchsite.action? # remove the wiki search engine pages
Disallow: /spaces/ # remove space action pages
Disallow: /themes/ # theme links
Disallow: /users/ # remove user action pages
Disallow: /x/ # remove tiny link urls

# End file

is caused by

CONFSERVER-9289 Resources served from /display/* are not sent with correct cache headers

Closed

CONFSERVER-9290 Improve browser-caching and back-navigation by removing the "no-store" cache control headers

Closed

is duplicated by

CONFSERVER-27053 Bundle a default robots.txt with Confluence

Closed

is related to

CONFSERVER-27053 Bundle a default robots.txt with Confluence

Closed

relates to

CONFCLOUD-8749 Make Confluence more configureable in how it handles web crawlers such as the Google Search Appliance

Gathering Interest

mentioned in: Page Loading...; Page Loading...

(2 mentioned in)

Form Name

Sorin Sbarnea (Citrix) added a comment - 27/Apr/2015 5:27 PM

Please note that not following HTTP standards, is not a new feature, is a product bug and also a proof that QA didn't make a good job on finding those.

Sorin Sbarnea (Citrix) added a comment - 27/Apr/2015 5:27 PM Please note that not following HTTP standards, is not a new feature, is a product bug and also a proof that QA didn't make a good job on finding those.

Sorin Sbarnea (Citrix) added a comment - 27/Apr/2015 5:26 PM - edited

It seems that Confluence does not follow even most basic HTTP standard requirements, like returning 204 when content was not changed or including the change date as a meta tag like `<META name="date" content="13-01-08">` and the date inside the HTTP headers does contain the date when the datetime of the response NOT of the last change made to the page (content).

Due to this, I should not have been so surprised when I observed that 66% of our traffic is made by GSA indexing, which obviously was reindexing each page daily and some of them more often. Over 230.000 requests in 24 hours, ~4/second.

This being said, where is the enterprise quality?

Sorin Sbarnea (Citrix) added a comment - 27/Apr/2015 5:26 PM - edited It seems that Confluence does not follow even most basic HTTP standard requirements, like returning 204 when content was not changed or including the change date as a meta tag like `<META name="date" content="13-01-08">` and the date inside the HTTP headers does contain the date when the datetime of the response NOT of the last change made to the page (content). Due to this, I should not have been so surprised when I observed that 66% of our traffic is made by GSA indexing, which obviously was reindexing each page daily and some of them more often. Over 230.000 requests in 24 hours, ~4/second. This being said, where is the enterprise quality?

Sorin Sbarnea (Citrix) added a comment - 24/Apr/2015 4:46 PM

Clearly with the current setup GSA can put your instance down easily. I just added the host-load: 1 based on http://www.stonetemple.com/articles/interview-matt-cutts.shtml and waiting to see if it makes a change.

Sorin Sbarnea (Citrix) added a comment - 24/Apr/2015 4:46 PM Clearly with the current setup GSA can put your instance down easily. I just added the host-load: 1 based on http://www.stonetemple.com/articles/interview-matt-cutts.shtml and waiting to see if it makes a change.

Yuji Shinozaki added a comment - 12/Mar/2015 2:44 PM

The overall solution should probably (also) use the X-Robot-Tags header, which wasn't around when this issue was created.

https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag

There are some SEO plugins that might provide control over things like this.

Yuji Shinozaki added a comment - 12/Mar/2015 2:44 PM The overall solution should probably (also) use the X-Robot-Tags header, which wasn't around when this issue was created. https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag There are some SEO plugins that might provide control over things like this.

Mary Washburn added a comment - 09/Dec/2013 5:38 PM

Having an Admin tool to create a robots.txt file would be greatly appreciated!

Mary Washburn added a comment - 09/Dec/2013 5:38 PM Having an Admin tool to create a robots.txt file would be greatly appreciated!

Sergey Svishchev added a comment - 31/Jul/2013 1:58 PM

Any news?

Sergey Svishchev added a comment - 31/Jul/2013 1:58 PM Any news?

William Zanchet (Inactive) added a comment - 30/May/2013 8:34 PM

Hi Chris, let me try to reach one of our team members to see if this is in our road map. As soon as I get an answer I'll let you know.

William Zanchet (Inactive) added a comment - 30/May/2013 8:34 PM Hi Chris, let me try to reach one of our team members to see if this is in our road map. As soon as I get an answer I'll let you know.

Feldhacker added a comment - 30/May/2013 8:17 PM

Any updates on this 6 year old issue?

Feldhacker added a comment - 30/May/2013 8:17 PM Any updates on this 6 year old issue?

Ben added a comment - 28/Feb/2013 8:47 PM

Commenting here as it seems related, we would like to be able to block certain spaces from being indexed either via an option to add custom robots.txt entries or a noindex meta tag.

Ben added a comment - 28/Feb/2013 8:47 PM Commenting here as it seems related, we would like to be able to block certain spaces from being indexed either via an option to add custom robots.txt entries or a noindex meta tag.

Daniel Flower added a comment - 27/Aug/2012 3:30 AM

It seems that any custom robots.txt files need to be updated as Confluence is evolving, so at the very least a robots.txt should be deployed with confluence to at least exclude PDF/Word exports, and maybe historical page versions etc.

Daniel Flower added a comment - 27/Aug/2012 3:30 AM It seems that any custom robots.txt files need to be updated as Confluence is evolving, so at the very least a robots.txt should be deployed with confluence to at least exclude PDF/Word exports, and maybe historical page versions etc.

childnode added a comment - 17/Feb/2011 1:39 PM - edited

This problem affects all Confluence instances officially indexed by search engines. As seen as like in this search result, historic versions of pages should NOT be indexed and therefor disallowed! If there are different users with different approaches, please add a option "disallow historic page for index"

Thank you and with kind regards,
~Marcel

childnode added a comment - 17/Feb/2011 1:39 PM - edited This problem affects all Confluence instances officially indexed by search engines. As seen as like in this search result , historic versions of pages should NOT be indexed and therefor disallowed! If there are different users with different approaches, please add a option "disallow historic page for index" Thank you and with kind regards, ~Marcel

jeff peichel added a comment - 30/Jun/2010 3:32 PM

Has anyone created a robots.txt for 3.1? Not sure what differences there would be...

jeff peichel added a comment - 30/Jun/2010 3:32 PM Has anyone created a robots.txt for 3.1? Not sure what differences there would be...

Peter Raymond added a comment - 07/Dec/2008 6:23 PM - edited

Given the last two comments, I'd drop back from the "set cache headers to never expire" stance and move to the "<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">" on historical pages stance.

Anyway, more analysis of our logs has caused us to add the following to our robots.txt file now that we've moved to v2.9.2:

Disallow: /spaces/usage/report.action # remove report generator
Disallow: /checkinout/ # remove Attachment Checkout plug-in links
Disallow: /customspacemgmt/ # remove space management links
Disallow: /homepage.action # remove home page action
Disallow: /spaces/space-bookmarks.action? # remove social bookmarking links

The social bookmarking one is the critical one of this bunch. On our system, at least, it takes around 2 minutes for any "&mode=bookmarksfor" pages to come up in a browser. Since the crawler is hitting a bunch of these it may be the source of our random slowdowns.

I'd still really like a simple config page in Confluence Admin where I could just go down and check yes/no on what types of pages we want indexed and not indexed and then have Confluence generate the proper page headers on the fly as needed, particularly the "nofollow" on the links to pages that we don't want Google to even see, much less crawl.

Peter Raymond added a comment - 07/Dec/2008 6:23 PM - edited Given the last two comments, I'd drop back from the "set cache headers to never expire" stance and move to the "<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">" on historical pages stance. Anyway, more analysis of our logs has caused us to add the following to our robots.txt file now that we've moved to v2.9.2: Disallow: /spaces/usage/report.action # remove report generator Disallow: /checkinout/ # remove Attachment Checkout plug-in links Disallow: /customspacemgmt/ # remove space management links Disallow: /homepage.action # remove home page action Disallow: /spaces/space-bookmarks.action? # remove social bookmarking links The social bookmarking one is the critical one of this bunch. On our system, at least, it takes around 2 minutes for any "&mode=bookmarksfor" pages to come up in a browser. Since the crawler is hitting a bunch of these it may be the source of our random slowdowns. I'd still really like a simple config page in Confluence Admin where I could just go down and check yes/no on what types of pages we want indexed and not indexed and then have Confluence generate the proper page headers on the fly as needed, particularly the "nofollow" on the links to pages that we don't want Google to even see, much less crawl.

Don Willis added a comment - 08/Apr/2008 11:23 PM

There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change.

Aside from Chris's point about page renaming, the dynamically rendered content of a historical page can change. Eg:

A user's permissions might change, allowing them to see the content of an include macro
A page linked to by the historic page might be created or removed, which changes the appearance of the link
Dynamic data such as a jira issues macro will frequently change.

Don Willis added a comment - 08/Apr/2008 11:23 PM There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change. Aside from Chris's point about page renaming, the dynamically rendered content of a historical page can change. Eg: A user's permissions might change, allowing them to see the content of an include macro A page linked to by the historic page might be created or removed, which changes the appearance of the link Dynamic data such as a jira issues macro will frequently change.

Christopher Owen [Atlassian] added a comment - 07/Nov/2007 6:29 AM

* There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change.

Unfortunately that's not quite true at the moment. A page can be deleted (or renamed) and a new page with the same name created which will start the version count over again. The makes it pretty much impossible to declare historical pages as never expiring.

Christopher Owen [Atlassian] added a comment - 07/Nov/2007 6:29 AM * There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change. Unfortunately that's not quite true at the moment. A page can be deleted (or renamed) and a new page with the same name created which will start the version count over again. The makes it pretty much impossible to declare historical pages as never expiring.

Paul Curren added a comment - 02/Nov/2007 1:32 AM

This problem should be addressed by the fixes in the referenced defects.

Paul Curren added a comment - 02/Nov/2007 1:32 AM This problem should be addressed by the fixes in the referenced defects.

Peter Raymond added a comment - 30/Jun/2007 4:46 PM

I was looking at the "There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change." suggestion in the original request and started thinking, maybe it would be better for historical pages to have <META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"> added to them automatically when a newer version is added. Thoughts?

Also, we've continued to refine the robots.txt file due to ongoing log analysis. Basically, we keep finding new URLs that the crawler is hitting that we'd rather they not. Here's the latest:

# Note: this files uses parameters specific to Google, parameters that are not in the robots.txt standard
# http://www.google.com/support/webmasters/, http://www.robotstxt.org/wc/faq.html and http://en.wikipedia.org/wiki/Robots_Exclusion_Standard were used to research said parameters
# some links shouldn't show to an anonymous browser such as GAS but are included for completeness

# Updated 2007.06.30.09.44

User-agent: * # match all bots. The Google Search Appliance (GSA) is our primary crawler but logs indicate there may be others on our Intranet
Crawl-delay: 5 # per http://en.wikipedia.org/wiki/Robots.txt#Nonstandard_extensions, sets number of seconds to wait between requests to 5 seconds. may not work
Request-rate: 1/5 # per http://en.wikipedia.org/wiki/Robots.txt#Extended_Standard, maximum rate is one page every 5 seconds. may not work
# DISABLED FOR NOW Visit-time: 0600-0845 # per http://en.wikipedia.org/wiki/Robots.txt#Extended_Standard, only visit between 6:00 AM and 8:45 AM UT (GMT), may not work
Disallow: /*?decorator=printable # remove printable version links, non-display URLs
Disallow: /*javascript* # remove any javascript links, per log analysis
Disallow: /admin/ # administrator links
Disallow: /adminstrators.action? # remove any administrator links
Disallow: /createrssfeed.action? # remove internal RSS links
# Disallow: /dashboard.action # primary dashboard link
Disallow: /dashboard.action? # remove secondary dashboard links, not needed for anonymous crawling
Allow: /display # ensure primary display pages are allowed
Disallow: /display/*&tasklist.complete= # remove tasklist links
Disallow: /display/*&tasklist.uncomplete= # remove tasklist links
Disallow: /display/*?decorator=normal # remove redundant link for standard display
Disallow: /display/*?decorator=printable # remove printable version links, display URLs
Disallow: /display/*?focusedCommentId= # remove page comment focus links
Disallow: /display/*?refresh= # prevent crawler from clicking refresh button
Disallow: /display/*?replyToComment= # remove reply to comment links
Disallow: /display/*?rootCommentId= # remove news comment focus links
Disallow: /display/*?showChildren= # remove the children view links, not needed, anonymous defaults to showing children
# Disallow: /display/*?showChildren=true # remove show children link - DISABLED for now so crawler can see more "real" pages
Disallow: /display/*?sortBy= # remove sort by links for pages with embedded attachments, not needed
Disallow: /display/*showComments= # remove comment links
Disallow: /display/WikiDevQA/ # remove the DEV Space from being indexed
Disallow: /doexportpage.action? # remove pdf export links
Disallow: /dopeopledirectorysearch.action # people search
Disallow: /dosearchsite.action # remove generic site searches
Disallow: /dosearchsite.action? # remove specific site searches
Disallow: /download/attachments/*?version= # knock out previous versions of attachments
Disallow: /download/userResources/ # knock out user resource links, per log analysis
Disallow: /download/resources/ # knock out resource links, per log analysis
Disallow: /dwr/ # knock out DWR links, per log analysis and http://getahead.org/dwr/
Disallow: /exportword? # remove word export links
Disallow: /form-mail-plugin/ # remove form mail links
Disallow: /label/ # remove all label links, per vendor analysis
Disallow: /labels/ # remove all label links, per vendor analysis
Disallow: /labels-javascript # remove label javascript
Allow: /labels/listlabels-alphaview.action # allow label index page
Disallow: /login.action # remove the login page
Disallow: /login.action? # remove the login page derivatives
# Next line, 35, will be enabled when line after, 36, is removed
# Allow: /pages/viewpage.action?* # allows indexing of pages with invalid titles for html (such as ?'s). Unfortunately currently allows page history to sneak in
Disallow: /pages/ # this line to purge GSA of all old page entries, _may_ eventually be removed so that specific /pages/ lines below take effect and non-html compatible titled pages can be crawled
# DISABLED FOR NOW Disallow: /pages/pageinfo.action? # exclude all the previous versions of pages by excluding Page Info pages
# Disallow: /pages/*?showChildren=true # remove show children link - DISABLED for now so crawler can see more "real" pages
Disallow: /pages/*&tasklist.complete= # remove tasklist links
Disallow: /pages/*&tasklist.uncomplete= # remove tasklist links
Disallow: /pages/*?decorator=normal # remove redundant link for standard display
Disallow: /pages/*?decorator=printable # remove printable version links, display URLs
Disallow: /pages/*?focusedCommentId= # remove page comment focus links
Disallow: /pages/*?refresh= # prevent crawler from clicking refresh button
Disallow: /pages/*?replyToComment= # remove reply to comment links
Disallow: /pages/*?rootCommentId= # remove news comment focus links
Disallow: /pages/*?showChildren=false # remove the don't show children link, not needed, per log analysis
Disallow: /pages/*?sortBy= # remove sort by links for pages with embedded attachments, not needed
Disallow: /pages/*showComments= # remove comment links
Disallow: /pages/copypage.action? # remove copy page links
Disallow: /pages/createblogpost.action? # remove add news links
Disallow: /pages/createpage.action? # remove add page links
Disallow: /pages/diffpages.action? # remove page comparison pages
Disallow: /pages/diffpagesbyversion.action? # remove page comparison links
Disallow: /pages/editblogpost.action? # remove edit news links
Disallow: /pages/editpage.action? # remove edit page links
Disallow: /pages/removepage.action? # remove the remove page links
Disallow: /pages/revertpagebacktoversion.action? # remove reversion links
Disallow: /pages/templates # remove template pages
Disallow: /pages/templates/ # block template indexes
Disallow: /pages/viewchangessincelastlogin.action? # remove page comparison pages
Disallow: /pages/viewpage.action?*&showComments # remove comments links
Disallow: /pages/viewpage.action?spaceKey= # remove page view links that are "duplicates" of the Display URL pages
Disallow: /pages/viewpagesrc.action? # remove view page source links
Disallow: /pages/viewpreviouspageversions.action? # remove the link to previous versions
Disallow: /plugins/ # blocks plug-in calls
Disallow: /rpc/ # remove any RPC links
Disallow: /s/ # remove any links to label calls down this path, per log analysis
Disallow: /searchsite.action? # remove the wiki search engine pages
Disallow: /spaces/*&decorator=printable # remove printable version links
Disallow: /spaces/blogrss.action? # remove rss links
Disallow: /spaces/listrssfeeds.action? # remove rss links
Disallow: /spaces/viewmail.action? # remove view mail links (we don't have email integration enabled anyway)
Disallow: /spaces/viewmailarchive.action? # remove view mail archive links (we don't have email integration enabled anyway)
Disallow: /spaces/viewthread.action? # remove view mail thread links (we don't have email integration enabled anyway)
Disallow: /themes/ # theme links
Disallow: /users/ # remove user action pages
Disallow: /x/ # remove tiny link urls

# End file

Peter Raymond added a comment - 30/Jun/2007 4:46 PM I was looking at the "There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change." suggestion in the original request and started thinking, maybe it would be better for historical pages to have <META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"> added to them automatically when a newer version is added. Thoughts? Also, we've continued to refine the robots.txt file due to ongoing log analysis. Basically, we keep finding new URLs that the crawler is hitting that we'd rather they not. Here's the latest: # Note: this files uses parameters specific to Google, parameters that are not in the robots.txt standard # http: //www.google.com/support/webmasters/, http://www.robotstxt.org/wc/faq.html and http://en.wikipedia.org/wiki/Robots_Exclusion_Standard were used to research said parameters # some links shouldn't show to an anonymous browser such as GAS but are included for completeness # Updated 2007.06.30.09.44 User-agent: * # match all bots. The Google Search Appliance (GSA) is our primary crawler but logs indicate there may be others on our Intranet Crawl-delay: 5 # per http: //en.wikipedia.org/wiki/Robots.txt#Nonstandard_extensions, sets number of seconds to wait between requests to 5 seconds. may not work Request-rate: 1/5 # per http: //en.wikipedia.org/wiki/Robots.txt#Extended_Standard, maximum rate is one page every 5 seconds. may not work # DISABLED FOR NOW Visit-time: 0600-0845 # per http: //en.wikipedia.org/wiki/Robots.txt#Extended_Standard, only visit between 6:00 AM and 8:45 AM UT (GMT), may not work Disallow: /*?decorator=printable # remove printable version links, non-display URLs Disallow: /*javascript* # remove any javascript links, per log analysis Disallow: /admin/ # administrator links Disallow: /adminstrators.action? # remove any administrator links Disallow: /createrssfeed.action? # remove internal RSS links # Disallow: /dashboard.action # primary dashboard link Disallow: /dashboard.action? # remove secondary dashboard links, not needed for anonymous crawling Allow: /display # ensure primary display pages are allowed Disallow: /display/*&tasklist.complete= # remove tasklist links Disallow: /display/*&tasklist.uncomplete= # remove tasklist links Disallow: /display/*?decorator=normal # remove redundant link for standard display Disallow: /display/*?decorator=printable # remove printable version links, display URLs Disallow: /display/*?focusedCommentId= # remove page comment focus links Disallow: /display/*?refresh= # prevent crawler from clicking refresh button Disallow: /display/*?replyToComment= # remove reply to comment links Disallow: /display/*?rootCommentId= # remove news comment focus links Disallow: /display/*?showChildren= # remove the children view links, not needed, anonymous defaults to showing children # Disallow: /display/*?showChildren= true # remove show children link - DISABLED for now so crawler can see more "real" pages Disallow: /display/*?sortBy= # remove sort by links for pages with embedded attachments, not needed Disallow: /display/*showComments= # remove comment links Disallow: /display/WikiDevQA/ # remove the DEV Space from being indexed Disallow: /doexportpage.action? # remove pdf export links Disallow: /dopeopledirectorysearch.action # people search Disallow: /dosearchsite.action # remove generic site searches Disallow: /dosearchsite.action? # remove specific site searches Disallow: /download/attachments/*?version= # knock out previous versions of attachments Disallow: /download/userResources/ # knock out user resource links, per log analysis Disallow: /download/resources/ # knock out resource links, per log analysis Disallow: /dwr/ # knock out DWR links, per log analysis and http: //getahead.org/dwr/ Disallow: /exportword? # remove word export links Disallow: /form-mail-plugin/ # remove form mail links Disallow: /label/ # remove all label links, per vendor analysis Disallow: /labels/ # remove all label links, per vendor analysis Disallow: /labels-javascript # remove label javascript Allow: /labels/listlabels-alphaview.action # allow label index page Disallow: /login.action # remove the login page Disallow: /login.action? # remove the login page derivatives # Next line, 35, will be enabled when line after, 36, is removed # Allow: /pages/viewpage.action?* # allows indexing of pages with invalid titles for html (such as ?'s). Unfortunately currently allows page history to sneak in Disallow: /pages/ # this line to purge GSA of all old page entries, _may_ eventually be removed so that specific /pages/ lines below take effect and non-html compatible titled pages can be crawled # DISABLED FOR NOW Disallow: /pages/pageinfo.action? # exclude all the previous versions of pages by excluding Page Info pages # Disallow: /pages/*?showChildren= true # remove show children link - DISABLED for now so crawler can see more "real" pages Disallow: /pages/*&tasklist.complete= # remove tasklist links Disallow: /pages/*&tasklist.uncomplete= # remove tasklist links Disallow: /pages/*?decorator=normal # remove redundant link for standard display Disallow: /pages/*?decorator=printable # remove printable version links, display URLs Disallow: /pages/*?focusedCommentId= # remove page comment focus links Disallow: /pages/*?refresh= # prevent crawler from clicking refresh button Disallow: /pages/*?replyToComment= # remove reply to comment links Disallow: /pages/*?rootCommentId= # remove news comment focus links Disallow: /pages/*?showChildren= false # remove the don't show children link, not needed, per log analysis Disallow: /pages/*?sortBy= # remove sort by links for pages with embedded attachments, not needed Disallow: /pages/*showComments= # remove comment links Disallow: /pages/copypage.action? # remove copy page links Disallow: /pages/createblogpost.action? # remove add news links Disallow: /pages/createpage.action? # remove add page links Disallow: /pages/diffpages.action? # remove page comparison pages Disallow: /pages/diffpagesbyversion.action? # remove page comparison links Disallow: /pages/editblogpost.action? # remove edit news links Disallow: /pages/editpage.action? # remove edit page links Disallow: /pages/removepage.action? # remove the remove page links Disallow: /pages/revertpagebacktoversion.action? # remove reversion links Disallow: /pages/templates # remove template pages Disallow: /pages/templates/ # block template indexes Disallow: /pages/viewchangessincelastlogin.action? # remove page comparison pages Disallow: /pages/viewpage.action?*&showComments # remove comments links Disallow: /pages/viewpage.action?spaceKey= # remove page view links that are "duplicates" of the Display URL pages Disallow: /pages/viewpagesrc.action? # remove view page source links Disallow: /pages/viewpreviouspageversions.action? # remove the link to previous versions Disallow: /plugins/ # blocks plug-in calls Disallow: /rpc/ # remove any RPC links Disallow: /s/ # remove any links to label calls down this path, per log analysis Disallow: /searchsite.action? # remove the wiki search engine pages Disallow: /spaces/*&decorator=printable # remove printable version links Disallow: /spaces/blogrss.action? # remove rss links Disallow: /spaces/listrssfeeds.action? # remove rss links Disallow: /spaces/viewmail.action? # remove view mail links (we don't have email integration enabled anyway) Disallow: /spaces/viewmailarchive.action? # remove view mail archive links (we don't have email integration enabled anyway) Disallow: /spaces/viewthread.action? # remove view mail thread links (we don't have email integration enabled anyway) Disallow: /themes/ # theme links Disallow: /users/ # remove user action pages Disallow: /x/ # remove tiny link urls # End file

Peter Raymond added a comment - 25/Jun/2007 3:29 AM

Page labels should be embedded into the HTML code as metadata for search engines to key in on.

Peter Raymond added a comment - 25/Jun/2007 3:29 AM Page labels should be embedded into the HTML code as metadata for search engines to key in on.

Peter Raymond added a comment - 22/Jun/2007 8:27 PM - edited

Attachments should also have 'never expire" cache headers on historical versions.
Image files should have a long cache time, preferably specifiable in the configuration somewhere.

Peter Raymond added a comment - 22/Jun/2007 8:27 PM - edited Attachments should also have 'never expire" cache headers on historical versions. Image files should have a long cache time, preferably specifiable in the configuration somewhere.

Assignee:: Unassigned

Reporter:: Peter Raymond

Votes:: 66 Vote for this issue

Watchers:: 61 Start watching this issue

Created:: 21/Jun/2007 11:32 PM

Updated:: Yesterday 2:03 AM

Details

Description

Attachments

Issue Links

Forms

Activity

Collapse comment: Sorin Sbarnea (Citrix) added a comment - 27/Apr/2015 5:27 PM

Expand comment: Sorin Sbarnea (Citrix) added a comment - 27/Apr/2015 5:27 PM

Collapse comment: Sorin Sbarnea (Citrix) added a comment - 27/Apr/2015 5:26 PM, Edited by Sorin Sbarnea (Citrix) - 27/Apr/2015 7:08 PM

Expand comment: Sorin Sbarnea (Citrix) added a comment - 27/Apr/2015 5:26 PM, Edited by Sorin Sbarnea (Citrix) - 27/Apr/2015 7:08 PM

Collapse comment: Sorin Sbarnea (Citrix) added a comment - 24/Apr/2015 4:46 PM

Expand comment: Sorin Sbarnea (Citrix) added a comment - 24/Apr/2015 4:46 PM

Collapse comment: Yuji Shinozaki added a comment - 12/Mar/2015 2:44 PM

Expand comment: Yuji Shinozaki added a comment - 12/Mar/2015 2:44 PM

Collapse comment: Mary Washburn added a comment - 09/Dec/2013 5:38 PM

Expand comment: Mary Washburn added a comment - 09/Dec/2013 5:38 PM

Collapse comment: Sergey Svishchev added a comment - 31/Jul/2013 1:58 PM

Expand comment: Sergey Svishchev added a comment - 31/Jul/2013 1:58 PM

Collapse comment: William Zanchet (Inactive) added a comment - 30/May/2013 8:34 PM

Expand comment: William Zanchet (Inactive) added a comment - 30/May/2013 8:34 PM

Collapse comment: Feldhacker added a comment - 30/May/2013 8:17 PM

Expand comment: Feldhacker added a comment - 30/May/2013 8:17 PM

Collapse comment: Ben added a comment - 28/Feb/2013 8:47 PM

Expand comment: Ben added a comment - 28/Feb/2013 8:47 PM

Collapse comment: Daniel Flower added a comment - 27/Aug/2012 3:30 AM

Expand comment: Daniel Flower added a comment - 27/Aug/2012 3:30 AM

Collapse comment: childnode added a comment - 17/Feb/2011 1:39 PM, Edited by childnode - 17/Feb/2011 1:40 PM

Expand comment: childnode added a comment - 17/Feb/2011 1:39 PM, Edited by childnode - 17/Feb/2011 1:40 PM

Collapse comment: jeff peichel added a comment - 30/Jun/2010 3:32 PM

Expand comment: jeff peichel added a comment - 30/Jun/2010 3:32 PM

Collapse comment: Peter Raymond added a comment - 07/Dec/2008 6:23 PM, Edited by Peter Raymond - 07/Dec/2008 6:31 PM

Expand comment: Peter Raymond added a comment - 07/Dec/2008 6:23 PM, Edited by Peter Raymond - 07/Dec/2008 6:31 PM

Collapse comment: Don Willis added a comment - 08/Apr/2008 11:23 PM

Expand comment: Don Willis added a comment - 08/Apr/2008 11:23 PM

Collapse comment: Christopher Owen [Atlassian] added a comment - 07/Nov/2007 6:29 AM

Expand comment: Christopher Owen [Atlassian] added a comment - 07/Nov/2007 6:29 AM

Collapse comment: Paul Curren added a comment - 02/Nov/2007 1:32 AM

Expand comment: Paul Curren added a comment - 02/Nov/2007 1:32 AM

Collapse comment: Peter Raymond added a comment - 30/Jun/2007 4:46 PM

Expand comment: Peter Raymond added a comment - 30/Jun/2007 4:46 PM

Collapse comment: Peter Raymond added a comment - 25/Jun/2007 3:29 AM

Expand comment: Peter Raymond added a comment - 25/Jun/2007 3:29 AM

Collapse comment: Peter Raymond added a comment - 22/Jun/2007 8:27 PM, Edited by Peter Raymond - 07/Dec/2008 6:19 PM

Expand comment: Peter Raymond added a comment - 22/Jun/2007 8:27 PM, Edited by Peter Raymond - 07/Dec/2008 6:19 PM

People

Dates