Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-8749

Make Confluence more configureable in how it handles web crawlers such as the Google Search Appliance

    • 26
    • 13
    • We collect Confluence feedback from various sources, and we evaluate what we've collected when planning our product roadmap. To understand how this piece of feedback will be reviewed, see our Implementation of New Features Policy.

      NOTE: This suggestion is for Confluence Server. Using Confluence Cloud? See the corresponding suggestion.

      We have discovered that Confluence is severely impacted by the Google Search Appliance, as seen in CSP-8619.

      • Since each page is dynamic there is no cache setting, so the GSA will hit the same page multiple times a day. In some instances we've had pages re-indexed several times an hour!
      • GSA, like Google, is pretty aggressive in following every single link. It hits page source, history, comment focus, everything.
      • Our site has ~8000 "current versions" of pages. GSA has indexed up to 65k worth of pages!
      • We're using an aggressive robots.txt file (below) but have not yet determined how successful it is.
        • The ideal solution would have been to block EVERYTHING but \display\, but we have quite a few pages that use unallowable characters in the page name (and thus URL), such as ?, that doing so would knock out a significant chunk of content.

      Suggestions:

      • Set up something in the admin console for search engine configuration, something like maybe a check list of what pages should be crawled and what should not. The ones that should not should either use noindex and nofollow meta-tags or maybe Confluence should generate a custom robots.txt file on its own and place in the root directory. Either is fine.
      • There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change.
      • All links to edit pages, admin pages, add comment, etc etc should have no-follow meta-tags.
      • At the least, it would be ideal if ALL current version pages had the /display/ in the beginning, including the ones with non-standard characters, and EVERYTHING else under different stem.
        • It's almost like this today, just the notable exception that the non-standard titled pages show up under the /viewpage stem instead of display. My robots.txt file below tries to work around that but, unfortunately, there's no way to pick up those pages without also picking up all of the version history pages. Some might want those crawled as well; I suspect most would not.

      Thank you.

      Peter

      Our "aggressive" robots.txt file, again for which we're waiting to see if GSA picks it up like we'd like:

      # Note: this files uses parameters specific to Google, parameters that are not in robots.txt standard
      # http://www.google.com/support/webmasters/, http://www.robotstxt.org/wc/faq.html and http://en.wikipedia.org/wiki/Robots_Exclusion_Standard were used to research said parameters
      # some links shouldn't show to an anonymous browser such as GAS but are included for completeness
      
      User-agent: * # match all bots. GSA is our primarly crawler but logs indicate there may be others on our Intranet
      Crawl-delay: 5 # per http://en.wikipedia.org/wiki/Robots.txt#Nonstandard_extensions, sets number of seconds to wait between requests to 5 seconds. may not work
      Disallow: /pages/ # this line to purge GSA of all old page entries, will be removed in next iteration so that specific /pages/ lines below take effect
      # DISABLED FOR NOW Visit-time: 0600-0845 # per http://en.wikipedia.org/wiki/Robots.txt#Extended_Standard, only visit between 6:00 AM and 8:45 AM UT (GMT), may not work
      Disallow: /admin/ # administrator links
      Disallow: /adminstrators.action? # remove any administrator links
      Disallow: /createrssfeed.action? # remove internal RSS links
      Disallow: /dashboard.action? # remove the dashboard, heavy resource hit
      Allow: /display # ensure primary display pages are allowed
      Disallow: /display/*&tasklist.complete= # remove tasklist links
      Disallow: /display/*?decorator=printable # remove printable version links
      Disallow: /display/*?focusedCommentId= # remove page comment focus links
      Disallow: /display/*?refresh= # prevent crawler from clicking refresh button
      Disallow: /display/*?replyToComment= # remove reply to comment links
      Disallow: /display/*?rootCommentId= # remove news comment focus links
      Disallow: /display/*?showComments=true&showCommentArea=true#addcomment # remove add comment links
      Disallow: /doexportpage.action? # remove pdf export links
      Disallow: /dopeopledirectorysearch.action # people search
      Disallow: /dosearchsite.action? # remove specific site searches
      Disallow: /exportword? # remove word export links
      Disallow: /login.action? # remove the login page
      # Next line, 26, will be enabled when line after, 27, is removed
      # Allow: /pages/viewpage.action?* # allows indexing of pages with invalid titles for html (such as ?'s). Unfortunately currently allows page history to sneak in
      Disallow: /pages/ # this line to purge GSA of all old page entries, will be removed in next iteration so that specific /pages/ lines below take effect
      Disallow: /pages/copypage.action? # remove copy page links
      Disallow: /pages/createblogpost.action? # remove add news links
      Disallow: /pages/createpage.action? # remove add page links
      Disallow: /pages/diffpages.action? # remove page comparison pages
      Disallow: /pages/diffpagesbyversion.action? # remove page comparison links
      Disallow: /pages/editblogpost.action? # remove edit news links
      Disallow: /pages/editpage.action? # remove edit page links
      Disallow: /pages/removepage.action? # remove the remove page links
      Disallow: /pages/revertpagebacktoversion.action? # remove reversion links
      Disallow: /pages/templates # remove template pages
      Disallow: /pages/templates/ # block template indexes
      Disallow: /pages/viewchangessincelastlogin.action? # remove page comparison pages
      Disallow: /pages/viewpagesrc.action? # remove view page source links
      Disallow: /pages/viewpreviouspageversions.action? # remove the link to previous versions
      Disallow: /plugins/ # blocks plug-in calls
      Disallow: /rpc/ # remove any RPC links
      Disallow: /searchsite.action? # remove the wiki search engine pages
      Disallow: /spaces/ # remove space action pages
      Disallow: /themes/ # theme links
      Disallow: /users/ # remove user action pages
      Disallow: /x/ # remove tiny link urls
      
      # End file
      

          Form Name

            [CONFSERVER-8749] Make Confluence more configureable in how it handles web crawlers such as the Google Search Appliance

            Please note that not following HTTP standards, is not a new feature, is a product bug and also a proof that QA didn't make a good job on finding those.

            Sorin Sbarnea (Citrix) added a comment - Please note that not following HTTP standards, is not a new feature, is a product bug and also a proof that QA didn't make a good job on finding those.

            Sorin Sbarnea (Citrix) added a comment - - edited

            It seems that Confluence does not follow even most basic HTTP standard requirements, like returning 204 when content was not changed or including the change date as a meta tag like `<META name="date" content="13-01-08">` and the date inside the HTTP headers does contain the date when the datetime of the response NOT of the last change made to the page (content).

            Due to this, I should not have been so surprised when I observed that 66% of our traffic is made by GSA indexing, which obviously was reindexing each page daily and some of them more often. Over 230.000 requests in 24 hours, ~4/second.

            This being said, where is the enterprise quality?

            Sorin Sbarnea (Citrix) added a comment - - edited It seems that Confluence does not follow even most basic HTTP standard requirements, like returning 204 when content was not changed or including the change date as a meta tag like `<META name="date" content="13-01-08">` and the date inside the HTTP headers does contain the date when the datetime of the response NOT of the last change made to the page (content). Due to this, I should not have been so surprised when I observed that 66% of our traffic is made by GSA indexing, which obviously was reindexing each page daily and some of them more often. Over 230.000 requests in 24 hours, ~4/second. This being said, where is the enterprise quality?

            Clearly with the current setup GSA can put your instance down easily. I just added the host-load: 1 based on http://www.stonetemple.com/articles/interview-matt-cutts.shtml and waiting to see if it makes a change.

            Sorin Sbarnea (Citrix) added a comment - Clearly with the current setup GSA can put your instance down easily. I just added the host-load: 1 based on http://www.stonetemple.com/articles/interview-matt-cutts.shtml and waiting to see if it makes a change.

            The overall solution should probably (also) use the X-Robot-Tags header, which wasn't around when this issue was created.

            https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag

            There are some SEO plugins that might provide control over things like this.

            Yuji Shinozaki added a comment - The overall solution should probably (also) use the X-Robot-Tags header, which wasn't around when this issue was created. https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag There are some SEO plugins that might provide control over things like this.

            Having an Admin tool to create a robots.txt file would be greatly appreciated!

            Mary Washburn added a comment - Having an Admin tool to create a robots.txt file would be greatly appreciated!

            Any news?

            Sergey Svishchev added a comment - Any news?

            Hi Chris, let me try to reach one of our team members to see if this is in our road map. As soon as I get an answer I'll let you know.

            William Zanchet (Inactive) added a comment - Hi Chris, let me try to reach one of our team members to see if this is in our road map. As soon as I get an answer I'll let you know.

            Feldhacker added a comment -

            Any updates on this 6 year old issue?

            Feldhacker added a comment - Any updates on this 6 year old issue?

            Ben added a comment -

            Commenting here as it seems related, we would like to be able to block certain spaces from being indexed either via an option to add custom robots.txt entries or a noindex meta tag.

            Ben added a comment - Commenting here as it seems related, we would like to be able to block certain spaces from being indexed either via an option to add custom robots.txt entries or a noindex meta tag.

            It seems that any custom robots.txt files need to be updated as Confluence is evolving, so at the very least a robots.txt should be deployed with confluence to at least exclude PDF/Word exports, and maybe historical page versions etc.

            Daniel Flower added a comment - It seems that any custom robots.txt files need to be updated as Confluence is evolving, so at the very least a robots.txt should be deployed with confluence to at least exclude PDF/Word exports, and maybe historical page versions etc.

            childnode added a comment - - edited

            This problem affects all Confluence instances officially indexed by search engines. As seen as like in this search result, historic versions of pages should NOT be indexed and therefor disallowed! If there are different users with different approaches, please add a option "disallow historic page for index"

            Thank you and with kind regards,
            ~Marcel

            childnode added a comment - - edited This problem affects all Confluence instances officially indexed by search engines. As seen as like in this search result , historic versions of pages should NOT be indexed and therefor disallowed! If there are different users with different approaches, please add a option "disallow historic page for index" Thank you and with kind regards, ~Marcel

            Has anyone created a robots.txt for 3.1? Not sure what differences there would be...

            jeff peichel added a comment - Has anyone created a robots.txt for 3.1? Not sure what differences there would be...

            Peter Raymond added a comment - - edited

            Given the last two comments, I'd drop back from the "set cache headers to never expire" stance and move to the "<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">" on historical pages stance.

            Anyway, more analysis of our logs has caused us to add the following to our robots.txt file now that we've moved to v2.9.2:

            Disallow: /spaces/usage/report.action # remove report generator
            Disallow: /checkinout/ # remove Attachment Checkout plug-in links
            Disallow: /customspacemgmt/ # remove space management links
            Disallow: /homepage.action # remove home page action
            Disallow: /spaces/space-bookmarks.action? # remove social bookmarking links
            

            The social bookmarking one is the critical one of this bunch. On our system, at least, it takes around 2 minutes for any "&mode=bookmarksfor" pages to come up in a browser. Since the crawler is hitting a bunch of these it may be the source of our random slowdowns.

            I'd still really like a simple config page in Confluence Admin where I could just go down and check yes/no on what types of pages we want indexed and not indexed and then have Confluence generate the proper page headers on the fly as needed, particularly the "nofollow" on the links to pages that we don't want Google to even see, much less crawl.

            Peter Raymond added a comment - - edited Given the last two comments, I'd drop back from the "set cache headers to never expire" stance and move to the "<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">" on historical pages stance. Anyway, more analysis of our logs has caused us to add the following to our robots.txt file now that we've moved to v2.9.2: Disallow: /spaces/usage/report.action # remove report generator Disallow: /checkinout/ # remove Attachment Checkout plug-in links Disallow: /customspacemgmt/ # remove space management links Disallow: /homepage.action # remove home page action Disallow: /spaces/space-bookmarks.action? # remove social bookmarking links The social bookmarking one is the critical one of this bunch. On our system, at least, it takes around 2 minutes for any "&mode=bookmarksfor" pages to come up in a browser. Since the crawler is hitting a bunch of these it may be the source of our random slowdowns. I'd still really like a simple config page in Confluence Admin where I could just go down and check yes/no on what types of pages we want indexed and not indexed and then have Confluence generate the proper page headers on the fly as needed, particularly the "nofollow" on the links to pages that we don't want Google to even see, much less crawl.

            There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change.

            Aside from Chris's point about page renaming, the dynamically rendered content of a historical page can change. Eg:

            • A user's permissions might change, allowing them to see the content of an include macro
            • A page linked to by the historic page might be created or removed, which changes the appearance of the link
            • Dynamic data such as a jira issues macro will frequently change.

            Don Willis added a comment - There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change. Aside from Chris's point about page renaming, the dynamically rendered content of a historical page can change. Eg: A user's permissions might change, allowing them to see the content of an include macro A page linked to by the historic page might be created or removed, which changes the appearance of the link Dynamic data such as a jira issues macro will frequently change.

            * There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change.

            Unfortunately that's not quite true at the moment. A page can be deleted (or renamed) and a new page with the same name created which will start the version count over again. The makes it pretty much impossible to declare historical pages as never expiring.

            Christopher Owen [Atlassian] added a comment - * There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change. Unfortunately that's not quite true at the moment. A page can be deleted (or renamed) and a new page with the same name created which will start the version count over again. The makes it pretty much impossible to declare historical pages as never expiring.

            This problem should be addressed by the fixes in the referenced defects.

            Paul Curren added a comment - This problem should be addressed by the fixes in the referenced defects.

            I was looking at the "There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change." suggestion in the original request and started thinking, maybe it would be better for historical pages to have <META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"> added to them automatically when a newer version is added. Thoughts?

            Also, we've continued to refine the robots.txt file due to ongoing log analysis. Basically, we keep finding new URLs that the crawler is hitting that we'd rather they not. Here's the latest:

            # Note: this files uses parameters specific to Google, parameters that are not in the robots.txt standard
            # http://www.google.com/support/webmasters/, http://www.robotstxt.org/wc/faq.html and http://en.wikipedia.org/wiki/Robots_Exclusion_Standard were used to research said parameters
            # some links shouldn't show to an anonymous browser such as GAS but are included for completeness
            
            # Updated 2007.06.30.09.44
            
            User-agent: * # match all bots. The Google Search Appliance (GSA) is our primary crawler but logs indicate there may be others on our Intranet
            Crawl-delay: 5 # per http://en.wikipedia.org/wiki/Robots.txt#Nonstandard_extensions, sets number of seconds to wait between requests to 5 seconds. may not work
            Request-rate: 1/5 # per http://en.wikipedia.org/wiki/Robots.txt#Extended_Standard, maximum rate is one page every 5 seconds. may not work
            # DISABLED FOR NOW Visit-time: 0600-0845 # per http://en.wikipedia.org/wiki/Robots.txt#Extended_Standard, only visit between 6:00 AM and 8:45 AM UT (GMT), may not work
            Disallow: /*?decorator=printable # remove printable version links, non-display URLs
            Disallow: /*javascript* # remove any javascript links, per log analysis
            Disallow: /admin/ # administrator links
            Disallow: /adminstrators.action? # remove any administrator links
            Disallow: /createrssfeed.action? # remove internal RSS links
            # Disallow: /dashboard.action # primary dashboard link
            Disallow: /dashboard.action? # remove secondary dashboard links, not needed for anonymous crawling
            Allow: /display # ensure primary display pages are allowed
            Disallow: /display/*&tasklist.complete= # remove tasklist links
            Disallow: /display/*&tasklist.uncomplete= # remove tasklist links
            Disallow: /display/*?decorator=normal # remove redundant link for standard display
            Disallow: /display/*?decorator=printable # remove printable version links, display URLs
            Disallow: /display/*?focusedCommentId= # remove page comment focus links
            Disallow: /display/*?refresh= # prevent crawler from clicking refresh button
            Disallow: /display/*?replyToComment= # remove reply to comment links
            Disallow: /display/*?rootCommentId= # remove news comment focus links
            Disallow: /display/*?showChildren= # remove the children view links, not needed, anonymous defaults to showing children
            # Disallow: /display/*?showChildren=true # remove show children link - DISABLED for now so crawler can see more "real" pages
            Disallow: /display/*?sortBy= # remove sort by links for pages with embedded attachments, not needed
            Disallow: /display/*showComments= # remove comment links
            Disallow: /display/WikiDevQA/ # remove the DEV Space from being indexed
            Disallow: /doexportpage.action? # remove pdf export links
            Disallow: /dopeopledirectorysearch.action # people search
            Disallow: /dosearchsite.action # remove generic site searches
            Disallow: /dosearchsite.action? # remove specific site searches
            Disallow: /download/attachments/*?version= # knock out previous versions of attachments
            Disallow: /download/userResources/ # knock out user resource links, per log analysis
            Disallow: /download/resources/ # knock out resource links, per log analysis
            Disallow: /dwr/ # knock out DWR links, per log analysis and http://getahead.org/dwr/
            Disallow: /exportword? # remove word export links
            Disallow: /form-mail-plugin/ # remove form mail links
            Disallow: /label/ # remove all label links, per vendor analysis
            Disallow: /labels/ # remove all label links, per vendor analysis
            Disallow: /labels-javascript # remove label javascript
            Allow: /labels/listlabels-alphaview.action # allow label index page
            Disallow: /login.action # remove the login page
            Disallow: /login.action? # remove the login page derivatives
            # Next line, 35, will be enabled when line after, 36, is removed
            # Allow: /pages/viewpage.action?* # allows indexing of pages with invalid titles for html (such as ?'s). Unfortunately currently allows page history to sneak in
            Disallow: /pages/ # this line to purge GSA of all old page entries, _may_ eventually be removed so that specific /pages/ lines below take effect and non-html compatible titled pages can be crawled
            # DISABLED FOR NOW Disallow: /pages/pageinfo.action? # exclude all the previous versions of pages by excluding Page Info pages
            # Disallow: /pages/*?showChildren=true # remove show children link - DISABLED for now so crawler can see more "real" pages
            Disallow: /pages/*&tasklist.complete= # remove tasklist links
            Disallow: /pages/*&tasklist.uncomplete= # remove tasklist links
            Disallow: /pages/*?decorator=normal # remove redundant link for standard display
            Disallow: /pages/*?decorator=printable # remove printable version links, display URLs
            Disallow: /pages/*?focusedCommentId= # remove page comment focus links
            Disallow: /pages/*?refresh= # prevent crawler from clicking refresh button
            Disallow: /pages/*?replyToComment= # remove reply to comment links
            Disallow: /pages/*?rootCommentId= # remove news comment focus links
            Disallow: /pages/*?showChildren=false # remove the don't show children link, not needed, per log analysis
            Disallow: /pages/*?sortBy= # remove sort by links for pages with embedded attachments, not needed
            Disallow: /pages/*showComments= # remove comment links
            Disallow: /pages/copypage.action? # remove copy page links
            Disallow: /pages/createblogpost.action? # remove add news links
            Disallow: /pages/createpage.action? # remove add page links
            Disallow: /pages/diffpages.action? # remove page comparison pages
            Disallow: /pages/diffpagesbyversion.action? # remove page comparison links
            Disallow: /pages/editblogpost.action? # remove edit news links
            Disallow: /pages/editpage.action? # remove edit page links
            Disallow: /pages/removepage.action? # remove the remove page links
            Disallow: /pages/revertpagebacktoversion.action? # remove reversion links
            Disallow: /pages/templates # remove template pages
            Disallow: /pages/templates/ # block template indexes
            Disallow: /pages/viewchangessincelastlogin.action? # remove page comparison pages
            Disallow: /pages/viewpage.action?*&showComments # remove comments links
            Disallow: /pages/viewpage.action?spaceKey= # remove page view links that are "duplicates" of the Display URL pages
            Disallow: /pages/viewpagesrc.action? # remove view page source links
            Disallow: /pages/viewpreviouspageversions.action? # remove the link to previous versions
            Disallow: /plugins/ # blocks plug-in calls
            Disallow: /rpc/ # remove any RPC links
            Disallow: /s/ # remove any links to label calls down this path, per log analysis
            Disallow: /searchsite.action? # remove the wiki search engine pages
            Disallow: /spaces/*&decorator=printable # remove printable version links
            Disallow: /spaces/blogrss.action? # remove rss links
            Disallow: /spaces/listrssfeeds.action? # remove rss links
            Disallow: /spaces/viewmail.action? # remove view mail links (we don't have email integration enabled anyway)
            Disallow: /spaces/viewmailarchive.action? # remove view mail archive links (we don't have email integration enabled anyway)
            Disallow: /spaces/viewthread.action? # remove view mail thread links (we don't have email integration enabled anyway)
            Disallow: /themes/ # theme links
            Disallow: /users/ # remove user action pages
            Disallow: /x/ # remove tiny link urls
            
            # End file
            

            Peter Raymond added a comment - I was looking at the "There's no reason for historical pages (version history) to not have cache headers. Set them to never expire as they'll never change." suggestion in the original request and started thinking, maybe it would be better for historical pages to have <META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"> added to them automatically when a newer version is added. Thoughts? Also, we've continued to refine the robots.txt file due to ongoing log analysis. Basically, we keep finding new URLs that the crawler is hitting that we'd rather they not. Here's the latest: # Note: this files uses parameters specific to Google, parameters that are not in the robots.txt standard # http: //www.google.com/support/webmasters/, http://www.robotstxt.org/wc/faq.html and http://en.wikipedia.org/wiki/Robots_Exclusion_Standard were used to research said parameters # some links shouldn't show to an anonymous browser such as GAS but are included for completeness # Updated 2007.06.30.09.44 User-agent: * # match all bots. The Google Search Appliance (GSA) is our primary crawler but logs indicate there may be others on our Intranet Crawl-delay: 5 # per http: //en.wikipedia.org/wiki/Robots.txt#Nonstandard_extensions, sets number of seconds to wait between requests to 5 seconds. may not work Request-rate: 1/5 # per http: //en.wikipedia.org/wiki/Robots.txt#Extended_Standard, maximum rate is one page every 5 seconds. may not work # DISABLED FOR NOW Visit-time: 0600-0845 # per http: //en.wikipedia.org/wiki/Robots.txt#Extended_Standard, only visit between 6:00 AM and 8:45 AM UT (GMT), may not work Disallow: /*?decorator=printable # remove printable version links, non-display URLs Disallow: /*javascript* # remove any javascript links, per log analysis Disallow: /admin/ # administrator links Disallow: /adminstrators.action? # remove any administrator links Disallow: /createrssfeed.action? # remove internal RSS links # Disallow: /dashboard.action # primary dashboard link Disallow: /dashboard.action? # remove secondary dashboard links, not needed for anonymous crawling Allow: /display # ensure primary display pages are allowed Disallow: /display/*&tasklist.complete= # remove tasklist links Disallow: /display/*&tasklist.uncomplete= # remove tasklist links Disallow: /display/*?decorator=normal # remove redundant link for standard display Disallow: /display/*?decorator=printable # remove printable version links, display URLs Disallow: /display/*?focusedCommentId= # remove page comment focus links Disallow: /display/*?refresh= # prevent crawler from clicking refresh button Disallow: /display/*?replyToComment= # remove reply to comment links Disallow: /display/*?rootCommentId= # remove news comment focus links Disallow: /display/*?showChildren= # remove the children view links, not needed, anonymous defaults to showing children # Disallow: /display/*?showChildren= true # remove show children link - DISABLED for now so crawler can see more "real" pages Disallow: /display/*?sortBy= # remove sort by links for pages with embedded attachments, not needed Disallow: /display/*showComments= # remove comment links Disallow: /display/WikiDevQA/ # remove the DEV Space from being indexed Disallow: /doexportpage.action? # remove pdf export links Disallow: /dopeopledirectorysearch.action # people search Disallow: /dosearchsite.action # remove generic site searches Disallow: /dosearchsite.action? # remove specific site searches Disallow: /download/attachments/*?version= # knock out previous versions of attachments Disallow: /download/userResources/ # knock out user resource links, per log analysis Disallow: /download/resources/ # knock out resource links, per log analysis Disallow: /dwr/ # knock out DWR links, per log analysis and http: //getahead.org/dwr/ Disallow: /exportword? # remove word export links Disallow: /form-mail-plugin/ # remove form mail links Disallow: /label/ # remove all label links, per vendor analysis Disallow: /labels/ # remove all label links, per vendor analysis Disallow: /labels-javascript # remove label javascript Allow: /labels/listlabels-alphaview.action # allow label index page Disallow: /login.action # remove the login page Disallow: /login.action? # remove the login page derivatives # Next line, 35, will be enabled when line after, 36, is removed # Allow: /pages/viewpage.action?* # allows indexing of pages with invalid titles for html (such as ?'s). Unfortunately currently allows page history to sneak in Disallow: /pages/ # this line to purge GSA of all old page entries, _may_ eventually be removed so that specific /pages/ lines below take effect and non-html compatible titled pages can be crawled # DISABLED FOR NOW Disallow: /pages/pageinfo.action? # exclude all the previous versions of pages by excluding Page Info pages # Disallow: /pages/*?showChildren= true # remove show children link - DISABLED for now so crawler can see more "real" pages Disallow: /pages/*&tasklist.complete= # remove tasklist links Disallow: /pages/*&tasklist.uncomplete= # remove tasklist links Disallow: /pages/*?decorator=normal # remove redundant link for standard display Disallow: /pages/*?decorator=printable # remove printable version links, display URLs Disallow: /pages/*?focusedCommentId= # remove page comment focus links Disallow: /pages/*?refresh= # prevent crawler from clicking refresh button Disallow: /pages/*?replyToComment= # remove reply to comment links Disallow: /pages/*?rootCommentId= # remove news comment focus links Disallow: /pages/*?showChildren= false # remove the don't show children link, not needed, per log analysis Disallow: /pages/*?sortBy= # remove sort by links for pages with embedded attachments, not needed Disallow: /pages/*showComments= # remove comment links Disallow: /pages/copypage.action? # remove copy page links Disallow: /pages/createblogpost.action? # remove add news links Disallow: /pages/createpage.action? # remove add page links Disallow: /pages/diffpages.action? # remove page comparison pages Disallow: /pages/diffpagesbyversion.action? # remove page comparison links Disallow: /pages/editblogpost.action? # remove edit news links Disallow: /pages/editpage.action? # remove edit page links Disallow: /pages/removepage.action? # remove the remove page links Disallow: /pages/revertpagebacktoversion.action? # remove reversion links Disallow: /pages/templates # remove template pages Disallow: /pages/templates/ # block template indexes Disallow: /pages/viewchangessincelastlogin.action? # remove page comparison pages Disallow: /pages/viewpage.action?*&showComments # remove comments links Disallow: /pages/viewpage.action?spaceKey= # remove page view links that are "duplicates" of the Display URL pages Disallow: /pages/viewpagesrc.action? # remove view page source links Disallow: /pages/viewpreviouspageversions.action? # remove the link to previous versions Disallow: /plugins/ # blocks plug-in calls Disallow: /rpc/ # remove any RPC links Disallow: /s/ # remove any links to label calls down this path, per log analysis Disallow: /searchsite.action? # remove the wiki search engine pages Disallow: /spaces/*&decorator=printable # remove printable version links Disallow: /spaces/blogrss.action? # remove rss links Disallow: /spaces/listrssfeeds.action? # remove rss links Disallow: /spaces/viewmail.action? # remove view mail links (we don't have email integration enabled anyway) Disallow: /spaces/viewmailarchive.action? # remove view mail archive links (we don't have email integration enabled anyway) Disallow: /spaces/viewthread.action? # remove view mail thread links (we don't have email integration enabled anyway) Disallow: /themes/ # theme links Disallow: /users/ # remove user action pages Disallow: /x/ # remove tiny link urls # End file

            • Page labels should be embedded into the HTML code as metadata for search engines to key in on.

            Peter Raymond added a comment - Page labels should be embedded into the HTML code as metadata for search engines to key in on.

            Peter Raymond added a comment - - edited
            • Attachments should also have 'never expire" cache headers on historical versions.
            • Image files should have a long cache time, preferably specifiable in the configuration somewhere.

            Peter Raymond added a comment - - edited Attachments should also have 'never expire" cache headers on historical versions. Image files should have a long cache time, preferably specifiable in the configuration somewhere.

              Unassigned Unassigned
              7173c64ad886 Peter Raymond
              Votes:
              66 Vote for this issue
              Watchers:
              61 Start watching this issue

                Created:
                Updated: