Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!


Sponsored Content

 

 
 

Photo

Dangerous Robots.txt File


  • Please log in to reply
26 replies to this topic

#1 kdawber

kdawber

    HR 1

  • Members
  • Pip
  • 1 posts

Posted 25 March 2004 - 08:46 AM

In a Recent 'High Ranking Advisor' email I recieved (issue 091) was the following advice relating to the robots.txt file:

To disallow specific directories or files, use the following code:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /includes/
Disallow: /pdf/admin.pdf

This code excludes the cgi-bin folder, which normally has no real information for a search engine, as well as the images folder and the includes folder (this last one typically houses external JavaScript and CSS files).


If I was a search engine spider, the above Robots.txt would immediately make me suspicious. On finding this the spider has three options:

(i) Obey the robots.txt file and potentially allow a page that may have used a CSS file to produce techniques such as white on white text or shifting a layer (eg with Div tags) over the top of text.

(ii) dissobey the robots.txt file in order to look at all the code each page.

(iii) Give each of the pages the 'kiss of death' flag so that they won't appear in the SERPs.

I think that the third is the easiest option. Even if they haven't done that yet, it is probably only a matter of time before they do.

Do you all agree on this?

#2 Jill

Jill

    High Rankings Advisor

  • Admin
  • 32,324 posts

Posted 25 March 2004 - 08:52 AM

No, absolutely disagree.

It's silly to have those files/directories indexed by the search engines, and any smart programmer would disallow them so that the engines would get to the meat that they should be getting.

It's every Webmaster's perogative to allow and disallow exactly what they want the engines to index. There's nothing spammy or whatever about that, and the engines are not so short-sighted (or dumb) to not index pages that have a robots.txt which excludes their cgi-bin and other non-relevant info.

Jill

#3 Alan Perkins

Alan Perkins

    Token male admin

  • Admin
  • 1,559 posts
  • Location:UK

Posted 25 March 2004 - 09:13 AM

Nope, kdawber has a point.

Shari Thurow used to advise people to exclude javascript and css from robots.txt, until a search engine rep asked her to stop doing it for precisely the reasons that kdawber mentions.

It's OK to exclude files or directories. But by excluding includes, you are excluding parts of files which you want indexed ... which makes it difficult to rank those files using on-the-page criteria.

Don't exclude includes, is my advice.

#4 qwerty

qwerty

    HR 10

  • Moderator
  • 8,294 posts
  • Location:Somerville, MA

Posted 25 March 2004 - 09:19 AM

I'm confused, Alan. Even if you disallow the robots access to your includes directory, you're not keeping them from the content in those includes when they get loaded into other pages. You're just keeping them from getting indexed as if they were complete pages.

I do agree that you shouldn't block spiders from your CSS, but I don't think there was anything in Matt's article suggesting that.

#5 Alan Perkins

Alan Perkins

    Token male admin

  • Admin
  • 1,559 posts
  • Location:UK

Posted 25 March 2004 - 09:22 AM

I'm confused, Alan.

Wrong type of include - you're talking about ASP or PHP includes.

Look at the context of the quote:

This code excludes the cgi-bin folder, which normally has no real
information for a search engine, as well as the images folder and the
includes folder (this last one typically houses external JavaScript
and CSS files).
"

The suggestion is to prevent spiders looking at external JS and CSS. This type of include is NOT included in the source of the document.

I do agree that you shouldn't block spiders from your CSS, but I don't think there was anything in Matt's article suggesting that.

Yes there was. See above.

#6 SearchRank

SearchRank

    HR 7

  • Active Members
  • PipPipPipPipPipPipPip
  • 2,333 posts
  • Location:Phoenix, AZ

Posted 25 March 2004 - 09:23 AM

If you block spiders from your includes files, they will still pick up those elements in a page because they are already in there when the spider retrieves the page. Spiders do not index raw pages as we would see them in our html editors. Rather they index them and see the same content we would see in a browser, with the include elements.

#7 Alan Perkins

Alan Perkins

    Token male admin

  • Admin
  • 1,559 posts
  • Location:UK

Posted 25 March 2004 - 09:25 AM

Wrong type of includes again, David. :)

Note the use of the word "external" in Matt's article: "...and the includes folder (this last one typically houses external JavaScript and CSS files)."

#8 qwerty

qwerty

    HR 10

  • Moderator
  • 8,294 posts
  • Location:Somerville, MA

Posted 25 March 2004 - 09:31 AM

Sorry. I've got it now, Alan.

#9 SearchRank

SearchRank

    HR 7

  • Active Members
  • PipPipPipPipPipPipPip
  • 2,333 posts
  • Location:Phoenix, AZ

Posted 25 March 2004 - 09:31 AM

I am referring to include files that contain html source code such as headers, footers, navigations, etc. What you are referring to are external files such as CSS and JavaSripts which are not "included" in a page but rather are "called for".

<!--#include file="includes/footer.inc"--> for example will put whatever code is in that include file into the source code for the page the spider is indexing.

<script src="scripts/rollover.js" type="text/javascript"></script> for example will not but rather references it.

That is what I was referring to.

At any rate, don't know why someone would want to exclude spiders from getting that anyway unless of course they are using spamming elements in the CSS or something like that.

#10 Grumpus

Grumpus

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 786 posts

Posted 25 March 2004 - 09:32 AM

The difference that's causing confusion, as Alan points out, is the type of include. A CSS style sheet is "referenced" not "included".

If I have, say, "header.html" in my "/includes/" directory and I ban spiders from my includes directory, it'll prevent spiders from indexing "header.html" as an individual file. But, the pages that "include" the header.html, such as my "index.shtml" will get the header content into the search engines just fine. SSI compiles it all as one single docment prior to sending it, so the spider doesn't even really know that "header.html" is actually a different file.

A CSS file never really becomes an included aspect of a document. It's referenced and so a separate call must be made in order to see the contents of that CSS file. Technically, CSS is a client side technology. Your browser gets the page, sees the refernce to the CSS and downloads that too. It then formats the page you requested based upon the data in the CSS file.

G.

#11 Alan Perkins

Alan Perkins

    Token male admin

  • Admin
  • 1,559 posts
  • Location:UK

Posted 25 March 2004 - 09:33 AM

I am referring to include files that contain html source code such as headers, footers, navigations, etc. What you are referring to are external files such as CSS and JavaSripts which are not "included" in a page but rather are "called for".

Yes, I agree it's ambiguous language.

The way I read it, the suggestion was to disallow access to external JS and CSS files. Something to do with it saying "this last one typically houses external JavaScript and CSS files". :)

#12 Denyse

Denyse

    HR 4

  • Active Members
  • PipPipPipPip
  • 189 posts
  • Location:Montreal, Quebec, Canada

Posted 25 March 2004 - 10:40 AM

What about image dir, is it ok to exclude them?

#13 qwerty

qwerty

    HR 10

  • Moderator
  • 8,294 posts
  • Location:Somerville, MA

Posted 25 March 2004 - 10:48 AM

If you don't want your images spidered, I think the best way to do that is just ban the image robots from everything.

#14 domokun

domokun

    Web jockey

  • Active Members
  • PipPipPipPip
  • 249 posts

Posted 25 March 2004 - 11:15 AM

what a fun thread!
stick-of-end-wrong. superb!
if i can be allowed to summarise:

feel free to disallow access to images folder.

if you disallow access to a folder containing ASP and PHP include files (used for such things as headers, footers, navigation bars etc.) it wont matter because these 'elements' will be shown to the Spider on the page they are 'included' into. so feel free to disallow this folder to.

if you have a folder containing external CSS and JavaScripts (which are referenced by your pages) DO NOT disallow the Spider access to this folder. it will make its tiny, binary ears prick up and make it think you are up to no good.

#15 Alan Perkins

Alan Perkins

    Token male admin

  • Admin
  • 1,559 posts
  • Location:UK

Posted 25 March 2004 - 11:23 AM

Yes, good summary. :)




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users