Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!


Sponsored Content

 

 
 

Photo

Anybody Else Ever See Google Index Their Robots.txt File?


  • Please log in to reply
24 replies to this topic

#1 Jill

Jill

    High Rankings Advisor

  • Admin
  • 32,316 posts

Posted 03 November 2008 - 10:49 AM

I was looking for something specific on my website via a Google search of just pages on my site, and found that they had actually indexed my robots.txt file. I don't recall ever seeing that before, and was wondering if others had noticed it?

Should I block it via itself? searchme.gif

#2 Alan Perkins

Alan Perkins

    Token male admin

  • Admin
  • 1,559 posts
  • Location:UK

Posted 03 November 2008 - 10:59 AM

Yep, I have seen it very often. It most often happens when somebody links to it.

#3 Jill

Jill

    High Rankings Advisor

  • Admin
  • 32,316 posts

Posted 03 November 2008 - 11:24 AM

Surprised I never noticed before. I guess it's no big deal, but for some reason I just assumed they wouldn't bother to index them. You didn't need to link to mine, Alan, they've already got it indexed!

#4 MaKa

MaKa

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 856 posts
  • Location:Llantwit Major, Wales, UK

Posted 04 November 2008 - 06:50 AM

QUOTE(Jill @ Nov 3 2008, 03:49 PM) View Post
Should I block it via itself? searchme.gif


I've seen whole sites not being indexed because the robots.txt return the wrong http status code. As they couldn't access the robots.txt they didn't index anything on the site. Couldn't blocking the robots.txt in the robots.txt not result in the same?

#5 1dmf

1dmf

    Keep Asking, Keep Questioning, Keep Learning

  • Active Members
  • PipPipPipPipPipPipPip
  • 2,154 posts
  • Location:Worthing - England

Posted 04 November 2008 - 07:15 AM


Isn't G's goal to index as many web pages / documents as possible to produce relevant, targeted and acurate search results as they possibly can for their searchers.

These files only exist for robots and not humans, why on earth would they index them, what purpose does this serve?

Do you think this is a bug?


#6 Jill

Jill

    High Rankings Advisor

  • Admin
  • 32,316 posts

Posted 04 November 2008 - 09:23 AM

QUOTE
Couldn't blocking the robots.txt in the robots.txt not result in the same?


Nah. If I just exclude it from being indexed (with a line within the text file), I don't think it will be a problem.

I did get pointed to a thread from Google's forum where someone from Google did say you could do that. I really don't care that it's indexed, so I think I'll just leave it alone though.

I was just surprised to see it indexed as I'd never noticed that happening before.

#7 qwerty

qwerty

    HR 10

  • Moderator
  • 8,287 posts
  • Location:Somerville, MA

Posted 04 November 2008 - 09:26 AM

I wonder if having it indexed could mean that they don't check it every time before spidering any other documents. That would be problematic, if you made a change to robots.txt and G continued to spider the blocked pages for a while before noticing the change.

#8 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 04 November 2008 - 09:38 AM

It's an interesting question and one I've never seen answered Bob.

My guess however is that whether the robots.txt is indexed or not probably has no bearing whatsoever on crawling. The simplest thing to do with spiders is to have them check for a robots.txt every single run, regardless of whether they could know that it existed in the past or not. Otherwise they'd never know if there had been any recent changes to the robots.txt since their last spider run.

The reason most robots.txt files don't get indexed is simply because there are no links pointing at them. When a link is found pointing to one though, it would make sense for the spiders to crawl the file as an indexable page, considering spiders are fairly dumb creatures. giggle.gif

#9 Alan Perkins

Alan Perkins

    Token male admin

  • Admin
  • 1,559 posts
  • Location:UK

Posted 04 November 2008 - 10:12 AM

QUOTE(Jill)
If I just exclude it from being indexed (with a line within the text file), I don't think it will be a problem.


The problem with that logic is that a line in the text file doesn't stop it being indexed. Technically, it stops it being read. However, robots.txt is the one file that is legally exempt from the robots standards, so the listing of robots.txt IN robots.txt is open to interpretation. I'd suggest the safest method in Google at least is to use this technique to prevent a file being indexed using NOINDEX equivalents in the HTTP response header.


QUOTE(qwerty)
I wonder if having it indexed could mean that they don't check it every time before spidering any other documents.


no, no reason why it should. You can check your log files to make sure, of course. robots.txt is not checked EVERY time before a resource is accessed, but it is checked very often. This suggests that it is cached, but on a much shorter refresh than a standard index.

#10 madams

madams

    HR 5

  • Active Members
  • PipPipPipPipPip
  • 504 posts
  • Location:Costa Blanca, Spain

Posted 04 November 2008 - 10:50 AM

But what about pages you dont want people to see?

Surely anyone could access the robots.txt file (of any site) and get a complete road-map into the stuff you asked the botīs NOT to show!

Sounds like hiding the front door key under a flower pot and then sticking a note on the front door with instructions where to find it thinking.gif

#11 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 04 November 2008 - 10:54 AM

You should never use robots.txt to provide any type of security Madams. That's not its purpose, so any webmaster who tries to make robots.txt part of their security procedures is asking for and usually finds trouble. After all, Bad bots can also read the robots.txt to see where they'd want to go for sure.

#12 madams

madams

    HR 5

  • Active Members
  • PipPipPipPipPip
  • 504 posts
  • Location:Costa Blanca, Spain

Posted 04 November 2008 - 11:03 AM

Yeah, I get that Randy but don't you find it a little strange why they are indexed at all?

I constantly read that SEīs serve up the most relevant results to the punters.

How does indexing a file that, itīs only reason for "being" is to instruct the bots, help the public?

#13 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 04 November 2008 - 11:18 AM

They just don't exclude .txt files from getting indexed Madams.

Could they exclude files that carry a specific name of robots.txt? Sure.

Should they? Maybe.

Does it make a huge difference either way? No, not really.

#14 Alan Perkins

Alan Perkins

    Token male admin

  • Admin
  • 1,559 posts
  • Location:UK

Posted 04 November 2008 - 11:26 AM

QUOTE(Randy)
Should they? Maybe.


Hmm, I think more than maybe! e.g. for a long time the RFCs were only available as text files. If Google's mission is to index all the world's info, then they can't leave out text files.

Sure, they can make a special case for robots.txt files - but these files are not private, quite the opposite, so there should be no problem at all with indexing them as long as the SE wants to waste(?) the space doing so. The logic is, if you link to it from (say) a HTML page, it may contain something worth indexing; anything that may legally be indexed is a candidate for indexing. If the only further tests are unique content and Pagerank threshold, then many robots.txt files will be indexed, as shown by the link I gave earlier. Some may even contain lots of meaningful text info. See, for example, http://www.webmaster....com/robots.txt.

#15 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 04 November 2008 - 11:31 AM

Sorry for the confusion Alan. On my maybe I was talking only about the special case of robots.txt files.

Frankly, if I were a search engine I wouldn't exclude those from being indexed. After all, if someone links to them there might be some useful information there.




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users