Should I block it via itself?
Are you a Google Analytics enthusiast?
Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE!

www.CustomReportSharing.com
From the folks who brought you High Rankings!
More SEO Content
Anybody Else Ever See Google Index Their Robots.txt File?
#1
Posted 03 November 2008 - 10:49 AM
Should I block it via itself?
#2
Posted 03 November 2008 - 10:59 AM
#3
Posted 03 November 2008 - 11:24 AM
#4
Posted 04 November 2008 - 06:50 AM
I've seen whole sites not being indexed because the robots.txt return the wrong http status code. As they couldn't access the robots.txt they didn't index anything on the site. Couldn't blocking the robots.txt in the robots.txt not result in the same?
#5
Posted 04 November 2008 - 07:15 AM
Isn't G's goal to index as many web pages / documents as possible to produce relevant, targeted and acurate search results as they possibly can for their searchers.
These files only exist for robots and not humans, why on earth would they index them, what purpose does this serve?
Do you think this is a bug?
#6
Posted 04 November 2008 - 09:23 AM
Nah. If I just exclude it from being indexed (with a line within the text file), I don't think it will be a problem.
I did get pointed to a thread from Google's forum where someone from Google did say you could do that. I really don't care that it's indexed, so I think I'll just leave it alone though.
I was just surprised to see it indexed as I'd never noticed that happening before.
#7
Posted 04 November 2008 - 09:26 AM
#8
Posted 04 November 2008 - 09:38 AM
My guess however is that whether the robots.txt is indexed or not probably has no bearing whatsoever on crawling. The simplest thing to do with spiders is to have them check for a robots.txt every single run, regardless of whether they could know that it existed in the past or not. Otherwise they'd never know if there had been any recent changes to the robots.txt since their last spider run.
The reason most robots.txt files don't get indexed is simply because there are no links pointing at them. When a link is found pointing to one though, it would make sense for the spiders to crawl the file as an indexable page, considering spiders are fairly dumb creatures.
#9
Posted 04 November 2008 - 10:12 AM
The problem with that logic is that a line in the text file doesn't stop it being indexed. Technically, it stops it being read. However, robots.txt is the one file that is legally exempt from the robots standards, so the listing of robots.txt IN robots.txt is open to interpretation. I'd suggest the safest method in Google at least is to use this technique to prevent a file being indexed using NOINDEX equivalents in the HTTP response header.
no, no reason why it should. You can check your log files to make sure, of course. robots.txt is not checked EVERY time before a resource is accessed, but it is checked very often. This suggests that it is cached, but on a much shorter refresh than a standard index.
#10
Posted 04 November 2008 - 10:50 AM
Surely anyone could access the robots.txt file (of any site) and get a complete road-map into the stuff you asked the botīs NOT to show!
Sounds like hiding the front door key under a flower pot and then sticking a note on the front door with instructions where to find it
#11
Posted 04 November 2008 - 10:54 AM
#12
Posted 04 November 2008 - 11:03 AM
I constantly read that SEīs serve up the most relevant results to the punters.
How does indexing a file that, itīs only reason for "being" is to instruct the bots, help the public?
#13
Posted 04 November 2008 - 11:18 AM
Could they exclude files that carry a specific name of robots.txt? Sure.
Should they? Maybe.
Does it make a huge difference either way? No, not really.
#14
Posted 04 November 2008 - 11:26 AM
Hmm, I think more than maybe! e.g. for a long time the RFCs were only available as text files. If Google's mission is to index all the world's info, then they can't leave out text files.
Sure, they can make a special case for robots.txt files - but these files are not private, quite the opposite, so there should be no problem at all with indexing them as long as the SE wants to waste(?) the space doing so. The logic is, if you link to it from (say) a HTML page, it may contain something worth indexing; anything that may legally be indexed is a candidate for indexing. If the only further tests are unique content and Pagerank threshold, then many robots.txt files will be indexed, as shown by the link I gave earlier. Some may even contain lots of meaningful text info. See, for example, http://www.webmaster....com/robots.txt.
#15
Posted 04 November 2008 - 11:31 AM
Frankly, if I were a search engine I wouldn't exclude those from being indexed. After all, if someone links to them there might be some useful information there.
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users









