| Important Announcement: ***Need an Affordable SEO Website Review?*** |
![]() ![]() |
Nov 3 2008, 10:49 AM
Post
#1
|
|
![]() High Rankings Advisor Group: Admin Posts: 29,196 Joined: 21-July 03 User's local time: Feb 9 2010, 08:14 AM From: Ashland, MA Member No.: 2 |
I was looking for something specific on my website via a Google search of just pages on my site, and found that they had actually indexed my robots.txt file. I don't recall ever seeing that before, and was wondering if others had noticed it?
Should I block it via itself? (IMG:style_emoticons/default/searchme.gif) |
|
|
|
Nov 3 2008, 10:59 AM
Post
#2
|
|
![]() Token male admin Group: Admin Posts: 1,436 Joined: 28-July 03 User's local time: Feb 9 2010, 01:14 PM From: UK Member No.: 45 |
Yep, I have seen it very often. It most often happens when somebody links to it.
|
|
|
|
Nov 3 2008, 11:24 AM
Post
#3
|
|
![]() High Rankings Advisor Group: Admin Posts: 29,196 Joined: 21-July 03 User's local time: Feb 9 2010, 08:14 AM From: Ashland, MA Member No.: 2 |
Surprised I never noticed before. I guess it's no big deal, but for some reason I just assumed they wouldn't bother to index them. You didn't need to link to mine, Alan, they've already got it indexed!
|
|
|
|
Nov 4 2008, 06:50 AM
Post
#4
|
|
![]() HR 6 ![]() ![]() ![]() ![]() ![]() ![]() Group: Active Members Posts: 848 Joined: 21-November 05 User's local time: Feb 9 2010, 01:14 PM From: Ogmore-by-Sea, Wales, UK Member No.: 9,487 |
Should I block it via itself? (IMG:style_emoticons/default/searchme.gif) I've seen whole sites not being indexed because the robots.txt return the wrong http status code. As they couldn't access the robots.txt they didn't index anything on the site. Couldn't blocking the robots.txt in the robots.txt not result in the same? |
|
|
|
Nov 4 2008, 07:15 AM
Post
#5
|
|
![]() Keep Asking, Keep Questioning, Keep Learning ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Active Members Posts: 1,950 Joined: 24-May 07 User's local time: Feb 9 2010, 01:14 PM From: Worthing - England Member No.: 17,339 |
Isn't G's goal to index as many web pages / documents as possible to produce relevant, targeted and acurate search results as they possibly can for their searchers. These files only exist for robots and not humans, why on earth would they index them, what purpose does this serve? Do you think this is a bug? |
|
|
|
Nov 4 2008, 09:23 AM
Post
#6
|
|
![]() High Rankings Advisor Group: Admin Posts: 29,196 Joined: 21-July 03 User's local time: Feb 9 2010, 08:14 AM From: Ashland, MA Member No.: 2 |
QUOTE Couldn't blocking the robots.txt in the robots.txt not result in the same? Nah. If I just exclude it from being indexed (with a line within the text file), I don't think it will be a problem. I did get pointed to a thread from Google's forum where someone from Google did say you could do that. I really don't care that it's indexed, so I think I'll just leave it alone though. I was just surprised to see it indexed as I'd never noticed that happening before. |
|
|
|
Nov 4 2008, 09:26 AM
Post
#7
|
|
![]() HR 10 Group: Moderator Posts: 7,489 Joined: 24-July 03 User's local time: Feb 9 2010, 08:14 AM From: Somerville, MA Member No.: 22 |
I wonder if having it indexed could mean that they don't check it every time before spidering any other documents. That would be problematic, if you made a change to robots.txt and G continued to spider the blocked pages for a while before noticing the change.
|
|
|
|
Nov 4 2008, 09:38 AM
Post
#8
|
|
![]() Convert Me! Group: Admin Posts: 17,377 Joined: 17-August 03 User's local time: Feb 9 2010, 07:14 AM Member No.: 551 |
It's an interesting question and one I've never seen answered Bob.
My guess however is that whether the robots.txt is indexed or not probably has no bearing whatsoever on crawling. The simplest thing to do with spiders is to have them check for a robots.txt every single run, regardless of whether they could know that it existed in the past or not. Otherwise they'd never know if there had been any recent changes to the robots.txt since their last spider run. The reason most robots.txt files don't get indexed is simply because there are no links pointing at them. When a link is found pointing to one though, it would make sense for the spiders to crawl the file as an indexable page, considering spiders are fairly dumb creatures. (IMG:style_emoticons/default/giggle.gif) |
|
|
|
Nov 4 2008, 10:12 AM
Post
#9
|
|
![]() Token male admin Group: Admin Posts: 1,436 Joined: 28-July 03 User's local time: Feb 9 2010, 01:14 PM From: UK Member No.: 45 |
QUOTE(Jill) If I just exclude it from being indexed (with a line within the text file), I don't think it will be a problem. The problem with that logic is that a line in the text file doesn't stop it being indexed. Technically, it stops it being read. However, robots.txt is the one file that is legally exempt from the robots standards, so the listing of robots.txt IN robots.txt is open to interpretation. I'd suggest the safest method in Google at least is to use this technique to prevent a file being indexed using NOINDEX equivalents in the HTTP response header. QUOTE(qwerty) I wonder if having it indexed could mean that they don't check it every time before spidering any other documents. no, no reason why it should. You can check your log files to make sure, of course. robots.txt is not checked EVERY time before a resource is accessed, but it is checked very often. This suggests that it is cached, but on a much shorter refresh than a standard index. |
|
|
|
Nov 4 2008, 10:50 AM
Post
#10
|
|
![]() HR 5 ![]() ![]() ![]() ![]() ![]() Group: Active Members Posts: 463 Joined: 22-March 04 User's local time: Feb 9 2010, 02:14 PM From: Costa Blanca, Spain Member No.: 2,974 |
But what about pages you dont want people to see?
Surely anyone could access the robots.txt file (of any site) and get a complete road-map into the stuff you asked the botīs NOT to show! Sounds like hiding the front door key under a flower pot and then sticking a note on the front door with instructions where to find it (IMG:style_emoticons/default/thinking.gif) |
|
|
|
Nov 4 2008, 10:54 AM
Post
#11
|
|
![]() Convert Me! Group: Admin Posts: 17,377 Joined: 17-August 03 User's local time: Feb 9 2010, 07:14 AM Member No.: 551 |
You should never use robots.txt to provide any type of security Madams. That's not its purpose, so any webmaster who tries to make robots.txt part of their security procedures is asking for and usually finds trouble. After all, Bad bots can also read the robots.txt to see where they'd want to go for sure.
|
|
|
|
Nov 4 2008, 11:03 AM
Post
#12
|
|
![]() HR 5 ![]() ![]() ![]() ![]() ![]() Group: Active Members Posts: 463 Joined: 22-March 04 User's local time: Feb 9 2010, 02:14 PM From: Costa Blanca, Spain Member No.: 2,974 |
Yeah, I get that Randy but don't you find it a little strange why they are indexed at all?
I constantly read that SEīs serve up the most relevant results to the punters. How does indexing a file that, itīs only reason for "being" is to instruct the bots, help the public? |
|
|
|
Nov 4 2008, 11:18 AM
Post
#13
|
|
![]() Convert Me! Group: Admin Posts: 17,377 Joined: 17-August 03 User's local time: Feb 9 2010, 07:14 AM Member No.: 551 |
They just don't exclude .txt files from getting indexed Madams.
Could they exclude files that carry a specific name of robots.txt? Sure. Should they? Maybe. Does it make a huge difference either way? No, not really. |
|
|
|
Nov 4 2008, 11:26 AM
Post
#14
|
|
![]() Token male admin Group: Admin Posts: 1,436 Joined: 28-July 03 User's local time: Feb 9 2010, 01:14 PM From: UK Member No.: 45 |
QUOTE(Randy) Should they? Maybe. Hmm, I think more than maybe! e.g. for a long time the RFCs were only available as text files. If Google's mission is to index all the world's info, then they can't leave out text files. Sure, they can make a special case for robots.txt files - but these files are not private, quite the opposite, so there should be no problem at all with indexing them as long as the SE wants to waste(?) the space doing so. The logic is, if you link to it from (say) a HTML page, it may contain something worth indexing; anything that may legally be indexed is a candidate for indexing. If the only further tests are unique content and Pagerank threshold, then many robots.txt files will be indexed, as shown by the link I gave earlier. Some may even contain lots of meaningful text info. See, for example, http://www.webmasterworld.com/robots.txt. |
|
|
|
Nov 4 2008, 11:31 AM
Post
#15
|
|
![]() Convert Me! Group: Admin Posts: 17,377 Joined: 17-August 03 User's local time: Feb 9 2010, 07:14 AM Member No.: 551 |
Sorry for the confusion Alan. On my maybe I was talking only about the special case of robots.txt files.
Frankly, if I were a search engine I wouldn't exclude those from being indexed. After all, if someone links to them there might be some useful information there. |
|
|
|
![]() ![]() ![]() |
|
Lo-Fi Version | Time is now: 9th February 2010 - 08:14 AM |