High Rankings Search Engine Optimization ForumHigh Rankings Advisor Search Marketing Newsletter

Welcome Guest ( Log In | Register )

Important Announcement: ***Need an Affordable SEO Website Review?***
2 Pages V   1 2 >  
Reply to this topicStart new topic
> Anybody Else Ever See Google Index Their Robots.txt File?, Should I exclude it via itself? :D
Jill
post Nov 3 2008, 10:49 AM
Post #1


High Rankings Advisor
Group Icon

Group: Admin
Posts: 29,196
Joined: 21-July 03
User's local time:
Feb 9 2010, 08:14 AM
From: Ashland, MA
Member No.: 2



I was looking for something specific on my website via a Google search of just pages on my site, and found that they had actually indexed my robots.txt file. I don't recall ever seeing that before, and was wondering if others had noticed it?

Should I block it via itself? (IMG:style_emoticons/default/searchme.gif)
Go to the top of the page
 
+Quote Post
Alan Perkins
post Nov 3 2008, 10:59 AM
Post #2


Token male admin
Group Icon

Group: Admin
Posts: 1,436
Joined: 28-July 03
User's local time:
Feb 9 2010, 01:14 PM
From: UK
Member No.: 45



Yep, I have seen it very often. It most often happens when somebody links to it.
Go to the top of the page
 
+Quote Post
Jill
post Nov 3 2008, 11:24 AM
Post #3


High Rankings Advisor
Group Icon

Group: Admin
Posts: 29,196
Joined: 21-July 03
User's local time:
Feb 9 2010, 08:14 AM
From: Ashland, MA
Member No.: 2



Surprised I never noticed before. I guess it's no big deal, but for some reason I just assumed they wouldn't bother to index them. You didn't need to link to mine, Alan, they've already got it indexed!
Go to the top of the page
 
+Quote Post
MaKa
post Nov 4 2008, 06:50 AM
Post #4


HR 6
******

Group: Active Members
Posts: 848
Joined: 21-November 05
User's local time:
Feb 9 2010, 01:14 PM
From: Ogmore-by-Sea, Wales, UK
Member No.: 9,487



QUOTE(Jill @ Nov 3 2008, 03:49 PM) *
Should I block it via itself? (IMG:style_emoticons/default/searchme.gif)


I've seen whole sites not being indexed because the robots.txt return the wrong http status code. As they couldn't access the robots.txt they didn't index anything on the site. Couldn't blocking the robots.txt in the robots.txt not result in the same?
Go to the top of the page
 
+Quote Post
1dmf
post Nov 4 2008, 07:15 AM
Post #5


Keep Asking, Keep Questioning, Keep Learning
*******

Group: Active Members
Posts: 1,950
Joined: 24-May 07
User's local time:
Feb 9 2010, 01:14 PM
From: Worthing - England
Member No.: 17,339




Isn't G's goal to index as many web pages / documents as possible to produce relevant, targeted and acurate search results as they possibly can for their searchers.

These files only exist for robots and not humans, why on earth would they index them, what purpose does this serve?

Do you think this is a bug?
Go to the top of the page
 
+Quote Post
Jill
post Nov 4 2008, 09:23 AM
Post #6


High Rankings Advisor
Group Icon

Group: Admin
Posts: 29,196
Joined: 21-July 03
User's local time:
Feb 9 2010, 08:14 AM
From: Ashland, MA
Member No.: 2



QUOTE
Couldn't blocking the robots.txt in the robots.txt not result in the same?


Nah. If I just exclude it from being indexed (with a line within the text file), I don't think it will be a problem.

I did get pointed to a thread from Google's forum where someone from Google did say you could do that. I really don't care that it's indexed, so I think I'll just leave it alone though.

I was just surprised to see it indexed as I'd never noticed that happening before.
Go to the top of the page
 
+Quote Post
qwerty
post Nov 4 2008, 09:26 AM
Post #7


HR 10
Group Icon

Group: Moderator
Posts: 7,489
Joined: 24-July 03
User's local time:
Feb 9 2010, 08:14 AM
From: Somerville, MA
Member No.: 22



I wonder if having it indexed could mean that they don't check it every time before spidering any other documents. That would be problematic, if you made a change to robots.txt and G continued to spider the blocked pages for a while before noticing the change.
Go to the top of the page
 
+Quote Post
Randy
post Nov 4 2008, 09:38 AM
Post #8


Convert Me!
Group Icon

Group: Admin
Posts: 17,377
Joined: 17-August 03
User's local time:
Feb 9 2010, 07:14 AM
Member No.: 551



It's an interesting question and one I've never seen answered Bob.

My guess however is that whether the robots.txt is indexed or not probably has no bearing whatsoever on crawling. The simplest thing to do with spiders is to have them check for a robots.txt every single run, regardless of whether they could know that it existed in the past or not. Otherwise they'd never know if there had been any recent changes to the robots.txt since their last spider run.

The reason most robots.txt files don't get indexed is simply because there are no links pointing at them. When a link is found pointing to one though, it would make sense for the spiders to crawl the file as an indexable page, considering spiders are fairly dumb creatures. (IMG:style_emoticons/default/giggle.gif)
Go to the top of the page
 
+Quote Post
Alan Perkins
post Nov 4 2008, 10:12 AM
Post #9


Token male admin
Group Icon

Group: Admin
Posts: 1,436
Joined: 28-July 03
User's local time:
Feb 9 2010, 01:14 PM
From: UK
Member No.: 45



QUOTE(Jill)
If I just exclude it from being indexed (with a line within the text file), I don't think it will be a problem.


The problem with that logic is that a line in the text file doesn't stop it being indexed. Technically, it stops it being read. However, robots.txt is the one file that is legally exempt from the robots standards, so the listing of robots.txt IN robots.txt is open to interpretation. I'd suggest the safest method in Google at least is to use this technique to prevent a file being indexed using NOINDEX equivalents in the HTTP response header.


QUOTE(qwerty)
I wonder if having it indexed could mean that they don't check it every time before spidering any other documents.


no, no reason why it should. You can check your log files to make sure, of course. robots.txt is not checked EVERY time before a resource is accessed, but it is checked very often. This suggests that it is cached, but on a much shorter refresh than a standard index.
Go to the top of the page
 
+Quote Post
madams
post Nov 4 2008, 10:50 AM
Post #10


HR 5
*****

Group: Active Members
Posts: 463
Joined: 22-March 04
User's local time:
Feb 9 2010, 02:14 PM
From: Costa Blanca, Spain
Member No.: 2,974



But what about pages you dont want people to see?

Surely anyone could access the robots.txt file (of any site) and get a complete road-map into the stuff you asked the botīs NOT to show!

Sounds like hiding the front door key under a flower pot and then sticking a note on the front door with instructions where to find it (IMG:style_emoticons/default/thinking.gif)
Go to the top of the page
 
+Quote Post
Randy
post Nov 4 2008, 10:54 AM
Post #11


Convert Me!
Group Icon

Group: Admin
Posts: 17,377
Joined: 17-August 03
User's local time:
Feb 9 2010, 07:14 AM
Member No.: 551



You should never use robots.txt to provide any type of security Madams. That's not its purpose, so any webmaster who tries to make robots.txt part of their security procedures is asking for and usually finds trouble. After all, Bad bots can also read the robots.txt to see where they'd want to go for sure.
Go to the top of the page
 
+Quote Post
madams
post Nov 4 2008, 11:03 AM
Post #12


HR 5
*****

Group: Active Members
Posts: 463
Joined: 22-March 04
User's local time:
Feb 9 2010, 02:14 PM
From: Costa Blanca, Spain
Member No.: 2,974



Yeah, I get that Randy but don't you find it a little strange why they are indexed at all?

I constantly read that SEīs serve up the most relevant results to the punters.

How does indexing a file that, itīs only reason for "being" is to instruct the bots, help the public?
Go to the top of the page
 
+Quote Post
Randy
post Nov 4 2008, 11:18 AM
Post #13


Convert Me!
Group Icon

Group: Admin
Posts: 17,377
Joined: 17-August 03
User's local time:
Feb 9 2010, 07:14 AM
Member No.: 551



They just don't exclude .txt files from getting indexed Madams.

Could they exclude files that carry a specific name of robots.txt? Sure.

Should they? Maybe.

Does it make a huge difference either way? No, not really.
Go to the top of the page
 
+Quote Post
Alan Perkins
post Nov 4 2008, 11:26 AM
Post #14


Token male admin
Group Icon

Group: Admin
Posts: 1,436
Joined: 28-July 03
User's local time:
Feb 9 2010, 01:14 PM
From: UK
Member No.: 45



QUOTE(Randy)
Should they? Maybe.


Hmm, I think more than maybe! e.g. for a long time the RFCs were only available as text files. If Google's mission is to index all the world's info, then they can't leave out text files.

Sure, they can make a special case for robots.txt files - but these files are not private, quite the opposite, so there should be no problem at all with indexing them as long as the SE wants to waste(?) the space doing so. The logic is, if you link to it from (say) a HTML page, it may contain something worth indexing; anything that may legally be indexed is a candidate for indexing. If the only further tests are unique content and Pagerank threshold, then many robots.txt files will be indexed, as shown by the link I gave earlier. Some may even contain lots of meaningful text info. See, for example, http://www.webmasterworld.com/robots.txt.
Go to the top of the page
 
+Quote Post
Randy
post Nov 4 2008, 11:31 AM
Post #15


Convert Me!
Group Icon

Group: Admin
Posts: 17,377
Joined: 17-August 03
User's local time:
Feb 9 2010, 07:14 AM
Member No.: 551



Sorry for the confusion Alan. On my maybe I was talking only about the special case of robots.txt files.

Frankly, if I were a search engine I wouldn't exclude those from being indexed. After all, if someone links to them there might be some useful information there.
Go to the top of the page
 
+Quote Post

2 Pages V   1 2 >   
Fast ReplyReply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



This forum is sponsored by High Rankings, a Boston SEO Agency
- Lo-Fi Version Time is now: 9th February 2010 - 08:14 AM