Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!


Sponsored Content

 

 
 

Photo

Wnat Does "noindex" Mean


  • Please log in to reply
7 replies to this topic

#1 maleman

maleman

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 677 posts

Posted 10 July 2008 - 12:23 PM

Greetings Folks,

What exactly will a legit bot do when it sees the following in robots.txt?

CODE
User-agent: *
Disallow: /private/

Will it not store information about the directory or files in the directory? Will the info not be stored on a search engine server and displayed in a SERP?

Will the url (hyperlink) will be stored on a search engine server and displayed in a SERP with no info about the content displayed, only the url displayed?

What happens if the "private" directory contains several .html files but no index.html file?

Thanks whitehat.gif

Edited by maleman, 10 July 2008 - 12:41 PM.


#2 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 10 July 2008 - 12:36 PM

QUOTE
Does it mean that any information about the directory or files in the directory will NOT be crawled and stored on a search engine server and displayed in a SERP?


Per the specs, yes this is what it means. Robots that follow the standards will not crawl or index any pages in the /private/ subdirectory. If there are existing links they can get to that link to the pages in the /private/ subdirectory they'll know they exist, but they still will not crawl or index them because of the robots.txt exclusion.

Because they don't crawl the pages in the first place any links to other pages on these excluded pages will not be seen, so won't end up in the linking graph or linking database. As an aside, this is how robots.txt and a meta robots noindex instruction in the html differs. With the latter (assuming there's not a nofollow also) the engines will in fact see the link on your excluded page. The excluded page will still not be included in the index, but the links on the excluded page may appear in the linking graph.

QUOTE
What happens if the "private" directory contains several .html files but no index.html file?


Since you're excluding the entire subdirectory via robots.txt none of the pages will be crawled or indexed.

#3 maleman

maleman

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 677 posts

Posted 10 July 2008 - 12:59 PM

Hey Randy,

Everything you said there comes out the same as I would have said it.

When I do a site search, my files in the "private" dir are showing up as hyperlinks with no description or anything else, just the links.

I have links to files in the "private" dir on pages that are not excluded. Evidently, these links are why the links to the excluded files are showing up in the SERPs.

Now, if I had stuck those files in a private dir with no links pointing to them and no index in the directory the files would not be found, so they would not be indexed nor would they show in a SERP.

Jill inspired this thread from another post I made about robots.txt

Thanks

#4 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 10 July 2008 - 01:59 PM

Do you have a Webmaster Tools account set up for the domain by chance?

If so, I'd complain directly to Google about it. After all, their own help docs (it's the first one) state that they're breaking their own standards.

Not that it should matter since normal searchers wouldn't do a site: type of search. But still, the possibility exists that some rogue webmaster out there might use it. Of course they'll probably just tell you to use their URL removal service, instead of actually fixing the problem on their end. jester.gif That's what big companies do. As they've already stated at the end of the above document where they give themselves an out that they frankly shouldn't have if they're going to follow the specs.

Or, if you really wanna screw with them, drop an .htaccess in that subdirectory that looks for any of the Google user agents (Googlebot, MediaPartners, Googlebot-image) that sends their bot to a 404 error page no matter what page they request in your private subdirectory. You could even give the 404 error page an appropriate message deriding their implementation of the robots.txt standard, complete with flowery language. That could be fun. giggle.gif

Edited by Randy, 10 July 2008 - 02:05 PM.


#5 maleman

maleman

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 677 posts

Posted 10 July 2008 - 03:54 PM

QUOTE
Do you have a Webmaster Tools account set up for the domain by chance?

No sir, I don't have any on this site. I only have one tool set up on a different site, a Live Search tool.

QUOTE
That could be fun

Yes sir that could be fun. But I can't monkey around too much on this site because it's important for business. So, I may try some of that on another site. I do have one I can play with.

#6 Jill

Jill

    High Rankings Advisor

  • Admin
  • 32,324 posts

Posted 10 July 2008 - 06:21 PM

QUOTE
When I do a site search, my files in the "private" dir are showing up as hyperlinks with no description or anything else, just the links.


This is very common. It's generally because they already knew about those URLs previously. Did you have those URLS un-excluded for awhile?

#7 Alan Perkins

Alan Perkins

    Token male admin

  • Admin
  • 1,559 posts
  • Location:UK

Posted 11 July 2008 - 05:45 AM

robots.txt is not an indexing standard, it is a crawling standard. No part of the robots.txt standard itself has anything to do with indexing. Therefore, the effect of robots.txt upon indexing is one of interpretation.

To work out what then happens (or may happen), you have to separate URLs from content.

Content can't be indexed if it can't be read. robots.txt prevents content being read. Therefore, it can prevent content being indexed.

However, that doesn't stop the URL itself from being indexed. Nothing can stop that except interpretation. The way Google interprets things is as follows:
  • As soon as Googlebot sees a link to a URL, it may index that URL. Along with the link to the URL, Google may also index some meta data such as the ODP description associated with that URL, or the anchor text used in the link to that URL. Note that all of this meta data can be obtained without the URL itself ever being fetched. The result is a partially-indexed page - the URL and some meta data is indexed, but not the content at the URL itself.
  • Ultimately Googlebot may try to fetch the URL itself. Before doing so, it will check the robots.txt file to ensure it is allowed to fetch the URL. At this point, Googlebot may discover that the URL is protected by robots.txt. If that's the case, it will remove the partially-indexed page from its index.
  • As time goes on, Googlebot may discover or re-discover links to the URL. It appears that Googlebot does not keep a list of URLs that it may not index; nor does it check robots.txt before creating a partially indexed page. Therefore, at the point where Googlebot discovers or re-discovers a link to the URL, it may re-create the partially-indexed page.


#8 maleman

maleman

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 677 posts

Posted 11 July 2008 - 09:23 AM

Thanks to all!

Wow Alan Perkins, that's a lot of info. I appreciate it.

QUOTE
Did you have those URLS un-excluded for awhile?

Not to my knowledge. I had them in a password protected directory during the testing phase and moved them to the excluded public directory after testing.

Near as I can figure, the links on the un-excluded pages are what show in the return.

TGIF whitehat.gif




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users