Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!


Sponsored Content

 

 
 

Photo

Robots.txt Not Working?


  • Please log in to reply
3 replies to this topic

#1 ephricon

ephricon

    HR 4

  • Active Members
  • PipPipPipPip
  • 236 posts
  • Location:over there -->

Posted 20 May 2004 - 06:29 AM

I just built a page on my site that I'm doing some experimenting with. I want to block Google from it so to make things easiest I just decided to create a directory for it. I put the page in a directory on the root and blocked "Googlebot" from that directory in my robots.txt file.

I did a allinurl this morning and indeed the site is in Google's index, although it just shows the link to the URL and no title or description for the page (as it does for all my other pages that are not blocked in the robots.txt file).

Is this what is supposed to happen? Are they essentially just recognizing that the page exists b/c it is linked from other pages but then not going to index the content or attributes of the page that would cause it to be provided as search results??? This is my first time ever blocking a file this way so I'm not sure how Google handles it.

Thanks!

#2 qwerty

qwerty

    HR 10

  • Moderator
  • 8,294 posts
  • Location:Somerville, MA

Posted 20 May 2004 - 07:53 AM

robots.txt, as I understand it, doesn't stop an SE from adding a document to its index; it just keeps it from spidering it. We're talking about a new site, so what I expect would happen over time is that they'd drop the page from the index if they're never given the opportunity to spider it, but for the time being, they know about it, but they don't know what's in it.

If you went the other way, using the robots meta tag, the spider would be permitted to go to the document in question, but when it got there it would be explicitly told not to index it. I expect that would keep it from showing up at all.

#3 Ron Carnell

Ron Carnell

    HR 6

  • Moderator
  • 959 posts
  • Location:Michigan USA

Posted 20 May 2004 - 03:15 PM

To directly answer the question, no, that's not the way robots.txt works. If the page is excluded, it is completely excluded.

But you also have to give it time (or, more accurately, timing). When you exclude a directory in robots.txt and then create a new page in that directory, you are following a specific sequence. However, there is no guarantee that a spider will follow your same sequence, and indeed, it's highly unlikely. Google doesn't check your robots.txt before grabbing every page, or even before every visit. The only way to truly know something is excluded is to add it to robots.txt and then wait until the spider gets a fresh copy of the file. Only then can you safely create a new page and know it won't be indexed.

I would guess you and Googlebot are simply out of sync with each other. Give it time and the spider should adjust.

#4 bkernst

bkernst

    HR 5

  • Active Members
  • PipPipPipPipPip
  • 385 posts
  • Location:Cape Town, South Africa

Posted 21 May 2004 - 06:04 AM

Checking various log files for many sites every month shows that robots.txt is downloaded a lot, usually right after or before the entrance page.
What I sometimes do, if I do not want somebody finding a page with a link to a document, then I don't allow a specific page to be indexed, and the folder in which the document is. Some documents are not meant to be used by the general public that doesn't know what the page topic is about. Some of my clients are very specialised, and only theyr own staff and direct clients have use for some documents.

Bernhard




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users