I just built a page on my site that I'm doing some experimenting with. I want to block Google from it so to make things easiest I just decided to create a directory for it. I put the page in a directory on the root and blocked "Googlebot" from that directory in my robots.txt file.
I did a allinurl this morning and indeed the site is in Google's index, although it just shows the link to the URL and no title or description for the page (as it does for all my other pages that are not blocked in the robots.txt file).
Is this what is supposed to happen? Are they essentially just recognizing that the page exists b/c it is linked from other pages but then not going to index the content or attributes of the page that would cause it to be provided as search results??? This is my first time ever blocking a file this way so I'm not sure how Google handles it.
Thanks!
Are you a Google Analytics enthusiast?
Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE!

www.CustomReportSharing.com
From the folks who brought you High Rankings!
More SEO Content
International SEM | Social Media | Search Friendly Design | SEO | Paid Search / PPC | Seminars | Forum Threads | Q&A | Copywriting | Keyword Research | Web Analytics / Conversions | Blogging | Dynamic Sites | Linking | SEO Services | Site Architecture | Search Engine Spam | Wrap-ups | Business Issues | HRA Questions | Online Courses
Robots.txt Not Working?
Started by
ephricon
, May 20 2004 06:29 AM
3 replies to this topic
#1
Posted 20 May 2004 - 06:29 AM
#2
Posted 20 May 2004 - 07:53 AM
robots.txt, as I understand it, doesn't stop an SE from adding a document to its index; it just keeps it from spidering it. We're talking about a new site, so what I expect would happen over time is that they'd drop the page from the index if they're never given the opportunity to spider it, but for the time being, they know about it, but they don't know what's in it.
If you went the other way, using the robots meta tag, the spider would be permitted to go to the document in question, but when it got there it would be explicitly told not to index it. I expect that would keep it from showing up at all.
If you went the other way, using the robots meta tag, the spider would be permitted to go to the document in question, but when it got there it would be explicitly told not to index it. I expect that would keep it from showing up at all.
#3
Posted 20 May 2004 - 03:15 PM
To directly answer the question, no, that's not the way robots.txt works. If the page is excluded, it is completely excluded.
But you also have to give it time (or, more accurately, timing). When you exclude a directory in robots.txt and then create a new page in that directory, you are following a specific sequence. However, there is no guarantee that a spider will follow your same sequence, and indeed, it's highly unlikely. Google doesn't check your robots.txt before grabbing every page, or even before every visit. The only way to truly know something is excluded is to add it to robots.txt and then wait until the spider gets a fresh copy of the file. Only then can you safely create a new page and know it won't be indexed.
I would guess you and Googlebot are simply out of sync with each other. Give it time and the spider should adjust.
But you also have to give it time (or, more accurately, timing). When you exclude a directory in robots.txt and then create a new page in that directory, you are following a specific sequence. However, there is no guarantee that a spider will follow your same sequence, and indeed, it's highly unlikely. Google doesn't check your robots.txt before grabbing every page, or even before every visit. The only way to truly know something is excluded is to add it to robots.txt and then wait until the spider gets a fresh copy of the file. Only then can you safely create a new page and know it won't be indexed.
I would guess you and Googlebot are simply out of sync with each other. Give it time and the spider should adjust.
#4
Posted 21 May 2004 - 06:04 AM
Checking various log files for many sites every month shows that robots.txt is downloaded a lot, usually right after or before the entrance page.
What I sometimes do, if I do not want somebody finding a page with a link to a document, then I don't allow a specific page to be indexed, and the folder in which the document is. Some documents are not meant to be used by the general public that doesn't know what the page topic is about. Some of my clients are very specialised, and only theyr own staff and direct clients have use for some documents.
Bernhard
What I sometimes do, if I do not want somebody finding a page with a link to a document, then I don't allow a specific page to be indexed, and the folder in which the document is. Some documents are not meant to be used by the general public that doesn't know what the page topic is about. Some of my clients are very specialised, and only theyr own staff and direct clients have use for some documents.
Bernhard
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users








