Posted 25 October 2003 - 09:29 PM
It's a good idea to have a blank robots.txt if you don't know how to configure one properly, just because. I heard rumors a while back while Google was going through some changes that the spider wasn't crawling sites that didn't have one. I did notice it turning away from my site after requesting robots.txt repeatedly and not getting it, so I put one up and it crawled. But if Googlebot didn't crawl sites without a robots.txt, there'd be a lot fewer sites in the index-- and a lot worse results!
So, it probably doesn't hurt but most likely doesn't make a difference.
I haven't used robots.txt to ban anything, but it is a *very* good idea to exclude all robots from your testing directory and the like.
A related issue is robots meta tags-- you can put tags in the head of a document if you want it left out of the search engine's results. I've done this for one article I wrote that i've published elsewhere-- I offer it on my site so visitors can read it, but I don't want it to show up in Google and elsewhere because it might cause the other place it's located to be filtered out of the results for duplicate content.
So, for that one, I've put in this tag:
<meta name="robots" content="noindex, follow">
meaning that Googlebot (or Slurp or Scooter or whoever) should not index the page, but should follow any links on the page.
You can also say noindex, no follow, meaning that the bots should neither index the page nor follow the links. And you can also say index, nofollow to tell the robots to index the page but not follow the links. Of course, you can say index, follow as well, but that is the default value for the tag, so that's the same as simply not including the tag at all.
Some people say it's ineffective and robots ignore it, but I mostly just care about the one bot, so I use it happily.
In order to ban email harvesters and the like, you may need to get heavier. If you have any really dorky friends who know about this stuff, or some free time, you can research them. They're server-level things, that actually make it impossible for the offending bot to access your website. .htaccess is one method you can use. I don't know how to use it, though-- I just give the name so you can look it up if you choose!
A lot of people design spider traps, too-- they're designed to catch any bot that disobeys robots.txt, and crash the program that runs the bot. Lots of email harvesters will come onto a site looking for files that have the names of email scripts and the like, so if you name one of these bot-killing scripts with that name, you can sometimes trick a malicious bot into recursive loops that tangle it up and make it unable to continue its crawl. It's fun to read about these traps...
Anyhow, happy bot-hunting! And welcome out of lurking!