Posted 02 January 2008 - 05:05 PM
I'm not sure why you're using
Disallow: /*.htm
Personally I wouldn't, even if I didn't currently have any .htm pages on the site. The reason being partly that I wouldn't want to take the chance of there never being a .htm page that I wanted to be spidered and partly because robots.txt can get really screwy sometimes since every spider out there could be using their own non-standard implementation of the standards.
For instance, in the robots.txt standard I believe their is supposed to be an implied wildcard at the end of each line. (Alan correct me if I'm wrong, you're the expert on the robots exclusion standards!) With this wildcard implementation if you had a line that said:
[/b]Disallow: /lederhosen[/b]
the spiders should technically exclude anything in your /lederhosen/ subdirectory and any file at the root level that started with the text string lederhosen. For example, files named lederhosen.htm, lederhosen.html, lederhosen.c lederhosen_red, etc
Given this, I would be worried about some engine actually sticking to the wildcard standard and making
Disallow: /*.htm
also disallow .html files.
Gonna move this to the Robots section of the forum, where those more knowledgeable about the standards will see it.