Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!


Sponsored Content

 

 
 

Photo

Robots.txt Exclusion Syntax


  • Please log in to reply
3 replies to this topic

#1 assafi

assafi

    HR 3

  • Active Members
  • PipPipPip
  • 68 posts

Posted 06 June 2004 - 04:00 AM

Hi,

I'm using robots.txt for a few years now.
My robot.txt file contains commands such as this:

Exclude: \info\center\data
Exclude: \cgi-bin\archived
Exclude: \cgi-bin\

Now I've noticed that whoever configured it ages ago seemed to have wrong syntax. I can only think of two reasons why this didn't cause my pages not to be indexed: A. it might be the case-sensitive issue (I'm using Robots.txt, which now I know is completely ignored). or B. the wrong syntax caused the bots to ignore the txt file all together.

Either way I'm now optimizing this robots.txt file.

My current dilemma is syntax related and is as follows:
I ran into this document at robotstxt.org:
"Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/". Also, you may not have blank lines in a record, as they are used to delimit multiple records. "

Now to get to the specifics, here's my case-study directory structure:
/info/
/info/archived/
/info/news/

now I want to disallow (exclude bots from the archived directory (subdirectory) and from the info directory but to allow /info/news/ (subdirectory at first level).

Can I use the following syntax:
User-agent: *
Disallow: /info/archive/ --> Is this a correct syntax according to the quote I've included above?
Disallow: /info/ ---> Will this cause the exclusion of all subdirectories including News for example, or only files located on /info/?

Any Inputs on this??

#2 chrishirst

chrishirst

    A not so moderate moderator.

  • Moderator
  • 5,881 posts
  • Location:Blackpool UK

Posted 06 June 2004 - 04:18 AM

Can I use the following syntax:
User-agent: *
Disallow: /info/archive/  --> Is this a correct syntax according to the quote I've included above?
Disallow: /info/  ---> Will this cause the exclusion of all subdirectories including News for example, or only files located on /info/? 


that is the correct syntax and the effect will be

User-agent: *
Disallow: /info/archive/ this will exclude the archive folder files and subfolders
Disallow: /info/ all files and subfolders located in the /info folder


not sure about the filename being case sensitive may well be on *nix servers but the wrong syntax would result in it getting ignored.

#3 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 06 June 2004 - 07:29 AM

That's a nasty problem on the /info/news/ portion of your question.

Google does support an "Allow:" directive, but it's not part of the standard. So who knows what the other search engines would think of that one.

So for Google the following would work:
User-Agent: *
Disallow: /info/ #Block info directory and all sub-directories
Allow: /info/news/ #Specifically allow anything in the /info/news directory

Assuming you want to satisfy all of the search engine spiders, I see only two solutions, neither of which is very good. Move the files in your /info/news/ to a location that you're not excluding, ie above the /info/ directory, or use a META Exclusion on all of the pages in your /info/ directory and sub-directories, making sure there is a non-excluded page that lists those /info/news/ pages you want to be indexed.

The only other way I can think of to do this via robots.txt would be to disallow each sub-directory under /info/, skipping the /news/ directory in that, and then also disallowing each file that resides inside the /info/ directory that you want to be excluded. Lots of work if the site is very large, but that might actually be the best way to go in the long run.

Oh, and yes the capitalization of your robots.txt will make a difference. Spiders all look for a file named "robots.txt". In a *nix environment they would get a 404 Not Found error if you had it named as "Robots.txt". Thus they would think you simply didn't have any exclusions and would roam freely.

#4 assafi

assafi

    HR 3

  • Active Members
  • PipPipPip
  • 68 posts

Posted 07 June 2004 - 07:00 AM

I've actually noticed the ..\News\ folder problem.

At first I thought to move the folder to a higher tree level, or to another relevant sub-folder. This was before I learned about the other possibilities Randy has just brought up.

The thing is that moving the direcctory requires updating about a million links (JK :lol:) - So I might actually go for the second choice Randy has offered: "The only other way I can think of to do this via robots.txt would be to disallow each sub-directory under /info/, skipping the /news/ directory in that, and then also disallowing each file that resides inside the /info/ directory that you want to be excluded. Lots of work if the site is very large, but that might actually be the best way to go in the long run."

A quick word on Meta-Tags exclusions: I don't think this approach is relevant for sites with a lot of pages. This approach is way too clumzy and hard to maintain in my view.




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users