Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!


Sponsored Content

 

 
 

Photo

Possible Robot Text File Problem


  • Please log in to reply
8 replies to this topic

#1 donbmjr

donbmjr

    HR 2

  • Active Members
  • PipPip
  • 17 posts

Posted 19 February 2004 - 08:48 PM

Hi

I wanted to keep the bots out of the nascar_races dir except for the two php pages below, I have not optimized the rest of the pages, and there are a considerable number of them in there.
Google paid a visit a week ago and indexed approx 50 of these pages, not only that
but these 50 pages appear to have replaced 50 pages that had already been indexed, some were my top keyword pages.

Where should I go from here, are the lines below correct or should I just have a single line Disallow: /nascar_races/

Goggle has indexed approx 350 pages, I thought it would read and index them all
not start replacing pages that had been indexed.

There are no tricks on the site, hidden text etc.


Disallow: /nascar_races/driver-list-2003-finishes.php4
Disallow: /nascar_races/nascar-winston-nextel-cup-results-2003.php4

Thanks

donbmjr

#2 SearchRank

SearchRank

    HR 7

  • Active Members
  • PipPipPipPipPipPipPip
  • 2,333 posts
  • Location:Phoenix, AZ

Posted 19 February 2004 - 09:18 PM

"Disallow: /nascar_races/" should keep bots out of that directory period. If you add specific pages, then I would assume they would then crawl the directory but exclude those two pages.

#3 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 19 February 2004 - 11:14 PM

Hey donbmjr,

If I'm reading your question correctly, the two files you mentioned are the only ones in the directory you want to be spidered. Correct?

If so, your robots.txt is doing exactly the opposite of what you want it to. It's Disallowing those two files, but allowing everything else in that directory.

Try this one instead:

User-agent: *
Allow: /nascar_races/driver-list-2003-finishes.php4
Allow: /nascar_races/nascar-winston-nextel-cup-results-2003.php4
Disallow: /nascar_races/

The above will tell all spiders that they're allowed to grab those two specific pages in the nascar_races directory, but everything else is off limits.

HTH

#4 Hank Cowdog

Hank Cowdog

    HR 4

  • Active Members
  • PipPipPipPip
  • 113 posts
  • Location:Chair, Den, Wylie, Outside Dallas, Texas, USA, North America, Earth, Sol, Milky Way

Posted 20 February 2004 - 03:27 AM

Randy Suggested:

User-agent: *
Allow: /nascar_races/driver-list-2003-finishes.php4
Allow: /nascar_races/nascar-winston-nextel-cup-results-2003.php4
Disallow: /nascar_races/

Unfortunately, the Allow: line mentioned above is not a part of the standard ;).

FYI, this site is the official source for the standard: www.robotstxt.org

and the only recognized lines (which may appear multiple times) are the "User-Agent" and "Disallow" lines.

As far as the example you gave, I assume that there is a User-Agent: * line above the Disallow lines. At least one User-Agent line is required. As configured, Randy is right, you are excluding only the two pages you mentioned, rather than allowing the two pages you mentioned. Note also that the robots.txt file must be placed in the domain root (i.e. http://somesite.com/robots.txt). There is no support for placing a robots.txt file in a subdir, having multiple robots.txt files, etc. There can be only one.

I would recommend that you move (leaving a 301 redirect behind) the two pages to which you want to allow access into a different subdir and implement the Disallow: line that searchrank suggested.

searchrank's suggestion to directly submit the two pages has some merit, but if the robot is abiding by the standard, it will request the robots.txt file first, and still not spider the directly-submitted pages.

Another option is to leave access to all the files open, and then add a robots META tag to each page to be excluded. I don't know which engines support this (SearchEngineWatch says "Most major engines support the meta robots tag"), but here is the format:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

This would need to be added to every page to be excluded (unless already excluded by the robots.txt file).

#5 Alan Perkins

Alan Perkins

    Token male admin

  • Admin
  • 1,559 posts
  • Location:UK

Posted 20 February 2004 - 04:02 AM

Excellent reply from Hank Cowdog.

Expanding on searchrank's reply, one further option is open. That option is to exclude each file in the nascar_races directory that you don't want to be indexed, leaving the files that you do.

So instead of

Disallow: /nascar_races/

Use this:

Disallow: /nascar_races/file1.php
Disallow: /nascar_races/file2.php
Disallow: /nascar_races/file3.php

Making sure that you don't have these lines:

Disallow: /nascar_races/driver-list-2003-finishes.php4
Disallow: /nascar_races/nascar-winston-nextel-cup-results-2003.php4

#6 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 20 February 2004 - 08:18 AM

Agreed that Allow is not part of the standard Hank, and it's been awhile since I tested it, but Allow used to work anyway. With Googlebot at least. I'll have to set up a test for that one again sometime to see if it's still viable or not.

#7 Grumpus

Grumpus

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 786 posts

Posted 20 February 2004 - 08:34 AM

I thought we were all very excited that, soon, Google wouldn't be the only game in town? ;)

Easiest solution here is to take those two files out of the directory you want everything else banned from and put them somewhere else.

Then disallow that directory with a single line and go have a beer.

G.

#8 donbmjr

donbmjr

    HR 2

  • Active Members
  • PipPip
  • 17 posts

Posted 20 February 2004 - 09:09 AM

Thanks to all for your suggestions

I believe for the time being I will go with the single line as Grumpus suggested
except for the beer.

Disallow: /nascar_races/
I do have a User-agent: * at the top

Just to be sure, am I correct in assuming that even if there is a link to a file
in this directory it will not get indexed, using just the single line above.

One other comment that has been made a few times before.

One of my best ranked pages running at #3 for a few months , a well written no funny stuff page dropped to #103 after the last google run, most of the pages before it are terrible.

Thanks again

donbmjr

#9 Alan Perkins

Alan Perkins

    Token male admin

  • Admin
  • 1,559 posts
  • Location:UK

Posted 20 February 2004 - 12:10 PM

Just to be sure, am I correct in assuming that even if there is a link to a file
in this directory it will not get indexed, using just the single line above.

Short answer: yes.

More complete answer: Once a URL is protected by robots.txt then it shouldn't be read. The URL may be indexed now and again, but the content won't be.




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users