Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



SEO Class in Chicago, IL

Learn How To Optimize Your Website on July 26, 2013


Looking for personalized in-depth SEO training among your peers?



High Rankings is offering a 1-day customized SEO training class in Chicago. Class size is limited so please sign-up now if you want in!



 


Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!



Photo

Robot.txt


  • Please log in to reply
28 replies to this topic

#16 Scottie

Scottie

    Psycho Mom

  • Admin
  • 6,294 posts
  • Location:Columbia, SC

Posted 22 September 2003 - 07:27 PM

Welcome to the forum, Guidaro! :embarrassed:

Corey, I think I've welcomed you before, but just in case I haven't- :applause: to you too.

#17 Corey Bryant

Corey Bryant

    HR 2

  • Members
  • PipPip
  • 18 posts
  • Location:Castle Pines North, CO

Posted 23 September 2003 - 01:13 PM

Thanks scottiecl - but please don't make me remember :unsure: But I think you have. LOL

#18 don1

don1

    HR 4

  • Active Members
  • PipPipPipPip
  • 173 posts
  • Location:Marlborough, MA

Posted 10 November 2003 - 01:34 PM

I am having a different situation with the robots.txt

In October I had 394 entries via the robots.txt and 212 exited immediately (without looking at/crawling other pages). That's only 48% that entered the site. I have been having a difficult time getting these buggers to crawl new pages. Googlebot shows up every few days but does not go anywhere. Why? Are they looking for changes in the file before crawling anywhere else? Have I told them to stop? I have tested the site with our search spider and they all seem to work as planned. I'm stumped.

#19 Think Web

Think Web

    Our Anti-Moron™ software comes in many flavors!

  • Active Members
  • PipPipPipPip
  • 162 posts
  • Location:Canton, Ohio

Posted 10 November 2003 - 04:48 PM

Sometimes web crawlers will look for the "Last-Modified" HTTP response header. If there hasn't been a change to the page since the last spider session, it may move on to another site.

#20 don1

don1

    HR 4

  • Active Members
  • PipPipPipPip
  • 173 posts
  • Location:Marlborough, MA

Posted 10 November 2003 - 04:54 PM

That makes sense if they were entering the index page or other html page but these critters are hitting the robots.txt first and turning tail. I'm going to take a shot at removing the file and see if they drill any further. I know this will create an error but I'm desparate.

#21 dragonlady7

dragonlady7

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 618 posts
  • Location:Buffalo, NY

Posted 10 November 2003 - 04:57 PM

Getting back to the robots.txt and why to bother question-- Robots.txt is a good starting place. It makes sure you're square with everyone who's legit. And it also means you can complain if someone displays pages you didn't want spidered-- it means you're aware of the issue, at least, and it means you've got a leg to stand on if you need to complain about a noncompliant bot.

It's also very handy if you want to get further into things-- start watching your logs for bots who never request it, and there you have your rogues that you should think about banning. :yingyang: You'll have to use more advanced techniques to keep the others off your site, but at least you've got a starting place.

#22 bwelford

bwelford

    HR 5

  • Active Members
  • PipPipPipPipPip
  • 484 posts
  • Location:Langley, British Columbia, Canada

Posted 11 November 2003 - 06:35 AM

... but these critters are hitting the robots.txt first and turning tail.

There's nothing to worry about here. It seems to be the normal behaviour. I less often see the opposite behaviour where a robot keeps going after getting the robots.txt file. Usually another robot from the same family will come and look for other files. That's much more typical in my log files.

#23 Corey Bryant

Corey Bryant

    HR 2

  • Members
  • PipPip
  • 18 posts
  • Location:Castle Pines North, CO

Posted 12 November 2003 - 12:07 PM

Sorry to dredge this up, but I have a question re: the robots.txt.

If the first line is:
User-agent: *
Disallow:

But then goes on to say:
User-agent: SpankBot
Disallow: /

Would SpankBot be banned or would it review the first one?

Thanks!!!

#24 dragonlady7

dragonlady7

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 618 posts
  • Location:Buffalo, NY

Posted 12 November 2003 - 12:30 PM

I think that you'd simply omit the first one.
Just put in the
"User-agent: SpankBot
Disallow: /"
line, and don't bother with the other.

Other bots will read it, say "I am not SpankBot", and go on their merry way spidering your site.

That's how it's supposed to work, anyway. (Yes, I looked it up!!!)

#25 Alan Perkins

Alan Perkins

    Token male admin

  • Admin
  • 1,559 posts
  • Location:UK

Posted 12 November 2003 - 12:42 PM

Would SpankBot be banned or would it review the first one?

Definitive Answer: It would be banned. "*" means "any robot not mentioned elsewhere".

In October I had 394 entries via the robots.txt and 212 exited immediately (without looking at/crawling other pages).

Hits on robots.txt are usually from robots. Web stats for "entry", "exit" and "path" don't work too well for robots, since robots can change IP address between requests and they don't accept cookies. Just ignore it.

#26 meta

meta

    HR 5

  • Active Members
  • PipPipPipPipPip
  • 301 posts
  • Location:Chicago

Posted 12 November 2003 - 04:11 PM

Many threads have included mentions of "bad" bots that nose through things that the webmaster does not want indexed. What exactly are they doing with the stuff? Why do they want it?

#27 dragonlady7

dragonlady7

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 618 posts
  • Location:Buffalo, NY

Posted 12 November 2003 - 04:34 PM

Often they're spammers, simply looking for more email addresses to harvest. Sometimes they're just the kind of bad bots that the exclusion protocol was written for-- not malicious, but overeager. Sometimes people hide good things in those directories, as well, like cgi-bins and the like. Sometimes a bad bot will go straight for the robots.txt to find out what it's not supposed to take, so it can take it!
So, there are a number of reasons.
Most people need not worry about them, but advanced users have plenty to keep up on their toes about!

#28 Corey Bryant

Corey Bryant

    HR 2

  • Members
  • PipPip
  • 18 posts
  • Location:Castle Pines North, CO

Posted 13 November 2003 - 11:01 AM

Great - thanks for the information!! Sorry I did not get back to you sooner. I just was not e-mailed anything about all these wonderful responses.

#29 dzinerbear

dzinerbear

    HR 3

  • Active Members
  • PipPipPip
  • 86 posts
  • Location:Toronto Canada

Posted 02 December 2003 - 01:47 PM

Hi all,

I have a question about excluding certain spiders from my site. I stumbled across a site that will generate a robots.txt file to include what it feels are a whole bunch of "nasty" bots.

The URL is here for anyone who wants to take a look at it:
http://www.1-hit.com...t-generator.htm

As I scan the list I recognize many of the names, so I'm pretty confident that they are bots I'm not interested in indexing my site. But in reading this thread, it seems that the spiders may or may not abide by my robots.txt file, so is there any point in excluding these 20 or so spiders.

Thanks
Dzinerbear




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users