Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!


Sponsored Content

 

 
 

Photo

Spiders - Looking For Robot.txt


  • Please log in to reply
13 replies to this topic

#1 stillgreen

stillgreen

    HR 1

  • Members
  • Pip
  • 5 posts
  • Location:Midwest USA

Posted 24 October 2003 - 11:51 PM

I found the answer to one of my questions here. It seems some days that I have more spiders and than visitors. Glad that's "normal".

Some of the bots are trying to access I file I don't think I have "robots.txt".

Should I have this file?

What's it for? Do I need it?

How do I got about getting one?

Is my new blog causing this?

Yep. I'm still a newbie as you can see. ;)

Wonder Wyant

Edited by scottiecl, 25 October 2003 - 12:16 AM.


#2 Scottie

Scottie

    Psycho Mom

  • Admin
  • 6,294 posts
  • Location:Columbia, SC

Posted 25 October 2003 - 12:20 AM

Hi Stillgreen! ;)

Welcome to the forum! (You can go to My Controls and put your URL in your signature file. :) )

Open notepad and save a blank document as robots.txt then upload it to the same directory your homepage is in. This will give permission to any robot to access your site.

If you want to exclude certain robots or certain pages on your site, see robotstxt.org.

#3 stillgreen

stillgreen

    HR 1

  • Members
  • Pip
  • 5 posts
  • Location:Midwest USA

Posted 25 October 2003 - 09:09 AM

Thanks for the welcome, Scottie!

I've been a lurker and learner for awhile now! Amazing how much little things count. According to Digital Point, I've brought my rankings up for my (very obscure) keywords up by many points in the last three weeks. So thanks, everybody, for all the help you've already given me! ;)

That was a wonderful site. I learned a lot about robots. There are a couple I'm going to ban from my site now that I've learned how. Hard to tell but I think they might be malicious little beasties.

Again, thanks for making me feel welcome and the help.

#4 qwerty

qwerty

    HR 10

  • Moderator
  • 8,295 posts
  • Location:Somerville, MA

Posted 25 October 2003 - 09:18 AM

One thing about the malicious little beastie robots -- they tend to ignore the robots exclusion protocol.

If you tell Googlebot not to look at a certain directory, it won't. But if you use robots.txt to say "please don't crawl my site, Mr Email Harvesting Spider," it's not going to make any difference.

#5 _Yura_

_Yura_

    HR 2

  • Active Members
  • PipPip
  • 20 posts
  • Location:Toronto, Ontario, Canada

Posted 25 October 2003 - 09:21 PM

Does it make difference having empty robot.txt file or not having it at all?

Does it play any role (having empty robot.txt file)?

#6 qwerty

qwerty

    HR 10

  • Moderator
  • 8,295 posts
  • Location:Somerville, MA

Posted 25 October 2003 - 09:28 PM

It shouldn't make any difference at all, since there's nothing in a robots.txt that actually gives permission. It's just there to tell spiders not to go to specific places on the site, and everything else is assumed. In other words, without a robots.txt, everything is assumed to be permitted.

However, I've heard from a number of people who claim that if they don't have a robots.txt file, spiders come to their site, look for the file, don't find it, and leave. That really shouldn't happen, and it's never happened to me, but I've always uploaded a file when I first published a site, even if it was an empty file. It may be that the people who claim this happens just aren't noticing some other problem.

But I see no reason not to upload the file. It can't hurt to have it there.

#7 dragonlady7

dragonlady7

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 618 posts
  • Location:Buffalo, NY

Posted 25 October 2003 - 09:29 PM

It's a good idea to have a blank robots.txt if you don't know how to configure one properly, just because. I heard rumors a while back while Google was going through some changes that the spider wasn't crawling sites that didn't have one. I did notice it turning away from my site after requesting robots.txt repeatedly and not getting it, so I put one up and it crawled. But if Googlebot didn't crawl sites without a robots.txt, there'd be a lot fewer sites in the index-- and a lot worse results!
So, it probably doesn't hurt but most likely doesn't make a difference.
I haven't used robots.txt to ban anything, but it is a *very* good idea to exclude all robots from your testing directory and the like.
A related issue is robots meta tags-- you can put tags in the head of a document if you want it left out of the search engine's results. I've done this for one article I wrote that i've published elsewhere-- I offer it on my site so visitors can read it, but I don't want it to show up in Google and elsewhere because it might cause the other place it's located to be filtered out of the results for duplicate content.
So, for that one, I've put in this tag:
<meta name="robots" content="noindex, follow">
meaning that Googlebot (or Slurp or Scooter or whoever) should not index the page, but should follow any links on the page.
You can also say noindex, no follow, meaning that the bots should neither index the page nor follow the links. And you can also say index, nofollow to tell the robots to index the page but not follow the links. Of course, you can say index, follow as well, but that is the default value for the tag, so that's the same as simply not including the tag at all.
Some people say it's ineffective and robots ignore it, but I mostly just care about the one bot, so I use it happily.

In order to ban email harvesters and the like, you may need to get heavier. If you have any really dorky friends who know about this stuff, or some free time, you can research them. They're server-level things, that actually make it impossible for the offending bot to access your website. .htaccess is one method you can use. I don't know how to use it, though-- I just give the name so you can look it up if you choose!

A lot of people design spider traps, too-- they're designed to catch any bot that disobeys robots.txt, and crash the program that runs the bot. Lots of email harvesters will come onto a site looking for files that have the names of email scripts and the like, so if you name one of these bot-killing scripts with that name, you can sometimes trick a malicious bot into recursive loops that tangle it up and make it unable to continue its crawl. It's fun to read about these traps...

Anyhow, happy bot-hunting! And welcome out of lurking!

#8 stillgreen

stillgreen

    HR 1

  • Members
  • Pip
  • 5 posts
  • Location:Midwest USA

Posted 25 October 2003 - 10:34 PM

Wow! I just had a weird thing happen.

I was going to post something about my 404 errors then decided at the last minute to check my cpanel again and make sure my figures were correct.

I used my browser's "back" buttom to come back to the forum but when I got here the whole screen was filled with what looked like the css script for the site.

I tried refreshing the page a couple of times but the result didn't change. I had to open a new page before it looked normal.

I took a screen shot if anyone would like to look at it. It's at my website right next to my new robots.txt file. Thank you! :whistle:

BTW - What I went to check was the fact that my 404 error rate is already almost nil since adding the bot file.

Oh, and another question. Now I have a bot looking for "/sumthin". Is this a joke?

#9 stillgreen

stillgreen

    HR 1

  • Members
  • Pip
  • 5 posts
  • Location:Midwest USA

Posted 25 October 2003 - 11:15 PM

O.K. Does this happen for other people? Everytime I refresh this page, I get the aforementioned text at the top of the page (I can scroll down to the actual posts).

I checked several other sites and this does not happen.

Are you techie types playin' games with the ol' carny lady?

I wouldn't have made another post but I can't figure out how to edit the previous after it is posted.

#10 chrishirst

chrishirst

    A not so moderate moderator.

  • Moderator
  • 5,887 posts
  • Location:Blackpool UK

Posted 26 October 2003 - 04:05 AM

The GET /sumthin is an often automated telnet script attack designed to generate a 404 response. As the 404 error response returns information about the server configuration this can be used to direct another attack on a vunerable server.


You can test this on your own site or server
Telnet into your site over port 80
(telnet example.com 80)
Type GET /sumthin HTTP/1.0 and press Enter twice.

you will get something like this

HTTP/1.1 302 Found
Date: Sun, 26 Oct 2003 08:57:56 GMT
Server: Apache/1.3.20 Sun Cobalt (Unix) Chili!Soft-ASP/3.6.2 mod_ssl/2.8.4 OpenSSL/0.9.6b PHP/4.1.2 mod_auth_pam_external/0.1 FrontPage/4.0.4.3 mod_perl/1.25
Location: url removed/sumthin
Connection: close
Content-Type: text/html; charset=iso-8859-1

The Server information can then be used to direct other attacks depending on what is installed and which version is vunerable.


Chris.

#11 stillgreen

stillgreen

    HR 1

  • Members
  • Pip
  • 5 posts
  • Location:Midwest USA

Posted 27 October 2003 - 05:26 AM

Thanks Chris!

Unfortunately, you are WAY over my head.

So here I go off to do research, research . . .:rolleyes:

#12 Jill

Jill

    High Rankings Advisor

  • Admin
  • 32,325 posts

Posted 27 October 2003 - 08:44 AM

Wonder, you need to clear out your cache to get rid of that weird code. I believe that and a reboot should do it for you.

Jill

#13 Alan Perkins

Alan Perkins

    Token male admin

  • Admin
  • 1,559 posts
  • Location:UK

Posted 27 October 2003 - 11:21 AM

Please note:
  • It's robots.txt, not robot.txt
  • A blank robots.txt gives explicit permission to crawl your site - a missing robots.txt gives implicit permission. A nice difference. :)
  • A blank robots.txt is handy when your site has a special 404 handler ... but make sure you set the permissions right so that it can still be read, even if it's empty!
  • It's only a protocol and nothing has to obey it. The bad guys don't.
  • Regarding the robots meta tag, the only attributes that MUST be obeyed by compliant robots are the negative ones (noindex, nofollow and none). And, again, it's only a protocol,


#14 websage

websage

    WebSage

  • Active Members
  • PipPipPipPipPip
  • 362 posts
  • Location:Arlington, VA, USA

Posted 27 October 2003 - 11:30 AM

For more information on the subject read the FAQ at:

http://www.robotstxt.../wc/robots.html




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users