Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!


Sponsored Content

 

 
 

Photo

Spiders!


  • Please log in to reply
11 replies to this topic

#1 hypntzd3

hypntzd3

    HR 3

  • Active Members
  • PipPipPip
  • 78 posts

Posted 02 April 2004 - 09:35 AM

Hey! :drunk:

I have a friend that due to traffic reasons, he would like to block spiders from accessing the site during the day, but allow them at night. Is there a list of the major spiders that crawled for all the search engines? If so do you know if and how I could go about getting a copy that list?

Thank in advance,
Hypntzd3 :thumbup:

#2 qwerty

qwerty

    HR 10

  • Moderator
  • 8,287 posts
  • Location:Somerville, MA

Posted 02 April 2004 - 10:12 AM

I don't know how recently the list was updated, but here you go: http://www.robotstxt...html/index.html

I don't know of a way to block spiders during certain hours, though. Maybe somebody here will have some thoughts on that.

#3 essexell

essexell

    HR 3

  • Active Members
  • PipPipPip
  • 62 posts
  • Location:UK

Posted 02 April 2004 - 11:14 AM

I'm not a great expert...

but I expect you could choose when spiders visit by setting up a CRON job on your server to call a PHP script that rewrites your robots.txt file at specified times.

#4 qwerty

qwerty

    HR 10

  • Moderator
  • 8,287 posts
  • Location:Somerville, MA

Posted 02 April 2004 - 11:35 AM

I'm not sure how (compliant) robots would respond to that. Let's say I set things up so that Googlebot only has access to my site between 2 and 5 am, and it happens to come by 3 times in a row at 5:30 am. So it's seen three times that access is denied to it. I doubt it would try coming back at a different time to see if the situation has changed.

Basically, my question comes down to how often a spider will come (if at all) once it's been told to go away. If you're hoping the bot will come in once a week during the hours you're allowing it to come in but it's already received a message that it can't come in, I suppose it would come back at some point to see if the denial of access has been lifted, but I rather doubt it would be soon.

#5 Grumpus

Grumpus

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 786 posts

Posted 02 April 2004 - 11:47 AM

Yeah - I agree with Qwerty. If it hits during an "off hour" you're done for that crawl. It'll try again next month (or tomorrow in case of a Google fresh crawl), but there may be some "they said now last time, so this is a low priority" factor that we don't know about.

There are lots of things you can do to lower the load on your server - like using the if-modified since headers, and so on. I highly doubt you are gonna find a crawl only during specific hours solution. I suppose some sort of error header might work, but I'm not sure what would do it in a way as to not adversely affect you...

G.

#6 essexell

essexell

    HR 3

  • Active Members
  • PipPipPip
  • 62 posts
  • Location:UK

Posted 02 April 2004 - 12:08 PM

Is it likely that a robot visiting once a week - or even daily - is going to make that much difference to the server load anyway?

I'm probably being a bit naive here, but there are only a few robots that you really want to visit anyway, so I can't see that it would have a noticable effect on the site for your visitors

#7 hypntzd3

hypntzd3

    HR 3

  • Active Members
  • PipPipPip
  • 78 posts

Posted 02 April 2004 - 12:27 PM

I don't have exact quotes as to the number of hits or bandwith that is being used, but his site gets visited daily by a good number of spiders and he gets lots of traffic. Additionally he will be regulating the spiders for some "ideas" we have on the back end.

Thanks for the URL qwerty, I'll take a look at it now.

Hypntzd3 :zap:

#8 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 02 April 2004 - 12:39 PM

My two cents would have to agree with the above sentiments. Trying to do a time-of-day bot exclusion is a concept that is fraught with potential issues. Technically possible, but I wouldn't do it.

Nobody knows for sure what the bots would do, but my guess is that one of two things will happen eventually. 1.) The spiders will simply stop coming by at all after being told multiple times they're not welcome; or 2.) If they do keep coming the site will not be spidered nearly as often or deeply as it currently is.

Two solutions as I see it... Totally restrict the bots from larger files that might put a stress on the server (PDF's, JPG's, highly dynamic pages, etc) or better yet simply tell your friend that they need to get on a better server.

Heck, I have servers that house highly popular and dynamic sites that pump through 2,000+ Gigs in a month's time --no that's not a typo-- and they never struggle one iota. Average server load is still less than 15%. But the server hardware is set up to support that kind of volume.

FWIW, your friend can get a decent dedicated server for as little as $50 per month these days, and come with 700 gigs of free bandwidth usage per month. That'll easily handle almost any 50 sites you want to throw on it.

#9 burgeltz

burgeltz

    HR 3

  • Active Members
  • PipPipPip
  • 80 posts
  • Location:Plano, TX

Posted 02 April 2004 - 12:56 PM

I absolutely agree that your friend's plan is dangerous.

Ok, maybe he has problems with misbehaving spiders, but Googlebot and other SE spiders aren't the issue. Google is very mindful of server load, and will reduce crawl speed/frequency if they notice slow server response times.

A better approach would be to comb his log files for a list of pages accessed by spiders. That will help you find who the bad boys are. Then you can either block them by IP or throttle their rate of access.

#10 hypntzd3

hypntzd3

    HR 3

  • Active Members
  • PipPipPip
  • 78 posts

Posted 02 April 2004 - 04:47 PM

Oh, don't get me wrong, I'm not sure I like the idea either, but he asked me for a list and he was going to have his developer do the work to block the spiders. There is no convincing this guy...once he has made up his mind, there is nothing else to talk about.

So...oh well. Thank you everyone for your valued feedback.

Hypntzd3 :)

#11 Ron Carnell

Ron Carnell

    HR 6

  • Moderator
  • 959 posts
  • Location:Michigan USA

Posted 02 April 2004 - 09:15 PM

Not only is the idea potentially dangerous, it likely won't even accomplish what is wanted.

An examination of raw server logs suggests that the spidering process is often more like being tickled than poked. One finger comes in and gets your robot.txt and other fingers, frequently much later, come back to explore the site. Ergo, if the 'bot finds the site open during the wee hours of the morning, it may well return during the heat of the day.

#12 Ledfish

Ledfish

    HR 3

  • Active Members
  • PipPipPip
  • 71 posts
  • Location:Michigan, USA

Posted 03 April 2004 - 08:54 AM

I agree, trying to limit it to certain hours for server balancing sounds like it would be a disaster and probably over a period of time will leave you on the outside looking in. It will then take several months to get them revisiting again.

A better approach might be to look at the logs and determine what specific spiders might be acting a little to greedy and because they aren't that important, ban just those spiders.

For example, we recently made the descision to ban wayback's ia_archiver because the spider had gotten to the point where it was visiting al least once but more often twice a day and looking at about 500 out of 1000 pages each time. We couldn't see the point in it doing this. Because our site is dynamic, hence ia_archiver was using load and bandwidth in a manner we decided unnecessary, we finally banned it just to get our logs and load to smoothen out.




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users