Are you a Google Analytics enthusiast?
More SEO Content
Posted 02 April 2004 - 09:35 AM
I have a friend that due to traffic reasons, he would like to block spiders from accessing the site during the day, but allow them at night. Is there a list of the major spiders that crawled for all the search engines? If so do you know if and how I could go about getting a copy that list?
Thank in advance,
Posted 02 April 2004 - 11:14 AM
but I expect you could choose when spiders visit by setting up a CRON job on your server to call a PHP script that rewrites your robots.txt file at specified times.
Posted 02 April 2004 - 11:35 AM
Basically, my question comes down to how often a spider will come (if at all) once it's been told to go away. If you're hoping the bot will come in once a week during the hours you're allowing it to come in but it's already received a message that it can't come in, I suppose it would come back at some point to see if the denial of access has been lifted, but I rather doubt it would be soon.
Posted 02 April 2004 - 11:47 AM
There are lots of things you can do to lower the load on your server - like using the if-modified since headers, and so on. I highly doubt you are gonna find a crawl only during specific hours solution. I suppose some sort of error header might work, but I'm not sure what would do it in a way as to not adversely affect you...
Posted 02 April 2004 - 12:08 PM
I'm probably being a bit naive here, but there are only a few robots that you really want to visit anyway, so I can't see that it would have a noticable effect on the site for your visitors
Posted 02 April 2004 - 12:27 PM
Thanks for the URL qwerty, I'll take a look at it now.
Posted 02 April 2004 - 12:39 PM
Nobody knows for sure what the bots would do, but my guess is that one of two things will happen eventually. 1.) The spiders will simply stop coming by at all after being told multiple times they're not welcome; or 2.) If they do keep coming the site will not be spidered nearly as often or deeply as it currently is.
Two solutions as I see it... Totally restrict the bots from larger files that might put a stress on the server (PDF's, JPG's, highly dynamic pages, etc) or better yet simply tell your friend that they need to get on a better server.
Heck, I have servers that house highly popular and dynamic sites that pump through 2,000+ Gigs in a month's time --no that's not a typo-- and they never struggle one iota. Average server load is still less than 15%. But the server hardware is set up to support that kind of volume.
FWIW, your friend can get a decent dedicated server for as little as $50 per month these days, and come with 700 gigs of free bandwidth usage per month. That'll easily handle almost any 50 sites you want to throw on it.
Posted 02 April 2004 - 12:56 PM
Ok, maybe he has problems with misbehaving spiders, but Googlebot and other SE spiders aren't the issue. Google is very mindful of server load, and will reduce crawl speed/frequency if they notice slow server response times.
A better approach would be to comb his log files for a list of pages accessed by spiders. That will help you find who the bad boys are. Then you can either block them by IP or throttle their rate of access.
Posted 02 April 2004 - 04:47 PM
So...oh well. Thank you everyone for your valued feedback.
Posted 02 April 2004 - 09:15 PM
An examination of raw server logs suggests that the spidering process is often more like being tickled than poked. One finger comes in and gets your robot.txt and other fingers, frequently much later, come back to explore the site. Ergo, if the 'bot finds the site open during the wee hours of the morning, it may well return during the heat of the day.
Posted 03 April 2004 - 08:54 AM
A better approach might be to look at the logs and determine what specific spiders might be acting a little to greedy and because they aren't that important, ban just those spiders.
For example, we recently made the descision to ban wayback's ia_archiver because the spider had gotten to the point where it was visiting al least once but more often twice a day and looking at about 500 out of 1000 pages each time. We couldn't see the point in it doing this. Because our site is dynamic, hence ia_archiver was using load and bandwidth in a manner we decided unnecessary, we finally banned it just to get our logs and load to smoothen out.
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users