Are you a Google Analytics enthusiast?
More SEO Content
Googlebot Crawl Stats Dramatically Changed
Posted 29 June 2009 - 10:13 PM
On average (as per my logs) I was getting 25,000 hits and at least 1 GB of downloaded data per day.
Now for the past 2 weeks, I'm getting 10,000 hits and only 100 MB per day. (40% hits & 10% data).
I have no idea why since nothing that I'm aware of has changed and no alerts in WMT. Any ideas? Has anyone else experienced the same?
PS. I'm seeing this in WMT as well, that's where I noticed it.
Posted 29 June 2009 - 10:42 PM
If it's your files are not actually getting updated often and your server is configured correctly to deliver a 304 Not Modified status code when the user agent sends a correctly formatted If-Modified-Since GET request (you'd be surprised how many servers are not correctly configured to do this) it would make complete sense for Googlebot to slow down its spidering for already known pages. And is actually a good thing to see.
Posted 29 June 2009 - 11:10 PM
I understand why it would make sense for it to slow down if nothing has changed, but why the change all of a sudden (why not have done this 6 months ago)?
How can I test if a 304 is returned?
Posted 30 June 2009 - 10:07 AM
To give you a quick glance at what it'll look like in your logs I'm pasting a couple of hits from my Chrome browser to a page on one of my test domains, with my IP number redacted. The first one is a 200 status. The second is a 304 status, because I'd just visited and cached the page. You'll see the HTTP status code right after where it says HTTP/1.1.
xx.xxx.xxx.xxx - - [30/Jun/2009:11:00:23 -0400] "GET /apache.html HTTP/1.1" 304 91 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/530.5 (KHTML, like Gecko) Chrome/188.8.131.52 Safari/530.5"
Posted 01 July 2009 - 12:54 PM
I looked at my logs and found 304's but mostly with images and scripts. I then filtered googlebots logs and the only 304's I saw were from my blog (blogger).
I came across this neat 304 tester: http://www.microsoft...ls/default.aspx which allowed me to learn more. My site is made up of several platforms so I tested each and here's my results:
Joomla pages (main site): always returns 200
Blogger pages (static): returns 304
SMF (forum): always returns 200
Old forum pages (static): returns 304
Static pages: some return only 200 while most others return 304 ???
Moodle pages (courses): returns 303
Wordpress pages: always returns 200
So it seems that the server is able to respond with a 304 but only with certain types of pages such static or images and not with dynamic or database pages. I will have to research if nad how I can get Joomla/Wordpress to return 304's because I like the idea of optimizing crawls this way because it just makes sense.
Although, I still remain puzzled as to why the sudden crawl change. I believe it has to do with my Google Custom Search account. I renewed & changed this on June 8 and the crawl rates changed as of June 14.
Posted 01 July 2009 - 04:19 PM
That'll likely explain why you get different results with some static html pages. You may simply have not visited them before.
Things get a good bit trickier for the server when you start talking about dynamic pages, such a php files, asp files or even shtml pages with include statements. The issue being that sure the server can see the date/time stamp for the file being requested, and can even see the date/time stamp for any other files that may be being included, but when you start talking about when an individual database record may have last been updated things get a lot more complicated. In a hurry. Thus most servers will default to a safe position and always return a 200 OK (basically ignoring the If-Modified-Since) for all dynamic pages. Regardless of whether the files hit a database or not.
There are ways to code around this little anomaly if one really feels the need. I never have personally. I just let the server give 'em a fresh page with a fresh Modified date each time a spider hits one of my php pages.
And of course this 200 OK from the server still doesn't stop browsers from making their own decisions. So they might still load those dynamic pages from the local cache, even if a server tells them not to because the file has a fresh modified date.
Yeah, it all gets convoluted. Quickly.
I didn't ask before, but how long have you been tracking spider hits like this? Is it possible that the previous numbers had been inflated a bit or a lot? I ask because starting several weeks ago now Google was apparently doing a full index rebuild. Which causes them to send spiders scurrying about madly for weeks on end, hungrily gobbling up anything and everything they see.
And just to ask it because of the rather large amount of bandwidth transfer you mentioned above, does the site have a ton of images? Or pdf files? Or something similar that's large but doesn't change often? I have a couple of sites that contain these types of files (a couple with massive numbers of images and another couple that get the double whammy because there are lots of pdfs and just as many images that are thumbnails of the pdfs. I ask because whenever Google performs an index reset the spider the heck out of these sites. But once they have confirmed that all of these more static files are still they pretty much leave them alone, other than to do the occasional spot check a few hundred at a time.
Posted 01 July 2009 - 06:28 PM
But I had
I normally never do, just take a peek every once in a while. What caught my eye in WMT is the graph showing a drastic drop off (6 month graph).
Just took another look and my logs show consistent high numbers since 8 months and prior to that it was about what it is today. These high numbers started shortly after the time I had another problem where googlebot was spidering a ton of search pages (see older posts) which I cleaned up/blocked/redirected/noindexed about 6 months ago. Hmmm, maybe after 6 months they decided to respect the blocks I put (or an expiration period) and have all of a sudden decided to no longer examine the crap pages. I will need to examine my logs to see if that may be the case.
No, no, & no. It's mostly just content. There's a few videos but those are hosted elsewhere on a CDN (content delivery network).
I'm just puzzled by the drastic drop (from 1000 mb to 100 mb) and want to understand why.
Since I'm suspecting it may be from my Custom Google Search, I will be upgrading the account for an extra $1250 per year, just to play it safe, not because I need it or that it brings extra value. There was some confusion with that and I was not able to get a clear response/understanding from Google after speaking with them a couple of times. I guess I can justify throwing more money at them by hoping that they'll take better care of my site or pay more attention to it
Google's guidelines say "Make sure your web server supports the If-Modified-Since HTTP header. This feature allows your web server to tell Google whether your content has changed since we last crawled your site."
So now that you made me aware of this and that my db pages are not doing it, I have to figure out how to fix this in Joomla & Wordpress (there's gotta be a way). The first reference I came across from another post was "I have tried with all the major cms like 'wordpress,joomla,etc'. But there is no valid answer for this. I need a cms with this feature inbuild. Do anyone know a solution for this issue." And of course, no answer
So if anyone knows how to setup joomla or wordpress with 304 responses, please share.
Posted 01 July 2009 - 06:50 PM
For wordpress, there's a plugin called "1 Blog Cacher"
For joomla, search google for "Joomla Tips #1 : Turning On Caching" and the first result (should be) will be a post with a hack from someone's comment. Although this page dates back from 2005 so I don't know if it even applies with newer installs.
Posted 02 July 2009 - 08:13 AM
I wouldn't count on that. It's highly unlikely it will have any effect whatsoever.
Posted 02 July 2009 - 09:40 AM
In other words, I wouldn't spend a lot of time, effort or money on such a project because I've never seen a site get hurt by how it handles 304's.
Posted 02 July 2009 - 02:07 PM
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users