Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!



Photo
- - - - -

Googlebot Crawl Stats Dramatically Changed


  • Please log in to reply
10 replies to this topic

#1 Dantek

Dantek

    HR 2

  • Members
  • PipPip
  • 14 posts

Posted 29 June 2009 - 10:13 PM

I don't know why but google bot is no longer crawling my site the way it used to for many months. The drop started on June 14 and continues till today.

On average (as per my logs) I was getting 25,000 hits and at least 1 GB of downloaded data per day.
Now for the past 2 weeks, I'm getting 10,000 hits and only 100 MB per day. (40% hits & 10% data).

I have no idea why since nothing that I'm aware of has changed and no alerts in WMT. Any ideas? Has anyone else experienced the same?

PS. I'm seeing this in WMT as well, that's where I noticed it.

#2 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 29 June 2009 - 10:42 PM

Does it really matter? Is the data they already know about changing that often? If it's not, there's really no need for them to eat up their bandwidth and yours spidering things they have already collected in the past. Other than to make sure it's there which can be accomplished with a simple If-Modified-Since request that includes the last date their spider happened by your site.

If it's your files are not actually getting updated often and your server is configured correctly to deliver a 304 Not Modified status code when the user agent sends a correctly formatted If-Modified-Since GET request (you'd be surprised how many servers are not correctly configured to do this) it would make complete sense for Googlebot to slow down its spidering for already known pages. And is actually a good thing to see.

#3 Dantek

Dantek

    HR 2

  • Members
  • PipPip
  • 14 posts

Posted 29 June 2009 - 11:10 PM

What matters to me at this time is why the change all of a sudden? What does it mean? As well as "are they picking up all my fresh content later then usual?" (+500 forum posts per day).

I understand why it would make sense for it to slow down if nothing has changed, but why the change all of a sudden (why not have done this 6 months ago)?

How can I test if a 304 is returned?


#4 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 30 June 2009 - 10:07 AM

The easiest way to tell is probably to look in your raw log files. On a typical Apache server these will usually appear in the access_log file. What you'll be looking for is the status code, which will appear right after the HTTP version information in the raw log hits.

To give you a quick glance at what it'll look like in your logs I'm pasting a couple of hits from my Chrome browser to a page on one of my test domains, with my IP number redacted. The first one is a 200 status. The second is a 304 status, because I'd just visited and cached the page. You'll see the HTTP status code right after where it says HTTP/1.1.

CODE
xx.xxx.xxx.xxx - - [30/Jun/2009:10:59:44 -0400] "GET /apache.html HTTP/1.1" 200 12 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/530.5 (KHTML, like Gecko) Chrome/2.0.172.33 Safari/530.5"

xx.xxx.xxx.xxx - - [30/Jun/2009:11:00:23 -0400] "GET /apache.html HTTP/1.1" 304 91 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/530.5 (KHTML, like Gecko) Chrome/2.0.172.33 Safari/530.5"



#5 Dantek

Dantek

    HR 2

  • Members
  • PipPip
  • 14 posts

Posted 01 July 2009 - 12:54 PM

Thanks Randy, this 304 is new to me and interesting stuff.

I looked at my logs and found 304's but mostly with images and scripts. I then filtered googlebots logs and the only 304's I saw were from my blog (blogger).

I came across this neat 304 tester: http://www.microsoft...ls/default.aspx which allowed me to learn more. My site is made up of several platforms so I tested each and here's my results:

Joomla pages (main site): always returns 200
Blogger pages (static): returns 304
SMF (forum): always returns 200
Old forum pages (static): returns 304
Static pages: some return only 200 while most others return 304 ???
Moodle pages (courses): returns 303
Wordpress pages: always returns 200

So it seems that the server is able to respond with a 304 but only with certain types of pages such static or images and not with dynamic or database pages. I will have to research if nad how I can get Joomla/Wordpress to return 304's because I like the idea of optimizing crawls this way because it just makes sense.

Although, I still remain puzzled as to why the sudden crawl change. I believe it has to do with my Google Custom Search account. I renewed & changed this on June 8 and the crawl rates changed as of June 14.


#6 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 01 July 2009 - 04:19 PM

Remember for there to be a 304 response normally the user agent (browser/spider/checking tool) would need to send a correctly formatted If-Modified-Since header to the server when it sends its GET request for a page. In a normal browser this means if you haven't visited a page before and/or have not cached it to your local machine for one reason or another, a 304 would never be triggered.

That'll likely explain why you get different results with some static html pages. You may simply have not visited them before.

Things get a good bit trickier for the server when you start talking about dynamic pages, such a php files, asp files or even shtml pages with include statements. The issue being that sure the server can see the date/time stamp for the file being requested, and can even see the date/time stamp for any other files that may be being included, but when you start talking about when an individual database record may have last been updated things get a lot more complicated. In a hurry. Thus most servers will default to a safe position and always return a 200 OK (basically ignoring the If-Modified-Since) for all dynamic pages. Regardless of whether the files hit a database or not.

There are ways to code around this little anomaly if one really feels the need. I never have personally. I just let the server give 'em a fresh page with a fresh Modified date each time a spider hits one of my php pages.

And of course this 200 OK from the server still doesn't stop browsers from making their own decisions. So they might still load those dynamic pages from the local cache, even if a server tells them not to because the file has a fresh modified date.

Yeah, it all gets convoluted. Quickly. lol.gif

I didn't ask before, but how long have you been tracking spider hits like this? Is it possible that the previous numbers had been inflated a bit or a lot? I ask because starting several weeks ago now Google was apparently doing a full index rebuild. Which causes them to send spiders scurrying about madly for weeks on end, hungrily gobbling up anything and everything they see.

And just to ask it because of the rather large amount of bandwidth transfer you mentioned above, does the site have a ton of images? Or pdf files? Or something similar that's large but doesn't change often? I have a couple of sites that contain these types of files (a couple with massive numbers of images and another couple that get the double whammy because there are lots of pdfs and just as many images that are thumbnails of the pdfs. I ask because whenever Google performs an index reset the spider the heck out of these sites. But once they have confirmed that all of these more static files are still they pretty much leave them alone, other than to do the occasional spot check a few hundred at a time.

#7 Dantek

Dantek

    HR 2

  • Members
  • PipPip
  • 14 posts

Posted 01 July 2009 - 06:28 PM

Great explanation.

QUOTE
That'll likely explain why you get different results with some static html pages. You may simply have not visited them before.

But I had smile.gif

QUOTE
I didn't ask before, but how long have you been tracking spider hits like this?

I normally never do, just take a peek every once in a while. What caught my eye in WMT is the graph showing a drastic drop off (6 month graph).

QUOTE
Is it possible that the previous numbers had been inflated a bit or a lot?

Just took another look and my logs show consistent high numbers since 8 months and prior to that it was about what it is today. These high numbers started shortly after the time I had another problem where googlebot was spidering a ton of search pages (see older posts) which I cleaned up/blocked/redirected/noindexed about 6 months ago. Hmmm, maybe after 6 months they decided to respect the blocks I put (or an expiration period) and have all of a sudden decided to no longer examine the crap pages. I will need to examine my logs to see if that may be the case.

QUOTE
And just to ask it because of the rather large amount of bandwidth transfer you mentioned above, does the site have a ton of images? Or pdf files? Or something similar that's large but doesn't change often?

No, no, & no. It's mostly just content. There's a few videos but those are hosted elsewhere on a CDN (content delivery network).
I'm just puzzled by the drastic drop (from 1000 mb to 100 mb) and want to understand why.

Since I'm suspecting it may be from my Custom Google Search, I will be upgrading the account for an extra $1250 per year, just to play it safe, not because I need it or that it brings extra value. There was some confusion with that and I was not able to get a clear response/understanding from Google after speaking with them a couple of times. I guess I can justify throwing more money at them by hoping that they'll take better care of my site or pay more attention to it biggrin.gif

FYI
Google's guidelines say "Make sure your web server supports the If-Modified-Since HTTP header. This feature allows your web server to tell Google whether your content has changed since we last crawled your site."
So now that you made me aware of this and that my db pages are not doing it, I have to figure out how to fix this in Joomla & Wordpress (there's gotta be a way). The first reference I came across from another post was "I have tried with all the major cms like 'wordpress,joomla,etc'. But there is no valid answer for this. I need a cms with this feature inbuild. Do anyone know a solution for this issue." And of course, no answer sad.gif

So if anyone knows how to setup joomla or wordpress with 304 responses, please share.

#8 Dantek

Dantek

    HR 2

  • Members
  • PipPip
  • 14 posts

Posted 01 July 2009 - 06:50 PM

I think I found the solution for getting the 304 on db pages.

For wordpress, there's a plugin called "1 Blog Cacher"

For joomla, search google for "Joomla Tips #1 : Turning On Caching" and the first result (should be) will be a post with a hack from someone's comment. Although this page dates back from 2005 so I don't know if it even applies with newer installs.

#9 Jill

Jill

    Recovering SEO

  • Admin
  • 33,012 posts

Posted 02 July 2009 - 08:13 AM

QUOTE
I guess I can justify throwing more money at them by hoping that they'll take better care of my site or pay more attention to it


I wouldn't count on that. It's highly unlikely it will have any effect whatsoever.

#10 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 02 July 2009 - 09:40 AM

FTR, I know that bit about the If-Modified-Since is in their guidelines, but truth is I've never seen them ding a site for not dealing with it correctly. And remember above where I said almost all of my sites have PHP pages? I've never bothered with creating a way to make those show a 304 if the pages are relatively static because it's just never seemed a hard and fast rule.

In other words, I wouldn't spend a lot of time, effort or money on such a project because I've never seen a site get hurt by how it handles 304's.

#11 BBCoach

BBCoach

    HR 5

  • Moderator
  • 402 posts

Posted 02 July 2009 - 02:07 PM

QUOTE
In other words, I wouldn't spend a lot of time, effort or money on such a project because I've never seen a site get hurt by how it handles 304's.
Me neither. There are so much more important todos than worrying about this.




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users

SPAM FREE FORUM!
 
If you are just registering to spam,
don't bother. You will be wasting your
time as your spam will never see the
light of day!