Overall (which I hadn't noticed until I'd already posted some numbers) the pagecounts across the board seem to be lower than normal.
Qwerty's site seems (I didn't count) to only have about 14 pages, so it's always going to be impossible to crawl more pages than a site has.
Cre8asite.Net is in a weird period now. Up until last month we had about 22-27K pages in there. We also had about 14-18K pages in Froogle from my Amazon Web Services Feed section of the site. Froogle doesn't like to list "associate sites" in there, so when it discovers (I assume someone "blows the whistle," but I dunno) that a site is an affiliate, it vanishes. This tends to, for a month or so, affect the overall pagecount (which suggests that there is crosstalk between the Froogle and Google databases of some kind and it needs to sort out what is what). It appears that the whole site now needs to be recrawled with the "list in froogle" flag turned off and this takes a month or so. (This is what happened with other sites I've used the Amazon scripts on).
On your site (if it's the one in your profile) it looks like there are only about 100 pages in there, so again, it's not gonna crawl more than exist. (You mirror in two languages, and I'm not really sure how that works, though).
With these forums here, I've suggested that the pagecount is low in other posts. I'm not sure why - though there are some things in the code here that I recomend avoiding if you want a good crawl. Jill and Scottie didn't seem worried about it and were happy with their traffic, so I never explored deeper, but I think there are several key things that need to be done codewise to pique the interest in the pages here a bit more.
The reality of the matter is that there are lots of other things that will affect how deep the spiders will go. I believe, in fact, that Google doesn't say "This site has a mean PR of X so I'll crawl Y pages" but rather "This site has a mean PR of X so I'll crawl for Y number of hours (or days)."
If everything else is optimal, then the figures I give out are closer to accurate for your specific case. My figures are only based upon a sampling of sites that I follow (about 15-20 in total - so it's not a huge sampling.) And, from time to time I will compare this to comparably sized, comparably themed, and comparably PR'd sites. (Something that I really need to do again soon, since my numbers do
seem way off right now - but there may be some other flux factors at play right now, too... I'm not sure.)
So, other factors that'll come into play:
-response times - if it takes longer to crawl, you won't get as many pages in because it can only give you so many hours out of each crawl cycle
- messy/bloated code - if the engine has to do more work stripping out bad/sloppy code, then it takes more time to process your pages and therefore, you won't get as many processed in the invisible time limit it has imposed on you. 100ths of seconds do add up...
- quantity and variety of deep links (your page gets one "main crawl" period, but if a lot of sites are deep linking to a page, that may trigger an exploratory crawl on that page and pages directly linked to it - the more links to that page, the more time you get on that "seeded crawl" and the more potential pages Google may find.
- full moons. Even though we don't have monthly updates anymore, your site does get crawled and updated on a monthly basis - it's just not all sites at once anymore. Some months the crawls are deeper (longer) than others. (Some of the discrepancies we are seeing right now may be that some sites are on their "January" cycle - a short cycle - while others are still showing their "December" cycle - one that was very deep by most standards.
- Good pyramidal linking structure. Remember, I talk in "mean PR" for depth of crawl. Your front page PR is not always a good indication of this. If your front page is a PR of 5 and you link to gobs and gobs of other pages equally, then the PR of the second level pages drops considerably. If your linking structure flows and moves through the site logically, then the PR will flow evenly downward through the site. When Google hits a page, it's interest in that page (and in following the links on it) is gauged in no small part by the PR of that page. If the PR skips from 5 to 3, then it's only going to show a "level 3" interest in following the links on the page. If it goes from 5 to 4, then it's more likely to continue on through that page and grab all of the ones it links to.
- Lots of well referenced fresh stuff. Since Google doesn't give "real" PR to a page until the monthly cycle update, fresh stuff tends to get a bonus (i.e. the guess is usually pretty generous). This stuff is done on a separate cycle from the deep crawl so if you have lots of well referenced fresh stuff that references other good stuff, then this stuff is likely to get in there on top of the regular cycle.
- and a few other things that I'm surely forgetting to mention right now.
The numbers I give should only be taken for what they are, a general and often vague rule of thumb. There are lots of other variables and exceptions that come into play.