Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!



Photo

Gwmts - Xml Indexed Files Is Way Off


  • Please log in to reply
6 replies to this topic

#1 ttw

ttw

    HR 5

  • Active Members
  • PipPipPipPipPip
  • 378 posts
  • Location:San Mateo, California

Posted 13 February 2014 - 05:55 PM

I'm trying to make sense of the Index Status versus SiteMap files indexed.

 

  • We have about 2,000 URLs in Google doing a site:

 

  • The "Total Indexed" is 1,660.  Google says this is the "The total number of URLs currently in Google's index.:

 

  • The sitemap was submitted with 504 URLs but Google reports only indexing 120.  I can't see any way of seeing the URLs it has indexed from the Site Map page.  

 

The site map page is showing this error: 

When we tested a sample of the URLs from your Sitemap, we found that some URLs were not accessible to Googlebot due to an HTTP status error. All accessible URLs will still be submitted.

 

First, is there a way to find see the URLs indexed from the site map. And second, should I care?

 

Why is there such a discrepancy between the # of URLs currently in the index vs those in the site map. I'm guessing that the client's site has about 500 html files on their site.

 

 

THanks

 



#2 chrishirst

chrishirst

    A not so moderate moderator.

  • Moderator
  • 6,934 posts
  • Location:Blackpool UK

Posted 13 February 2014 - 06:42 PM

 

First, is there a way to find see the URLs indexed from the site map. And second, should I care?

 

Not really and no.



#3 Michael Martinez

Michael Martinez

    HR 10

  • Active Members
  • PipPipPipPipPipPipPipPipPipPip
  • 5,120 posts
  • Location:Georgia

Posted 14 February 2014 - 11:50 AM

The site map page is showing this error: 

When we tested a sample of the URLs from your Sitemap, we found that some URLs were not accessible to Googlebot due to an HTTP status error. All accessible URLs will still be submitted.

 
First, is there a way to find see the URLs indexed from the site map. And second, should I care?


I'm going to disagree with Chris because of your other thread about filtering botnets.

If your server is suffering heavy enough botnet crawling, that DOES have a direct impact on search engines' ability to crawl and index your content. They will THROTTLE DOWN their crawl to reduce load on a server that their algorithms conclude is unresponsive.

Although you have not provided sufficient evidence to show that botnets are affecting your server's performance (all you have done is ask a couple of questions), it's worth looking into.

If there is an internal configuration problem with your server independent of the botnets, again it's worth looking into.

If the search engine is telling you it cannot crawl your site completely, THAT is an issue. Sometimes these issues are intermittent and go away, but I have found more and more over the past year or two that these kinds of reports are signs that something is amiss and should be found and fixed.

#4 ttw

ttw

    HR 5

  • Active Members
  • PipPipPipPipPip
  • 378 posts
  • Location:San Mateo, California

Posted 14 February 2014 - 10:38 PM

Hi Michael:  

 

The bot initating javascript post was about another client so the two posts are unrelated. 

 

At the client with the indexing irregularities described above, I do not have access to any server information. I was simply sitting with the client sharing the keyword data in GWMTs and then tripped upon the sitemap/index differences and was stymied.

 

So, now that you know they are unrelated posts, do you have any suggestions about where to look next for the discrepancy between what Google is calling files in their Index vs site map.  

 

I have a sneaky feeling that their Drupal platform could be producing a lot of extra pages which of course wouldn't be a good thing either. 

 

When I run a ScreamingFrog report for HTML-only files, I see 556 URLs which is about the number in the sitemap file. 

 

So if Google is saying that there are only 120 indexed URLs when I see over 500 in ScreamingFrog then I have no idea what Google's doing.



#5 chrishirst

chrishirst

    A not so moderate moderator.

  • Moderator
  • 6,934 posts
  • Location:Blackpool UK

Posted 15 February 2014 - 09:20 AM

That's the way it is with ALL kinds of ranking checks, they are NTSRT (Never The Same Results Twice)
 
 
@Michael, my 'No' reply was to the finding indexed pages from the XML sitemap specifically, if googlebot (other search bots are available) appears in the site access logs and 'fetches' the cource from all the URLs it really doesn't matter how it got there, just as long as it did.
 
What matters is that if it CAN find the all the site URLs, how bots discover URLs is in the minutiae of marketing data, how people 'discover' the URLs is far, far more valuable as a site metric.
 
 

then I have no idea what Google's doing


Of course you have, that is what site access logs are for.

Edited by chrishirst, 15 February 2014 - 09:21 AM.


#6 Michael Martinez

Michael Martinez

    HR 10

  • Active Members
  • PipPipPipPipPipPipPipPipPipPip
  • 5,120 posts
  • Location:Georgia

Posted 16 February 2014 - 12:32 AM

For what it's worth, I have never seen a 1-to-1 correlation between crawl and index data except on very small sites, and I have been doing this since 1998. With a large enough site you should expect pages to drop out of the index because they have low internal PageRank (not Toolbar PR). Eventually they are crawled again and added back (for a while). It's a continuous cycle.

This is part of what I call "Peanut Butter SEO", which is based on a metaphor Matt Cutts used to explain how Google uses PageRank. He said "you get only so much (internal) PageRank for your site" and "you spread it across the site (through your navigation) like spreading peanut butter on bread." So at some point most sites run out of internal PageRank and the search engine stops crawling/indexing.

In reality it's a little more complicated. They use multiple "crawl queues" which you can think of as long lists of page URLs to be fetched. Each queue feeds a spider that goes out and grabs a page, then returns to the queue for another URL. Somehow they use your internal PageRank to prioritize how many and how often URLs from your site are added to these queues.

If they still do "deep crawls", where the spider starts with a URL on your site (usually your root URL) and pulls links from each fetched page as it goes along, they apparently also use internal PageRank to set some limits.

Obviously they are not sharing many details with us so we have to generalize about what happens; but I have analyzed many server logs. I can see multiple crawlers (each with its own IP address) hitting the root URL all day long and some of the more important pages of the site, but the deeper you go into the site, the fewer hits each page gets and the fewer crawlers that try to fetch it.

In addition to crawl priorities, there are also lag times built into the process. Again the amount of delay affecting a crawl-to-index window may be associated with internal PageRank (but I believe there are other factors as well, such as frequency of publication on the host or site).

#7 chrishirst

chrishirst

    A not so moderate moderator.

  • Moderator
  • 6,934 posts
  • Location:Blackpool UK

Posted 16 February 2014 - 07:25 AM

 

For what it's worth, I have never seen a 1-to-1 correlation between crawl and index data except on very small sites,

True, but if any of Google's crawlers have, or are requesting any URL from time to time you can be absolutely sure that Google 'knows' the URL exists and has it in the crawl sheduler. Whether or not that URL is shown in search results, even for a "site:" search is an entirely different matter to knowing whether the URL has been 'found' or not, or how it was discovered.

The chances of getting an 'accurate' count of the URLs that Google has in their index using the search interface is somewhere between slim and none.






0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users

SPAM FREE FORUM!
 
If you are just registering to spam,
don't bother. You will be wasting your
time as your spam will never see the
light of day!