Are you a Google Analytics enthusiast?
More SEO Content
Problems With Google Wt And Crawling Non-existent Pages
Posted 14 March 2010 - 08:53 AM
I use WT with my site and submitted my sitemap to it back when I first started it. All was fine in the beginning, but then I added a translator tool (Wordpress plugin) which would automatically translate pages and create new posts and categories, based on the translations. To cut a long story short, I decided to stop using the plugin. I updated and resubmitted the sitemap accordingly and disallowed the affected directories from being crawled via robots.txt (and the site has been recrawled many times since). Yes, I know I should probably have implemented a redirect at the same time.
Problem is, some time has passed, the sitemap has been resubmitted again (and recrawled, as you'd expect), yet WT is still looking for certain pages that are no longer there. WT is reporting 10 404s and says that the pages are linked to via other pages (which themselves don't exist). I should add that the pages are no longer indexed and haven't been for some time.
I'm assuming that the correct way to deal with this is to implement redirects, but I'm wondering if there's a reason why WT is still looking for these pages, despite a new sitemap and no links on existing pages to the non-existent pages? How should this be properly fixed?
Posted 14 March 2010 - 10:38 AM
If I read correctly above, you excluded the pages that were going bye-bye via robots.txt so that they could no longer be spidered?
If that part is correct I think you'll want to remove the robots.txt exclusion, for at least a time. It may not make a bit of difference, but with it you're creating a bit of a catch-22 situation for spiders. What I mean by that is your robots.txt is telling the spiders not to visit those pages, and if the spider do not visit those pages they can not receive the 404 Not Found error to see the pages are gone, so all they can do is fall back to the previous information that they had on file about the missing pages.
It's kinda like expecting them not to look, but to see with perfect clarity.
Not that it's a huge issue or will absolutely change if you let the spiders see what's what. Most engines are slow to remove 404 pages from their dataset anyway, erring on the side of thinking once existing pages will always exist. As long as you have a good custom 404 error page set up it's not a big deal.
Posted 14 March 2010 - 11:15 AM
I can't remember why I would have added those directories to robots.txt, but I've just noticed that I applied to Google to have them removed some time ago. Anyway, Google says that my application to have them removed was denied because, according to their help pages:
I understand the logic of what you're suggesting, though.
None of this appears to have had an effect on my rankings or traffic, but I just like to maintain good housekeeping.
Posted 14 March 2010 - 12:40 PM
Posted 15 March 2010 - 12:38 PM
If I recall correctly, I think I might even have used Google's own robots.txt generator.
Posted 15 March 2010 - 08:23 PM
Posted 16 March 2010 - 03:51 PM
Well, the /de, /pl, /pt and /it directories are ones that aren't being removed by Google, even though the directories themselves no longer exist. It doesn't seem to want to crawl directories like /fr and /nl, for example.
Posted 17 March 2010 - 10:17 AM
Is it /de, /pl, /pt and /it you want removed? Or /De, /Pl, /Pt and /It
You have one in your response and another in your robots.txt. And because of the capitalization they are two entirely different locations as far as Google is concerned
Posted 17 March 2010 - 01:02 PM
I know you can put the Sitemap line anywhere in the file and it will be recognized, but I wonder if that can cause a problem with separating an Allow or Disallow line from the User-Agent line it should be associated with. For example if the third and fourth lines in the code above were switched, I know it would work. A robot would read it as "For all user-agents, allow everything, but disallow /wp-admin/. There is a sitemap file at xxxxxxxxxxxxx.co.uk/sitemap.xml."
But because of the way the code above is ordered, I wonder if a robot might read it as "For all user-agents, allow everything. There is a sitemap file at xxxxxxxxxxxxx.co.uk/sitemap.xml. Somebody is disallowed from /wp-admin/, but I don't know who."
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users