Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!



Photo
- - - - -

Problems With Google Wt And Crawling Non-existent Pages


  • Please log in to reply
8 replies to this topic

#1 jsp1983

jsp1983

    HR 3

  • Active Members
  • PipPipPip
  • 98 posts
  • Location:Blackpool, UK

Posted 14 March 2010 - 08:53 AM

I've got what seems to be a straight-forward issue, but I don't think I can see the straight-forward way to fix it!

I use WT with my site and submitted my sitemap to it back when I first started it. All was fine in the beginning, but then I added a translator tool (Wordpress plugin) which would automatically translate pages and create new posts and categories, based on the translations. To cut a long story short, I decided to stop using the plugin. I updated and resubmitted the sitemap accordingly and disallowed the affected directories from being crawled via robots.txt (and the site has been recrawled many times since). Yes, I know I should probably have implemented a redirect at the same time.

Problem is, some time has passed, the sitemap has been resubmitted again (and recrawled, as you'd expect), yet WT is still looking for certain pages that are no longer there. WT is reporting 10 404s and says that the pages are linked to via other pages (which themselves don't exist). I should add that the pages are no longer indexed and haven't been for some time.

I'm assuming that the correct way to deal with this is to implement redirects, but I'm wondering if there's a reason why WT is still looking for these pages, despite a new sitemap and no links on existing pages to the non-existent pages? How should this be properly fixed?

#2 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 14 March 2010 - 10:38 AM

Let me make sure I understand one part of the equation...

If I read correctly above, you excluded the pages that were going bye-bye via robots.txt so that they could no longer be spidered?

If that part is correct I think you'll want to remove the robots.txt exclusion, for at least a time. It may not make a bit of difference, but with it you're creating a bit of a catch-22 situation for spiders. What I mean by that is your robots.txt is telling the spiders not to visit those pages, and if the spider do not visit those pages they can not receive the 404 Not Found error to see the pages are gone, so all they can do is fall back to the previous information that they had on file about the missing pages.

It's kinda like expecting them not to look, but to see with perfect clarity. wink1.gif

Not that it's a huge issue or will absolutely change if you let the spiders see what's what. Most engines are slow to remove 404 pages from their dataset anyway, erring on the side of thinking once existing pages will always exist. As long as you have a good custom 404 error page set up it's not a big deal.

#3 jsp1983

jsp1983

    HR 3

  • Active Members
  • PipPipPip
  • 98 posts
  • Location:Blackpool, UK

Posted 14 March 2010 - 11:15 AM

Thanks for replying, Randy.

I can't remember why I would have added those directories to robots.txt, but I've just noticed that I applied to Google to have them removed some time ago. Anyway, Google says that my application to have them removed was denied because, according to their help pages:

QUOTE
Your request has been denied because the content hasn't been blocked with the appropriate robots.txt directive or metatags to block us from indexing or archiving this page.


I understand the logic of what you're suggesting, though.

None of this appears to have had an effect on my rankings or traffic, but I just like to maintain good housekeeping.

#4 Jill

Jill

    Recovering SEO

  • Admin
  • 32,774 posts

Posted 14 March 2010 - 12:40 PM

It sounds like perhaps you have an error in your robots.txt file and are not actually blocking Google.

#5 jsp1983

jsp1983

    HR 3

  • Active Members
  • PipPipPip
  • 98 posts
  • Location:Blackpool, UK

Posted 15 March 2010 - 12:38 PM

This is what I've got in my robots.txt:

CODE
User-Agent: *
Allow: /
Sitemap: http://www.xxxxxxxxxxxxx.co.uk/sitemap.xml
Disallow: /wp-admin/
Disallow: /wp-content/themes/
Disallow: /wp-includes/
Disallow: /category/
Disallow: /page/
Disallow: /feed/
Disallow: /comments/
Disallow: /De
Disallow: /Fr
Disallow: /It
Disallow: /Es
Disallow: /Nl
Disallow: /Pt
Disallow: /Pl


If I recall correctly, I think I might even have used Google's own robots.txt generator.

#6 Jill

Jill

    Recovering SEO

  • Admin
  • 32,774 posts

Posted 15 March 2010 - 08:23 PM

That doesn't mean anything to us as we don't know which directories you want to remove and which you don't.

#7 jsp1983

jsp1983

    HR 3

  • Active Members
  • PipPipPip
  • 98 posts
  • Location:Blackpool, UK

Posted 16 March 2010 - 03:51 PM

QUOTE(Jill @ Mar 16 2010, 01:23 AM) View Post
That doesn't mean anything to us as we don't know which directories you want to remove and which you don't.


Well, the /de, /pl, /pt and /it directories are ones that aren't being removed by Google, even though the directories themselves no longer exist. It doesn't seem to want to crawl directories like /fr and /nl, for example.

#8 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 17 March 2010 - 10:17 AM

To clarify...

Is it /de, /pl, /pt and /it you want removed? Or /De, /Pl, /Pt and /It

You have one in your response and another in your robots.txt. And because of the capitalization they are two entirely different locations as far as Google is concerned

#9 qwerty

qwerty

    HR 10

  • Moderator
  • 8,547 posts
  • Location:Somerville, MA

Posted 17 March 2010 - 01:02 PM

I'm curious about something in the code:
CODE
User-Agent: *
Allow: /
Sitemap: http://www.xxxxxxxxxxxxx.co.uk/sitemap.xml
Disallow: /wp-admin/

I know you can put the Sitemap line anywhere in the file and it will be recognized, but I wonder if that can cause a problem with separating an Allow or Disallow line from the User-Agent line it should be associated with. For example if the third and fourth lines in the code above were switched, I know it would work. A robot would read it as "For all user-agents, allow everything, but disallow /wp-admin/. There is a sitemap file at xxxxxxxxxxxxx.co.uk/sitemap.xml."

But because of the way the code above is ordered, I wonder if a robot might read it as "For all user-agents, allow everything. There is a sitemap file at xxxxxxxxxxxxx.co.uk/sitemap.xml. Somebody is disallowed from /wp-admin/, but I don't know who."




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users

SPAM FREE FORUM!
 
If you are just registering to spam,
don't bother. You will be wasting your
time as your spam will never see the
light of day!