We have a pair of websites on two separate domains, one English and one French. Google is crawling regularly, but indexing pages that do not exist. For example, on the English site, Google has indexed a page with the following title: Restaurant G%252525252525u00e9rant%2525252525252fe resto rapide ... The %252525252525u00e9 is a single accented French character from the equivalent french page on the other domain. This same character problem shows up also in the URL for the page.
Is there any tool that can help us simulate what Google might be seeing and where it goes astray?
Are you a Google Analytics enthusiast?
Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE!

www.CustomReportSharing.com
From the folks who brought you High Rankings!
More SEO Content
International SEM | Social Media | Search Friendly Design | SEO | Paid Search / PPC | Seminars | Forum Threads | Q&A | Copywriting | Keyword Research | Web Analytics / Conversions | Blogging | Dynamic Sites | Linking | SEO Services | Site Architecture | Search Engine Spam | Wrap-ups | Business Issues | HRA Questions | Online Courses
Google Indexing Non-existant Pages
Started by
amabaie
, May 15 2008 09:39 PM
5 replies to this topic
#1
Posted 15 May 2008 - 09:39 PM
#2
Posted 15 May 2008 - 09:59 PM
Someone else just asked a similar question.. I would just recommend as Randy says to just not use any special characters in urls - having them in your urls does not help, (and obviously does harm).
Time to start 301'ing?
edit: seeing as how you probably know more about this than me I'm sure someone else has a more direct approach with how to deal with your problem. Sorry!
Time to start 301'ing?
edit: seeing as how you probably know more about this than me I'm sure someone else has a more direct approach with how to deal with your problem. Sorry!
Edited by ozaark, 15 May 2008 - 10:08 PM.
#3
Posted 15 May 2008 - 10:57 PM
First things first...
What status code does the server deliver when this funky page gets requested? I have to assume it's something other than a 404 Not Found, otherwise it wouldn't be indexed. That's the best ultimate fix, to make sure non-existent pages actually deliver a good 404 Not Found status code, rather than trying to do something else that ends up being a 200 OK response.
After this question/suggestion, I guess the issue comes down to figuring out how Google is finding the page in the first place. I'm not sure if Xenu would show it or not, but it's free so I'd give that a whirl if it were me. Also, do you provide an XML feed to Google? Or an RSS feed? I'm just wondering if something might have crept into one of those where the character conversion didn't happen exactly smoothly.
Do you have access to your raw log files by chance? If so, I'd dig through them a bit, searching for parts of the weird strings you're seeing in the urls. If you can find a hit by a real user instead of a bot you might be able to see referrer info. It's a bit of a crapshoot, but shouldn't take that long if you use a search/find instead of trying to parse through the log file line by line.
hmm... Let me sleep on it. Maybe something more useful will come to me in a dream.
Hey, stop laughing! Most of my best ideas have come to me in the middle of the night.
What status code does the server deliver when this funky page gets requested? I have to assume it's something other than a 404 Not Found, otherwise it wouldn't be indexed. That's the best ultimate fix, to make sure non-existent pages actually deliver a good 404 Not Found status code, rather than trying to do something else that ends up being a 200 OK response.
After this question/suggestion, I guess the issue comes down to figuring out how Google is finding the page in the first place. I'm not sure if Xenu would show it or not, but it's free so I'd give that a whirl if it were me. Also, do you provide an XML feed to Google? Or an RSS feed? I'm just wondering if something might have crept into one of those where the character conversion didn't happen exactly smoothly.
Do you have access to your raw log files by chance? If so, I'd dig through them a bit, searching for parts of the weird strings you're seeing in the urls. If you can find a hit by a real user instead of a bot you might be able to see referrer info. It's a bit of a crapshoot, but shouldn't take that long if you use a search/find instead of trying to parse through the log file line by line.
hmm... Let me sleep on it. Maybe something more useful will come to me in a dream.
Hey, stop laughing! Most of my best ideas have come to me in the middle of the night.
#4
Posted 16 May 2008 - 07:23 AM
The resulting page is essentially the template for the site with only the URL and title tag different, but nothing in the content section. Perhaps a 404 solution might be the answer. I'll have the programmer check the XML feed.
The thing is that this error is not happening only on the French pages on the French domain. If so, we could attribute the problem to the way the accents are being rendered. The fact that French title tags and URLs are showing up at all on the English domain is problematic. MSN and Yahoo are not having a problem with this, so perhaps it is the XML feed (which I think we submitted only to Google).
Regarding special characters in URL strings, we have them written like G%c3%a9rant (for Gérant), for example.
Thanks for the ideas.
The thing is that this error is not happening only on the French pages on the French domain. If so, we could attribute the problem to the way the accents are being rendered. The fact that French title tags and URLs are showing up at all on the English domain is problematic. MSN and Yahoo are not having a problem with this, so perhaps it is the XML feed (which I think we submitted only to Google).
Regarding special characters in URL strings, we have them written like G%c3%a9rant (for Gérant), for example.
Thanks for the ideas.
#5
Posted 16 May 2008 - 08:28 AM
I'd probably try to keep it simple and introduce an error routine into the template code if it were me. If you can work out something that throws a 404 Not Found header status when the page doesn't exist it should handle all of those, even if a link to a non-existent page pops up somewhere.
This way you could apply one solution that will cover the bases across all domains and all languages in one fell swoop. As far as the search engines are concerned it won't really matter what shows up on the visible page, as long as long as there's a 404 in the mix.
Then if they don't drop out of the index you can also use the URL Removal Tool at Google. It should accept the strange urls, since they deliver a proper 404.
This way you could apply one solution that will cover the bases across all domains and all languages in one fell swoop. As far as the search engines are concerned it won't really matter what shows up on the visible page, as long as long as there's a 404 in the mix.
Then if they don't drop out of the index you can also use the URL Removal Tool at Google. It should accept the strange urls, since they deliver a proper 404.
#6
Posted 31 May 2008 - 01:37 AM
One thing to check is to see if your site got hacked. I have a website that I use for web development beta testing and I set a few directories to 777 and forgot about it. I did a search tonight on Google and noticed that the website got hacked....big time. Something to check for. You can tell pretty easily by hitting the 'cached' link in the search results.
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users








