The website mydomain.com currently has over 60000 pages in Google's index. The problem is that there are 50,000 pages that don't belong and that I don't want. The objective is how to get them removed as quickly as possible.
The reason I don't want them is because 32000 of them are unrelated/off-topic to the theme of the website and all have the same Title and Description. The other 18000 are related but also have the same Title & Description.
These url's (as far as I know) were discovered through two Search boxes by Googlebot submitting random queries and spidering the results (stupid bot, let Matt know how stupid by sending him this post).
The problem URL's are www.mydomain.com/folder/page.htm?* (32000) and www.mydomain.com/page.php?* (18000).
Here's what I've done so far to try to get them removed:
For the 32000 URL's:
1. Added Disallow: /folder/page.htm?* in robots.txt to deny crawling.
2. Setup 301 from www.mydomain.com/folder/page.htm?* to www.mydomain.com/folder/results/page.htm?* so that I can block the new directory in robots for future crawls.
3. Added Disallow: /folder/results/ to robots.txt to deny future crawling.
4. Requested URL Removal for directory /folder/results/ in WMT. (this should be useless since these new urls's are not in the index).
5. Added meta robots noindex, nofollow, noarchive in html. The noarchive was added in hopes that it will remove the cached copy faster.
The ideal solution I was looking for was to ask WMT to remove URL's using a wild card. For ex: Remove /folder/page.htm?*, but such an option does not exist.
Now I'm thinking I should remove action 1 because it may never perform the 301, and therefore never update to the new blocked url, or update the cached copy. Then there's action 2 that could be removed since I also added noindex, therefore not needing to update the 301 anymore and possibly causing more delays. Action 3 would then be useless. Hmmm...I just went ahead and removed actions 1 & 2, making only 5 effective.
For the 18000 URL's:
1. Added Disallow: /page.php?* in robots.txt to deny crawling.
2. Setup 301 from www.mydomain.com/page.php?* to www.mydomain.com/old-search/results.html so that I can block the new directory in robots for future crawls.
3. Added Disallow: /old-search/ to robots.txt to deny future crawling.
Didn't bother with URL removal in WMT for a directory, since it doesn't even exist yet.
I am not able to modify the html meta to add noindex since it is part of a cms that uses a global setting.
The next best thing I could come up with is to remove action 1 and then create an html sitemap with all the old 18000 url's so that google can update it's index (that would be crazy). Also, this might not even work since action 3 may prevent it. Even if it did, the caches will probably remain for a while.
Hmmm...I guess I should probably remove action 1 because googlebot may not be able to see that those url's are now redirected. Ok, I just did that.
Have I done the best thing I could do for the 32000? What about the 18000?
So now that I've explained what the issue is (I hope it was clear), does anyone have any solution to offer that will allow the quick removal of these url's. What else can I do for either or? Any guesses you have are welcome.
BTW: This is kind of urgent because I have dropped to position 2 for my main term that I have maintained at 1 for over a year now. And I believe it may be because my site profile has been compromised with 80% of the stucture being garbage and 60% off-topic. And it's all because of Googlebot looking for more data that it shouldn't be looking for.
Edited by Dantek, 04 September 2008 - 05:31 PM.