Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



SEO Class in Chicago, IL

Learn How To Optimize Your Website on July 26, 2013


Looking for personalized in-depth SEO training among your peers?



High Rankings is offering a 1-day customized SEO training class in Chicago. Class size is limited so please sign-up now if you want in!



 


Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!



Photo
- - - - -

Url Removal Problem


  • Please log in to reply
26 replies to this topic

#1 Dantek

Dantek

    HR 2

  • Members
  • PipPip
  • 14 posts

Posted 04 September 2008 - 05:09 PM

I have a problem with my website and am looking for advice on the quickest way to resolve. Allow me to explain the issue and what I've done so far.

The website mydomain.com currently has over 60000 pages in Google's index. The problem is that there are 50,000 pages that don't belong and that I don't want. The objective is how to get them removed as quickly as possible.

The reason I don't want them is because 32000 of them are unrelated/off-topic to the theme of the website and all have the same Title and Description. The other 18000 are related but also have the same Title & Description.

These url's (as far as I know) were discovered through two Search boxes by Googlebot submitting random queries and spidering the results (stupid bot, let Matt know how stupid by sending him this post).

The problem URL's are www.mydomain.com/folder/page.htm?* (32000) and www.mydomain.com/page.php?* (18000).

Here's what I've done so far to try to get them removed:

For the 32000 URL's:

1. Added Disallow: /folder/page.htm?* in robots.txt to deny crawling.
2. Setup 301 from www.mydomain.com/folder/page.htm?* to www.mydomain.com/folder/results/page.htm?* so that I can block the new directory in robots for future crawls.
3. Added Disallow: /folder/results/ to robots.txt to deny future crawling.
4. Requested URL Removal for directory /folder/results/ in WMT. (this should be useless since these new urls's are not in the index).
5. Added meta robots noindex, nofollow, noarchive in html. The noarchive was added in hopes that it will remove the cached copy faster.

The ideal solution I was looking for was to ask WMT to remove URL's using a wild card. For ex: Remove /folder/page.htm?*, but such an option does not exist.

Now I'm thinking I should remove action 1 because it may never perform the 301, and therefore never update to the new blocked url, or update the cached copy. Then there's action 2 that could be removed since I also added noindex, therefore not needing to update the 301 anymore and possibly causing more delays. Action 3 would then be useless. Hmmm...I just went ahead and removed actions 1 & 2, making only 5 effective.

For the 18000 URL's:

1. Added Disallow: /page.php?* in robots.txt to deny crawling.
2. Setup 301 from www.mydomain.com/page.php?* to www.mydomain.com/old-search/results.html so that I can block the new directory in robots for future crawls.
3. Added Disallow: /old-search/ to robots.txt to deny future crawling.

Didn't bother with URL removal in WMT for a directory, since it doesn't even exist yet.
I am not able to modify the html meta to add noindex since it is part of a cms that uses a global setting.

The next best thing I could come up with is to remove action 1 and then create an html sitemap with all the old 18000 url's so that google can update it's index (that would be crazy). Also, this might not even work since action 3 may prevent it. Even if it did, the caches will probably remain for a while.

Hmmm...I guess I should probably remove action 1 because googlebot may not be able to see that those url's are now redirected. Ok, I just did that.

Have I done the best thing I could do for the 32000? What about the 18000?

So now that I've explained what the issue is (I hope it was clear), does anyone have any solution to offer that will allow the quick removal of these url's. What else can I do for either or? Any guesses you have are welcome.

BTW: This is kind of urgent because I have dropped to position 2 for my main term that I have maintained at 1 for over a year now. And I believe it may be because my site profile has been compromised with 80% of the stucture being garbage and 60% off-topic. And it's all because of Googlebot looking for more data that it shouldn't be looking for.

Edited by Dantek, 04 September 2008 - 05:31 PM.


#2 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 04 September 2008 - 07:20 PM

Welcome Dantek ! hi.gif

A couple of questions for you.

Are the /folder/page.htm and /page.php files both only used to return results of your site search?

If so, you may still be able to remove those pages via WMT. I've never tried this myself, having never run into a situation like you have, but I'm not sure you'd need the wildcard element if you removed the base page. I'm not positive, but Google may not need a wildcard or each querystring, so it might be worth a shot.

As another possibility, you could also simply move these result pages to a different filename or location, then let the old location/filename deliver a proper 404 Not Found response. This wouldn't require any robots.txt entries because the files would really be gone. Though you'd still probably want to use the removal tool since old, gone pages tend to hang around in the supplemental index for a long, long time.

Looking out for the future, have you considered blocking the page where the searches are conducted from the spiders via robots.txt? I'm assuming here that you have a page where people enter in their search data. You'd certainly want to robots.txt any results page for the future.

FWIW, I totally agree that Googlebot shouldn't be doing the stuff they're doing with site search appliances. Especially not with the random text schtuff. That said, I doubt it's going to stop. Especially considering you can now go directly to a search page via their new chrome browser by starting to type stuff in its omnibox. I've tested it a couple of times on Amazon and it is pretty cool from the users perspective. But for webmasters this sort of thing is going to turn into a massive headache.

#3 Dantek

Dantek

    HR 2

  • Members
  • PipPip
  • 14 posts

Posted 04 September 2008 - 10:41 PM

Randy, thank you for your response.

The /folder/page.htm is an important page that ranks very well with 10000 pageviews a day all on it's own. When a search is performed, it outputs /folder/page.htm?*. Unfortunately, those pages have been indexed. In this case I cannot do the WMT removal on the base page (good idea). Going forward, I have already added a noindex,nofollow,noarchive tag. I can also use a wildcard in robots or redirect them to /folder/results/page.htm?*.

Moving these pages to a new location and returning a 404 doesn't seem feasible (db query generated pages). Although these pages are currently showing a noindex, so that should do it as well. The problem is how do I get Google to kill them from their index, or update their index with the new noindex page (like you said, tend to hang around). Should I 301 them to new location? Should I add the wildcard robots entry? Should I not do any of those and leave them be rediscovered with the noindex? (they may never be re-spidered since many are random queries)

For the /page.php, I don't need the base page indexed. Also, this search page has been replaced by Google Custom Search on a different url. So trying to remove just the base is definately worth a try (I'll let you know). Hmmm, this base page on it's own also 301's to the home page. What if removing it would also remove the home page...yikes. I don't think so, but that's a scary thought. Anyone have experience with this one? (removing a 301'd url)

Yeah, the random queries are the cherry on top. I have blank pages indexed with queries like: ains, gro, ario, studen, etc...

#4 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 05 September 2008 - 07:13 AM

QUOTE
For the /page.php ... Anyone have experience with this one? (removing a 301'd url)


In theory the home page shouldn't be removed if you remove the page that's doing the 301 redirect to the home page. But I'd hate to rely on theory! I'm with you on this one. I think I'd probably remove the 301 before requesting removal, just to be safe.

On the page.htm location, that's going to be a tough one! If I understand you correctly /page.htm is a valid page that needs to stay indexed. but anything with page.htm? needs to be de-indexed. Right?

A further question, if real users go to this page with a query string do they actually get something that might be useful for them? If not, I'd probably simply set up a scripted 301 that detects a query string. When no query string is present nothing happens. When one is present do a 301 redirect back to the same base page address without any query string.

The reason I'd do this is oddly enough I've found that 301'd pages actually get removed from the index more quickly than 404's do. I know there's no sense in this concept, but I've seen it enough times to know it happens.

This idea of course won't work if real users get to valuable pages at /page.htm?whatever. I hate to say it, but this is one of those times where I might be tempted to perform a bit of user agent detection and serve Googlebot a 301 in certain circumstances while letting real users through as normal. The reason I hate to even think about it is this is most definitely UA cloaking. But it's a good kind of cloaking since it's attempting to clean up the mess Googlebot made all on its own.



#5 Dantek

Dantek

    HR 2

  • Members
  • PipPip
  • 14 posts

Posted 05 September 2008 - 10:34 AM

QUOTE(Randy @ Sep 5 2008, 07:13 AM) View Post
In theory the home page shouldn't be removed if you remove the page that's doing the 301 redirect to the home page. But I'd hate to rely on theory! I'm with you on this one. I think I'd probably remove the 301 before requesting removal, just to be safe.

Good idea, I don't think removing the 301 on the home page poses a problem and it is much safer.

QUOTE(Randy @ Sep 5 2008, 07:13 AM) View Post
On the page.htm location, that's going to be a tough one! If I understand you correctly /page.htm is a valid page that needs to stay indexed. but anything with page.htm? needs to be de-indexed. Right?

Right on.

QUOTE(Randy @ Sep 5 2008, 07:13 AM) View Post
A further question, if real users go to this page with a query string do they actually get something that might be useful for them? If not, I'd probably simply set up a scripted 301 that detects a query string. When no query string is present nothing happens. When one is present do a 301 redirect back to the same base page address without any query string.

The reason I'd do this is oddly enough I've found that 301'd pages actually get removed from the index more quickly than 404's do. I know there's no sense in this concept, but I've seen it enough times to know it happens.

This idea of course won't work if real users get to valuable pages at /page.htm?whatever. I hate to say it, but this is one of those times where I might be tempted to perform a bit of user agent detection and serve Googlebot a 301 in certain circumstances while letting real users through as normal. The reason I hate to even think about it is this is most definitely UA cloaking. But it's a good kind of cloaking since it's attempting to clean up the mess Googlebot made all on its own.

Yes, real users get search results that are useful, except for bad queries that would give zero results. My feeling on 301's vs 404's is the same. I can easily 301 them into a directory blocked by robots and the users would still get their data. Right now they all have noindex in the meta. The real question is which method would be most effective in speeding up the removal (redirect, noindex, block existing pages in robots, etc..)

This is one of those cases that would require a Google engineer to confirm what would work best, unless someone out there has actually experienced the different effects (and that nothing has changed in the algo since the experience).



#6 Jill

Jill

    High Rankings Advisor

  • Admin
  • 32,379 posts

Posted 05 September 2008 - 10:37 AM

It's good that you want to fix the problem, but it's got nothing to do with this:

QUOTE
BTW: This is kind of urgent because I have dropped to position 2 for my main term that I have maintained at 1 for over a year now. And I believe it may be because my site profile has been compromised with 80% of the stucture being garbage and 60% off-topic. And it's all because of Googlebot looking for more data that it shouldn't be looking for.


#7 Dantek

Dantek

    HR 2

  • Members
  • PipPip
  • 14 posts

Posted 05 September 2008 - 11:37 AM

biggrin.gif Jill, nice to hear from you.

Just to clarify, are you saying that a website that has 60000 pages where:
- 12000 are good unique pages
- 18000 are search results scraped from the 1000 good pages (many have no results) and all have the same title & description
- 32000 are search results with external content that is unrelated to the site and all have the same title & description
- 40000 of these bad pages have been added to the index over the course of 2 weeks (last month my site index was only 20000)

have nothing to do with hurting my serps, or have nothing to do with losing a #1 spot that I've steadily had for 1 1/2 years?

Or are you saying that the serp change has to do with something that google has changed in their algo?

BTW: As far as I can tell, the competitor who's been at #2, and now at #1, doesn't belong there. So if my profile change since 2 weeks now has nothing to do with it, then Google's algo is making a bad call (IMO). cry.gif

#8 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 05 September 2008 - 01:12 PM

Hey Dantek, I think I may have just had a brain twist. See if this'll possibly work...

The pages you want to dump on the page.htm?* page are those instances where it's a crappy query that returns zero results, right?

What about handling that on the back end instead of the front end? As in code your search results so that if the query/search results in a zero return you slap a different scripted server response in there? Perhaps a 301 back to the base page.htm address, adding to it text on the screen that says the search was unsuccessful (for real users) or simply deliver the page at the address that normally gets produced with the normal text but send a 404 Not Found scripted header for Googlebot's consumption?

I would think either one of those should help with Googlebot's craziness without impacting users. You can still produce whatever you want users to see in the text, but the server response could be used to tell Googlebot to bugger off.

Make sense?

#9 Dantek

Dantek

    HR 2

  • Members
  • PipPip
  • 14 posts

Posted 05 September 2008 - 06:19 PM

QUOTE(Randy @ Sep 5 2008, 01:12 PM) View Post
The pages you want to dump on the page.htm?* page are those instances where it's a crappy query that returns zero results, right?

Not really. It's mixed because the crappy query page also outputs links with other search suggestions which are good pages, so those got spidered as well.

QUOTE(Randy @ Sep 5 2008, 01:12 PM) View Post
Make sense?

eek.gif I think so (not a programmer), but in this case, I guess it doesn't apply.

I'm still trying to determine the best course of action with the options that are currently known or available.
- redirect to a denied directory
- deny existing urls in robots
- don't deny or redirect and allow googlebot to pickup the noindex

Jill, I would love to get some details on why my serp change has nothing to do with this. Not because I disagree (somewhat), but so that I can learn from it wink1.gif


#10 Jill

Jill

    High Rankings Advisor

  • Admin
  • 32,379 posts

Posted 05 September 2008 - 06:40 PM

QUOTE
have nothing to do with hurting my serps, or have nothing to do with losing a #1 spot that I've steadily had for 1 1/2 years?


Moving from position 1 to position 2? Or moving from position 1 to nowhere?

If it's the former, then absolutely positively 100% NO NO NO NO NO. If it were a problem, you'd be gone, not 2. Why would they move you down a spot just because you got a bunch of crap URLs indexed by mistake?

The notion of it is just crazy (in my opinion). If your site was always number one, you should count yourself lucky. And actually, how do you know it was always #1? You can't check it every second of every day from every computer in every geographical location, from everyone's personalized settings.

And another thing...have you actually lost any traffic or sales?

#11 Dantek

Dantek

    HR 2

  • Members
  • PipPip
  • 14 posts

Posted 05 September 2008 - 08:19 PM

QUOTE(Jill @ Sep 5 2008, 06:40 PM) View Post
Moving from position 1 to position 2? Or moving from position 1 to nowhere?

It's from #1 to #2.

QUOTE(Jill @ Sep 5 2008, 06:40 PM) View Post
If it's the former, then absolutely positively 100% NO NO NO NO NO. If it were a problem, you'd be gone, not 2. Why would they move you down a spot just because you got a bunch of crap URLs indexed by mistake?

Wow...really...100% not the case. I'm not really an SEO guy (IT) but I would figure that a site with 80% of it's pages being crappy and in the supplemental doesn't help with all the algo factors in place. IMO, this should cost the website a point or two when evaluating it.

QUOTE(Jill @ Sep 5 2008, 06:40 PM) View Post
The notion of it is just crazy (in my opinion). If your site was always number one, you should count yourself lucky. And actually, how do you know it was always #1? You can't check it every second of every day from every computer in every geographical location, from everyone's personalized settings.

And another thing...have you actually lost any traffic or sales?

lol.gif I'm not really crazy...I just think about crazy stuff. Yes, I was very lucky and happy being #1, although now I'm not so happy or lucky anymore. I don't know if I was #1 all the time with all the different variables. But I do know where I stood for the variables I did keep an eye on it over the past 2 years (different geo's/IP's, different G servers, different tld's, non-personalized). For ex: currently I know that from the US (main target) I stand at #1 on 6 Google.tld's and I stand at #2 on 10 Google.tld's using non-personalized settings in English. I also know that my Google referrals for that spot are down 10% this week from last week (relatively calculated). I don't check these things religiously because I have better things to check, so I probably missed some fluctuations, but never-the-less. As far as lost sales goes, I'm down 10% this week but this does not apply because there are way too many other variables and sales are not online and can take up to 3 months from when the site was discovered.

I also understand what you are really saying from your excellent article: searchengineland.com/080131-071244.php and I'm sure you understand why #1 is so important to me from your words "It can certainly be a difficult concept for some of them to grasp, as rankings are often a vanity thing for them." phew.gif

Bottom line is I have at least another 267 #1 spots that I know of so it's not much of an overall traffic impact. But when the competitor is an ex-partner, it's personal, one way shout.gif or the other yahoo.gif (the other is preferred).



Edited by Randy, 05 September 2008 - 08:35 PM.
De-linked link per forum rules.


#12 Jill

Jill

    High Rankings Advisor

  • Admin
  • 32,379 posts

Posted 05 September 2008 - 08:59 PM

Whatever. I can assure you that this other stuff has nothing to do with going from #1 to #2. That's such an every day occurrence, that to even wonder about it is just not worth the time/trouble.

#13 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 05 September 2008 - 09:23 PM

QUOTE
It's mixed because the crappy query page also outputs links with other search suggestions which are good pages, so those got spidered as well.


Sure, and it's to be expected that they'd find something good I suppose. giggle.gif

Still, if you wanted those good pages indexed you could simply provide a static link to them via some other method than your search form! Or submit them in an xml sitemap. I simply give 'em another route there when I have site search stuff I want them to see, but I guess now I'm going to have to Googlebot-proof my forms since I only check indexed pages about once every two years.

Wouldn't it make sense if they simply added a robots.txt instruction for telling Googlebot to stay away from your forms? A noforms instruction would make sense to me!

Another question for you Dantek, since I thought I remembered Matt Cutts saying something about this subject several months ago and managed to find the time to track it down. FTR it wasn't Matt who said it, but I think I probably did find it from his site since he linked to it. (Matt's post I vaguely remembered is here if you want to read it or throw the question out to him.)

The question I have is... Would there be any detriment for your users if you converted your forms from the get to the post method?

The reason I ask is the following little nugget from the Google Webmaster Blog from this past April when the announced they were doing this after people discovered it and asked about it.
QUOTE
Only a small number of particularly useful sites receive this treatment, and our crawl agent, the ever-friendly Googlebot, always adheres to robots.txt, nofollow, and noindex directives. That means that if a search form is forbidden in robots.txt, we won't crawl any of the URLs that a form would generate. Similarly, we only retrieve GET forms and avoid forms that require any kind of user information.


Emphasis is mine.

What the above says to me, and what I thought I remembered but had to find to confirm, is Google would leave your forms alone if they used the POST method. They'll only attempt to crawl through forms using the GET method.

So if you can change your form submission method without impacting usability for your real visitors it should certainly stop any further damage by that crazy Googlebot. Though you'd still want to exclude all of those already indexed pages via robots.txt. Or if you didn't want to do it via robots.txt tweak your search script a bit so that when it detects a search has occurred slap a meta robots tag in at the page level and set it to noindex, follow

In theory those two relatively simple steps should keep googlebot away from your forms in the future and get the currently indexed junk out of their index.

Last question I have for tonight. Do you have a Google Webmaster Central account set up for the site in question? If so, and if you're not concerned about them taking a harder look at your site, I'd recommend sending them a support query through WMC to see if the folks who control this newish bot activity have an easier or better solution. After all, it's their bot that caused the problem.

If you do contact them, please lobby for a new robots.txt and/or meta robots method to tell Googlebot to stay the heck away from forms! Who knows, maybe they'll listen when the see just how much overboard their spider has gone with your site.

#14 Dantek

Dantek

    HR 2

  • Members
  • PipPip
  • 14 posts

Posted 05 September 2008 - 10:09 PM

Thanks Jill, point taken.

Randy, a noforms tag makes a lot of sense. I won't bother changing the form method since I have recently applied some blocks and going forward should be a non-issue.

Thanks for that link to Matt's blog, I see I'm not alone. I also see I'm not the only one with the crazy notion (sorry to bring that up again):
"Also, on a small site with a well laid out structure that has relatively few static pages with lots of content and a search box then I would have thought that your ‘random search’ indexing would only serve to destroy the balance’ of the structure of the site away from being highly structure to completely random. Surely this has implications for how Google ’sees’ the site for ranking purposes?"

Yes, I recently setup a G WMC account. Huh, I can send them a support query? I looked but could not find, thank you for that, I'll look again. And I'll put in my vote for a noform instruction. In the end, all I want is to get the crap de-indexed ASAP, even if it has no bearing on SERPS. Still not convinced about that (my wife says I have a hard-head), but what do I know. flowers.gif



#15 Jill

Jill

    High Rankings Advisor

  • Admin
  • 32,379 posts

Posted 05 September 2008 - 11:00 PM

QUOTE
no bearing on SERPS. Still not convinced about that (my wife says I have a hard-head), but what do I know.


You should listen to your wife! wink1.gif




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users