Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!


Sponsored Content

 

 
 

Photo

Robot.txt And Duplicate Content


  • Please log in to reply
8 replies to this topic

#1 kookee

kookee

    HR 2

  • Members
  • PipPip
  • 25 posts

Posted 12 June 2008 - 12:56 PM

If I have three duplicate pages: www.example.com/page1, www.example.com/category/page1 and www.example.com/foo/page1

And block www.example.com/catagory/page1 in the robot text file,

And only pages */category/* link to www.example.com/page1, will www.example.com/page1 be blocked from spiders?

This is pretty much what my cart software is doing with all products. I want the spiders to only access www.example.com/foo/page1

not
www.example.com/page1
or
www.example.com/category/page1

Hope that makes scene.


#2 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 12 June 2008 - 05:06 PM

Since you're excluding one page you want to exclude is the only one that includes a link to another page you want to exclude your logic makes sense. However I would caution you to do it that way.

You'd be better off excluding both of the pages you want to be excluded via your robots.txt. It's a trivial addition and you never know when cart software might put the other urls in something (even like a feed) or the spiders might discover a link to one of these other pages on someone elses site. Better safe than sorry.

If you wanted to exclude everything in the /category/ directory you can also exclude just the directory. There's an implied wildcard at the end, so it would exclude every page inside the category subdirectory. Though it's not part of the robots.txt standard you can do the same sort of thing with filenames too, if those ones you want to exclude at the root level have something unique in their page names. There you would use the * wildcard character.

#3 kookee

kookee

    HR 2

  • Members
  • PipPip
  • 25 posts

Posted 13 June 2008 - 05:26 AM

Hi Randy, thank for the info.

The cart in question is [removed to protect the innocent and guilty] but they have a real shitty duplicate content problem.

For example see the three URLs to the the same product below (this is from the demo store).

/ink-eater-krylon-bombear-destroyed-tee-1.html
/apparel/shirts/ink-eater-krylon-bombear-destroyed-tee-1.html
/catalog/product/view/id/120/s/ink-eater-krylon-bombear-destroyed-tee/category/18/

On my site, with robot.txt I block /catalog

In this example I would also like to block all products in root, so /ink-eater-krylon-bombear-destroyed-tee-1.html

The problem is that there are to many product to block directly. The products in root are only liked to from /catalog/somthing/somthing/and_so_on

So I'm hoping (not a word I like!) that by blocking /catalog that it will also block the root products. It sound like that is the case form your last post.

(Actually there are other places which I have blocked too but I'm trying to keep thinks simple here)

I'm just hoping that, soon there will be a fix to this. Magento is the best cart software that I have ever come acroos but they need to sort out the duplicat products problem. They are only at version one, so they can be forgiven.

I was thinking of prefixing all product with a code in the URL such as /hhdgf_aproduct. Then blocking anything in root with hhdgf in the URL, but then that seems OTT.

Would it seem that I am doing all I can for now or are there any other solutions?

Edited by Randy, 13 June 2008 - 08:34 AM.


#4 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 13 June 2008 - 08:39 AM

Unfortunately your OTT solution is about the only one that'll work, given the fact that you want to block the root level urls. The reason for this being that even Google's extension of the robots.txt standard doesn't support true REGEX statements. It does support * wildcards, but that's not enough if every page you want to exclude doesn't start with the same series of characters.

This is a case where it's going to be far more desirable to fix the root cause of the problems. Which means getting the software to stop linking to those pages from the category page, and possibly then tweaking the .htaccess so that calls to those pages at the root level produce a 404 or 301.

Have you tried working with the cart manufacturers to see if they have a solution? It really needs to come from there. robots.txt won't be enough for the situation you've laid out.

#5 kookee

kookee

    HR 2

  • Members
  • PipPip
  • 25 posts

Posted 14 June 2008 - 04:54 AM

QUOTE(Randy @ Jun 13 2008, 02:39 PM) View Post
Unfortunately your OTT solution is about the only one that'll work, given the fact that you want to block the root level urls. The reason for this being that even Google's extension of the robots.txt standard doesn't support true REGEX statements. It does support * wildcards, but that's not enough if every page you want to exclude doesn't start with the same series of characters.

This is a case where it's going to be far more desirable to fix the root cause of the problems. Which means getting the software to stop linking to those pages from the category page, and possibly then tweaking the .htaccess so that calls to those pages at the root level produce a 404 or 301.

Have you tried working with the cart manufacturers to see if they have a solution? It really needs to come from there. robots.txt won't be enough for the situation you've laid out.


Thanks for the info.
The cart manufacturers doesn't seam all that interested, going by the forums. Don't get me wrong the <cart name removed> teem are brilliant. They have created the best cart software I have ever seen and I'm not the only one saying that. It's just the duplicate products that sucks a bit, OK a lot.

I think I'll just try and forget about the problem for now and optimize the the hell out of the category pages.

Edited by Randy, 14 June 2008 - 06:46 AM.


#6 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 14 June 2008 - 06:45 AM

I don't think I could ever put an e-commerce platform on my "Best" or even "Good" list if the developers have no interest in making their platform search engine friendly. How can it be Best or Good if such basic things are not only overlooked, but actively ignored?

It's the same thing every other shopping cart out there has had to go through at one time or another, which is understandable since cart developers tend to be code jockeys and not be SEO's. They all eventually either come around, or they simply disappear because nobody will pay software that has such huge and easy to fix flaws.

#7 kookee

kookee

    HR 2

  • Members
  • PipPip
  • 25 posts

Posted 17 June 2008 - 10:55 AM

[Merged into original thread since it's a continuation of the same situation. ~Randy]

Now I'm wondering if products need indexing at all.

The shop in question sells greeting cards (real ones not ecards) and searches tend to be more general than specific because people like to brows. For this reason I'm thinking that I may be better to just optimizing the category pages and robot.txt block the product pages. Is this a good or bad idea?

Edited by Randy, 17 June 2008 - 12:23 PM.


#8 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 17 June 2008 - 12:28 PM

It's a question that cannot be answered in a vacuum kookee.

For instance, if you provide a feed to Google Base/Froogle, I don't think you'd want to then block spiders from your product pages. If those product pages get direct traffic from the search engines, I don't think you'd want to start excluding. Etc, etc.

Have you looked at your web stats to see if traffic is landing directly on these product pages? I suspect the volumes will be quite small for each page, so you'll need to dig fairly deep. However if you take all of those 1's and 2's from the various product pages it can often add up to a significant portion of a site's traffic.

You'll want to be sure before you do something like this. I could have some pretty drastic negative effects if you end up cutting off a good traffic and revenue stream.

#9 kookee

kookee

    HR 2

  • Members
  • PipPip
  • 25 posts

Posted 19 June 2008 - 08:48 AM

The shop's not live yet. I think your right, I need some stats first. Thank for your help :-)




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users