Skip navigation
SEO Website Audit

Data Feeds and Duplicate Content

January 2, 2008
Hi Jill,

I have a question for you.  I know you are busy, so I’ll try to keep it to the point.

A lot of my clients pull data/products from semi-public databases to populate their websites, similar to how real estate agents show listings of homes on their sites.  I’ve been ensuring that each client has unique valuable and professionally edited content whenever possible on the rest of the site, but I’m afraid that if I make the portions of the site that use the semi-public data accessible to the search engines, they will find duplicate information on other sites and my client site(s) would not be indexed.

So, based on that “fear,” I have blocked off access to the robots (as much as can be done) to avoid them indexing the pages that have these data feeds and the corresponding details.  

So my question is...

Should I go to the extra length to derive unique content to each product or would I just be spinning my wheels due to the engines detecting the similarity of the pages anyway?

Thank you for your time and I have always enjoyed your articles.


++Jill's Response++

Hi Alexander,

So many people have a misunderstanding of the whole duplicate content issue.

It’s fine to allow the search engines to index that content. The search engines wouldn't drop or refuse to index an entire site just because some pages had information that was also contained on other pages.  That's a common scenario that they know how to deal with appropriately (for the most part).  

The worst that will happen is that the search engines simply won’t index just those particular duplicated pages, or if they did, that the indexed pages wouldn't show up in the search results for their optimized keyword phrases. However, if you block the search engines from indexing any of the content via robots.txt, they definitely won’t index that content.

What I would recommend is wrapping your own unique content around the database-pulled content.  In other words, you’d add some copy before the listings (or whatever happens to be in the data feed), and perhaps after the feed info as well.  This would provide you with the best chance of having those pages indexed and possibly showing up in the rankings for the keyword phrases for which you choose to optimize.

Hope this helps!

Post Comment

 Margaret Rose-Duffy said:
Thanks so much for that information, Jill. This is an issue that I also had a misunderstanding on. Thank you so much for clearing that up.
 Shell Harris said:
I wonder how this would apply with the large amounts of duplicated content from wikipedia. I have seen a drop in many wikipedia sites lately, but you never see a duped wikipedia site ranking, yet thousands have taken content from that site.

Do you think wikipedia content is so overused it is pointless to even wrap original content around it?
 Jill Whalen said:
Hi Shell,

I am not sure I'm understanding your question. Feel free to email me personally to explain further!

 Karen Thumm said:
I don't know what is meant by "semi-public databases" but what concerns me is that your clients may be lifting copyrighted materials from other sites and putting them on their own sites. Just because it's on the web, doesn't mean it's free for the taking. As an artist, I've had my images stolen and even copied and passed off as the thief's own original artwork. It's a HUGE problem for us artists, and we fight back when we find infringements!

Alexander, are you making sure that the content taken from other websites is actually free for taking and not infringeing on someone's copyrights? I'd be a lot more concerned about violating the Digital Millenium Copyright Act than about duplicate content.
 Jason said:
@Karen Thumm: Republishing content does not necessarily violate copyright, as it's generally not being claimed as original content. In this context, I'm understanding 'semi-public databases' to be things like affiliate marketing feeds, or even openly available RSS feeds of news articles, product listings, or classified ads, which can easily be integrated into and republished as part of a completely new site. For example, there are numerous blogs which simply post articles from other newsfeeds, and while they ultimately link back to the originating site, they still get the benefit of the content. The aggregators can then focus on selling ad space and generating other income. For the feed owners, forming distribution partnerships with other sites is a great way to build traffic on a cheap PPC or even free basis, but it can also create some issues with your own pages getting indexed down the road. Given the same core information displayed on two different pages, which will Google consider the better site?