I was wondering if the contents of a pdf file will count as duplicate content too. I want to add a pdf information document to my website but this is almost completely the same as the version hosted on many other sites. Two questions:
- will the pdf content be seen as duplicate content if google has indexed almost the same pdf on some other sites (I'm sure google did index the content on other sites since the pdf is on many sites)
- if the pdf counts as duplicate content, what would be a good strategy to avoid penalties? I thought it might be an idea to place the pdf file(s) in a seperate subdirectoy and then use a robots.txt file indicating that any content in this directory should not be indexed. Would this be the best approach?
thanks a lot!
Are you a Google Analytics enthusiast?
Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE!

www.CustomReportSharing.com
From the folks who brought you High Rankings!
More SEO Content
International SEM | Social Media | Search Friendly Design | SEO | Paid Search / PPC | Seminars | Forum Threads | Q&A | Copywriting | Keyword Research | Web Analytics / Conversions | Blogging | Dynamic Sites | Linking | SEO Services | Site Architecture | Search Engine Spam | Wrap-ups | Business Issues | HRA Questions | Online Courses
Pdf Seen As Duplicate Content?
Started by
seobarry
, May 28 2008 06:22 AM
3 replies to this topic
#1
Posted 28 May 2008 - 06:22 AM
#2
Posted 28 May 2008 - 06:45 AM
QUOTE
will the pdf content be seen as duplicate content if google has indexed almost the same pdf on some other sites (I'm sure google did index the content on other sites since the pdf is on many sites)
Yes, because it is duplicate content!
QUOTE
if the pdf counts as duplicate content, what would be a good strategy to avoid penalties?
Google doesn't really have a duplicate content penalty. Instead it's a filter. Yahoo! can be another matter if there is massive duplication, but I don't get the sense you're saying the entire site is being duplicated.
QUOTE
I thought it might be an idea to place the pdf file(s) in a seperate subdirectoy and then use a robots.txt file indicating that any content in this directory should not be indexed. Would this be the best approach?
This is exactly the way to handle it, exclude the pdf files via robots.txt. Whether you do this via a subdirectory exclusion or simply exclude all files that carry a .pdf extension matters not, just that they're being excluded.
#3
Posted 28 May 2008 - 05:46 PM
I have a clients site that is unformation dissemination, Data Specifications etc.
All Data and Spec pages are duplicated as PDFs and a download link to them from the HTML page. It's been like that for some years now. No special measures are taken and the HTML pages show high in the SERPs. Interesting though, I get a lot of researchers using advanced search for PDFs coming from Google.
All Data and Spec pages are duplicated as PDFs and a download link to them from the HTML page. It's been like that for some years now. No special measures are taken and the HTML pages show high in the SERPs. Interesting though, I get a lot of researchers using advanced search for PDFs coming from Google.
#4
Posted 30 May 2008 - 02:20 AM
Good point Piskie, I was thinking about the PDF duplicate content "issue" the other day and couldn't find a convincing reason to block them. What does it matter whether people find your PDF or your HTML page in the results. They both have their advantages and disadvantages.
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users








