Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!



Photo
- - - - -

How To Check Pdf For Indexability?


  • Please log in to reply
7 replies to this topic

#1 bobmeetin

bobmeetin

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 533 posts
  • Location:Colorado

Posted 19 October 2009 - 12:24 PM

I have a client who has a lot of PDF docs which he wants to put online, but I would like to qualify them first.

1) It seems to me that if I can highlight text character by character, that's a good thing.
2) But also shouldn't I be able to grep for the word (market - for example) or in a text editor (such as vi) search for and find the word if it indeed, market, is in on the page and can be highlighted?

Is there a standard techy way to check pdf documents for indexability?

BTW - I first searched for forum for pdf first, but was surprised to get zero results.

#2 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 19 October 2009 - 04:00 PM

Most PDF files are indeed indexable.

However trying to grep, vi or nano search for words within the PDF's isn't going to get you very far. The issue being that the PDFs are actually binary files where the text in them is going to be encoded into pdf format. So the text that shows up when you view one of these files with acrobat reader is simply not going to be there for grep, vi or nano to see.

Try opening one of those pdf files with your vi and you'll see what I mean. Other than a few header lines here human readable text is available, everything else is going to look like gobbly gook.

#3 bobmeetin

bobmeetin

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 533 posts
  • Location:Colorado

Posted 19 October 2009 - 05:30 PM

Okay that's a good enough explanation for me. I had to real dig far back to recall this but there is a UNIX command called 'strings', ie '%strings document_name%, which you can run against a file and it will output some information to the screen. Other than some minor document/formatting info it returns gobbly-gook as well.

We know that PDFs may be a combo of text and imagery. The vertical text that is highlightable, letter by letter, is that likely to be the only indexable text, as the angled, curved stuff is really probably created using an image editing utility and is graphic in nature?

What do you mean by 65 character titles? Relevance to?

#4 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 20 October 2009 - 08:58 AM

No clue how that got in there Bob. lol.gif It was something I'd copied to my clipboard when replying to another question, but I certainly didn't intend to paste it here! I'll go back and remove it from the original response so it's not quite so confusing.

#5 bobmeetin

bobmeetin

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 533 posts
  • Location:Colorado

Posted 10 November 2009 - 10:08 AM

Indexability and findability - they are not necessarily equivalent as we know. With regular web pages you have the flexibility of massaging the title tag, meta description, etc. PDFs (should it be PDF's - although not possessive) however, lose that baggage. To access them you typically add a link to a page or a menu, nothing more.

That being said, I assume that if you set up both a web page and a pdf with similar/identical content that the web page would win, hands down.

So the question here is "How much findability value is there in adding your stock PDF product info pages to your website? As opposed to taking the time/$$ in converting them to text/html?"


#6 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 10 November 2009 - 01:23 PM

I don't know about the time trade off. Given one or the other I'd probably choose having 'em in plain old html since text pages seem to get picked up better. However this doesn't mean you can't get PDF's to rank well if that's what you've got to work with. It can be done.

FTR, there is a property in PDF files that seems to basically function the same as a <title> tag in an html page. It's been quite awhile since I last tested this so I would need to go back and re-review all of the various files and make sure things haven't changed. (The test was testing PDF security settings, Read only vs. Editable text in the PDF and the various PDF Properties.

When I last checked the parts that got indexed and ended up being searchable in Google were:
  • Text content in the PDF. It didn't appear to matter if it was Read Only text or was set up in a form so as to be editable.
  • The Title of the PDF file properties. In Acrobat Pro those accessible via File > Document Properties

None of the other things available in the PDF Properties area held any sway with Google, so presumably were not indexed. Those include the Author, Subject and yes the Keywords properties.

If you have the paid Pro version of Acrobat available to you, or something else that will allow you to see the Propertes, I still have the test page up on my personal site that links to the 16 different PDF versions I tested at the time. 3+ years after the fact. lol.gif

#7 andreamoro

andreamoro

    HR 1

  • Members
  • Pip
  • 3 posts
  • Location:Letchworth

Posted 11 November 2009 - 05:29 AM

Did you make some public test available on line? I did something last 26th of October at PDF Ranking test (www.andreamoro.eu/SEO-test-PDF/) and I'm still evaluating the results. Sooner I'll publish a blog post about my deduction.

#8 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 11 November 2009 - 09:07 AM

I'm 99% sure I published the general outcome here somewhere, since the test was driven by a question that had been raised in a thread here somewhere. That was three years ago. The chances of me finding that again are pretty much nil. giggle.gif

I don't recall publishing a table of what did and didn't get indexed with each version. I might have, but if I did that's lost to my fading memory. Of course since the test PDF's are still up there it should be fairly simple to look at each test PDF again and see what's being done with it years after the fact. wink1.gif Just gotta find a few hours to actually look at the specifics.




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users

SPAM FREE FORUM!
 
If you are just registering to spam,
don't bother. You will be wasting your
time as your spam will never see the
light of day!