Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!


Sponsored Content

 

 
 

Photo
- - - - -

Ocr And Crawlers


  • Please log in to reply
15 replies to this topic

#1 cheapgary

cheapgary

    HR 3

  • Active Members
  • PipPipPip
  • 76 posts

Posted 11 July 2005 - 12:01 AM

Anybody know of any progress in the development of spidering scanned text documents via Optical Character Recognition (OCR) ?


Regards,
Gary

Edited by cheapgary, 11 July 2005 - 01:06 AM.


#2 Jill

Jill

    High Rankings Advisor

  • Admin
  • 32,311 posts

Posted 11 July 2005 - 12:10 AM

I believe that all of the search engines read/index pdf files using that technique.

#3 cheapgary

cheapgary

    HR 3

  • Active Members
  • PipPipPip
  • 76 posts

Posted 11 July 2005 - 12:32 AM

"I believe that all of the search engines read/index pdf files using that technique."
I'm trying some experiments now but I haven't been able to nail down that this is true with scans, ie image text.

Regards,
Gary

#4 lyn

lyn

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 940 posts
  • Location:London, Ontario

Posted 11 July 2005 - 06:41 AM

I haven't heard anything of the SE's using OCR to read text from image files. Most PDFs embed fonts along with images and render the text as text, so the spiders are able to read it without running character recognition.

Occasionally you get a PDF that is just an image scan of doc that has been stuffed into a PDF, but then it's just a jpeg with a PDF wrapper around it - no text to read.

If you can highlight sections of text alone with the Reader selection tool, it's a properly formed PDF that the spiders can read. If's it's just a scan, the selection tool won't find any text either, just a single full-page image.

L.

#5 Jill

Jill

    High Rankings Advisor

  • Admin
  • 32,311 posts

Posted 11 July 2005 - 07:52 AM

The engines read pdfs, as far as I know, they don't bother to convert images to text that are on people's pages.

#6 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 11 July 2005 - 01:04 PM

I can confirm Lyn's take on PDFs, having done more than a bit of studying on it.

The engines are reading the text that is embedded into the pdf file. They're not using OCR for that in the strict sense, in that they're not actually looking at what appears or doesn't appear like OCR would with an image.

I have literally thousands of examples where what appears to be text in a PDF, but is actually part of the image used to create the base PDF, is completely ignored and not indexed. But the embedded text does get indexed and does show up in the SERPs. Additionally, at least the Title in the PDF Properties does get indexed and will rank, even if it is not visible in the PDF file.

#7 cheapgary

cheapgary

    HR 3

  • Active Members
  • PipPipPip
  • 76 posts

Posted 11 July 2005 - 08:37 PM

There's a site I use that is a State archives of old scanned newspapers from the 1800's and up. It has it's own search engine, can keyword search and it delivers the scanned image of the document all tattered and faded with the keyword highlighted.

I've corresponded with them and got the name of the ocr program.

I'm looking into when/how this can be deployed in a larger scale. I'm thinking, couldn't Google go out and crawl images with ocr. Maybe it needs a special robots file name or something so as to know to index it with ocr.

Regards,
Gary

#8 qwerty

qwerty

    HR 10

  • Moderator
  • 8,287 posts
  • Location:Somerville, MA

Posted 11 July 2005 - 09:10 PM

What does Google Print use? That's searchable. For example, here are the results of a search for "april is the cruelest month" eliot

#9 cheapgary

cheapgary

    HR 3

  • Active Members
  • PipPipPip
  • 76 posts

Posted 11 July 2005 - 09:35 PM

qwerty,
Now that's nice. I'll check this out some more.

Can't right click on it, doesn't download on "file save as".

Can you spot what format that document is in? Is it in any image file format?

I wonder if Google can crawl MY archives? What would I have to do?? Hmmm .... It could even be the app I was describing. I have to go look for an email for the name of the thing. I was up there $$$ wise. Maybe I don't have to buy it now.

update:
Well, it DOES download as a view source/save as .txt, change it to htm and open it in DW. It's all there but there's no image in the folder and it has a blank .gif stretched across it, and it's not a bg. Hmmm...again.

Here we go ..

<style type=text/css>.theimg { background-image:url("http://print.google....iXGbouQMCBQBco");background-repeat:no-repeat;background-position:center left;background-color:white; }</style>

But this is a local file without a bg in the folder, so where is this img?

Regards,
Gary

Edited by cheapgary, 11 July 2005 - 10:31 PM.


#10 cheapgary

cheapgary

    HR 3

  • Active Members
  • PipPipPip
  • 76 posts

Posted 12 July 2005 - 02:22 AM

BINGO ....!!!

MOUNTAIN VIEW, Calif. - January 17, 2001 - Google Inc., developer of the award-winning Google search engine, today announced that the company has named Wayne Rosing, 54, as its new vice president of Engineering. Rosing, a Silicon Valley veteran, brings over a quarter century of engineering and research experience to the Mountain View-based company.

As vice president of Engineering, Rosing will be responsible for managing Google's skilled engineering staff and guiding the company's technologic development. Rosing joins Google from his most recent position as chief technology officer and vice president of Engineering at Caere Corporation. While at Caere, Rosing managed all engineering for optical character recognition (OCR) product lines and was the driving force behind the acquisition of what has become one of Caere's key products, Omniform, a comprehensive forms application. ......."

That's the company!

#11 cheapgary

cheapgary

    HR 3

  • Active Members
  • PipPipPip
  • 76 posts

Posted 12 July 2005 - 02:59 AM

Here's the email I was looking for. The program I saw in use at CHNC is even nicer than Omnipage. Now that I recall where I was in this, I found I have Omnipage on disc that came with my Canon digital camera #2. I'm looking for OAP in use in another site.

So, Google IS doing it with OCR. I wonder if I can get them to do mine?

Well, nice meetin' you all. Bye.

You can go ahead and close this thread , too.

Regards,
Gary


"Thanks for your feedback on Colorado's Historic Newspaper Collection .....

CHNC is using some fairly sophisticated (and expensive) software called
Olive ActivePaper Archive (http://www.olivesoftware.com) that was
developed or contemporary newspaper publishers and adapted for use with historic
newspapers.


To do basic OCR at home there are a number of products available on the
market and we're not in the position to endorse one or the other. The
ABBYY FineReader (www.abbyy.com) or ScanSoft OmniPage (www.omnipage.com) software is frequently used in library and archival applications. "

#12 lyn

lyn

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 940 posts
  • Location:London, Ontario

Posted 12 July 2005 - 11:03 AM

It looks to me like the contents of Google Print are produced by scanning books and PDFs that are submitted directly for indexing. Very cool, but not a result of spidering.

L.

#13 Jill

Jill

    High Rankings Advisor

  • Admin
  • 32,311 posts

Posted 12 July 2005 - 11:23 AM

Yes, lyn, I believe that is correct.

#14 qwerty

qwerty

    HR 10

  • Moderator
  • 8,287 posts
  • Location:Somerville, MA

Posted 12 July 2005 - 11:45 AM

It is. I'm just curious about exactly what's displayed when I click through from a result. I don't think that's a PDF, and if it's a graphical file I don't know how they highlight text. I suppose it must be their own technology.

#15 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 12 July 2005 - 11:49 AM

True. OCR is being employed by Google in their push to scan all sorts of books from libraries and such. It's something they're interested in doing obviously.

That's completely different from what their spider can or can't do though. They'll have to pull the files internal to do anything exotic, which isn't happening on anything approaching a massive or routine scale yet.




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users