| Important Announcement: ***Need an Affordable SEO Website Review?*** |
![]() ![]() |
Jul 11 2005, 12:01 AM
Post
#1
|
|
|
HR 3 ![]() ![]() ![]() Group: Active Members Posts: 76 Joined: 23-December 04 User's local time: Feb 9 2010, 04:37 PM Member No.: 6,037 |
Anybody know of any progress in the development of spidering scanned text documents via Optical Character Recognition (OCR) ?
Regards, Gary This post has been edited by cheapgary: Jul 11 2005, 01:06 AM |
|
|
|
Jul 11 2005, 12:10 AM
Post
#2
|
|
![]() High Rankings Advisor Group: Admin Posts: 29,201 Joined: 21-July 03 User's local time: Feb 9 2010, 03:37 PM From: Ashland, MA Member No.: 2 |
I believe that all of the search engines read/index pdf files using that technique.
|
|
|
|
Jul 11 2005, 12:32 AM
Post
#3
|
|
|
HR 3 ![]() ![]() ![]() Group: Active Members Posts: 76 Joined: 23-December 04 User's local time: Feb 9 2010, 04:37 PM Member No.: 6,037 |
"I believe that all of the search engines read/index pdf files using that technique."
I'm trying some experiments now but I haven't been able to nail down that this is true with scans, ie image text. Regards, Gary |
|
|
|
Jul 11 2005, 06:41 AM
Post
#4
|
|
|
HR 6 ![]() ![]() ![]() ![]() ![]() ![]() Group: Active Members Posts: 940 Joined: 28-April 04 User's local time: Feb 9 2010, 04:37 PM From: London, Ontario Member No.: 3,389 |
I haven't heard anything of the SE's using OCR to read text from image files. Most PDFs embed fonts along with images and render the text as text, so the spiders are able to read it without running character recognition.
Occasionally you get a PDF that is just an image scan of doc that has been stuffed into a PDF, but then it's just a jpeg with a PDF wrapper around it - no text to read. If you can highlight sections of text alone with the Reader selection tool, it's a properly formed PDF that the spiders can read. If's it's just a scan, the selection tool won't find any text either, just a single full-page image. L. |
|
|
|
Jul 11 2005, 07:52 AM
Post
#5
|
|
![]() High Rankings Advisor Group: Admin Posts: 29,201 Joined: 21-July 03 User's local time: Feb 9 2010, 03:37 PM From: Ashland, MA Member No.: 2 |
The engines read pdfs, as far as I know, they don't bother to convert images to text that are on people's pages.
|
|
|
|
Jul 11 2005, 01:04 PM
Post
#6
|
|
![]() Convert Me! Group: Admin Posts: 17,380 Joined: 17-August 03 User's local time: Feb 9 2010, 02:37 PM Member No.: 551 |
I can confirm Lyn's take on PDFs, having done more than a bit of studying on it.
The engines are reading the text that is embedded into the pdf file. They're not using OCR for that in the strict sense, in that they're not actually looking at what appears or doesn't appear like OCR would with an image. I have literally thousands of examples where what appears to be text in a PDF, but is actually part of the image used to create the base PDF, is completely ignored and not indexed. But the embedded text does get indexed and does show up in the SERPs. Additionally, at least the Title in the PDF Properties does get indexed and will rank, even if it is not visible in the PDF file. |
|
|
|
Jul 11 2005, 08:37 PM
Post
#7
|
|
|
HR 3 ![]() ![]() ![]() Group: Active Members Posts: 76 Joined: 23-December 04 User's local time: Feb 9 2010, 04:37 PM Member No.: 6,037 |
There's a site I use that is a State archives of old scanned newspapers from the 1800's and up. It has it's own search engine, can keyword search and it delivers the scanned image of the document all tattered and faded with the keyword highlighted.
I've corresponded with them and got the name of the ocr program. I'm looking into when/how this can be deployed in a larger scale. I'm thinking, couldn't Google go out and crawl images with ocr. Maybe it needs a special robots file name or something so as to know to index it with ocr. Regards, Gary |
|
|
|
Jul 11 2005, 09:10 PM
Post
#8
|
|
![]() HR 10 Group: Moderator Posts: 7,489 Joined: 24-July 03 User's local time: Feb 9 2010, 03:37 PM From: Somerville, MA Member No.: 22 |
What does Google Print use? That's searchable. For example, here are the results of a search for "april is the cruelest month" eliot
|
|
|
|
Jul 11 2005, 09:35 PM
Post
#9
|
|
|
HR 3 ![]() ![]() ![]() Group: Active Members Posts: 76 Joined: 23-December 04 User's local time: Feb 9 2010, 04:37 PM Member No.: 6,037 |
qwerty,
Now that's nice. I'll check this out some more. Can't right click on it, doesn't download on "file save as". Can you spot what format that document is in? Is it in any image file format? I wonder if Google can crawl MY archives? What would I have to do?? Hmmm .... It could even be the app I was describing. I have to go look for an email for the name of the thing. I was up there $$$ wise. Maybe I don't have to buy it now. update: Well, it DOES download as a view source/save as .txt, change it to htm and open it in DW. It's all there but there's no image in the folder and it has a blank .gif stretched across it, and it's not a bg. Hmmm...again. Here we go .. <style type=text/css>.theimg { background-image:url("http://print.google.com/print?id=n01ymdF9_vsC&pg=63&img=1&q=%22april+is+the+cruelest+month%22+eliot&sig=um9G1K1a9eRtUiXGbouQMCBQBco");background-repeat:no-repeat;background-position:center left;background-color:white; }</style> But this is a local file without a bg in the folder, so where is this img? Regards, Gary This post has been edited by cheapgary: Jul 11 2005, 10:31 PM |
|
|
|
Jul 12 2005, 02:22 AM
Post
#10
|
|
|
HR 3 ![]() ![]() ![]() Group: Active Members Posts: 76 Joined: 23-December 04 User's local time: Feb 9 2010, 04:37 PM Member No.: 6,037 |
BINGO ....!!!
MOUNTAIN VIEW, Calif. - January 17, 2001 - Google Inc., developer of the award-winning Google search engine, today announced that the company has named Wayne Rosing, 54, as its new vice president of Engineering. Rosing, a Silicon Valley veteran, brings over a quarter century of engineering and research experience to the Mountain View-based company. As vice president of Engineering, Rosing will be responsible for managing Google's skilled engineering staff and guiding the company's technologic development. Rosing joins Google from his most recent position as chief technology officer and vice president of Engineering at Caere Corporation. While at Caere, Rosing managed all engineering for optical character recognition (OCR) product lines and was the driving force behind the acquisition of what has become one of Caere's key products, Omniform, a comprehensive forms application. ......." That's the company! |
|
|
|
Jul 12 2005, 02:59 AM
Post
#11
|
|
|
HR 3 ![]() ![]() ![]() Group: Active Members Posts: 76 Joined: 23-December 04 User's local time: Feb 9 2010, 04:37 PM Member No.: 6,037 |
Here's the email I was looking for. The program I saw in use at CHNC is even nicer than Omnipage. Now that I recall where I was in this, I found I have Omnipage on disc that came with my Canon digital camera #2. I'm looking for OAP in use in another site.
So, Google IS doing it with OCR. I wonder if I can get them to do mine? Well, nice meetin' you all. Bye. You can go ahead and close this thread , too. Regards, Gary "Thanks for your feedback on Colorado's Historic Newspaper Collection ..... CHNC is using some fairly sophisticated (and expensive) software called Olive ActivePaper Archive (http://www.olivesoftware.com) that was developed or contemporary newspaper publishers and adapted for use with historic newspapers. To do basic OCR at home there are a number of products available on the market and we're not in the position to endorse one or the other. The ABBYY FineReader (www.abbyy.com) or ScanSoft OmniPage (www.omnipage.com) software is frequently used in library and archival applications. " |
|
|
|
Jul 12 2005, 11:03 AM
Post
#12
|
|
|
HR 6 ![]() ![]() ![]() ![]() ![]() ![]() Group: Active Members Posts: 940 Joined: 28-April 04 User's local time: Feb 9 2010, 04:37 PM From: London, Ontario Member No.: 3,389 |
It looks to me like the contents of Google Print are produced by scanning books and PDFs that are submitted directly for indexing. Very cool, but not a result of spidering.
L. |
|
|
|
Jul 12 2005, 11:23 AM
Post
#13
|
|
![]() High Rankings Advisor Group: Admin Posts: 29,201 Joined: 21-July 03 User's local time: Feb 9 2010, 03:37 PM From: Ashland, MA Member No.: 2 |
Yes, lyn, I believe that is correct.
|
|
|
|
Jul 12 2005, 11:45 AM
Post
#14
|
|
![]() HR 10 Group: Moderator Posts: 7,489 Joined: 24-July 03 User's local time: Feb 9 2010, 03:37 PM From: Somerville, MA Member No.: 22 |
It is. I'm just curious about exactly what's displayed when I click through from a result. I don't think that's a PDF, and if it's a graphical file I don't know how they highlight text. I suppose it must be their own technology.
|
|
|
|
Jul 12 2005, 11:49 AM
Post
#15
|
|
![]() Convert Me! Group: Admin Posts: 17,380 Joined: 17-August 03 User's local time: Feb 9 2010, 02:37 PM Member No.: 551 |
True. OCR is being employed by Google in their push to scan all sorts of books from libraries and such. It's something they're interested in doing obviously.
That's completely different from what their spider can or can't do though. They'll have to pull the files internal to do anything exotic, which isn't happening on anything approaching a massive or routine scale yet. |
|
|
|
![]() ![]() ![]() |
|
Lo-Fi Version | Time is now: 9th February 2010 - 03:37 PM |