Friday, October 31, 2008

Google Now Search Scanned Documents.. Wow..!

If you've ever had trouble finding scanned documents on Google, it's probably because it was not indexing them. On Thursday, this all changed. Google has announced that it is now indexing scanned documents.

Google is now able to perform optical character recognition (OCR) on any scanned document it finds stored in the PDF format. OCR technology is able to "read" a scanned document and covert it into words that can be searched and indexed.

To people reading these documents, the distinction between words and pictures of words makes little difference, but for a computer the picture is almost unintelligible. Consider a circle. Should it be read it as a zero, the letter 'O', just a circle, or the ring from my coffee cup? People learn to answer this kind of question very quickly, but for the computer it is a painstaking and error-prone process. Now to apply it to all scanned PDF images on the Internet? Very impressive.

Official Google Blog: A picture of a thousand words?

No comments: