Google has announced that it will now begin including scanned documents in its search results – a feat that requires an immense amount of processing power and advanced image recognition technology. Unlike standard text documents, scanned files don’t contain any text data that Google’s spiders can index. Instead, Google has employed Optical Character Recognition (OCR) technology, converting photos of words into digital text files.
In the past Google would attempt to index these image files as well as possible, but could typically search only file titles and nearby metadata – not the contents of the documents. From now on Google searches will include the text within these scanned images in normal search results. When you encounter a scanned document you’ll be able to view it in its original form as a PDF, or as a converted text file (click “View As HTML”).
Such technology has existed for quite a while, but accuracy has always been an issue – and the fact that Google is doing it on such massive scale makes it a very impressive accomplishment. It also opens the doors to much more thorough searching, especially for content that is often found in printed documents (like academic papers).
Here’s an example (the first result is a scanned document): Repairing Aluminum Wiring
For more, check out the announcement here.