Amazon Textract brings intelligence to OCR

One of the challenges just about every business faces is converting forms to a useful digital format. This has typically involved using human data entry clerks to enter the data into the computer. State of the art involved using OCR to read forms automatically, but AWS CEO Andy Jassy explained that OCR is basically just a dumb text reader. It doesn’t recognize text types. Amazon wanted to change that and today it announced Amazon Textract, an intelligent OCR tool to move data from forms to a more useable digital format.

In an example, he showed a form with tables. Regular OCR didn’t recognize the table and interpreted it as a string of text. Textract is designed to recognize common page elements like a table and pull the data in a sensible way.

Jassy said that forms also often change, and if you are using a template as a workaround for OCR’s lack of intelligence, the template breaks if you move anything. To fix that, Textract is smart enough to understand common data types like Social Security numbers, dates of birth and addresses, and interprets them correctly no matter where they fall on the page.

“We have taught Textract to recognize this set of characters is a date of birth and this is a Social Security number. If forms change, Textract won’t miss it,” Jassy explained.

more AWS re:Invent 2018 coverage