Monday, May 26, 2008

Diverse Image Sources Challenge Traditional Document OCR/ICR

I’ve seen two growing trends in document processing: the increasing use of decentralized and smaller scanners, and the loss of control of precise printing for the source documents being scanned.

These two combined trends are causing lots of grief for long-established industries and solutions that depend on precisely measured field positioning to perform OCR/ICR.

For example, years ago, state tax agencies would lay out and then contract with print shops to produce tax forms, and then these paper forms were sent back, where they were scanned on a few “big iron” scanners.
Today, each form is still printed, perhaps at more print shops due to competitive requirements. The “same” forms are also color or black-and-white photocopied by individuals, and printed (color or black-and-white), at various scaling factors, from downloaded PDF files.

Another example is insurance agencies, where desktop scanners are used at each agency to scan and send documents to an insurance company’s central processing. Again, images may be scanned at 200dpi, 300dpi, color, grayscale, bi-tonal, and output as TIF, PDF, JPEG, and who knows what other format-du-jour.

While no one has had to toss out their existing traditional OCR/ICR capture technology, all these variations take a huge amount of extra effort to deal with. I’ve seen traditional OCR/ICR systems where a single logical form has to be implemented 4 different times to handle these differences.

Does anyone else have a story to tell here, especially one with a happy ending? I know there are different approaches to OCR/ICR capture that don’t have problems with these kinds of variations. Has anyone tried them?

Paul Traite, ICP

2 comments:

Henrico Dolfing said...

First of all I fully agree with your observations of the trends in document processing.

I think the solution to this problem is:

1) Treating "forms" as semi-structured documents instead of traditional forms. And there are some OCR/ICR engines that handle semi-structured documents pretty well.

2) Focus on the image pre-processing when chosing your solution as well. Image pre-processing becomes more and more important to supply the OCR/ICR engine with the input it handles best.

Paul Traite said...

Glad I'm not the only one out here seeing this.

I'd make a distinction between OCR/ICR engines, which recognize either full page and/or fields in pre-defined locations, and higher level document capture (DC) systems.

DC systems are built on top of OCR/ICR engine(s), and use them to dynamically find data on semi-structured (and unstructured) documents.

There are a number of OCR/ICR engine suppliers, like ABBYY, OCE/Captaris, Nuance/Scansoft. All of them have in the last few years been trying to move "upscale" into the document processing arena. I've seen varying levels of success from this group moving from providing toolkits for DC vendors, to providing fuller solutions for diverse business IT departments to implement capture of their own business forms.