Optical character recognition, or OCR, is a method of converting a scanned
image into text. When a page is scanned, it is typically stored as a bit-mapped
file in TIF format. When the image is displayed on the screen, we can read it.
But to the computer, it is just a series of black and white dots. The computer
does not recognize any "words" on the image.
This is what OCR does. OCR looks at each line of the image and attempts to
determine if the black and white dots represent a particular letter or number.
OCR was actually developed originally to assist sight-impaired individuals gain
access to printed information. That same technology has been updated and
improved and is now used to "read" computer files.
OCR can be a very powerful tool for a law firm. The key is its ability to
produce a text version of the scanned documents. Once a text file has been
created, it then becomes possible to launch a text search and locate any page
with a given word or set of words.
For example, let's say you are working on a case that has 100 boxes of
documents. You need to find every page where the name John Jones appears. How do
you do it? Well, the traditional way is to have someone sit down and read each
and every piece of paper in all 100 boxes and pick out the pages that are
There are two obvious problems here. First of all, there is an enormous
amount of time that must be expended for this task. Every hour of that time must
be paid for. Given the level of expertise that the individual reviewing the
documents must have, and their associated payroll costs, this can be a very
substantial cost to the firm.
Secondly, there is no guarantee that a critical page will not be missed.
Manually reading all of those pages is a very boring task. Fatigue, boredom, and
human error almost ensure that a page will be missed here and there. It is just
a gamble as to whether or not the pages that will be missed are important or
not. Given the huge amount of time involved in the task, no one is going to pay
for a second pass. Firms have just had to accept the fact that pages will be
With OCR, though, this whole process is simplified and made more accurate.
Once the documents have been scanned and processed through the OCR module, there
is a text version of every page available. Now someone can launch a search for
John Jones and let the computer do the searching. It will find every page of
every document where that name appears. The process may take some time,
depending upon how many pages are to be searched, but no matter how many pages
there are, there is no cost involved. No one has to dedicate any time to the
process once it starts.
When the OCR process is completed, it will have assembled a list of every
page from every document that contains the word or words that were used in the
search. Those pages can be selected, reviewed, or even printed. OCR is a great
research tool and can provide vastly superior access to critical information
than can manual searches.
However, it is important to understand the limitations and capabilities of
OCR. While it is a great tool, it is not perfect. The biggest factor in the
success or failure of an OCR process is the quality of the original documents.
It has been our experience that if the original documents were clean, laser
printed pages, OCR should read 98+% of the words correctly. Some words may not
be read correctly if there is handwriting over it, or if there are stamps or
other marks that partial cover the text.
If the original documents were faxes, or multi-generational photocopies, or
were printed with a dot matrix printer, the success rate of OCR drops off
quickly. These types of documents may only have a 60%-80% successful read. The
same is true of even laser printed documents that have lots of lines and boxes.
The lines and boxes confuse OCR, because OCR tries to read the lines as part of
the text. If the original documents were hand written, OCR will NOT read the
information at all. In spite of the claims of some OCR or ICR companies, we have
yet to see any software that will successfully read "real" hand written
material. There are some packages that will read carefully printed block
letters, but this is not generally what is done in the real world.
In summary, OCR is a very powerful tool for research within a law firm. It
has the power and capability of creating vast amounts of textual data that can
then be searched. As long as the limitations of OCR are understood, it can be of
great benefit to law firms with cases of any size.