
`Lawrence O'Gormcn
`Rongochar Kasfu ri
Ex. 1007, p. 1 of 41
`Ex. 1007, p. 1 of 41


`Executive Briefing
`Document Image Analysis
`Lawrence O’Gorman
`Rangachar Kasturi
`Los Alamitos, California
Ex. 1007, p. 2 of 41
`Ex. 1007, p. 2 of 41


`ii 3i
`In the late 19805. the prevalence of fast computers. large computer memory. and inex-
`pensive scanners fostered an increasing interest
`in document image analysis. With
`many paper documents being sent and received via fax machines and stored digitally
`in large document databases. interest grew in doing more than simply viewing and
`printing these images. Research was undertaken and commercial systems built to read
`text on a page, to find fields on a form. and to locate iines and symbols on a diagram.
`Today, we see the resuits of this research and development in document pioccssing
`and optical character recognition (OCR). OCR is used by post offices to automatically
`route mail; engineering diagrams are extracted from paper for computer storage and
`modification; handheld computers recognize symbols and handwriting for use in
`niche markets such as inventory control. [n the future. applications such as these will
`be improved, and other document applications will be added. For instance. the mil-
`lions of paper volumes now in libraries will be replaced by computer tiles of page
`images that can be searched for by content and accessed by many people simulta—
`neously—and they wili never be misshelvcd, Businesspeople wiil carry their file cabi-
`nets in their portable computers: paper copies of new product literature. receipts. or
`other random notes will be instantly filed and accessed in the computer; and sigma
`lures will be analyzed by the computer for verification and security access.
`This: book describes some of the technical methods and systems used for docu-
`ment processing of text and graphics images. The methods have grown out of the
`fieids of digitai signal prmessing. digital image processing, and pattern recognition.
`The objective is to give the reader an understanding of what approaches are used for
`application to documents and how these methods apply to different situations. Since
`the field of document processing IS relatively new, it is also dynamic; in other words.
`current methods have room for improvement, and innovations are still being made. In
`addition. there are rarely definitive techniques for all cases oia certain problem.
`The intended audience is executives. managers. and other decision makers whose
`businesses require some acquaintance or understanding of document processing. (We
`call this group “executives" in accordance with the [iterative Briefing series.) Some
`mdirnentary knowledge of computers and computer images will be helpful back-
`groumi for these readers. We begin with basrc principles (such as defining pixels) hut
`technique operates and not necessarily knowledge of picture processing. A grasp of
`the terminology goes a long way toward aiding the executive in understanding the
`technology and processes discussed in each chapter. For this reason, each section
`begins with a list of keywords. With knowledge of the terminology and whatever
`depth of method or system understanding that he or she decides to take from the text.
`the executive shouid be well equipped to deal with document—processing issues.
`in each chapter. we attempt to identify major problem areas and to describe more
`than one method applied to each problem. as well as the advantages and disadvan»
`tages of each method. This gives an understanding of the problems and also the nature
`of trade-offs that so often must he made in choosing a method. With this tinderstandw
`ing of the problem and a knowledge of the methodology options. an executive will
`have the technical background and context with which to ask questions, judge recon:-
`mendations, weigh options, and make decisions.
`We incittdc technoiogy descriptions and references to the technical papers that
`host give details on the techniques presented in the book. The technology descriptions
`are detailed enough for one to understand the methods—if implementation is desired.
`the references will facilitate this. i’opular as woli as accepted methods are presented
`so that the executive can compare a variety of options. in many cases. some of the
`options are advanced methods not currently used in commercial products. Depending
`eessing. These are described from the applications viewpoint to give concrete exam-
`pies of where and how the methods are implemented.
`The book is organized in the sequence that document images are usually pro-
`cessed. After document input by digital Scanning, pixel processing is first perfumed.
`This level of processing includes operations that are applied to all image pixels. These
`include noise removal, image enhancement. and segmentation of image components
`into text and graphics (lines and symbols). 'Featttrcdevel analysis treats groups of pix~
`cls as entities and includes line and curve detection and shape description. The last
`Chapter 1
`Textual Processing
`\\ ..
`Graphical Processing
` Line Processing
`0 tical Character
`Region and
`Skew, Text Lines.
`Straight Lines,
`Filled Regions
`Text Blocks, and
`Corners. and Curves
`Figure 1. 1 Hierarchy oi'document processing subarcas listing the types ofdocument components in each suh~
`analysis techniques. the megabytes of initial data are culled to yield
`tic description of the document.
`it is not difficult to find examples ofthe need for doc
`ument analysis. Look around
`the workplace and you will see stacks of paper docum
`cnts. Some may be computer
`generated, though invariably by different computers and
`software. and their electronic
`formats may be incompatible. Some will include both
`formatted text and tables. as
`well as handwritten entries. and they (tiller in size. from 3.5 x 2 in. (8.89 x 5.08mi)
`business cards to 34 x 44 in. (86 x l l loin) engineering drawings. in many businesses
`today. imaging systems are used to store images of pages so that storage and retrieval
`are more efficient. Future document analysis systems will recognize types of docu~
`ments, enable the extraction of their Functional pans, and be able
`to translate from one
`computer generated format to another. Many other exampics ex
`ist of the. use of and
`need for document systems. Glance behind the counter
`in a post office at the mounds
`of letters and packages. In some US. post offic
`es. over a million pieces of mail must
`be handled each day. Machines to perform sorti
`ng and address recognition have been
`1.1 Her
`- 1080IPR2021 0
`What is a Document image and What: Do We [Jo with It?
`used for several decades. but there is still the need to process more mail. more quickly.
`and more accurately. Examine the- stacks of a library. where row after row of paper
`documents are stored. Loss of material. misiiling. limited numbers oi‘cach copy. and
`even degradation of materials are common problems and may be improved hy docu-
`ment analysis techniques. All of these examples serve as applications ripe for the
`potential solutions of document image analysis.
`Though document image analysis has been in use for a couple of decades (espe-
`cially in the banking business for computer reading of numeric check codes). only in
`the late I980s and early £99le has the field grown rapidly. The predominant reason for
`this is the greater speed and lower cost of hardware now available. Since fax machines
`have become ubiquitous, the cost ofoptical scanners for document input has dropped
`to the level
`these are affordable to even small businesses: and individuals.
`Although document images contain a relatively large amount of data. even personal
`computers now have adequate speed to process them. Computer main memory also is
`now adequate for large document images: more importantly. however. optical memory
`is now available for mass storage of large amounts of data. This improvement in hard-
`ware. and the increased use of computers for storing paper documents. has led to
`increasing interest in improving the technology of document processing and recogni-
`tion. The advancements being made in document analysis software and algorithms are
`an essential complement to these hardware improvements. With OCR recognition
`rates now in the mid to high 90 percent range. and other document processing mellir
`ods achieving similar improvements, these advances in research have also driven doc-
`ument image analysis forward.
`As improvements
`in technology continue. document systems will become
`increasingly more common. For instance. OCR systems will he more widely used to
`store. search. and excerpt from paper~based documents. Page layout analysis tech-
`niques will recognize a particular form or page format and allow its duplication. {Ira-
`grams will be entered from pictures or by hand and logically edited.
`computers will translate handwritten entries into electronic documents. Archives of
`paper documents in libraries and engineering companies wili be electronically con-
`vened for more efficient storage and instant delivery to a home or office computer.
`Although it will be increasingly the case that documents are produced and reside on a
`computer. because there are many different systems and prott'icols and because paper
`is a very comfortable medium for us to deal with. paper documents will he with us to
`some degree for many decades to come. The difference will he that they will iiuaily he
`integrated into our computerized world.
`includes digital signal processing and digital image processing. Digital signal process-
`1.1 Hardware Advancements and the Evolution of
`Document Image Analysis
`The history of document image analysis can be traced through a computer lineage that
`computer vision for processing images of three-dimensional scenes used in robotics.
`In the mid- to lote- l 9805, document image analysis bega
`this was predominantly due to hardware advancements enabling processing to be
`formed at a reasonable cost and speed. Whereas a speech sign
`in frames of 256 samples long .
`document image is from 2.550 X 3.300 pixels for at business l‘
`dots per inch (dpi) ( l2 dots permillimeter) to 34.000X44 000 pixels fora 34 x 44 in.
`systems are now available for storing business forms, performing OCR on typewritten
`text, andcompressing engineering drawings. Document analysis research continues to
`pursue more intelligent handling of documents. better compression. especially
`through component recognition. and faster processing.
`ment analysis
`enhancement, and segmentation. For gray-scale images with information that is inher—
`that the image ol‘ the document contains only raw data that must he further at
`to glean the information. For example, Figure 1.3 shows the imag
`is a pixel array ofON or OFF values whose sh' pe is known to h
`however. to a computer it is just a string ol‘ bits in computer me
`1.2.1 Pixel-Level Processing [Chapter 2)
`This stage ofdocument image analysis includes hinarization, noise reduction, signal
`e of the letter 6. This
`umans as the letter e,-
`1 Title, 2 3
`Figure 1 .a
`Ex. 1007, p. 8 of 41


`dC ,1: at
`“we 6 urea
`m an
`7,500x 10 Character Features
`1.0 X 5 Region Features
`500 Line and Curve Segments
`Ranging from 20 to 2,000 Pixels Long
`I0 Filled Regions Ranging from
`'20 x 20 to 200 x 200 Pixeln
`What la a Document Image and What Do We Do with It?
`Document Page
`Data Capture
`107 pixels
`7,500 Character Boxes, Each
`About 15 X 20 Pixels
`Text Analysis
`Graphics Analysis
`and Recognition
`and Recognition
`1,500 Words, 10 Paragraphs,
`l Titic, 2 Subtitles, and so on
`2 Line Diagrams,
`1 Company Logo. and so on
`Document Description
`Figure 1.2 Atypical sequence of steps for document analysis, with examples of intermediate and final results
`and data site.
`background and whiz
`d by perfimning
`What: Is a Document Image and What; Do We Do with It?
`Segmentation occurs on two levels. On the first level, segmentation occurs if the
`document contains both text and graphics—these are separated for subsequent pro-
`cessing by different methods. 0n the second level. segmentation is performed on tcxt
`by iocating columns. paragraphs. Words. titles. and captions; and on graphics by sepa-
`rating symbol and line components. For instance. in a page: containing a flow chart
`with an accompanying caption. text and graphics are first separated. Then the text is
`separated into that of the caption and that of the chart. The graphics are separated into
`rectangles. circics. connecting lines. and so on.
`1 .2.2 Feature-«Level Analysis [Chapter 3]
`in a text image. the global features describe each page and consist of skew (the tilt at
`which the page has been scanned). line lengths. tine spacing, and so on. Local features
`describe individual characters and consist of font size, number of loops in a character.
`numbe:r of crossings. accompanying dots and so on.
`In a graphical
`image. global features describe the skew of the page. the line
`widths. range of curvature. minimum fine lengths. and so on. Local features describe
`each corner. curvc. and straight line. as wail as the rectangles, circles. and other geo—
`metric shapes.
`1.2.3 Recognition of Text and Graphics [Chapters 4
`and 5]
`The float step in document image analysis is recognition and description: components
`are assigned a semantic label and the entirc document
`is described as a whoic.
`Domain knowledge is used most extensively at this stage. The result is a description
`of a ductirncnt as a human wouid give it. For a text image. we refer for example. not to
`pixel groups or biobs of black on white. but to titles. subtitles. bodies ni‘text, and took
`notes. Depending on the arrangement of these text blocks, a page of text may be a title
`page of a papcr, ii labia of contents of a journal. at business form. or the l’acc of a mail
`piece. For a graphical image. an electrical circuit diagram for instance. we refer not to
`tines joining circles and triangles and other shapes. but to connections betwocn AND
`gates, transistors. and other eicctronic components. The components and their connec—
`tions describe a particular circuit that has a purpose. in the known domain. It is this
`semantic description that is most ci'ficicntiy stored and most cfi‘cctivcly used for com
`mon tasks. such as indexing and modifying particular document components .
` Chapter 2
`Preparing the Document
`2. 1
`Data capture of documents by optical scanning or by digital video yields a file of pic-
`ture elements, or pixels, that is the raw input to document analysis. These pixels are
`samples of intensity values taken in a grid pattern over the document page, where the
`intensity values may be OFF (0) or ON (1) for binary images, 0 to 255 for gray-scale
`images, and three channels of 0 to 255 color values for color images. The first step in
`document analysis is to perform processing on this image to prepare it for further
`analysis. Such processing includes thresholding to reduce a gray-scale or color image
`to a binary image, reduction of noise to reduce extraneous data, and thinning and
`region detection to enable easier subsequent detection of pertinent features and
`objects of interest. This pixel-level processing (also called preprocessing and low-
`level processing in other literature) is the subject of this chapter.
`2.2 Thresholding
`thresholding, binarization, global
`intensity histogram
`thresholding, adaptive thresholding,
`In this treatment of document processing. we deal with images containing text and
`graphics of binary information—that is, these images contain a single foreground
`level that is the text and graphics of interest and a single background level upon which
`1 0
`Chapter 2
`\ t
`he foreground contrasts. We will also call the foreground objects, regions of interest,
`or components. (Of course, documents may also contain true gray-scale [or color]
`information, such as in photographic figures; however, except for recognizing the
`presence of a gray—scale picture in a document, we leave the analysis of pictures to the
`more general fields of image analysis and machine vision.) Although the information
`is binary, the data—in the form of pixels with intensity values—are not likely to have
`only two levels, they, instead, have a range of intensities. This may be due to non-uni-
`form printing or non-uniform reflectance from the page or a result of intensity transi-
`tions at the region edges that are located between foreground and background regions.
`The objective in binarization is to mark pixels that belong to true foreground regions
`with a single intensity (0N) and background regions with a different intensity (OFF).
`Figure 2.] illustrates the results of binarizing a document image at different threshold
`values. The UN values are shown in black in Figure 2.1, and the OFF values are in
`For documents with a good contrast of components against a uniform back-
`ground, binary scanners are available that combine digitization with thresholding to
`yield binary data; however, for the many documents that have a wide range of back-
`ground and object intensities, this fixed threshold level often does not yield images
`with clear separation between the foreground components and the background. For
`instance, when a document is printed on differently colored paper, when the fore
`ground components are faded due to photocopying, or when different scanners have
`different light levels, the best threshold value will also be different. For these cases,
`there are two alternatives. One is to empirically determine the best binarization setting
`on the scanner (most binary scanners provide this adjustment) and to do this each time
`an image is poorly binarized. The other alternative is to start with gray-scale images
`(having a range of intensities, usually from 0 to 255) from the digitization stage and
`then to use methods for automatic threshold determination to better perform binariza-
`tion. Although the latter alternative requires more input data and processing,
`advantage is that a good threshold level can be found automatically, ensuring consis-
`tently good images, and precluding the need for time~consuming manual adjustment
`and repeated digitization. The following discussion presumes initial digitization to
`gray-scale images.
`If the pixel values of the components and those of the background are fairly con-
`sistent in their respective values over the entire image, a single threshold value can be
`found for the image. This use of a single threshold for all image pixels is called global
`thresholding. Processing methods will be described that automatically determine the
`best global threshold value for different images. For many documents, however, a sin-
`gle global threshold value cannot be used even for a single image due to non-unifor-
`mities within foreground and background regions. For example, for a document
`containing white background areas as well as highlighted areas of a different back-
`ground color, the best thresholds will change by area. For this type of image, different
`threshold values are required for different local areas; this is adaptive thresholding
`and will also be described.
`Preparing the Document Image
`1 1
`a. Histogram
`(Whlte) (b)
`255 (black)
`Intensity Value
`b. Low Threshold
`1.} 9 u
`T ém‘é‘i’ku'ssio 5
`Scientists, p.5
`d. High Threshold
`c. Good Threshold
`Topping ihe
`_ r, Talent of Russia's
`" Scientists, p.5
`Figure 2.1
`Image binarization. (a) Histogram of original gray-scale image; horizontal axis shows markings
`for threshold values of images below. The lower peak is for the white background pixels, and the
`upper peak is for the black foreground pixels. Image binarized with: (b) threshold value too low.
`(c) good threshold value. and (d) threshold value too high.
`Chapter 2
`2.2.1 Global Thresholding
`The most straightforward way to automatically select a global threshold is to use a
`histogram of the pixel intensities in the image. The intensity histogram plots the num-
`ber of pixels with values at each intensity level. See Figure 2.1 for a histogram of a
`document image. For an image with well-differentiated foreground and background
`intensities, the histogram will have two distinct peaks. The valley between these peaks
`can be found as the minimum between two maxima, and the intensity value there is
`chosen as the threshold that best separates the two peaks.
`There are a number of drawbacks to global threshold selection based on the shape
`of the intensity distribution. The first drawback is that images do not always contain
`well—differentiated foreground and background intensities because of poor contrast
`and noise. A second drawback is that, especially for an image of sparse foreground
`components, such as for most graphics images, the peak representing the foreground
`will be much smaller than the peak of the background intensities. This often makes it
`difficult to find the valley between the two peaks. In addition, reliable peak and valley
`detection are separate problems unto themselves. One way to improve this approach is
`to compile a histogram of pixel intensities that are weighted by the inverse of their
`edge-strength values [1]. Region pixels with low edge values will be weighted more
`highly than boundary and noise pixels with higher edge values, thus sharpening the
`histogram peaks due to these regions and facilitating threshold detection between
`them. An analogous technique is to highly weight intensities of pixels with high edge
`values, then choose the threshold at the peak of this histogram corresponding to the
`transition between regions [2]. This requires peak detection of a single maximum, and
`this is often easier than valley detection between two peaks. This approach also
`reduces the problem of large size discrepancy between foreground and background
`region peaks because edge pixels are accumulated on the histogram instead of region
`pixels; the difference between a small and large size area is a linear quantity for edges
`versus a much larger squared quantity for regions. A third method uses a Laplacian
`weighting. The Laplacian is the second derivative operator, which highly weights
`transitions from regions into edges (the first derivative highly weights edges). This
`will highly weight the border pixels of both foreground regions and their surrounding
`backgrounds, and because of this the histogram will have two peaks of similar area.
`Although these histogram-shape techniques offer the advantage that peak and valley
`detection are intuitive, peak detection is still susceptible to error due to noise and
`poorly separated regions. Furthermore, when the foreground or background region
`consists of many narrow regions, such as for text, edge and Laplacian measurement
`may be poor due to very abrupt transitions (narrow edges) between foreground and
`A number of methods determine foreground and background classes by using for-
`mal pattern recognition techniques that optimize some measure of separation. One
`method is minimum error thresholding [3, 4] (Figure 2.2). Here, the foreground and
`background intensity distributions are modeled as normal (Gaussian or bell—shaped)
`probability density functions. For each intensity value (from 0 to 255, or a smaller
`Preparing the Document Image 13
`range if the threshold is known to be limited to it), the means and variances are calcu-
`lated for the foreground and background classes, and the threshold is chosen such that
`the misclassification error between the two classes is minimized. Minimum error
`thresholding is classified as a parametric technique because of the assumption that the
`gray—scale distribution can be modeled as a probability density function. This is a pop-
`ular method for many computer vision applications, but some experiments indicate
`that documents do not adhere well to this model: thus, results with this method are
`poorer than nonparametric approaches [5]. One nonparametric approach is Otsu’s
`method [6, 7]. Calculations are first made of the ratio of between~class variance to
`within-class variance for each potential threshold value. The classes here are the fore-
`ground and background pixels, and the purpose is to find the threshold that maximizes
`the variance of intensities between the two classes and minimizes them within each
`class. This ratio is calculated for all potential threshold levels, and the level at which
`the ratio is maximum is the chosen threshold. An approach similar to Otsu’s employs
`an information theory measure, entropy, which is a measure of the information in the
`image expressed as the average number of bits required to represent the information
`[5, 8]. Here, the entropy for the two classes is calculated for each potential threshold,
`and the threshold where the sum of the two entropies is largest is chosen as the best
`threshold. Moment preservation is another thresholding approach [9]. This method is
`less popular than the preceding ones; however, we have found it to be more effective
`in binarizing document images containing text. In the moment preservation method, a
`threshold is chosen that best preserves moment statistics in the resulting binary image
`as compared with the initial gray—scale image. These moments are calculated from the
`intensity histogram—the first four moments are required for binarization.
`Many images have more than two levels. For instance, magazines often employ
`boxes to highlight text; the background of the box has a different color than the white
`background of the page. In this case, the image has three levels: background, fore-
`ground text, and background of highlight box. To properly threshold an image of this
`type, multithresholding must be performed. There are fewer multithresholding meth-
`ods than binarization methods. Most require that the number of levels is known (for
`example, [6]). For the cases where the number of levels is not known beforehand, one
`method [16] will determine the number of levels automatically and perform appropri-
`ate thresholding. This added level of flexibility may sometimes lead to unexpected
`results; for example, a magazine cover with three intensity levels may be thresholded
`to four levels because of the presence of an address label that is thresholded at a sepa-
`rate level.
`2.2.2 Adaptive Thresholding
`A common way to perform adaptive thresholding is by analyzing gray-level intensi-
`ties within local windows across the image to determine local thresholds [10, 11].
`White and Rohrer [12] describe an adaptive thresholding algorithm for separating
`characters from background. The threshold is continuously changed through the
`image by estimating the background level as a two-dimensional running average of
`Chapter 2
`Number of Pixels
`Number of Pixels
` Foreground
`h—p Intensity
`I Intensity
`Area of Misclassification
`Figure 2.2 Illustration of misclassification error in thresholding. Left, intensity histogram showing fore-
`ground and background peaks; right, the tails of the foreground and background populations have
`been extended to show the intensity overlap of the two populations. This overlap makes it impos—
`sible to correctly classify all pixels using a single threshold. The minimum-error method of
`threshold selection minimizes the total misclassification error.
`local pixel values taken for all pixels in the image (Figure 2.3). Mitchell and Gillies
`[13] describe a similar thresholding method where background white-level normaliza-
`tion is first done by estimating the white level and subtracting this level from the raw
`image. Segmentation of characters is accomplished by applying a range of thresholds
`and selecting the resulting image with the least noise content. Noise content is mea-
`sured as the sum of areas occupied by components that are smaller and thinner than
`empirically determined parameters. From the results of binarization for different
`thresholds shown in Figure 2.1, one can see that the best threshold selection yields the
`least visible noise. The main problem with any adaptive binarization technique is the
`choice of window size. The window size should be large enough to guarantee that
`enough background pixels are included to obtain a good estimate of average value,
`but not so large as to average over non-uniform background intensities. Often, how-
`ever, the features in the image vary in size, causing problems with fixed window size.
`To remedy this, domain dependent information can be used to ensure that the results
`of binarization give the expected features (a large blob of an ON-valued region is not
`expected in a page of smaller symbols, for instance). If the result is

