`5,109,439
`[11] Patent Number:
`[19]
`Umted States Patent
`
`Froessl
`[45] Date of Patent:
`Apr. 28, 1992
`
`llllllllllllll|||IllllIllll||||l|||ll|||||||||1||I|||||l|l||||l||||l|l|||||
`
`[54] MA55 DOCUMENT STORAGE AND
`RETRHEVAL SYSTEM
`
`..................... .. 332;’!
`4.811.166 3/I989 Gonzalez et al.
`4.933.979
`6/1990 Suzuki et al.
`....................... .. 382/bl
`
`Inventor: Horst Fl-oessl, Gutenbergstrasse 2-4,
`13-6944 Hemsbach Fed‘ Rep. of
`Gennany
`
`Boudreau.
`Prfimafy Examl'n£-‘r—LeO
`Assistant Examiner—T)avid Fox
`Attorney, Agent, or Ftrm—WaIter C. Farley
`
`[21] Appl. No.: 536,769
`[22] Filed:
`Jam 32, 199;)
`
`ABSTRACT
`[57]
`A sequence of documents is delivered to an optical
`scanner in which each document is scanned to form a
`digital image representation ofthe content of the docu-
`Goa‘ 9/00
`-
`131- Cl-5 ------
`[51]
`152] U-5- CL --------------------------------- " 332/ ment. In one embodiment, the image representation is
`_
`'
`converted into code (ASCII) and is automatically ex-
`[53] Flew '3‘ Search """"""""" 3§§£,;’6§1"§g§4i/2926554;
`amined by data processing apparatus to select search
`‘
`'
`’
`'
`words which meet predetermined criteria and by which
`[56]
`References Cited
`the document can subsequently located.
`In another
`us. PATENT DOCUMENTS
`embodiment, the image is not converted. The Search
`words are stored in a nonvolatile memory in code form
`é,35S.824 ll/I982 Glickman et al
`and the entire document content is stored in mass stor-
`4.553.261 ll/3935 Froessl .... ..
`6/!9BT Matsueda ..
`" 382,5?
`4.672.683
`4 743 618
`'2: 382/61
`5/1933 Takeda cl 3].
`4.758380 Ir‘/1988 Tsunekawa et ai.
`382/61
`4.760.606 T/I938 Lesnick et al.
`..................... .. 382/61
`
`age, either in code or image fonn. Techniques for se-
`19911113 ‘be 5“”-‘h Words are di5°30S=d-
`_
`21 Claims, 7 Drawing Sheets
`
`
`
`144
`
`145
`
`
`
`14?
`
`
`
`USEH
`STATION
`
`USER
`STATION
`
`
`
`
`
` INPUT KEYBOARD
`MOU E, ETC.}
`
`USER
`STATION
`COMM. LINK
`
`146
`OH NETWORK
`SERVER _ COMPUTER
`voumus s NON-
`VOLATILE MEMORY;
`RAM. TABLES. HD
`
`
`
`
`
`
`
`SOFTWARE.
`
`
`HARDWARE FOR
`5E*‘"°”
`
`CHAR cow
`pnocssson
`
`Page 1 of 17
`Page 1 of 17
`
`FIS Exhibit 1018
`FIS Exhibit 1018
`
`
`
`u
`
`m
`
`A
`
`5
`
`Page 2 of 17
`
`D.5NQ93
`
`8
`
`agIma.m_._mE
`
`
`..._mzmfimzfité89n_O...2.58H288z_2,58mzmfi8.352:;zmmté89m83EMEE89wmfisoo<ass
`
`
`xomdmmQwasmo“.22J.zo_§.Eon_z_mwazmmea.
`
`
`w..A,Eo_u_.E...........EEm.MEa;
`
`
`
`
`
`mmazmm22e2:;mo“.m_m<s__IomfimMmzmfi839mm:_8m838...
`
`232%.E82m82macs12$:wmos...$05mmE=m
`
`zoEs_m9_z_.;_§s_oomm2m%
`
`
`
`82.w,é...m.,._o$____.,___._.u.”_m_..M..,m_.mfi,.,w4
`
`S.5GE«N
`“M..__,.__m__w_,‘,___fi___m_
`
`.m3
`
`.9.o._EEa.28EzzémB888%
`
`
`
`
`f.HDut3P3U
`
`Apr. 28, 1992
`
`Sheet 2 of 7
`
`5,109,439
`
`AV._,mm9flzmznooo
`
`mac:
`
`Bmmmoofi
`
`
`
`xomfimBmdmmz__._o§
`
`
`
`noSum20.:ammo;
`
`mo§oz<._$9._.zm_._..5ooo
`
`adc.mmmoommnmo
`
`3
`
`$05aEmszoo
`
`
`
`_._oEmwzmmozo
`
`o_:._.=$memo;
`
`mos...mmoa
`
`Page 3 of 17
`
`D._om%an_
`
`._§z§_mo“.05“.
`
`mmmmmmoe.mom....m_.>mm
`
`maooz_mmopma99..
`
`2.
`
`3
`
`.mE_ams;
`
`mm_mmw._..ao<
`
`N.2:9.
`
`>
`
`E02mo.mo§_::wz<E
`
`
`am._.<ammoozo
`
`mmmmmmnmc9Emssooocomm502
`
`:_§.m.
`zomfimm<mm_mmmEo<
`
`Q0...84demo;
`
`2.
`
`9_._.EsomfijmmmoommoaaEmszoo
`
`
`memo;zomfim2".2.
`
`Eazmmm_moo_._o
`
`.2522...momesp.
`82.E92.mmoa
`
`3.?
`
`
`
`no2m__>m_m25.5:
`
`
`
`:8mo".mzomfim
`
`_mE8_:m._moE
`
`#525,.moEma
`
`ztzm
`
`Eozmm
`
`zo=.§mou_z_
`
`2.50”.
`
`....
`
`moo
`
`zo_..#m_>zoo
`
`_2m.aoE
`
`S.9205..........S.om.:oE
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Apr. 23, 1992
`
`Sheet 3 of 7
`
`5,109,439
`
`
`
`zo:$.Eou_z_mmazmm8...
`
`E03zomfimm<mm:mom9o._.
`
`' ' ‘ . ' ' l . I ‘ . '|
`
`Page 4 of 17
`
`2._ov%an_
`
`
`
`
`
`zo:.§mou_z_mz_.._mE&000..
`
`
`zo_E_2m9_z_mmom8588AV
`
`.25,5:mo;95".zo_wE>z8988SE
`
`amfimsamofi:_m_o‘..w._._,_hm.,__,__,,__”__.H.w
`
`.no..
`
`<NGE8
`
`,_%_W._.,‘_..mm....___m
`
`
`
`no2m__>m_m252::
`
`
`
`>E2m_..§z§_
`
`.>m._.m=u_.s_w:..._oEmoo
`
`mommomfim
`
`Eozmm
`
`zo_:§__"_ou_z_
`
`25”.z_
`
`
`
`Emsaooom¢:.zm3mmoawEmszoo
`
`mo<:_$0.5mmtzm
`
`38zEm_mo._.wm~...<m.
`
`83Q2932%.
`
`
`
`92n::55m_._m§069z_358:55.z_.._m_Ea>zmmts.O0O._
`
`
`
`OwO._O...53.OOO._000..mmézoo<ass
`
`m._m<.._.
`
`mMIE
`
`
`
`onOm_vO.._000..IOn_
`
`
`
`m_w§_xomqmm
`
`
`
`
`
`>z§2oommozmm5..m_._mEmom_._mEommofiM.9,59.
`
`
`
`
`
`
`
`
`
`
`
`S.U
`
`m
`
`mA
`
`m
`
`w
`
`4.,mmGE
`
`Page 5 of 17
`
`D.6mmama
`
`M,Emaooz_.8;M55adc.mmmoomm“mo
`
`
`
`
`
`
`IomfimgoEma__.._oEmemo;85..e:25m_w§az<._.59Emzsoonzmemo;
`
`
`
`
`
`.zmmozommopmzomfim5m_,_m_mmz_=o<_..._
`
`
`
`
`
`t._§z§:805¢E323n.83..
`
`
`
`H_mmm_Ea<mom.sm_>mmwmmmm_moo<
`
`._...mBmmmoofiMAV>ma9mpzmzaooa
`...m3mo:9=._.=s
`omzsmmmooM552,59.%.momo_s
`_._om<mw>zoE§mou_z_2.u_z_
`
`
`
`mwozwmmmooxo8Eozmm
`
`
`M3.53.
`
`mm_mwm¢8<B2922game;mpzmzaooo98¢50$zomfim2m_mwmmmS<m.962mo3%.._._zmz<EaE5mmoozo
`
`882.$054”Q9...2,59.
`
`
`
`a I '.IIIII-_l.I.I.I..lIPM5dz:9:..adcs_oE
`
`
`
`
`US. Patent
`
`Apr. 23, 1992
`
`Sheet 5 of 7
`
`5,109,439
`
`
`
`START SEARCH WORD
`SELECTION
`
`
`
`'__________"4: 100
`In—————:z ——-an-:——
`: CONVERT TEXT 5
`
`f’ 54. 94
`
`102
`
`CHECK EACH WORD
`IN TEXT FOR !N|TlAL
`CAPITAL LETTER
`
`"
`
`110
`
`CAPS VOCAB.
`TABLE
`
`
`
`I we
`°°'¢«7?+?%A‘g3RD
`vocAS. TABLE
`
`
`
`105
`Y
`
`
`
`cs???” "*5"
`D
`BY FULL
`srop
`:2
`
`
`DOES
`v‘.?{?u5‘%§§f1~5§‘L
`
`
`LETTER
`9
`
`
`
`112
`
`
`
`116
`
`WORD IN CAPS
`
`
`VOCAB. TABLE AS WOR
`COMPARE OTHER (NON-CAP)
`NOT TOOSELECT
`
`WORDS WITH LANGUAGE
`TABLE DICTIONARY) AND
`
`SELECT S SEARCH WORDS
`
`
`THOSE MARKED," INCLUDING
`SPECIAL AND “MUST”
`SELECTIONS FOR
`
`SPECIFIC BUSINESS
`
`A
`
`90 NOT
`STORE
`
`STORE AS SEARCH
`wonn A CORRELATE
`WITH DOCUMENT no
`
`107
`
`
`
`119
`
`THERE MORE TEXT
`IN THE DOCUMENT
`
`
`
`
`
`I
`
`120
`
`Y
`
`G0 To START SEARCH
`wono SELEGT1ON
`
`FIG. 3
`
`122
`
`N N
`
`EXT
`DOCUMENT
`
`Page 6 of 17
`Page6of 17
`
`
`
`tHe.taPS”U
`
`82
`
`_._..J
`
`93A,901.15
`
`Page 7 of 17
`
`D.5Nmama
`
`..6....
`
`2:
`
`Em:8.we
`
`zoE:m
`
`
`
`r..2:35.mno:M.>5%ao¢<om>e_5%.
`
`
`
`
`35-202am_.__._.<._o>
`
`z1_m.__.%m.mv_z_._E200em...-n8.xmwflflwmmomm_Sn=._ooUH:
`
`
`
`
`EE:o>.25”.Ecaozmz .m.>z8.m§omzoE&mmozmommwmmwmm_.._o"_m_m<a.E<_._m5%2.19..mm._m_§
`
`
`
`
`
`
`
`
`
`
`
`US. Patent
`
`Apr. 23, 1992
`
`Sheet 7 of 7
`
`5,109,439
`
`150
`
`ENTER SEARCH
`WORDS
`
`
`
`
`
`COMPARE SEARCH
`WORDS WITH STORED
`SEARCH WORDS
`
`
`
`
`
`
`
`DISPLAY LIST
`OF SEARCH
`WORDS
`
`
`
`SEARCH
`WORDS
`FOUND ?
`
`CALL UP LIST OF
`SEARCH WORDS
`MEETING CRITERIA
`
`
`
`
`
`SELECT SEARCH
`WORDS FROM
`DISPLAY NO. OF
`LIST
`DOCUMENTS
`
`HAVING SEARCH
`
`WORDS CHOSEN
`
`
`
`
`
`
`158
`
`
`
`N
`
`
`TOO
`
`MANY TO REVIEW
`VISUIALLY
`
`-152
`
`‘I64
`
`
`
`DISPLAY
`DOCUMENTS
`
`PRINT OR
`QUIT
`
`
`
`
`
`
`
`CHOOSE ADDED
`CRITERIA TO RE-
`
`DUCE NUMBER OF
`
`
`DOCUMENTS
`
`
`
`FIG. 5
`
`Page 8 of 17
`Page8of 17
`
`
`
`1
`
`5,109,439
`
`MASS DOCUMEINT STORAGE AND RETRIEVAL
`SYSTEM
`
`This invention relates to a system for the mass storage
`of documents and to a method for automatically select-
`ing search words by which the documents can be re-
`trieved on the basis of the document content.
`
`BACKGROUND OF THE INVENTION
`
`Various systems are used for the mass storage and
`retrieval of the contents of documents including sys-
`tems such as those disclosed in my earlier U.S. Pat. Nos.
`4,273,440; 4,553,261; and 4,276,065. While these systems
`are indeed quite usable and effective, they generally
`require considerable human intervention. Other systems
`involve storage techniques which do not use the avail-
`able technology to its best advantage and which have
`serious disadvantages as to speed of operation and elli-
`ciency. In this context, the term "mass storage" is used
`to mean storage of very large quantities of data in the
`order of, e.g., multiple megabytes. gigabytes or tera-
`bytes. Storage media such as optical disks are suitable
`for such storage although other media can be used.
`Generally speaking, prior large-quantity storage sys-
`tems employ one of the following approaches:
`A. The content of each document is scanned by some
`form of optical device involving character recogni-
`tion (generically. OCR) so that all or major parts of
`each document are converted into code (ASCII or
`the like) which code is then stored. Systems of this
`type allow full-text code searches to be conducted
`for words which appear in the documents. An
`advantage of this type of system is that indexing is
`not absolutely required because the full text of each
`document can be searched, allowing a document
`dealing with a specific topic or naming a specific
`person to be located without having to be con-
`cerned with whether the topic or person was
`named in the index. Such a system has the disad-
`vantages that input tends to be rathel: slow because
`of the conversion time required and input also
`requires human supervision and editing. usually by
`a person who is trained at least enough to under-
`stand the content of the documents for error-
`checking purposes. Searching has also been slow if
`no index is established and, for that reason, index-
`ing is often done. Also, the question of how to deal
`with non-word images (graphs, drawings, pictorial
`representations) must be dealt with in sotne way
`which differs from the techniques for handling text
`in many OCR conversion systems. Furthermore,
`such systems have no provision for offering for
`display to the user a list of relevant search words,
`should the user have need for such assistance.
`B. The content of each document is scanned for the
`purpose of reducing the images of the document
`content to a form which can be stored as images,
`i.e., without any attempt to recognize or convert
`the content into ASCII or other code. This type of
`system has the obvious advantage that graphical
`images and text are handled together in the same
`way. Also, the content can be displayed in the same
`form as the original document, allowing one to
`display and refer to a reasonably faithful reproduc-
`tion of the original at any time. In addition, rather
`rapid processing of documents and storage of the
`contents is possible because no OCR conversion is
`
`I0
`
`15
`
`20
`
`15
`
`30
`
`35
`
`45
`
`SI}
`
`55
`
`60
`
`65
`
`Page 9 of 17
`Page 9 of 17
`
`2
`needed and it is not necessary for a person to check
`to see that conversion was proper. The disadvan.
`tages of such a system are that some indexing tech-
`nique must be used. While it would be theoretically
`possible to conduct a pattern search to locate a
`specific word ‘'match‘‘ in the stored images of a
`large number of documents. success is not likely
`unless the “searched for“ word is presented in a
`font or typeface very similar to that used in the
`original document. Since such systems have had no
`way of identifying which font might have been
`used in the original document, a pattern search has
`a low probability of success and could not be relied
`upon. Creating an index has traditionally been a
`rather time consuming, labor-intensive task. Also,
`image storage systems (i.e., storing by using bit-
`mapping or line art or using Bezier models) typi-
`cally require much more memory than storing the
`equivalent text in code. perhaps 25 times as much.
`Various image data banks have conte into existence
`but acceptance at this time is very slow mainly due to
`input and retrieval problems. Because of the above
`difficulties. mass storage systems mainly have been re-
`stricted to archive or library uses wherein retrieval
`speed is of relatively little significance or wherein the
`necessary human involvement for extensive indexing
`can be cost justified. There are, however, other contexts
`in which mass storage could be employed as a compo-
`nent of a larger and different document handling system
`if the above disadvantages could be overcome.
`SUMMARY OF THE INVENTION
`
`An object of the present invention is to Ptovide a
`method of handling input documents. storing the con-
`tents of the documents and automatically creating a
`selection of search words for the stored documents with
`little or no human intervention.
`A further object is to provide a method of machine-
`indexing contents of documents which are to be stored
`in image form in such a way that the documents can be
`retrieved.
`
`Another object is to provide a method to display
`search words to users in an indexed or a non-indexed
`system.
`Briefly described, the invention comprises a method
`of retrievably storing contents of a plurality of docu-
`ments having images imprinted thereon comprising
`optically scanning the documents to form a representa-
`tion of the images on the documents. A unique identif-
`cation number can be assigned to each document and to
`the image representation of each document. Search
`words are automatically selected from each document
`to be used in locating the document from mass storage.
`The selected search words are converted to code, cor-
`relating the converted search words with the unique
`identification number of the document from which the
`search words were selected. The search words are
`stored in code, and the image representation of each
`document is stored in mass storage or the entire text is
`convened into ASCII or other code with the Search
`words being retained in separate storage for display to
`users when desired.
`It should be kept in mind that the invention contem-
`plates three possible approaches which have their own
`advantages and disadvantages. In one approach, the text
`is “read" by a scanner or the like and kept in a bit-
`mapped or similar digital for, as it emerges from the
`scanner rather than being converted into ASCII or
`
`
`
`5,109,439
`
`3
`other code. Search words are extracted and converted
`into code but the main body of the text is stored (in mass
`storage} as an image. In the second approach, the entire
`document (to the extent possible) is converted. search
`words are selected and stored in code form. and the
`entire text is stored in code. In the third approach. the
`document is also entirely converted (to the extent possi-
`ble) and search words are selected but the document is
`finally stored in image form. Except for the search
`words. the converted text is not saved in mass storage.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`In order to impart full understanding of the manner in
`which these and other objects are attained in accor-
`dance with the invention, particularly advantageous
`embodiments thereof will be described with reference
`to the accompanying drawings, which form part of this
`specification, and wherein:
`FIGS. 1A and 1B. taken together, constitute a flow
`diagram illustrating the overall steps of a first embodi-
`ment of a document processing method in accordance
`with the invention;
`FIGS. 2A and 2B, taken together. constitute a flow
`diagram illustrating the steps of a second embodiment
`of a document processing method in accordance with
`the invention;
`FIG. 3 is a flow diagram illustrating a search word
`selection process in accordance with the invention:
`FIG. 4 is a block diagram of a system in accordance
`with the invention; and
`FIG. 5 is a flow diagram illustrating a retrieval
`method in accordance with the invention.
`
`DESCRIPTION OF THE PREFERRED
`EMBODIMENTS
`
`The present invention will be described in the context
`of a system for handling incoming mail in an organiza-
`tion such as a corporation or government agency which
`has various departments and employees and which re-
`ceives hundreds or thousands of pieces of correspon-
`dence daily. At present, such mail is commotlll’ handled
`manually because there is no practical alternative. Ei-
`ther of two approaches is followed. depending on the
`size and general policies of the organization:
`in one
`approach. mail is distributed to departments. and per-
`haps even to individual addressees, before it is opened.
`to the extent that its addressee can be identified from the
`envelope; and in the other approach. the mail is opened
`in a central mail room and then distributed to the ad-
`dressees In either case. considerable delay exists before
`the mail reaches the intended recipient. In addition,
`there is very little control over the tasks which are to be
`performed in response to the mail because a piece of
`mail may go to an individual without his or her supervi-
`sor having any way to track the response. Copying (i.e.,
`making a paper copy) of each piece of mail for the
`supervisor is, of course, unnecessarily wasteful. The
`present system can be used to store and distribute such
`incoming mail documents.
`Referring first to FIG. 1, at the beginning of the
`process of the present invention. each incoming docu-
`ment 20 is delivered 21 to a scanner and is automatically
`given a distinctive identification (ID) number which
`can be used to-identify the document in both the hard
`copy form and in storage. The ID number can be
`printed on the original of the document. in case it be-
`comes necessary to refer to the original in the future.
`Preferably, the ID number is a 13 digit number of which
`
`Page 10 of 17
`Page10of17
`
`4
`two digits represent the particular scanner {in the event
`that the organization has more than one) or the depart-
`ment in which or for which the incoming documents
`are being processed. two digits represent the current
`year. three digits represent the day of the year and six
`digits represent the time (hour, minute and second}.
`The number is automatically provided by a time
`clock as each document is fed into the system. For
`reasons which will be discussed below, it is anticipated
`that most documents will be processed in a time of
`about two seconds each which means that the time-
`based ID number will be unique for each document. As
`the number is being printed on the document, it is sup-
`plied to non-volatile storage, such as a hard disk. for
`cross reference use with other information about the
`document.
`
`While use of the ID number is clearly preferred, it
`would be possible to group documents, as by week or
`month received, and rely on other criteria to locate
`specific documents within each group. In such a case.
`the ID number would not be unique to each individual
`document but some other form of identification can
`enable reference to a specific document.
`In order for the processing to be reliable, there are
`certain prerequisites for the documents. systems and
`procedures to allow the documents to be processed.
`Most of these are common to all conversion systems,
`not only those of the present invention. Currently avail-
`able hardware devices are capable of performing these
`functions. The criteria are:
`a. Each document should be easily readable. ie. have
`reasonably good printing.
`b. The print should be on one side of the page only.
`For documents having printing on both sides,
`it
`should be standard practice to use one side only.
`c. The scanner should have a document feeder.
`d. A copying machine should be available for either
`copying documents darker when the original is too
`light. or
`copying damaged or odd-size documents not suit-
`able for feeder input.
`e. Character recognition software used with the sys-
`tem must be powerful and able to convert several
`different fonts appearing on one page.
`f. Preferably the software should also be able to con-
`vert older type fonts and must be able to separate
`text and graphics appearing on the same page.
`At this preliminary stage, pre-run information 22 can
`also be supplied to the apparatus to set, for example, the
`two-digit portion indicating the department for which
`documents are being processed. This is helpful if a sin-
`gle scanner is to be used for more than one department
`or if a scanner in one department is temporarily inopera-
`tive and one for another department is being used.
`The documents are fed into the scanner, after or
`concurrently with assignment of the ID number, the
`scanner being of a type usable in optical character rec-
`ognition (OCR} but without the usual recognition hard-
`ware or software. The scanner thus produces an output
`which is typically an electrical signal comprising a se-
`ries of its of data representing successive lines taken
`from the image on the document. Each of the successive
`lines consists of a sequence of light and dark portions
`(without gray scales) which can be thought of as equiv-
`alent to pixels in a video display. Several of these “pixel
`lines“ form a single line of typed or printed text on the
`document. the actual number of pixel
`lines (also re-
`ferred to as "line art“) needed or used to form a single
`
`10
`
`IS
`
`20
`
`25
`
`3D
`
`35
`
`40
`
`45
`
`50
`
`55
`
`65
`
`
`
`5
`line of text being a function of the resolution of the
`scanner.
`
`5,109,439
`
`6
`including address. is attached, 36. to the ID number for
`that particular document for subsequent use as a search
`word. If no pattern match is found, a flag can be at-
`tached to the ID number for that document to indicate
`that fact, allowing human intervention to deter:-nine
`whether the logo pattern should be added to the exist-
`ing table.
`As will be discussed, the ID number and any addi-
`tional information which is stored with that number, as
`well as search words to be described. are ultimately
`stored in code rather than image form. Such code is
`preferably stored on a hard disk while the images are
`ultimately stored in a mass store such a WORM (write
`once. read many times) optical disk. Meanwhile, all
`such data is held in RAM.
`
`At this stage, the system enters into a process of se.
`lecting search words and other information from the
`remaining parts of the document to allow immediate
`electronic distribution as well as permanent storage of
`the documents which have specifically designated ad-
`dressees and to permit subsequent retrieval on the basis
`of information contained in the document. Some of the
`techniques for doing these tasks are language- and cus-
`tom-dependent, as will be discussed. and the techniques
`must thus be tailored to the languages and customs for
`the culture in which the system is intended to be used.
`A general principle in this embodiment is to attempt to
`recognize portions of the document which are likely to
`contain information of significance to subsequent re-
`trieval before the document is converted into code and
`to then convert into code only specific search words
`within those recognized portions.
`It is customary in many countries to have the date of
`the letter and information about the addressee isolated
`at the top of a letter following a logo, or in a paragraph
`which is relatively isolated from the remainder of the
`text. This part of the letter easily can be recognized
`from the relative proportion of text space to blank space
`without first converting the text into code. Once recog-
`nized, 38. this portion can be converted. identified as
`“date" and “addressee“ information 40 and stored with
`the document ID. All known arrangements for writing
`a date can be stored in a data table for comparison with
`the document so that the date and its characteristics can
`be recognized.
`If the date and addressee information cannot be rec-
`ognized in a specific document, the ID for that docu-
`ment is flagged 42 for human intervention so that the
`date is manually added to the extent that it is available.
`In this context.
`the “addressee" would normally be
`either a specifically named person or a department
`within the overall organization. To facilitate identifying
`the addressee, a table can be maintained with individual
`and department names for comparison.
`At this stage of the process, normally about two sec-
`onds or less after the document has been introduced into
`the scanner. enough information will have been deter-
`mined (in most cases) for the system to send to the
`individual addressee, as by a conventional E-mail tech-
`nigue, notification 44 that a document has been re-
`ceived, from whom, and that the text is available from
`mass storage under a certain ID number. If desired. the
`image of the entire document can be transmitted to the
`addressee but a more efficient approach is to send only
`notification. allowing the intended recipient to access
`the image from mass storage.
`In a similar fashion, the name of the individual sender,
`as distinguished from a company with which the indi-
`
`10
`
`15
`
`25
`
`30
`
`35
`
`20
`
`In conventional OCR, software is commonly used to
`analyze immediately the characteristics of each group
`of pixel
`lines making up a line of text in an effort to
`"recognize" the individual characters and, after recog-
`nition. to replace the text line with code. such as ASCII
`code, which is then stored or imported into a word
`processing program. In one aspect of the present inven-
`tion (FIG. 1), recognition of the full
`text is not at-
`tempted at this stage. Rather, the data referred to above
`as pixel lines is stored in that image form without con-
`version. In the other approach (FIG. 2), the full text is
`convened into code and is then stored in mass storage
`(e.g., optical disk) while the converted search words are
`stored, as suggested above, in a readily accessible form
`of non-volatile memory such as a hard disk. In this
`connection. memory such as random access memory,
`buffer storage and similar temporary forms of memory
`are referred to herein as either RAM or volatile mem-
`ory and read/write memory such as hard disk, diskette,
`tape or other memory which can be relied upon to
`survive the deenergization of equipment is referred to as
`non-volatile memory.
`The pixel line image is stored in a temporary memory
`such as RAM 26 and the ID number, having been gen-
`erated in it code such as ASCII by the time clock or the
`like concurrently with the printing, is stored in code
`form and correlated in any convenient fashion with its
`associated document image.
`As will be recognized, the image which is stored in
`this fashion includes any graphical, non-text material
`imprinted on the document as well as unusually large
`letters or designs. in addition to the patterns of the text.
`Commonly, incoming correspondence will
`include a
`letterhead having a company logo or initials thereon. At
`this stage 26 of the process, the image can be searched
`to determine if patterns indicative of a logo or other
`distinctive letterhead (generically referred to herein as a
`“logo“) is present. This can be automatically perfonncd
`by examining the top two to three inches of the docu-
`ment for characters which are larger than normal docu-
`ment fonts or have other distinctive characteristics. By
`“automatically" it is meant that the step can be per-
`formed by machine, i.e.. by a suitably constructed and
`programmed computer of which examples are readily
`available in the marketplace. The term "automatically"
`will be used herein to mean "without human interven-
`tion" in addition to meaning that the step referred to is
`done routinely.
`If such a logo is found, 28, a comparison 30 can be
`made to see if the sender's company logo matches a
`known logo from previous correspondence. This infor-
`mation cart be useful in subsequent retrieval. For this
`purpose, a data table 32 including stored patterns of 55
`known logos is maintained correlated with the identifi-
`cation of the sending organization, the pattern informa-
`tion in the table 32 being in the same form as the signals
`produced by the scanner so that the scanner output can
`be compared with the table to see if a pattern match
`exists.
`To seek a pattern match. a comparison is performed
`preferably using a system of the type produced by Ben-
`son Computer Research Corporation, McLean, Va.
`which utilizes a search engine employing parallel pro-
`cessing and in-memory data analysis for very rapid
`pattern comparison. If the letterhead/logo on a docu-
`ment is recognized, 34», an identification of the sender,
`
`45
`
`S0
`
`65
`
`Page 11 of 17
`Page11of17
`
`
`
`5,109,439
`
`7
`vidual might be employed. is usually readily recogniz-
`able, 46, near the end ofthe document page on which it
`appears. If recognizable, the sender's name and/or title
`is chosen routinely, 48, as one of the Search words.
`Additionally. it will be recognized that the presence of
`the sender's name at the end is an indication that the
`page on which it appears is the last page of that specific
`document, while the presence of the addressee's name
`near the top indicates that the page is the first page. An
`indication of Attachments at the bottom can also be
`chosen to show that there is more to be associated with
`the letter.
`Multiple page documents can be recognized by the
`absence of letterhead information on the second and
`subsequent pages and by the presence of a signature on
`a page other than the one with address information. It is
`important to correlate all subsequent pages with the
`first page so that when a multiple page document is
`found in a search, the first page is displayed and the user
`can then "leaf through" the document by sequentially
`displaying the subsequent pages.
`If a specific document exhibits any problems with
`character recognition. 50, the search words and related
`material are stored and the ID flagged for human atten-
`tion, 52. The human review 56 is for the purpose of
`determining the reasons for the problem, correcting
`them if possible and either retrying the machine pro-
`cessing or manually entering the desired information.
`The next task. 54, is to identify by machine those
`words in the text of the document which are significant
`to the meaning of the document and which can be used
`as search words, apart from identification of the sender,
`addressee. etc. The manner in which this task will be
`accomplished is more language-dependent
`than the
`above. A more complete discussion of the text search
`word selection process follows with reference to FIG.
`3. The chosen search words are converted to code, 58,
`stored with, or correlated with, the ID number and the
`image itself is transferred to the mass store. If more
`documents are to be processed, 60. the method starts
`again at 21.
`To summarize. the documents received by a com-
`pany are analyzed to identify and store important words
`from various parts of each such document. In the exam-
`ple of a business letter. such information should include
`the following:
`Sending organization (letterhead information)
`Date of the letter
`Addressee (company, organization)
`Reference
`Individual addressee (Dear Mr. ---)
`Search words chosen from text
`Presence of enclosure/annex
`Individual sender
`FIG. 2 shows an alternative embodiment in which the
`input document text is converted, to the extent possible.
`at the beginning of the process while the scanning is
`being performed. This difference leads to a number of
`other changes throughout the process, although many
`of the steps are the same. The process of FIG. 2 will be
`briefly discussed with emphasis on the differences from
`FIG. 1.
`
`To begin with. the feeding of documents 60 to scan-
`ner SI and the insertion of pre-run information 62 is the
`same. However, after or concurrently with scanning,
`the entire document is converted. 63, to code by suit-
`able conventional character recognition equipment and
`software and stored in volatile memory. As in FIG. 1,
`
`Page 12 of 17
`Page 12 of17
`
`8
`the image of the document is stored in RAM. 64, even
`though the conversion is accomplished. If there are any
`OCR conversion problems, 65,
`the ID number
`is
`flagged for human review. 65. and correction or manual
`entry. 67.
`The image is searched for a logo pattern. 70, and if a
`logo is found. 74. its pattern is compared. 75, with pat-
`terns stored in a logo table 76. If found, 78, the infon-na-
`tion stored therein about the sender is added. 80, to the
`ID data stored. If not. it can be added manually, 82.
`The system can be arranged to search for addressee
`and date information in either the image in RAM or the
`converted code in RAM, but the preferred method is to
`search in code, 72. If found, 84, these data are chosen,
`86, as search words. If not, the document is flagged for
`human review, 87. ‘Notification of the receipt of a docu-
`ment, or the entire document, can then be sent to the
`addressee, 88.
`If date and sender information has been found, 90, it
`is added as search words, 92. The search word selection
`from the text is performed. 94. chosen words are stored
`and correlated with the ID number, 96, and the con.
`verted image data are stored in WORM or other mass
`store. As before, the ID and search word information is
`stored in a non-volatile. rewritable form of memory
`such as a hard disk. In this approach. storage of the
`image is possible together with full text conversion or
`conversion in part as well as conversion of search
`words into code. On the other hand, total conversion
`can be used only for the search for, and extraction of
`search words with, possibly, editing being performed to
`only the search words or only to the capital letters of
`the search words. The search in code in this case in-
`cludes, e.g., date. addressee and sender.
`Using this approach, the remainder of the converted
`text is not stored but is deleted.
`Correction of incorrectly converted search words
`and/or rejections (words which cannot be recognized
`and converted} can also be reduced to two errors per
`rejection. or more for any characters following a capital
`letter. The capital letter itself would have to be correct
`for later ease and reliability of searching.
`FIG. 3 illustrates a process for selecting search words
`from the text of a document automatically. i.e., without
`human intervention in the case of most documents.
`which is a very important part of the present