`Doner et al.
`
`[54] APPARATUS AND METHOD FOR
`RETRIEVING AND GROUPING Th1AGES
`REPRESENTING TEXT FILES BASED ON
`THE RELEVANCE OF KEY WORDS
`EXTRACTED FROM A SELECTED FILE TO
`THE TEXT FILES
`
`[75]
`
`Inventors: Christopher G. Doner, San Francisco;
`Lawrence G. Miller, Saratoga; Ian D.
`Emmons, Richmond; Michael R.
`Barnes, Berkeley, all of Calif.
`
`[73] Assignee: Caere Corporation, Los Gatos, Calif.
`
`[21] Appl. No.: 948,669
`
`[22] Filed:
`
`Sep. 22, 1992
`
`Int. Cl.6
`............................. G06F 17/30; G06F 17/21
`[51]
`[52] U.S. Cl . .................... 395/605; 364/419.19; 395/348;
`395/759
`[58] Field of Search ..................................... 395/600, 159;
`364/419.08, 419.19
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`4,359,824
`4,839,853
`4,868,733
`5,020,019
`5,060,135
`5,062,074
`5,211,563
`5,263,159
`5,276,616
`5,297,042
`
`11/1982 Glickman et al .................. 364/419.19
`6/1989 Deerwester et al. .................... 395/600
`9/1989 Fujisawa et al ........................ 395/600
`5/1991 Ogawa .................................... 395/600
`10/1991 Levine et al ............................ 364/200
`9/1991 Kleinberger ............................ 395/600
`5/1993 Haga et al .............................. 434/322
`11/1993 Mitsui ..................................... 395/600
`1/1994 Kuga et al ........................... 364/419.8
`3/1994 Morita ................................ 364/419.19
`
`OTHER PUBLICATIONS
`
`Salton et al., "Parallel Text Search Methods", Communica(cid:173)
`tions of the ACM vol. v31 Issue N2 p. 202(14), Feb. 1988.
`Kimoto et al "A Dynamic Thesaurus and Its Application to
`1991
`Associated
`Information
`Retrieval"
`Jul.
`IJCNN-91-Seattle IEEE Press pp. 19-29 vol. 1.
`
`I 1111111111111111 11111 lllll lllll lllll lllll lllll lllll lllll 111111111111111111
`US005598557 A
`[lll Patent Number:
`[45] Date of Patent:
`
`5,598,557
`Jan. 28, 1997
`
`Churbuck, "Haystack Searching", Forbes, v. 149, n. 4 Feb.
`17, 1992, pp. 130 (2).
`
`Donna Harman and Gerald Candela, "Retrieving Records
`from a Gigabyte of Text on a Minicomputer Using Statistical
`Ranking", Dec. 1990, pp. 581-589.
`
`Kimoto et al., "Automatic Indexing System for Japanese
`Text" 1989, Review of the Electrical Communications
`Laboratories, V. 37, No. 1, pp. 51-56.
`
`Al-Hawamdeh, S. et al., "Compound Document Processing
`System", Proc. of the Fifteenth Annual International Com(cid:173)
`puter Software and Applications Conf., pp. 640-644 Sep.
`1991.
`
`Salton, G. et al., "The SMART Automatic Document
`Retrieval System-An Example", Communications of the
`AMC, vol. 8 No. 6, pp. 391-398 Jun. 1965.
`
`Primary Examiner-Thomas G. Black
`Assistant Examiner-Jack M. Choules
`Attorney, Agent, or Finn-Blakely, Sokoloff, Taylor & Zaf(cid:173)
`man
`
`[57]
`
`ABSTRACT
`
`An apparatus for searching and retrieving files in a database
`without a user being required to provide keywords or query
`terms. A user first selects and opens a reference file. A
`natural language recognition algorithm is used to determine
`the subject words of the selected file. Next, a statistical
`comparison between the subject words and the contents of
`files in a database is performed. Based on the statistical
`comparison, files are assigned weighted relevancies. Rel(cid:173)
`evant files are prioritized and displayed to the user in groups.
`The groups are formed based on the retrieved files relevance
`to specific subject works of the selected file. The groups of
`retrieved files are displayed in associating with the subject
`word they are relevant to.
`
`30 Claims, 8 Drawing Sheets
`
`BASEOONTHESUBJECTWORDSOFTHE
`REFERENCE DOCUMENT, DITTRMINE WEIGHTED
`RELEVANCE OFDOCUMEN'iS IN THE DATABASE
`
`RANKANDOISPLAYTHEREI.EVANT
`DOCUMENTS ACCORDING TO THEIR WEIGHTS
`
`r ?OJ
`
`ros
`
`DITTA MINE 11-IETllREE MOST COMMON
`SUBJECT WORDS IN THE REFERENCE DOCUMENT
`
`FOR EACH OFTliE Tl-lREE MOST COMMON SUBJECT
`\VOROS. RETTIIEVE ANO PRIORITIZE DOCUMENTS
`RELEVANTTOTliOSESUBJECTWOROS
`
`,.--- 71 D
`
`Page 1 of 15
`
`GOOGLE EXHIBIT 1036
`
`
`
`U.S. Patent
`
`Jan.28, 1997
`
`Sheet 1 of 8
`
`5,598,557
`
`Static
`Memory
`106
`
`Mass Storage
`Device
`
`107
`
`{}
`
`101
`
`{~
`
`Bus
`
`fi
`
`Processor
`102
`- - - - -
`
`I
`1
`00 J
`
`- - -
`
`-
`
`Main
`Memory
`104
`
`{}
`
`~s
`
`OCR
`
`108
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`L - - - - -
`
`y
`
`Display
`'
`121 "
`
`A
`
`-
`
`Keyboard
`122
`
`Cursor
`Control
`123
`
`Hard Copy
`Device
`124 "
`
`.A
`
`Sound
`;Recording anc
`Playback ,_
`"
`Device
`125
`
`A
`
`Scanner
`126
`
`Figure 1
`
`Page 2 of 15
`
`
`
`U.S. Patent
`
`Jan.28, 1997
`
`Sheet 2 of 8
`
`5,598,557
`
`Import Files
`201
`
`Manual
`Input 202
`
`Scan
`
`Specify
`Zones 204
`
`Recognize
`205
`
`Edit
`
`Index
`
`207
`
`Figure 2
`
`Page 3 of 15
`
`
`
`....::a
`01
`Ol
`-...
`00
`\0
`Ol
`-...
`01
`
`~IG_3
`
`s>
`
`¢,
`
`r,;_ =(cid:173) tD a
`
`00
`0 ....,
`
`lj,)
`
`....
`i:.... ?
`
`s:,:i
`N
`
`-...J
`\0
`\0
`
`('0 = """'"
`
`"""'"
`~
`~
`•
`rJ'J.
`0 •
`
`~ Wonder Products
`
`I Thingamagigs
`l!J Gidgets
`
`W~©l~®ft®
`l!J Whatsits
`
`304
`
`301
`
`~ Wonder Company
`
`l!J Sales
`
`. · .. : --~---.. ·.·. ·. :-:·.: :.:_;_ ·.: :·. ·:: :·: .. ·-:.·:: :· :: :: :
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... .
`. · 1· ................. · ...................... .
`. --... ·1· ... ·j: ......... -.•~·.-:::. =•·::.·.·.:::: ::·
`
`IW@@lllllliJil®f!ilft ~~®rroft @@ffilf©llil .• ".
`
`Weighted Boolean Search ...
`Weighted Word Search ...
`Options
`
`=
`
`Page 4 of 15
`
`
`
`U.S. Patent
`U.S. Patent
`
`Jan. 28, 1997
`Jan. 28, 1997
`
`Sheet 4 of 8
`Sheet 4 of 8
`
`5,598,557
`5,598,557
`
`......
`<I O'l : : : ·:. ••
`::J ••••••
`t>
`..DI •••• ••
`<J.)
`• ••
`•
`.....
`0
`·.::-:: Q)
`.-.: .. O'l
`: : : • :-.
`"O
`
`. :: :-: ~
`:::-· .. : ~ <J.)
`
`0.. ::,•:
`Q) •• ••
`
`LO
`c:,
`s:t"
`
`<O
`c:,
`s:t"
`
`r--
`c:,
`s:t"
`
`C\J
`c:,
`s:t"
`
`je0uegd)
`00
`
`·,:·. ::: "O
`,,,•., C
`·••,•
`0
`··•.·· >
`::: -:: >
`······
`......
`: .. : . ·.·:
`·.·.·. :.:: ~:::!::::=====================~
`......
`. ..
`II::·.·:
`
`sj}insoy5oTwaAld
`
`
`
`
`
`Q)
`"O
`C
`0
`$
`
`wPst
`
`
`
`
`
`
`
`mopuljAsuondgOulupyyYyoeesS
`
`Page 5 of 15
`
`O'l
`"'O
`
`-Q)
`3: ....
`
`JOBPIAAJOPUOAA“PIOAA
`
`Page 5 of 15
`
`
`
`.....:a
`Ol
`Ol
`,..
`00
`\0
`Ol
`,..
`Ol
`
`00
`
`s,
`tit
`('I) ....
`00 =(cid:173)
`
`('I)
`
`-...J
`\C
`\C
`....
`YJ
`N
`c:..i ?
`
`~ = "'""" ~ = "'"""
`
`•
`00.
`0 •
`
`508
`
`507
`
`506
`
`505
`
`504
`
`503
`
`=
`'v I L'.
`• •••• ••
`·:_!_ ••• •• •••• ·._ • .__.. ••• •• ••• •.· ••• : • •••• • .. •••• • •• : • •••• ·--·-~ • : • ••• • •••• • •• : • ••• •• •,._!_~--:-:: ::. ·: ·.: :·. •;: :·-· •• ·•:.·:: :· :: :-·:
`· · · · · -~-· ·2· · · · · · ·1· w·
`· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ~ · · · · · · · · · · · · · · · · ·· · · · · · · · · · · ·. · · · · · · · · · · ·
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....
`...... ·····LJ····· ·····~····•·j····.~::.···:.··.:~~::::.·::.:.·:.·.•.:-::::::
`. -.--.-.--.--.--.--.-·····j······ .-.... LJ
`File
`De_g_ug
`=
`'v IL'.
`
`~IG_5
`
`514
`
`product
`(517
`
`510
`
`"?
`
`◊'
`
`machines
`
`(516
`
`--
`
`511
`
`gadget
`
`(515
`
`509
`
`l!J Thingamagigs
`l!J Gidgets
`
`\VM'D©l@@it~
`
`Whatsits
`
`lc:17 Wonder Company
`
`Wonder Products
`l!J Sales
`
`501
`
`502
`
`Widgets
`
`Current Agent Document: Widgets
`
`Name
`
`IQJ@©l!IIUil'il@ITilit b\@J@lrilit @@(ill!'@lhl
`
`'v I 6. Ill=
`
`Help
`
`500
`
`Admin Options Window
`
`Results Search
`
`Edit
`
`Page 6 of 15
`
`
`
`U.S. Patent
`
`Jan.28, 1997
`
`Sheet 6 of 8
`
`5,598,557
`
`INPUT A LIST OF KEYWORDS
`
`SEARCH FOR DOCUMENTS MEETING
`THE KEYWORD REQUIREMENTS
`
`601
`
`602
`
`COMPUTE THE IDF _s- 603
`FOR A DOCUMENT
`
`COMPUTE THE RELEVANCE OF
`THE DOCUMENT TO A KEYWORD
`
`604
`
`NO
`
`606
`
`SUM THE RELEVANCES OF DOCUMENT
`TO EACH OF THE KEYWORDS
`
`NO
`
`RANK EACH DOCUMENT
`ACCORDING TO ITS ASSIGNED WEIGHT
`
`608
`
`FIG. 6
`
`Page 7 of 15
`
`
`
`U.S. Patent
`
`Jan. 28, 1997
`
`Sheet 7 of 8
`
`5,598,557
`
`~
`
`SELECT AND OPEN A
`REFERENCE DOCUMENT
`
`PARSE THE REFERENCE
`DOCUMENT INTO SENTENCES
`
`DISREGARD STOP WORDS
`
`DETERMINE PARTS OF SPEECH
`FOR EACH WORD IN THE SENTENCE
`
`DETERMINE THE SUBJECT
`WORD OF THE SENTENCE
`
`NO
`
`701
`
`702
`
`703
`
`704
`
`705
`
`BASED ON THE SUBJECT WORDS OF THE
`REFERENCE DOCUMENT, DETERMINE WEIGHTED
`RELEVANCE OF DOCUMENTS IN THE DATABASE
`
`RANK AND DISPLAY THE RELEVANT
`DOCUMENTS ACCORDING TO THEIR WEIGHTS
`
`DETERMINE THE THREE MOST COMMON
`SUBJECT WORDS IN THE REFERENCE DOCUMENT
`
`FOR EACH OF THE THREE MOST COMMON SUBJECT
`WORDS, RETRIEVE AND PRIORITIZE DOCUMENTS
`RELEVANT TO THOSE SUBJECT WORDS
`
`707
`
`708
`
`709
`
`710
`
`FIG. 7
`
`Page 8 of 15
`
`
`
`...J
`Ol
`~ Ol
`00
`\0
`~ Ol
`Ol
`
`~
`
`s,
`~
`...,..
`l'D
`00 =- 1'0
`
`~ ?
`
`-...J
`\0
`\0
`I-"
`"'~
`N
`
`~ = f"'+,-
`~
`~
`•
`rJ'J.
`0 •
`
`Current Agent Document: Widgets
`. ·.·. ·.· : . ·.·. --~--~-=-~ .. : . ·.· .. ·.·. ·~-~--~"'~ .. ·.·. ·.· :_._·: _•. ·.·. ·. __ .:. ·.· . ·.·. · .. : . ·: .. ·.·. ·. :-:·. ::. ·: ·.: :·.·:: :·-· .. ··:.·:: :· :: :-·:
`····························································~·········································
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....
`· · · · · · · · · 1 · · · · · ~ · · · · · ·j · · · · · r · · · · ·j · · · · · ~--... · .... · · ·.:::: = ..... ·~· ... =•·:. ·.·. ·.: =:: = ..
`· · · · · ·~ · · · · · t· · · · · ·~ · · · · · · · · · · ·r · · · · · tm·
`De_Qug I~
`= File
`=
`'v I 6.
`
`~IG_S
`
`801
`
`802
`
`product
`
`~dvertisement
`
`[g'.issertation
`
`~ewspaper
`
`~agazine
`
`[§~(~:~~~~L~ I
`
`=I I ~S~91~~;TI
`
`il machines ~
`
`1-=
`
`II
`
`gadget I
`
`'
`
`-
`
`Widgets
`
`Help
`
`Admin Options Window
`
`Search
`
`Results
`
`Edit
`
`@@©lllllliJil®lfilit &i@@IJilit ~@IIDU'©lhl
`
`800
`
`Page 9 of 15
`
`
`
`5,598,557
`
`1
`APPARATUS AND METHOD FOR
`RETRIEVING AND GROUPING IMAGES
`REPRESENTING TEXT FILES BASED ON
`THE RELEVANCE OF KEY WORDS
`EXTRACTED FROM A SELECTED FILE TO
`THE TEXT FILES
`
`FIELD OF THE INVENTION
`
`The present invention pertains to the field of computer(cid:173)
`ized information search and retrieval systems and methods.
`More particularly, the present invention relates to an appa(cid:173)
`ratus and method for searching and retrieving text found in
`a database as a function of their relevancy to a desired
`subject matter.
`
`BACKGROUND OF THE INVENTION
`
`5
`
`2
`that a user specifies a search for (keyword 1 AND keyword
`2) OR keyword 3, the computer retrieves all texts containing
`keyword 3 plus those texts containing both keyword 1 and
`keyword 2. Two examples of this type of text retrieval
`system are the LEXISTM and DialogTM systems.
`Even though computerized search and retrieval systems
`greatly facilitate a user in locating relevant texts, there yet
`remains many disadvantages with these systems. One dis(cid:173)
`advantage of this type of prior art search and retrieval
`10 method is that the user is required to anticipate one or more
`keywords used to identify and distinguish relevant texts. In
`other words, the user must guess the words used by the
`author of a desired text. This problem arises because a user
`typically does not have advance knowledge of how the texts
`l5 of interest are worded. If a user fails to guess appropriate
`keywords, highly relevant text might be missed.
`Another disadvantage with typical prior art search and
`retrieval systems is that picking significant keywords is a
`tricky and delicate operation. If a keyword is too common
`and/or if a user utilizes an inclusive OR function to join
`multiple keywords, a search request can potentially result in
`the retrieval of hundreds of text satisfying the broadly
`defined search criteria. Often, only a small handful of text
`among the hundreds of retrieved texts is of actual interest to
`a user. The user must then expend much time and energy to
`tediously scan each text and winnow out the truly relevant
`texts from the vast pool of retrieved texts. Conversely, if the
`keyword is too specific or if the exclusive AND function is
`used to join multiple keywords, the search might be too
`30 restrictive. Highly relevant text which did not meet the
`specific keyword criteria will not be retrieved. Hence, a user
`frequently chooses different keywords and conjunctions in a
`costly and time-consuming iterative process to tailor the
`search request. Consequently, operating typical prior art
`35 search and retrieval systems require skill, training, and
`expertise.
`Therefore, what is needed is an apparatus and method for
`determining and ranking the significance of each retrieved
`40 document so that a user can broaden the scope of a search
`to catch any relevant text without being unduly burdened by
`having to wade through inconsequential texts. It would be
`highly preferable for the same apparatus and method to also
`provide a mechanism to easily and naturally navigate
`45 between texts dealing with related subject matter.
`
`25
`
`20
`
`Due to rapid advances made in electronic storage tech(cid:173)
`nology, it is becoming ever more convenient and economi(cid:173)
`cally attractive to store information electronically as a series
`of digital bits of data. As such, "texts" from magazines,
`newspapers, journals, encyclopedias, books, and other
`printed materials are increasingly being classified and
`grouped together into various databases. These texts can be
`comprised of miscellaneous strings of characters, sentences,
`or documents having indeterminate or varied lengths and
`can be of a wide variety of data classes, such as words,
`numbers, graphics, etc. Computers are then utilized to
`access these databases in order to store additional new text
`and to retrieve old, stored texts. One added advantage of
`electronically storing information is that computers can be
`programmed to search and retrieve specific texts in a data(cid:173)
`base which is of special interest to the user. In essence, a
`computer can perform indexing functions, such as a card
`catalog. A user can retrieve a particular text by inputting the
`title, author, date of publication, or some other description
`specific to that text. In response, the computer can automati(cid:173)
`cally search, retrieve, and display the desired text.
`However, if the user does not know of a specific text or
`wishes to conduct research on a general subject matter, the
`computer can be programmed to select certain text which
`might be of significance to the user. Prior art search and
`retrieval systems have typically accomplished this by focus(cid:173)
`ing on "keywords" or query terms. A user who wishes to find
`texts of a particular nature, first specifies one or more
`keywords which might be contained in the desired texts.
`Typically, each text in the database is assigned a unique
`reference number. All words in the text, except for trivial
`words such as "a," and "the," etc., are tagged with the unique 50
`reference number and are placed in an alphabetical index.
`Hence, all texts in the database containing a given keyword
`are located by searching for that keyword in the alphabetical
`index and returning a set of reference numbers. Thereby,
`texts corresponding to the reference numbers are known to 55
`contain the keyword and are accessed via the computer.
`In order to provide the user with greater flexibility, many
`prior art search and retrieval systems provide for "Boolean"
`searches. A Boolean search involves searching for docu(cid:173)
`ments containing more than one keyword. This is typically 60
`accomplished by joining the keywords with conjunctions
`such as the exclusive "AND" function and/or the inclusive
`"OR" function. If two or more keywords are joined by an
`AND, only those texts which contain all those joined key(cid:173)
`words are retrieved. If two or more keywords are joined by 65
`the inclusive "OR" function, all texts which contain at least
`one of the joined keywords are retrieved. For example, given
`
`SUMMARY OF THE INVENTION
`
`In view of the problems associated with information
`search and retrieval systems, one object of the present
`invention is to provide an apparatus and method for ranking
`retrieved documents according to its relevance.
`Another object of the present invention is to provide an
`information search and retrieval system which does not
`require a user to specify keywords or query terms.
`Another object of the present invention is to provide a
`mechanism so that a user can easily and naturally navigate
`between groups of files dealing with related subject matter.
`These and other objects of the present invention are
`implemented in an information search and retrieval com(cid:173)
`puter system. A user initiates a search by selecting and
`opening a file containing subject matter of particular inter(cid:173)
`est. The computer system performs a natural recognition
`algorithm to determine the subject words of the document
`corresponding to the selected file. This is accomplished by
`parsing the document into sentences, determining the parts
`of speech for each word in the sentence, and picking out the
`
`Page 10 of 15
`
`
`
`5,598,557
`
`4
`DETAILED DESCRIPTION
`
`20
`
`25
`
`30
`
`3
`subject word of the sentence based on heuristic syntactical
`grammar rules.
`Once all the subject words in the reference document have
`been found, they are used in a statistical comparison algo(cid:173)
`rithm to determine the relevancy of each file in a database. 5
`A file's relevancy is a function of both the frequency of
`subject words occurring in that file and the distribution of the
`subject words within the database. The file's relevancy is
`also normalized to its length. Relevant files are then
`retrieved and displayed in a list. The most relevant docu- 10
`ments are displayed at the top of the list, while those which
`are not as relevant are displayed in descending order. Hence,
`a user is not required to guess at keywords or query terms
`prior to conducting a search. The user need only select a
`document which is of interest, and the present invention 15
`retrieves and prioritizes relevant documents residing in the
`database.
`The present invention also provides a user with a means
`for navigating between files of related topics. A thumbnail
`image comprising a scaled down bit-mapped representation
`of the cover sheet of the reference document is displayed.
`The three most commonly occurring subject words in the
`reference document are displayed next to this thumbnail
`image. Files in the database which have relevance to each of
`the three subject words are retrieved and are prioritized
`according to their degree of relevance to that particular
`subject word. The thumbnail image of the most relevant file
`to the first subject word is displayed adjacent to that subject
`word. It is followed by the thumbnail image of the next most
`relevant file to the first subject word, etc. Similar thumbnail
`images of files corresponding to the second and third subject
`words are also displayed.
`By placing a moveable cursor over any of the thumbnail
`images and clicking on it, the user can designate that file to
`be the new reference file. This initiates a new search based
`on the subject words of the new reference file. The search
`produces a new list of files ranked according to the degree
`of relevance to the new reference file. It also produces the
`three most common subject words of the new reference
`document and new thumbnail images of files prioritized to
`those subject words. Thus, the present invention allows a
`user to conduct research on a topic by successfully selecting
`new reference documents based on prior search results.
`
`35
`
`An apparatus and method for searching and retrieving
`significant text from a database is described. In the following
`description, for the purposes of explanation, numerous spe(cid:173)
`cific details such as mathematical formulas, flowcharts,
`menus, etc., are set forth in order to provide a thorough
`understanding of the present invention. It will be apparent,
`however, to one skilled in the art that the present invention
`may be practiced without these specific details. In other
`instances, well-known structures and devices are shown in
`block diagram form in order to avoid unnecessarily obscur-
`ing the present invention.
`Referring to FIG. 1, the computer system upon which the
`preferred embodiment of the present invention can be imple(cid:173)
`mented is shown as 100. Computer system 100 comprises a
`bus or other communication means 101 for communicating
`information, and a processing means 102 coupled with bus
`101 for processing information. System 100 further com(cid:173)
`prises a random access memory (RAM) or other dynamic
`storage device 104 (referred to as main memory), coupled to
`bus 101 for storing information and instructions to be
`executed by processor 102. Main memory 104 also may be
`used for storing temporary variables or other intermediate
`information during execution of instructions by processor
`102. Computer system 100 also comprises a read only
`memory (ROM) and/or other static storage device 106
`coupled to bus 101 for storing static information and instruc(cid:173)
`tions for processor 102. Data storage device 107 is coupled
`to bus 101 for storing information and instructions.
`Furthermore, a data storage device 107 such as a magnetic
`disk or optical disk and its corresponding disk drive can be
`coupled to computer system 100. Computer system 100 can
`also be coupled via bus 101 to a display device 121, such as
`a cathode ray tube (CRT), for displaying information to a
`computer user. An alphanumeric input device 122, including
`alphanumeric and other keys, is typically coupled to bus 101
`for communicating information and command selections to
`processor 102. Another type of user input device is cursor
`control 123, such as a mouse, a trackball, or cursor direction
`keys for communicating direction information and com(cid:173)
`mand selections to processor 102 and for controlling cursor
`movement on display 121. This input device typically has
`two degrees of freedom in two axes, a first axis (e.g., x) and
`45 a second axis (e.g., y), which allows the device to specify
`positions in a plane.
`Moreover, data can be input by scanner 126. The scanner
`126 serves to read out the contents of an original document
`or photograph as digitized image information. An OCR
`50 (Optical Character Reader) 108 can be utilized to recognize
`textual portions of a scanned document. Another device
`which may be coupled to bus 101 is hard copy device 124
`which may be used for printing instructions, data, or other
`information on a medium such as paper, film, or similar
`types of media. Additionally, computer system 100 can be
`coupled to a device for sound recording and/or playback 125
`such as an audio digitizer coupled to a microphone for
`recording information. Further, the device may include a
`speaker which is coupled to a digital to analog (D/A)
`60 converter for playing back the digitized sounds. Finally,
`computer system 100 can be a terminal in a computer
`network (i.e., a LAN).
`The currently preferred embodiment of the present inven(cid:173)
`tion can be part of an overall document management soft(cid:173)
`ware package. To conduct a search, a user first specifies a
`particular database. Databases are usually organized so that
`files stored on a particular database share a common
`
`40
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The present invention is illustrated by way of example,
`and not by way of limitation, in the Figures of the accom(cid:173)
`panying drawings and in which like reference numerals refer
`to similar elements and in which:
`FIG. 1 illustrates a computer system as may be utilized by
`the preferred embodiment of the present invention.
`FIG. 2 is a flowchart illustrating the steps for creating a
`new database.
`FIG. 3 illustrates a typical window displayed on a CRT
`which can be used as a user interface for the present
`invention.
`FIG. 4 illustrates a window displaying a search dialog
`box.
`FIG. 5 is a window illustrating the results of a document
`agent search.
`FIG. 6 is a flowchart illustrating the steps for determining
`and ranking the relevance of files in a database.
`FIG. 7 is flowchart illustrating the steps involved in a 65
`document agent search.
`FIG. 8 illustrates a search results window.
`
`55
`
`Page 11 of 15
`
`
`
`5,598,557
`
`10
`
`20
`
`5
`attribute. For example, an attorney might utilize a database
`containing cases from a particular jurisdiction; a doctor
`might consult a database containing files of patient histories;
`a marketing manager might access a database containing
`product reviews for spotting market trends; etc. The data(cid:173)
`base can be an already existing database or a newly created
`database. FIG. 2 is a flowchart illustrating the steps for
`creating a new database. Computer files containing useful
`information can be imported by copying it over to the
`database, step 201. Moreover, data in the form of docu(cid:173)
`ments, reports, magazine and newspaper articles, can be
`entered either manually by means of a keyboard, step 202,
`or they can be entered by using an optical scanner, step 203.
`Moreover, the data can already exist on the computer
`system. The user can specify zones of a scanned image or
`file which is of particular significance for further processing,
`step 204. Textual portions of a scanned bit-map image or file
`can be recognized and converted into ASCII code data, step
`205. The ASCII code data can then be edited, step 206.
`Finally, the processed information is indexed and saved to
`the database, step 207.
`Once a database has been selected, the user can select a
`weighted keyword search, a weighted Boolean search, or a
`document agent search. FIG. 3 illustrates a typical window
`300 which can be displayed on a CRT. Window 300 is
`provided as user interface for the present invention. Window
`300 is comprised of a number of pull-down menus which
`can be accessed by a cursor positioning device, such as a
`mouse. The search menu 301 is accessed by the user to select
`the desired type of search (i.e., keyword 302, Boolean 303,
`or document search 304). The selected type of search is
`highlighted. For example, FIG. 3 illustrates the user having
`selected a Document Agent Search 304.
`If the user selects the weighted word search 302, a search
`dialog box 401 is displayed, as illustrated in FIG. 4. The user
`then types in one or more keywords and clicks on the OK
`box 402 to initiate the search based on the inputted key(cid:173)
`word(s). When the search is completed, a Search Results
`window 403 is displayed. FIG. 4 illustrates a Search Result
`window 403 displaying a list of retrieved documents
`405--407. The list displays those retrieved documents as a
`function of their relevance. Documents having the most
`significance are displayed at the top of the list, whereas
`retrieved documents having less relevance are displayed
`near the bottom of the list. In addition to displaying each
`retrieved document according to its relevancy, a box bearing
`a bar is superimposed over each document's file name. The
`extension of the bar indicates that document's degree of
`relevance to the keyword(s). For example, a search based on
`the keyword Wonder Widget 404 might result in the retrieval
`of three documents 405--407. (It is noted that Wonder Widget
`and Widgets are fictitious names.) A data sheet 405 describ(cid:173)
`ing the product, which is highly relevant, is displayed at the
`top of the list and has a relatively long bar. A brochure 406
`describing all Wonder products, including WonderWidget,
`having some relevance, is displayed in the middle. It has a
`medium-sized bar. A magazine article 407 of a competing
`product that mentions WonderWidget, has low relevance and
`is ranked last in the list. Correspondingly, it has a small bar.
`In the currently preferred embodiment, the bars are color
`coded red, green, and blue, to respectively indicate the
`documents having much, some, and less relevance. The
`determination of the document's relevancy is described in
`detail below.
`For greater flexibility, a user can specify a Weighted 65
`Boolean Search, wherein keywords are joined by conjunc(cid:173)
`tions (e.g., AND, OR, etc.) Again, any retrieved documents
`
`6
`are weighted and ranked according to their relevance to the
`Boolean search request. Typically, a Boolean search results
`in the retrieval of a few highly relevant documents, a
`medium sized grouping of documents having modest rel-
`5 evancy, and a large grouping of documents having little
`relevancy. Note that in the present invention, a user is not
`unduly penalized for using inclusive OR conjunctions.
`Although more documents are likely to be retrieved, the user
`can quickly scan through the most significant documents
`(i.e., documents at the top of the list). The effect of adding
`keywords in an inclusive OR search contributes to the
`determination of a document's relevancy and influences
`which documents "float" to the top of the list.
`Alternatively, a user can opt for a Document Agent
`Search, which allows the user to initiate a search for
`15 documents which are similar to a reference document
`selected by the user. First, the user selects and opens a
`reference document. Next, the user selects the Document
`Agent Search option from the Search pull-down menu.
`Thereupon, the present invention retrieves documents from
`the database which are related to the reference document.
`The relevancy of each retrieved document to the reference
`document is determined, and each document is ranked and
`displayed according to its relevancy.
`FIG. 5 shows a window 500, as may be displayed on a
`25 CRT, illustrating the results of a Document Agent Search. A
`user first selects a particular file, such as Widgets 501, from
`a folder Wonder Products 502. The Widgets 501 document
`is designated the reference document against which other
`documents in the database are compared in determining
`30 relevancy. Note that with this type of search, the user is not
`required to supply keywords. The present invention retrieves
`those documents that are considered to be relevant, ranks
`each retrieved document, and lists the retrieved documents
`in ascending order based on their degrees of relevancy. For
`35 example, if six documents 503-508 were retrieved, the top
`document entitled Data Sheet 503 is considered to have the
`most relevance to the reference document Widgets 501.
`Likewise, the bottom documents, such as Dissertation 507
`and Advertisement 508, are considered to be the least
`40 relevant.
`A section 509 of window 500 is used to display an
`organized chart 510 of relevant documents. Initially, chart
`510 displays a "thumbnail" image 511 of the cover sheet of
`the reference document. A thumbnail image is a bit-mapped
`45 shrunken, miniaturized representation of a page of a docu(cid:173)
`ment (usually the title page ). Multiple rows of thumbnail
`images 512---514 are displayed to the right of the thumbnail
`image of the reference document. Each row comprises
`retrieved files of relevant documents. The first row corre-
`50 sponds to retrieved files having relevance with respect to the
`most relevant subject word in the reference document;
`similarly, the second row corresponds to retrieved files
`having relevance with respect to the second most relevant
`word in the reference document; etc. For example, if the
`three most relevant subject words in the reference document
`Widgets 511 are "gadget" 515, "machines" 516, and "prod(cid:173)
`uct" 517, those documents having relevance to the word
`"gadget" is categorized into the top row. The second and
`third rows comprise documents having relevance to the
`60 subject words "machines" and "product." The documents in
`a row are arranged so that the most relevant document is
`placed at the left with successively decreasing relevant
`documents placed to the right. Hence, document 512 has
`more relevance to the subject word "gadget" 515 than
`document 518.
`Chart 510 provides a user with a means for navigating
`between related documents. By glancing at the thumbnail
`
`55
`
`Page 12 of 15
`
`
`
`5,598,557
`
`7
`images, the subject words, and the titles, a user can get a
`general indication of those documents which are of interest.
`The user can also open a document to examine its contents.
`The user can then select a particularly interesting document
`by positioning a cursor over that document's thumbnail 5
`image and clicking a button. This designates that document
`as the new reference document. This results in a new search,
`yielding more related documents. The user can