`Holt et al.
`
`[54] ASSOCIATIVE TEXT SEARCH AND
`RETRIEVAL SYSTEM HAVING A TABLE
`INDICATING WORD POSITION IN PHRASES
`
`[75]
`
`Inventors: John Holt, Centerville; David James
`Miller, Spring Valley; X. Allan Lu,
`Springboro; Ray Daley; Minh Doan,
`both of Dayton; Richard G. Graham,
`Beavercreek; Catherine Leininger,
`Dayton; Darin W. McBeath,
`Miamisburg; Thomas Pease, Mason;
`Stephen M. Sever, Kettering; Dale
`Waddell; Franz Weckesser, both of
`Dayton, all of Ohio
`
`[73] Assignee: Reed Elsevier, Inc., Newton, Mass.
`
`[21] Appl. No.: 473,824
`
`[22] Filed:
`
`Jun. 7, 1995
`
`Related U.S. Application Data
`
`[63] Continuation of Ser. No. 155,304, Nov. 22, 1993.
`Int. Cl.6
`............................. G06F 17/30; G06F 17/21
`[51]
`[52] U.S. Cl. .......................... 395/605; 395/792; 395/793;
`395/759; 395/760; 395/606; 395/603
`[58] Field of Search ..................................... 395/605, 792,
`395/793, 603, 606, 795, 760; 364/419.19
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`................... 364/200
`4,241,402 12/1980 Mayper, Jr. et al.
`4,270,182
`5/1981 Asija ....................................... 364/900
`4,358,824 11/1982 Glickman et al. ...................... 364/200
`4,384,329
`5/1983 Rosenbaum et al. ................... 364/300
`4,471,459
`9/1984 Dickinson et al.
`..................... 364/900
`4,499,553
`2/1985 Dickinson et al.
`..................... 364/900
`4,554,631 11/1985 Reddington ............................. 364/300
`4,580,218
`4/1986 Raye ....................................... 364/300
`4,688,195
`8/1987 Thompson et al. ..................... 364/300
`4,706,212 11/1987 Toma ...................................... 364/900
`4,760,528
`7/1988 Levin.
`4,787,035 11/1988 Bourne .................................... 364/300
`
`I 1111111111111111 11111 lllll 111111111111111 11111 1111111111 111111111111111111
`US005771378A
`[11] Patent Number:
`[45] Date of Patent:
`
`5,771,378
`Jun. 23, 1998
`
`4,819,156
`4,823,306
`4,839,853
`4,852,003
`4,862,408
`
`4/1989 De Lorme et al. ................ 395/182.13
`4/1989 Barbie et al. .. ... ... .... ... ... ... ... ... 364/900
`6/1989 Deerwester et al. .................... 364/900
`7/1989 Zamora.
`8/1989 Zamora ................................... 364/900
`
`(List continued on next page.)
`
`FOREIGN PATENT DOCUMENTS
`
`WO 92/04681
`
`3/1992 WIPO .
`
`OTHER PUBLICATIONS
`
`Wong, Wai Yee Peter, et al., "Implementations of Partial
`Document Ranking Using Inverted Files," Information Pro(cid:173)
`cessing & Management, (Great Britain), vol. 29, No. 5, pp.
`647-669, 1991.
`Robert F. Jack, "Essential Guide to the Library IBM PC, vol.
`4, Data Communications: Going Online", Merkler Corp.
`©1987, pp. 62-67.
`
`(List continued on next page.)
`
`Primary Examiner-Gail 0. Hayes
`Assistant Examiner-Joseph Thomas
`Attorney, Agent, or Firm-Reid & Priest LLP
`
`[57]
`
`ABSTRACT
`
`An associative text search and retrieval system uses one or
`more front end processors to interacting with a network
`having one or more user terminals connected thereto to
`allow a user to provide information to the system and receive
`information from the system. The system also includes
`storage for a plurality of text documents, and at least one
`processor, coupled to the front end processors and the
`document storage. The processor(s) search the text docu(cid:173)
`ments according to a search request provided by the user and
`provide to the front end processor a predetermined number
`of retrieved documents containing at least one term of the
`search request. The retrieved documents have higher ranks
`than documents not provided to the front end processor. The
`ranks are calculated using a formula that varies according to
`the square of the frequency in each of the text documents of
`each of the search terms.
`
`15 Claims, 12 Drawing Sheets
`
`74,
`
`\
`
`LOC A
`LOC B
`
`LOC C
`
`LOC D
`LOC E
`
`LOC f-
`
`LOC G
`
`\ ~O~ H
`
`LOC I
`
`Page 1 of 23
`
`GOOGLE EXHIBIT 1038
`
`
`
`5,771,378
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`4,868,750
`4,914,590
`4,918,588
`4,931,935
`4,965,763
`4,972,349
`4,974,191
`4,991,087
`4,991,094
`5,005,127
`5,099,425
`5,109,509
`5,117,349
`5,123,103
`5,265,065
`5,321,833
`5,323,316
`5,544,352
`
`9/1989 Kucera et al. .......................... 364/419
`4/1990 Loatman et al. ........................ 364/419
`4/1990 Barrett et al. .. ... ... ... ... .... ... ... ... 364/200
`6/1990 Ohira et al. ............................. 364/419
`10/1990 Zamora .
`11/1990 Kleinberger ... .... ... ... ... ... ... .... .. 364/900
`11/1990 Amirghodsi et al. ................... 364/900
`2/1991 Burkowski et al. .................... 364/200
`2/1991 Fagan et al. .
`4/1991 Kugimiya et al. .
`3/1992 Yuji et al. ............................... 364/419
`4/1992 Katayama et al. ...................... 395/600
`5/1992 Tirfling et al. .......................... 395/600
`6/1992 Ohtaki et al. ........................... 395/600
`11/1993 Turtle . ... ... ... ... .... ... ... ... ... ... .... .. 395 /600
`6/1994 Chang et al.
`........................... 395/605
`6/1994 Kadashevich et al. ................. 364/419
`8/1996 Egger . ... ... ... ... .... ... ... ... ... ... .... .. 395 /605
`
`OIBER PUBLICATIONS
`
`Vu/Text®User Guide, A Knight Ridder Company, Aug.
`1989 Chapter 7, p. 7.11.
`Save™Retrieval Guide,. Jun. 1, 1990, pp. 12-17 of Chapter
`12.
`Dialog OnDisc User's Guide, Dialog Information Services,
`Inc.,©1987, pp. E-4 & E-5.
`Dialog File 410, Acc.# 00035496: "New Full-Text Features
`in McGraw-Hill ... " published Nov. 1968.
`Dialog File 410, Ace, # 00044457: "KWIC and Hilight:
`How and When," published May 1988.
`Croft, et al., "A Retrieval Model Incorporating Hypertext
`Links," Hypertext '89 Proceedings, Association for Com(cid:173)
`puter Machinery, pp. 213-224 (Nov., 1989).
`
`Turtle, et al., "Inference Networks for Document Retrieval,"
`Coins Technical Report 90---07, University of Massachusetts,
`pp. 1-16, (Mar., 1990).
`Turtle, et al., "Inference Networks for Document Retrieval,"
`Sigir 90, Association for Computer Machinery, pp. 1-24
`(Sep., 1990).
`Turtle, "Inference Network for Document Retrieval," Ph.D.
`Dissertation, Coins Technical Report 90-92, University of
`Massachusetts (Feb., 1991).
`Turtle, et al., "Efficient Probabilistic Inference for Text
`Retrieval," RIAO '91 Conference Proceedings, Recherche
`d'Information Assistee' par Ordinateur, Universitat Auton(cid:173)
`sma de Burcelona, Spain, pp. 644-661 (Apr., 1991).
`Porter, "An Algorithm for Suffix Stripping," Program, vol.
`14, pp. 130-137 (1980).
`Turtle, et al., "Evaluation of an Inference Network-Based
`Retrieval Model," Transactions on Information Systems,
`Association for Computer Machinery, vol. 9, No. 3, pp.
`187-222 (Jul., 1991).
`Croft, et al., "Interactive Retrieval for Complex Docu(cid:173)
`ments," Information Proceeding and Management, vol. 26,
`No. 5, pp. 593-613 (1990).
`Haynes, "Describing a System for the Specialized User: A
`Case Study," Proceedings-1985 National Online Meeting,
`Learned Information, Inc., pp. 205-213 (Apr. 30, 1985).
`"Text Search and Retrieval Reference Manual for the Auto(cid:173)
`mated Patent System (APS)," U.S. Department of Com(cid:173)
`merce, pp. 1-5, 3-2 to 3-3, 3-8 to 3-9, 3-157-1 to 7-7,
`7-35 to 7-37 (Oct. 21, 1992).
`"Text Search and Retrieval Training Manual for the Auto(cid:173)
`mated Patent System (APS)," U.S. Department of Com(cid:173)
`merce, pp. 28, 29, 39-43 (Oct. 21, 1992).
`
`Page 2 of 23
`
`
`
`U.S. Patent
`
`Jun. 23, 1998
`
`Sheet 1 of 12
`
`5,771,378
`
`~I
`
`:.-:,:-::
`et::
`0
`5
`f-
`w
`z
`
`~I
`
`t13I
`
`~I
`
`~,
`
`<;f(cid:173)
`l[)
`
`il
`
`~I
`
`~I
`
`ff51 ~,
`
`~I
`
`~I
`
`~l
`
`~I ~,
`~,
`
`~I
`
`~I
`
`-..........
`
`(j
`~
`
`\ 0
`
`rD
`
`Page 3 of 23
`
`
`
`U.S. Patent
`
`Jun. 23, 1998
`
`Sheet 2 of 12
`
`5,771,378
`
`r70
`
`I
`
`/
`
`/
`
`/
`
`/
`
`/
`
`/
`
`/
`
`/
`
`/
`
`/
`
`/
`
`/
`
`P1 ~2-
`TERM1
`TERM2 p3' P4
`.
`
`. .
`.
`TERMN Pr Ps
`
`LOC A
`LOC B
`.
`
`LOC C
`
`LOC D
`LOC E
`
`.
`
`'
`
`LOC F
`
`'
`
`\
`\
`\
`\
`\
`\
`\
`
`LOC G
`
`LOCH
`.
`.
`.
`LOCI
`
`FIG 3
`
`DOCUMENT
`TEXT
`
`72
`-
`
`INDEX
`74
`-
`
`ANCILLARY
`DATA
`76
`-
`
`FIG 2
`
`Page 4 of 23
`
`
`
`U.S. Patent
`
`Jun. 23, 1998
`
`Sheet 3 of 12
`
`5,771,378
`
`ussc
`1700 to
`1900
`
`82
`
`ussc
`1901 to
`1960
`
`83
`
`ussc
`1961 to
`1992
`
`84
`
`ussc
`1993
`
`85
`
`80
`
`FIG 4
`
`80
`
`83
`
`ussc
`1700 to
`1900
`
`82
`
`ussc
`1901 to
`1960
`
`ussc
`1961 to
`1992
`
`84
`
`ussc
`1993
`
`85
`
`92
`
`94
`
`95
`
`FIG 5
`
`Page 5 of 23
`
`
`
`U.S. Patent
`
`Jun. 23, 1998
`
`Sheet 4 of 12
`
`5,771,378
`
`~104
`
`100
`
`102
`Enter your FREESTYLE (TM) set:'h Description.
`Enter phrases in quotation marks.
`Example : "lender liability", "28 U.S.C. @ 1332 a"
`
`Type .bool to enter a Boolean search
`
`F ex further explanation, press the H key (f<X HELP) and then the ENTER key.
`
`FIG 6
`
`SEARCH OPTIONS
`
`Press ENTER to send search.
`To use a Search Option, enter an equal sign followed by the number.
`
`Search Description:
`WHAT DIFFERENT METHODS OF "USABILITY TEllNG" AND TOM ARE USED
`BY SOFTWARE DEVELOPMENT COMPANIES?
`112a
`
`('114
`('I 15
`Enter/Edit Mandatory Terms
`Enter/Edit Restrictions (e.g., dafe)
`Thesaurus/"! 16
`117
`(I 18
`Edit Search Oescriptiorl
`Change number of documents Current setting: 25
`
`<=1>
`<=2>
`<"'3>
`<=4>
`<=5>
`
`F ex further explanation, press the H key (for HELP) and then the ENTER key.
`
`FIG 7
`
`Page 6 of 23
`
`
`
`U.S. Patent
`
`Jun. 23, 1998
`
`Sheet 5 of 12
`
`5,771,378
`
`SEARCH OPTIONS
`
`Press ENTER to send search.
`To use a Search Option, enter an equal sign followed by the number.
`
`Search Description:
`WHAT DIFFERENT METHODS OF "USABILITY TESTING" AND TOM ARE USED
`I 12o
`BY SOFTWARE DEVELOPMENT COMPANIES?
`
`Mandatory: "USABILITY TESTING", TOM
`
`122
`
`('114
`rl 15
`Enter/Edit Mandatory Terrrb
`Enter/Edit Restrictions (e.g., date)
`Thesaurus/"1I6
`117
`(118
`Edit Search Descriptior1
`Change number of documents Current setting: 25
`
`<:::1 >
`<=2>
`<m3>
`<=4>
`<=5>
`
`110
`
`For further explanation, press the H key (for HELP) and then the ENTER key.
`
`FIG 8
`
`=1
`
`SEARCH RESTRICTIONS
`
`To use restrictions, enter an equal sign followed by the number.
`Press ENTER to return to Search Options.
`
`1 132
`<•1 > DAtE:
`1133
`<=2> COURT:
`.
`134
`<"'3> JuoC
`1135
`<=4> COUNSEL:
`,r"--136
`<=5> NAME:
`
`Example: after 6/6/1993
`
`Example: Sixth Circuit
`
`130
`
`Example: Rehnquist
`
`Example: Gerry
`
`Example: Brown and Topeka Board of Education
`
`For further explanation, press the H key (for HELP) and then the ENTER key.
`
`FIG 9
`
`Page 7 of 23
`
`
`
`U.S. Patent
`
`Jun. 23, 1998
`
`Sheet 6 of 12
`
`5,771,378
`
`SEARCH OPTIONS
`
`Press ENTER to send search.
`To use a Search Option, enter an equal sign followed by the number.
`
`Search Description:
`WHAT DIFFERENT METHODS OF "USABILITY TEl!NG" AND TOM ARE USED
`BY SOFTWARE DEVELOPMENT COMPANIES?
`
`1120
`Restrictions: DAT~ (AFT 10/1/92). COURT (SIXTH CIRCUIT)
`142)
`
`('114
`Enter/Edit Mandatory Terms
`('I 15
`Enter/Edit Restrictions (e.g., date)
`Thesaurus/'l 16
`117
`(I 18
`Edit Search Descriptiori
`Change number of documents Current setting: 25
`
`<a1>
`<=2>
`<-3>
`<=<4>
`<=5>
`
`110
`
`For further explanation, press the H key (for HELP) and then the ENTER key.
`
`FIG 10
`
`SEARCH OPTIONS
`
`Press ENTER to send search.
`To use a Search Option, enter an equal sign followed by the number.
`
`Search Description:
`WHAT DIFFERENT METHODS OF "USABILITY TE~TING" AND TOM ARE USED
`li
`BY SOFTWARE DEVELOPMENT COMPANIES?
`120
`Mandatory:
`"USABILITY TESTING", TOMl22
`
`Restrictions: DATE (AFT 10/1/92). COURT (Sl~H CIRCUIT)
`~ 42
`('114
`<=1 > Enter/Edit Mandatory Terms
`('I 15
`<=2> Enter/Edit Restrictions (e.g., date)
`<-3> Thesaurusr'I 16
`117
`(118
`<""4> Edit Search Descriptiori
`<=5> Change number of documents Current setting: 25
`
`110
`
`For further explanation, press the H key (for HELP) and then the ENTER key.
`
`FIG II
`
`Page 8 of 23
`
`
`
`U.S. Patent
`
`Jun. 23, 1998
`
`Sheet 7 of 12
`
`5,771,378
`
`1, 2, 3, 6-10
`SEARCH TERMS FOUND IN THESAURUS
`
`('154
`Enter term numbers to view synonyms. Example: 1, 2, 3, 6-1 O
`
`150
`
`<=1 > Return to Search Options
`
`1234567890123456789012345678901234567890123456789012345678901234567890123
`2 Usable
`1 5 Piaget
`28 Andretti
`3 TOM
`16 Ncrman
`29 Winog-ad
`4 Models
`17 Carroll
`30 Newell
`5 Algorithms
`18 Cognitive
`31 Shank
`6 Graphics
`19 Metaphysical
`32 Feigenbaum
`20 Metamorphosis
`33 Hypoturbocharged
`7 Chomskian
`8 Transformational
`21 Carburator
`35 Knowledge
`9 Grammar
`Injection
`36 Wcrds
`22
`10 Averylongsearchss...
`23 12345678901234567...
`11 Anotheryerylongss...
`24 words
`25 words
`12 12345678901234567890
`26 words
`1 3 words
`
`37 wwororddss l,
`
`~
`
`38
`39 words
`
`52
`
`For fur1her explanation, ixess the H key (for HELP) and then the ENTER key.
`
`FIG 12
`
`1, 3
`
`('162
`Synonyms for: INCOHERENCE
`Enter synonym numbers to include in search and ixess ENTER (e.g., 1, 2,-3)
`
`<,.1 > Return to Search Options
`
`<a2> Return to Term Selection
`
`--- -- - - ---f-.1§-=!- __ --Term Variations-- - --- - - - - - - - - - - - -
`
`1 incoherent
`
`2 chaos
`5 disjunction
`8
`incomprehensibility
`11
`randomness
`14 unclearness
`17 wandering
`
`3 disconnection
`6 discrder
`9 lack of clarity
`12 ranting
`1 5 unevenness
`
`4 discontinuity
`7 illegibility
`1 O meaninglessness
`13 raving
`16 unintelligibility
`
`160
`
`For further explanation, press the H key (fcr HELP) and then the ENTER key.
`
`FIG 13
`
`Page 9 of 23
`
`
`
`U.S. Patent
`
`Jun. 23, 1998
`
`Sheet 8 of 12
`
`5,771,378
`
`Press ENTER to send search
`To use a Search Option, enter an equal sign followed by the number.
`
`SEARCH OPTIONS
`
`Search Description:
`1112b
`WHAT DIFFERENT METHODS OF "USABILll'Y TESTING" AND TOM
`("CONTINUOUS IMPROVEMENT", "QUALITY FUNCTION DEPLOYMENT")
`ARE USED BY SOFTWARE DEVELOPMENT COMPANIES?
`,122
`"USABILITY TESTING", TOM
`
`Mandatory:
`
`Restrictions: SATE (AFT 10/1/92). COU~ {SIXTH CIRCUIT)
`('114
`142
`rl 15
`<=1 > Enter/Edit Mandatory Terms
`<=2> Enter/Edit Restrictions (e.g., date)
`I 17
`<~3> Thesaurus/""! 16
`{118
`<"'4> Edit Search Descriptiorl
`<=5> Change number of documents Current setting: 25
`
`110
`
`For further explanation, press the H key {for HELP) and then the ENTER key.
`
`FIG 14
`
`Your FREESTYLE search has retrieved the top 44 documents based on
`statistical ranking. Search terms are listed in order of importa·nce.
`Terms after the • occurred too frequently to enhance your search.
`
`"PUBLIC FIGURE" DEFINED PURPO~ES DEFAMATION• FIRST CASE
`C172 ~
`173
`
`Mandatory Terms: DEFAMATION
`
`170
`
`Press ENTER to view documents in KWIC or use Full, Cite, or Segment keys.
`,174
`<=1 > Browse documents in SuperKWIC (. SK)
`r-175
`<=2> Location of search terms documents (. whereJ
`176
`<=3> Number of documents with search terms (. whyf
`< .. 4> Change document order (. sortf'l77
`
`For further explanation, press the H key (for HELP) and then the TRANSMIT key.
`
`FIG 15
`
`Page 10 of 23
`
`
`
`U.S. Patent
`
`Jun. 23, 1998
`
`Sheet 9 of 12
`
`5,771,378
`
`NUMBER OF DOCUMENTS WITH SEARCH TERMS (. why)
`
`184~
`Documents
`Retrieved
`
`186\
`Documents
`Matched
`
`188\
`Term Importance
`(0-100)
`
`DEFAMATION
`PUBLIC FIGURE
`DEFINED
`PURPOSES
`DEFAMATION
`A
`
`44
`11
`22
`29
`44
`
`Total Retrived:
`
`45
`
`<-1> Return to browse
`<=2> Location of terms (. where)
`
`80
`11
`26
`51
`80
`
`(M) Mandatcxy
`12
`8
`5
`4
`
`180
`
`Fcx further explanation, press the H key (fcx HELP) and then the TRANSMIT key.
`
`FIG /6
`
`LOCATION OF SEARCH TERMS IN DOCUMENTS (. where)
`
`Document numbers are listed accross the top of the chart.
`Terms are listed down the side in o..-der of importance.
`Asterisks (*) indicate the existence of terms in documents.
`To view a document, enter the document number.
`1
`1 2 3 4 5 6 7 8 9 0
`
`1941
`2
`1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
`
`)()(
`
`) ( ) ( ) (
`
`****
`)( * * * * *
`)(
`* * )(
`*
`********* *
`)(.)()()(-)()(
`*
`**
`)(-)(
`* *****
`)()(
`) ( - ) ( j \ - ) ( j \ ) ( - ) ( ) ( ) ( ) ( l ( - ) ( ) ( ) ( **********
`* * * *"* ** * * *"* * * ** * * * ** ** ***
`
`l ( l ( ) (
`
`FINISHED PRODUCT
`COMPONENT PART
`NON RESIDENT
`MANUFACTURER
`BRINGING
`FORUM
`INCORPORATED
`PROPER
`ACTION
`TENTH TERM
`<=1 > Return to browse
`<=2> Number of documents with terms (. why)
`Press the H key (fcx Help) and then the ENTER key.
`
`190
`
`FIG 17
`
`Page 11 of 23
`
`
`
`U.S. Patent
`
`Jun. 23, 1998
`
`Sheet 10 of 12
`
`5,771,378
`
`START
`
`202
`
`203
`
`ENTER SEARCH
`DESCRIPTION
`
`DETERMINE
`PHRASES
`
`204
`YES
`
`207
`
`, - - - - - - - - ,
`PROCESS
`OPTION
`
`NO
`
`NO
`
`YES
`
`208
`
`YES
`
`DESC.
`ORNOT
`1 ST MAND.
`
`NO
`
`PERFORM
`SEARCH
`
`DISPLAY
`RESULTS
`
`210
`
`212
`
`DONE
`
`FIG /8
`
`Page 12 of 23
`
`
`
`U.S. Patent
`
`Jun. 23, 1998
`
`Sheet 11 of 12
`
`5,771,378
`
`WORD 1
`WORD 2
`
`BITMAP 1
`BITMAP 2
`
`ID 1
`ID 2
`
`.
`
`WORD N
`
`)
`222
`
`ID N
`
`(224
`
`BITMAP N
`
`22i
`FIG 19
`
`!230
`
`FIG 20
`
`WORD A WORD B WORD C WORD D
`
`250
`FIG 21
`
`LEVEL 1
`
`LEVEL 2
`
`LEVEL 3
`
`LEVEL 8
`
`Page 13 of 23
`
`
`
`U.S. Patent
`
`Jun. 23, 1998
`
`Sheet 12 of 12
`
`5,771,378
`
`START
`
`START
`
`262
`
`PROVIDE TERMS
`ANO FILTER
`INSTRUCTIONS TO
`SR'S
`
`RECEIVE
`DOC COUNTS
`
`264
`
`CALCULATE
`MAXDFI AND DFI
`FOR EACH TERM
`
`265
`
`ELIMINATE SR'S
`WITH NO MATCH
`
`266
`
`PROVIDE DFI AND
`MAXDFI TO SR'S
`
`267
`
`- - - - - - - -
`
`MERGE RANKS
`AND REQUEST
`DOCS
`
`269
`
`- - -
`
`-·-- -
`
`271
`
`272
`
`CALCULATE
`TERM
`IMPORTANCE
`
`DISPLAY
`DOCS
`
`DONE
`
`263
`
`PERFORM
`SEARCH
`
`CALCULATE DOC
`RANKS AND
`RETURN TOP N
`RANKS
`
`RETURN
`REQUESTED
`DOCS
`
`268
`
`270
`
`DONE
`
`FIG 22
`
`Page 14 of 23
`
`
`
`1
`ASSOCIATIVE TEXT SEARCH AND
`RETRIEVAL SYSTEM HAVING A TABLE
`INDICATING WORD POSITION IN PHRASES
`
`5,771,378
`
`This is a continuation of U.S. patent application Ser. No. 5
`08/155,304, filed Nov. 22, 1993.
`
`FIELD OF THE INVENTION
`
`This invention relates to the field of searching and retriev(cid:173)
`ing text documents and more particularly to the field of using
`one or more computers to search a plurality of text docu(cid:173)
`ments in order to retrieve documents having particular terms
`and phrases.
`
`BACKGROUND OF THE INVENTION
`
`2
`one of the supplied search terms and then ranks each
`document using a ranking formula that varies according to
`the square of the term frequency of each of the search terms
`in the document. The ranking formula can also vary accord(cid:173)
`ing to the inverse document frequency of each search term.
`The formula can also use a maximum term frequency to
`estimate the size of a document and the maximum document
`frequency to estimate the number of documents in a collec(cid:173)
`tion of documents, thus reducing the amount of processing
`10 needed to determine document size and the number of
`documents in a collection. The user can provide mandatory
`terms which cause the search to only return documents that
`contain those terms.
`The system can employ a thesaurus to provide both
`15 synonyms and morphological variations of words. Phrases
`in the search description are detected using a table with a
`bitmap indicating possible positions of a word in a phrase
`and by using a tree having nodes corresponding to ID's
`associated with words in a phrase, the nodes being con-
`20 nected according to the order that the words can appear in a
`phrase. The system optimizes the search by distinguishing
`between noise words, which are not provided in an index for
`the documents, and frequently used terms, which are pro(cid:173)
`vided in the index but which are not used in the search.
`The system can provide display options for the documents
`that are retrieved by the search, including displaying a
`window of text that contains the greatest number and
`diversity of search terms and mandatory terms. The system
`can also display a screen indicating which search terms are
`in which retrieved documents and can display a screen that
`indicates the importance of each term, which varies accord(cid:173)
`ing to the inverse of the document frequency of each term.
`The documents can be sorted according to rank or according
`to a predetermined default method, such as reverse chrono(cid:173)
`logical order.
`The system can include a plurality of interconnected
`processors and appropriate data therefore wherein some of
`the processors perform searches and others of the processors
`merge the search data and interact with the user.
`
`40
`
`It is known that a large collection of text documents can
`be searched for particular keywords or phrases. A user can
`provide a single word or phrase or multiple words or phrases
`connected by Boolean connectors such as "AND" or "OR".
`However, in many cases, a user must be fairly sophisticated
`in order to perform searches of sufficient complexity in order
`to retrieve the exact class of documents that the user desires
`without having to perform an excessive number of searches.
`Associative retrieval, a technique for information retrieval 25
`developed in the 1960s by Gerard Salton, addresses some of
`the shortcomings of Boolean searching. Automatic Text
`Processing, (published by Addison Wesley, New York, N.Y.
`1988, and written by Gerard Salton) provides a description
`of associative retrieval searching. The basic formula used in 30
`associative retrieval involves calculating a term weight for
`each term within a search request, and scoring documents in
`a collection based on the sum of the weights for the search
`request terms that occur within the document. The two basic
`weighting factors are known as the term frequency-tf-and 35
`the inverse document frequency-idf.
`The term frequency is defined as the number of times the
`term occurs within a given document. Hence, the term
`frequency must be calculated for each document within the
`collection.
`The inverse document frequency is defined as the inverse
`of the number of documents in the entire collection which
`contain the term. Therefore, if df documents within a col(cid:173)
`lection of N documents contain a given term, the idf would
`be 1/df. The idf can be normalized with respect to the
`number of documents by setting it to log(N/df). The idf is
`calculated for each search request term, but is constant for
`the collection and does not vary by document. The score for
`a given document is calculated by summing the product of
`the tf and the idfs for each search request term contained in
`the document.
`However, there are many aspects of Associative Retrieval
`as described by Salton which render it impractical or
`unwieldy for large scale commercial use for searching and 55
`retrieving documents in large databases. Furthermore, most
`of the work done in the area of Associative Retrieval has
`failed to adequately address aspects relating to human
`interaction and feedback. It is desirable, therefore, to provide
`an associative text search and retrieval system that over(cid:173)
`comes the deficiencies of known systems.
`
`45
`
`50
`
`BRIEF DESCRIPTION OF DRAWINGS
`FIG. 1 is a schematic view of a document searching
`system according to the invention.
`FIG. 2 illustrates data stored in a physical document
`collection.
`FIG. 3 illustrates data stored in an index for a physical
`document collection.
`FIG. 4 illustrates a logical document collection comprised
`of a plurality of physical document collections.
`FIG. 5 illustrates a logical document collection comprised
`of a plurality of subsets of physical document collections.
`FIG. 6 is a screen illustrating entry of a search description.
`FIG. 7 is a screen illustrating entry of search options.
`FIG. 8 is a screen illustrating entry of mandatory terms.
`FIG. 9 is a screen illustrating entry of restrictions.
`FIG. 10 is a screen illustrating displaying of restrictions.
`FIG. 11 is a screen illustrating entry of both mandatory
`terms and restrictions.
`FIG. 12 is a screen illustrating a thesaurus function.
`FIG. 13 is a screen illustrating selection of synonyms
`and/or morphological variations of a term using the thesau-
`65 rus function.
`FIG. 14 is a screen illustrating mandatory terms,
`restrictions, and thesaurus entries.
`
`60
`
`SUMMARY OF THE INVENTION
`
`According to the present invention, a user provides a
`search description containing one or more search terms to an
`associative text search and retrieval system that searches a
`document database to retrieve documents containing at least
`
`Page 15 of 23
`
`
`
`5,771,378
`
`3
`FIG. 15 is a screen illustrating options for viewing docu-
`ments retrieved after a search.
`FIG. 16 is a screen illustrating a "why" function.
`FIG. 17 is a screen illustrating a "where" function.
`FIG. 18 is a flowchart illustrating overall operation of the
`system according to the invention.
`FIG. 19 shows a table used to detect phrases.
`FIG. 20 shows a tree data structure used to detect phrases.
`FIG. 21 shows a plurality of contiguous words from a
`search description.
`FIG. 22 is a flowchart illustrating operation of a search
`algorithm.
`
`DETAILED DESCRIPTION OF DRAWINGS
`Referring to FIG. 1, a document search and retrieval
`system 30 allows a user to search a subset of a plurality of
`documents for particular key words or phrases and retrieves,
`for the user to view, documents that correspond to the search
`request. The system 30 comprises a plurality of Search and
`Retrieval (SR) computers 32-35 connected via a high speed
`interconnection 38 to a plurality of Session Administrator
`(SA) computers 42-44. Each of the SR's 32-35 is connected
`to one or more document collections 46-49, each containing
`text for a plurality of documents, indexes therefor, and other
`ancillary data. More than one SR can be provided access to
`a single document collection. Also, a single SR can be
`provided access to more than one document collection. The
`SR's 32-35 can be implemented using a variety of com(cid:173)
`mercially available computers known to one of ordinary
`skill in the art, such as Model EXlO0 manufactured by
`Hitachi Data Systems of Santa Clara Calif.
`Each of the SA's 42-44 is provided access to data
`representing phrase and thesaurus dictionaries 52-54. The
`SA's 42-44 can also be implemented using a variety of
`commercially available computers, such as Models 5990
`and 5995 manufactured by Amdahl Corporation of Sunny(cid:173)
`vale Calif. The interconnection 38 between the SR's and the
`SA's can be any one of a number of two-way high-speed 40
`computer data interconnections known to one of ordinary
`skill in the art, such as the Model 7200-DX manufactured by
`Network Systems Corporation of Minneapolis Minn.
`Each of the SA's 42-44 is connected to one of a plurality
`of front end processors 56-58 . The front end processors 45
`56-58 provide a connection of the system 30 one or more
`commonly available networks 62 for accessing digital data,
`such as an X.25 network, long distance telephone lines, and
`SprintNet. Connected to the network 62 is a plurality of
`terminals 64--66 which provide user access to the system 30. 50
`The terminals 64--66 can be dumb terminals that simply
`process and display data inputs and outputs, or can be one
`of a variety of readily available stand-alone computers, such
`as an IBM or IBM-compatible Personal Computer. The front
`end processors 56-58 can be implemented by a variety of 55
`commercially available devices, such as Models 4745 and
`4705 manufactured by the Amdahl Corporation of Sunny(cid:173)
`vale Calif. Note that the number of components shown in
`FIG. 1 are for illustrative purposes only and that the system
`30 described herein can have any number of SA's, SR's, 60
`front end processors, etc. Also, the distribution of processing
`described herein may be modified and may in fact be
`performed on a single computer without departing from the
`spirit and scope of the invention.
`A user wishing to access the system 30 via one of the 65
`terminals 64-66 will use the network 62 to establish a
`connection, by means known to one of ordinary skill in the
`
`4
`art, to one of the front end processors 56-58. The front end
`processors 56-58 handle communication with the user ter(cid:173)
`minals 64--66 by providing output data for display by the
`terminals 64-66 and by processing terminal keyboard inputs
`5 entered by the user. The data output by the front end
`processors 56-58 includes text and screen commands. The
`front end processors 56-58 support screen control
`commands, such as the commonly known VTl00
`commands, which provide screen functionality to the termi-
`10 nals 64-66 such as clearing the screen and moving the cursor
`insertion point. The front end processors 56-58 can handle
`other known types of terminals and/or stand-alone comput(cid:173)
`ers by providing appropriate commands.
`Each of the front end processors 56-58 communicates
`15 bidirectionally, by means known to one of ordinary skill in
`the art, with the particular one of the SA's 42-44 connected
`thereto. It is also possible to configure the system, in a
`manner known to one of ordinary skill in the art, such that
`one or more of the front end processors can communicate
`20 with more than one of the SA's 42-44. The front end
`processors 56-58 can be configured to "load balance" the
`SA's 42-44 in response to data flow patterns. The concept
`of load balancing is known to one of ordinary skill in the art.
`Each of the SA's 42-44 contains an application program,
`25 described in more detail hereinafter, that processes search
`requests input by a user at one of the terminals 64--66 and
`passes the search request information on to one or more of
`the SR's 32-35 which perform the search and returns the
`results, including the text of the documents, to the SA's
`30 42-44. The SA's 42-44 provide the user with text docu(cid:173)
`ments corresponding to the search results via the terminals
`64--66. For a particular user session (i.e. a single user
`accessing the system via one of the terminals 64-66), a
`single one of the SA's 42-44 will interact with a user
`35 through an appropriate one of the front end processors
`56-58.
`Referring to FIG. 2, data 70 stored in each of the physical
`document collections 46-49 consists of document text 72,
`an index 74, and ancillary document information 76. The
`data 70 can be located in one or more files of a computer
`hard disk storage device. The document text 72 portion of
`the data 70 is comprised of character data representing text
`(such as ASCII or EBCDIC character data) for a plurality of
`documents. Each of the documents that are part of the
`document text 72 can be accessed individually. The index 74
`contains a list of terms (words and phrases) that are present
`in all of the documents of the document text 72 along with
`the locations in the documents of those terms. The ancillary
`document information 76, described in more detail
`hereinafter, contains other information about the documents,
`such as the dates associated with the documents, the source
`of the documents, etc.
`Referring to FIG. 3, the index 74 for a document collec(cid:173)
`tion comprises a plurality of entries that relate particular
`terms (terml-termn) to a plurality of locations (loc A-loc
`I). The table shown on the left-hand portion of FIG. 3 relates
`each term to a pair of pointers such that terml is related to
`pointers Pl and P2, term2 is related to pointers P3 and P4,
`and termn is related to pointers Pr and Ps. The right-hand
`portion of FIG. 3 represents a list of all of the locations for
`all of the terms in the physical document collection. The
`pointers associated with each term point to the first and last
`locations in the list in order to correlate the terms in the text
`of the documents of the physical collection with the loca(cid:173)
`tions of the terms. For example, FIG. 3 shows terml being
`located at locations loc A (pointed to by Pl) through loc C
`(pointed to by P2) in the list. All of the locations in the list
`
`Page 16 of 23
`
`
`
`5,771,378
`
`5
`between the entry for loc A and the entry for loc C indicate
`separate locations for terml in the document collection.
`Words and phrases which are so common as to be of little
`value in document searching, such as the word "of', are
`deemed "noise words" and are not included in the index. A 5
`list of noise words for each physical document collection is
`stored with the ancillary document information 76.
`The SR's 32-34 search documents in a physical collection
`for particular terms by accessing the index 74. Terms in the
`search request are matched with terms in the index 74 in 10
`order to find specific documents in the document text 72
`which contain the terms in the search request. Plural terms
`are depluralized and stored in their singular form. Terms that
`are submitted for a search are also depluralized. Deplural(cid:173)
`ization is known in the art and is described in Program, Vol. 15
`1, no. 3, pp. 130-137, Jul. 1980, which is incorporated by
`reference herein.
`A user does not typically search all of the documents of
`the system 30, but rather, chooses a subset of the