`~.e SKAar I.for•• ~io. Ie~riey.l
`S7.~ea V.der V.II
`
`Edward A. Fox
`
`*
`
`83-560
`September 1983
`
`.
`*
`Department of Computer SC1ence
`Cornell University
`Ithaca. New York
`
`14853
`
`now at:
`Department of Computer Science
`Virginia Tech
`Blacksburg. Virginia
`
`24061
`
`This work was supported in part by
`tion. under grant IST-81-08696.
`
`the National Science Founda(cid:173)
`
`001
`
`Facebook Ex. 1005
`
`
`
`TABLE OF CONTENTS
`
`1 Introduction .................................•................................................................................................
`
`2 History of Implementations
`
`2.1 Early Versions
`
`2.2 UNIX Implementation
`
`3 Design
`
`3.1 IBM SMART System
`
`3.2 Use of UNIX
`
`3.3 Design Principles
`
`3.4 Retrieval Capabilities
`
`3.5 Data Storage Schemes
`
`3.6 INGRES Relations
`
`3.7 Program for Search and Ranking
`
`:.::..
`
`3.8 Retrieval Related Programs
`
`3.9 Indexing Package
`
`4 Creation and Clustering of Extended Vectors
`
`4.1 Background
`
`4.2 Constructing Bibliographic Submatrices
`
`4.2.1 Input Data
`
`4.2.2 Initial subvectors
`
`-1-
`
`2
`
`3
`
`3
`
`4
`
`5
`
`5
`
`6
`
`8
`
`9
`
`12
`
`4
`
`20
`
`2
`
`24
`
`27
`
`27
`
`29
`
`29
`
`0
`
`002
`
`Facebook Ex. 1005
`
`
`
`-n-
`
`4.2.3 Updating Subvectors
`
`4.2.4 Construction Algorithm Complexity
`
`4.3 Implementation in SMAR T
`
`4.4 Clustering and Searching
`
`4.4.1 Background
`
`4.4.2 Clustering with Bibliographic Data
`
`4.4.3 Algorithms
`
`4.4.4 Clustering
`
`4.4.4.1 Parameters
`
`4.4.4.2 Actual Procedures in Outline Form
`
`~.~~..
`
`4.4.4.3 Linearized Cluster Tree
`
`4.4.5 Searching
`
`5 EVALUATION
`
`5.1 Background
`
`5.1.1 Recall/Precision
`
`5.1.2 Summaries Available
`
`5.1.3 Key Issues
`
`5.2 Methods Utilized
`
`5.2.1 Similarity Based Ranking
`
`5.2.2 Query Statistics
`
`5.2.3 User Oriented Averages
`
`5.2.4 Single Value Measure
`
`2
`
`35
`
`5
`
`41
`
`2
`
`43
`
`3
`
`44
`
`4
`
`47
`
`2
`
`53
`
`5
`
`55
`
`5
`
`56
`
`7
`
`58
`
`8
`
`59
`
`1
`
`61
`
`003
`
`Facebook Ex. 1005
`
`
`
`-iii-
`
`5.2.5 Statistical Comparison
`
`5.2.6 Sample Evaluation Output
`
`5.3 S Interface
`
`5.3.1 S Package
`
`5.3.2 Use of S
`
`5.4 Discussion and Conclusion
`
`::~:..
`
`6 Fut ure Plans
`
`6.1 Packaging
`
`6.2 Cornell Projects
`
`6.2.1 UNIX Retrieval Tools
`
`6.2.2 Phrase and Thesaurus Construction
`
`6.2.3 Probabilistic Retrieval
`
`6.2.4 Office Information Systems
`
`6.3 Extensions
`
`6.4 Conclusion
`
`References
`
`1
`
`62
`
`0
`
`70
`
`0
`
`73
`
`5
`
`76
`
`6
`
`76
`
`78
`
`78
`
`g
`
`79
`
`2
`
`83
`
`004
`
`Facebook Ex. 1005
`
`
`
`SOME CONSIDERATIONS FOR IMPLEMENTING TIm
`SMART INFORMATION RETRIEVAL SYSTEM UNDER UNIX
`•
`
`Edward A. Fox
`
`Abstract
`
`Since the early 1960's the SMART project has tested out new ideas in information science aimed
`
`at fully automatic document retrieval. Beginning in 1980 development of an enhanced and generalized
`
`version of SMART has progressed at Cornell. The current implementation is in the C language and
`
`runs under the UNIX operating system on a VAX 11/780 computer.
`
`The history of SMAR T is outlined. Considerations that led to the current design are described.
`
`Since SMAR T now allows multiple concept types to be manipulated in connection with an extended
`
`vector representation, storage and processing issues are discussed, including use of INGRES relations.
`
`Clustering algorithms are presented and run parameters are given for document clustering and subse-
`
`quent clustered searching. SMART experiments (e.g. with p-norm queries, or probabilistic methods)
`
`can be compared using the evaluation package. The S statistical package can be applied to performing
`
`other special analysis and descriptive tasks. Finally, to illustrate the usefulness of these facilities, an
`outline is given of current SMART activities and or future plans.
`
`tDepartment of Computer Science, Cornell University, Ithaca, NY 14853; now at Dept. of Comput(cid:173)
`er Science, Virginia Tech, Blacksburg, VA 24061. This work was supported in part by the National
`Science Foundation, under grant IST-81-08696.
`
`-1-
`
`005
`
`Facebook Ex. 1005
`
`
`
`1. Introduction
`
`-2-
`
`Development of the SMART 8ystem has proceeded since 1961 (Salton 1980). This author began
`
`implementing SMART under the UNIX! operating system in 1980 on Cornell University Department of
`
`Computer Science's computers. Preliminary design notions were discussed in [Fox 1981). This report
`
`brings the status of SMART up to date, describing the running system as used for the experiments
`
`detailed in (Fox 1983b).
`
`Section 2 concerns the history of SMART. Early versions are mentioned; for more elaboration see
`
`[Salton 1971a, 1975, 1980]. The current UNIX version is also introduced.
`
`The design of SMART is covered in section 3. Except for clustering and evaluation, all of the
`
`basic package is dealt with. Beginning with points gleaned from the IBM SMART system [SMART
`
`Project 1974], reasoning is given of how tools available under UNIX could be applied to retrieval tasks.
`
`Design principles are then listed, and the capabilities desired for the system are enunciated.
`
`Issues of
`
`data storage are considered,
`
`including use of the INGRES relational database package. With data
`
`requirements taken care of, program modularization is diseussed. Searching, ranking, retrieval, and
`
`index ing are each separately considered.
`
`Section 4 focuses on the formation and manipulation of extended vectors. A brief summary is
`
`given of the model introduced in [Fox 1983b), which has been tested using two new experimental collec(cid:173)
`
`tions [Fox 1983a]. A general procedure is given ror constructing similar document collections, followed
`
`by a description of implementation issues as they relate to SI\fART software. Two types of manipula(cid:173)
`
`tion of these extended vectors, clustering and clustered searching, are then considered.
`
`The SMART evaluation package is discussed next. 01 particular interest may be the integration
`
`of special software with the general purpose S statistical package.
`
`Justification is also given for the
`
`lUNIX is a trademark of Bell Laboratories.
`
`006
`
`Facebook Ex. 1005
`
`
`
`various algorithms and tests incorporated.
`
`-3-
`
`This report concludes with a sketch of recent studies being made with the SlvIART software, and
`
`with mention of ideas for future projects. The interested reader is advised to view the latter purely as
`
`illustrating some of the potential applications for SMAR T in upcoming years.
`
`2. History of Implementationa
`
`2.1. Early Versions
`
`· In the late 1950's, automatic text analysis began to be seriously studied.
`
`In 1961, the SMAR T
`
`(System for Mechanical Analysis and Retrieval of Text) project began, with the object of designing a
`
`fully automatic document retrieval system and of testing out new ideas in information science.
`
`In SMAR T, an automatic indexing component constructs stored representations or documents.
`
`Then a search component selects those documents which seem most similar to a user's query. Finally,
`
`an evaluation component allows rating of batches of results, using relevance judgments obtained from
`
`the user population, thereby facilitating comparisons between various indexing and search schemes [Sal(cid:173)
`
`ton 1980].
`
`SMART was originally intended for operation in a time-sharing environmen t.
`
`In 1964 a prelim(cid:173)
`
`inary design suited for such operation was suggested [Rocchio 1964] and implementation work began
`
`for a version to run under M.I.T. 's CTSS [Cane 1964].
`
`In 1006, Lesk concluded that fully automatic
`
`documentation centers were feasible with current technology (Lesk 1966].
`
`In 1970, a detailed design for
`
`an IBM 360 based conversational system was prepared (Williamson & Williamson 1970). Later, a par(cid:173)
`
`tial implementation for the IBM 370 TSO environment was undertaken.
`
`The most complete early version of SMART was developed for batch operation on the IBM 370 at
`
`Cornell, and was completed around 1974 [SMART Project 1974]. The last major project using that
`
`007
`
`Facebook Ex. 1005
`
`
`
`-4-
`
`system at Cornell was completed in 1980, and involved indexing, clustering, searching, and evaluation
`
`runs to show how SMART methods could be beneficially applied to indexing of business correspondence
`
`[Nodtvedt 1980].
`
`The IBM SMART system had been programmed in FOR TRAN IV with some portions in assem-
`
`bly language. Auxiliary programs, such as (or clustering, were coded in a variety of other languages,
`
`especially in PL/I. Since running the system resulted in expensive computer charges, and since modify-
`
`ing or extending the generally undocumented monolithic system seemed inadvisable, it was decided to
`
`implement a generalized version of SMAR T using more modern methods and less expensive hardware.
`
`2.2. UNIX ImplementatioD
`
`By 1980, several components of the SMART system had been reprogrammed in the C language
`
`[Kernighan & Ritchie 1978] to run under the UNIX operating system [Ritchie & Thompson 1974] on a
`
`Digital Equipment Corporation PDpl 11/60 computer. Later that year the 80ftware was transferred to
`
`a VAX' 11/780, and work since then has been OD that hardware. There are now plans to also port the
`
`programs to a Motorola 68000 series based microcomputer running UNIX.
`
`A number of proposals and tools have been considered to aid in the development effort. On the
`
`VAX, a collection of routines implementing the operators of the relational algebra [Date 1982] were pro-
`
`grammed to allow occasional manipulation of relations; many of the smaller data files were-stored as
`
`relations to aid in standardization. To generalize that approach,
`
`the idea of relational processing of
`
`abstract data types was advocated Cor retrieval operations [Fox Ig81]. Though such a development
`
`scheme was theoretically appealing, there was not enough time or programming support to carry the
`
`project to completion in such a manner.
`
`2pDP is a trademark of Digital Equipment Corporation.
`IVAX is a trademark of Digital Equipment Corporation.
`
`008
`
`Facebook Ex. 1005
`
`
`
`-5-
`
`In 1981, an early version of the INGRES relational database management system [Stonebraker
`
`et.al. 1976] was obtained for use at Cornell.
`
`In 1982, the S statistical package [Becker & Chambers
`
`1981] was also installed on the VAX.
`
`INGRES served as the backbone for data storage and manipula(cid:173)
`
`tion in all situations where efficiency was not of crucial importance, and S became a handy tool for
`
`exploratory data analysis and for statistical calculations. The histograms, scatter plots, and much .of
`
`the statist ical reporting in [Fox 1983a, 1983b) were all done using S.
`
`Though various such tools speeded up and supplemented the retrieval experimentation, the main
`
`part. of SMART should be implementable on any UNIX system with sufficient storage and adequate
`
`memory management.
`
`Indeed, a version 01 SMAR T, used for a number of etandard operations when
`
`efficiency is essential, requires neither INGRES nor S. One possible plan for the future is to have
`
`retrieval experiments carried out on a dedicated microcomputer that runs UNIX. Then the statistical
`
`evaluation and the comparison of results among runs could be done on 8 VAX which supports INGRES
`
`and S.
`
`3. Design
`
`3.1. mM SMART System
`
`The IBM SMART system, completed in the mid 1970's by a group of students and programmers
`
`at CornelJ, was designed so as to be able to carry out the necessary processing without requiring a great
`
`deal of memory or machine resources (Williamson et al. 1971). Further, it was programmed in FOR(cid:173)
`
`TRAN, since members of the team were proficient in that language, even though the language standard
`
`then current lacked good tools for structured programming, text processing, or manipulation of files on
`
`secondary storage units.
`
`009
`
`Facebook Ex. 1005
`
`
`
`SMART was executed as a single program, which brought in appropriate procedures as overlays
`
`when needed. Communication between various subprograms was through data files on secondary
`
`storage, or, more often,
`
`through variables held in a globally accessible common area of high speed
`
`memory. Unfortunately, to save space, portions of that common area were reused for varying purposes,
`
`making the data flow and interconnections exceedingly difficult to follow.
`
`In 1981 an undergraduate student studied these SMART programs in order to prepare an over(cid:173)
`
`view of the system. The task was rather difficult and eonfuaing, and 80 it was decided that further
`
`analysis of that software package would not be encouraged. However, the general structure deduced,
`
`part of which is shown in Figure 1, did provide useful guidance for other development efforts.
`
`As can be seen in Figure 1, an executive routine leads to data initialization and then calls in, one
`
`by one, other procedures required to carry out the batch job submitted. Some of the data provided
`
`specifies values of parameters required by the processing routines; the remaining parameters are given
`
`appropriate settings by default.
`
`In any case, the list of variables needing specification was useful infor(cid:173)
`
`mation in helping plan the description of modules for a subsequent implementation.
`
`3.2. Uce 01 UNIX
`
`In 1980, it was decided that future research studies by this author would make use of the experi(cid:173)
`
`mental facilities provided by the National Science Foundation ror the Department of Computer Science
`
`at Cornell, since there were no charges for time or storage space, and since available development tools
`
`were easier to work with than those on the IBM system. The UNIX4 operating system, in particular,
`
`encouraged interactive program development and testing of new ideas, using quick to develop prototype
`
`implementations; it was also very effective at performing the necessary text processing operations.
`
`·UNIX is a trademark of Bell Laboratories.
`
`010
`
`Facebook Ex. 1005
`
`
`
`-7-
`
`Figure 1: Structure of IBM 370 SMART System
`
`Executive Routine - calls initialization routine
`Data Set Initialization - start or restart runs, and call other routines
`Text Input Routines
`Store Input
`Convert Text to Vectors
`Perform Partial Syntactic Analysis
`Reeode Vectors for New Dictionary
`Vector Input Routines
`Load in or Modify a Vector Collection
`Read in Results of Previous Searches
`List out a Vector Collection
`Search Routines
`Modify Queries for Feedback Runs
`Specify Which Documents should be Correlated
`Correlate Documents against Query
`Sort on Correlation and Assign Ranks
`Prepare Results for Printing
`Clustering Routines
`Define Newly Specified Cluster
`Cluster via Dattola's Algorithm
`Cluster via Rocchio's Algorithm
`Averaging Routines
`Compute Various Averages
`Print Output and Graphs
`Run and Print Statistical Tests
`
`For example, to index a document, one program could take a typical text input file and return a
`
`list of single words. A second could remove "stop" words, while a third could replace words with their
`
`stems. Other routines could sort lists, or number the lines of a file. By taking the output of one run
`
`and "piping" it
`
`into another program, two or more processes could execute virtually simultaneously,
`
`avoiding the need for intermediary files to be actually materialized in their entirety.
`
`With the UNIX tools provided and a moderate number of new ones, a simple system for indexing
`
`documents was devised. To make it easier for others to use, the steps were recorded in a file that could
`
`011
`
`Facebook Ex. 1005
`
`
`
`-8-
`
`thereafter be executed as a "shell script;" UNIX allows procedures to be written in a system command
`
`language or shell and those executable files can later be invoked as if they were a single program.
`
`Eventually, to provide capabilities needed to handle more complex document formats, and to allow
`
`more efficient operation as required for indexing thousands of documents, several sections of the index(cid:173)
`
`ing script were redone as single programs. This methodology of connecting available tools ·to develop
`
`prototypes that test new ideas, and of subsequently adapting and tuning those early versions for more
`
`efficient operation, is very much in accord with the practice of stepwise refinement [Wirth 1971].
`
`3•.3. Design Principlea
`
`To formalize these general notions, a number of design principles were explicitly stated to guide
`
`further planning:
`
`(1) There should be a network of programs rather than a single large one.
`
`(2) Communication between programs should be through data files.
`
`(3) Communication between procedures and functions should be through argument
`
`lists and return
`
`values.
`
`(4) Lists of default settings of parameters should be stored in text form in data files, to facilitate
`
`making changes.
`
`(5) Different shell scripts should be developed to call the various programs automatically, for all stan(cid:173)
`
`dard tasks,
`
`(6) The same program (or at least many of the same procedures) should be usable, with minor exten(cid:173)
`
`sions perhaps, for experimental studies and for "production" retrieval runs. Thus, separate ver(cid:173)
`
`sions should be avoided as much as possible, and one should be able to run large batch experi(cid:173)
`
`ments, interactively carry out a test search, or try out a new type of similarity function, all with
`
`012
`
`Facebook Ex. 1005
`
`
`
`the stand ard system.
`
`-9-
`
`(7) Old SMART data files should be converted into new forms that might be required for indexing or
`
`other addition al processing.
`
`(8) Emphasis should be on getting new methods to work, or reimplementing old SMART components,
`
`as expeditiously as possible. All available tools should be exploited, including INGRES and S, and
`
`coding from scratch should only be done as a last resort.
`
`(9) Tuning should be done as needed. First, sections of shell scripts should be replaced by specialized,
`
`more efficient programs to accomplish the same function. Then, algorithms and data structures
`
`should be revised to reduce processing and data manipulation costs.
`
`3 ••• Retrieval Capabilities
`
`The actual retrieval software was also planned 80 as to provide a number of capabilities in a single
`
`coherent system:
`
`(1) The ranking of retrieved documents was to be done separately from the actual search.
`
`(2) Search methods should include sequential and clustered techniques first and manipulation of
`
`inverted files later.
`
`(3) Vector and Boolean systems should be easily modelled.
`
`(4) Probabilistic methods should be incorporated.
`
`(5) A number of similarity functions should be included, such as:
`
`a- eosine correlation
`b- vector inner product
`c- Boolean expression evaluation
`d- p-norm expression [Salton, Fox &, Wu 19S2] interpretation
`e- probabilistic weighting where terms are assumed independent,
`where term dependencies follow a tree structure, or where
`
`013
`
`Facebook Ex. 1005
`
`
`
`-10-
`
`the BLE scheme ([Duda & Hart 1973] and (Yu, Luk & Siu 1979])
`with singles, pairs, and triples of terms are used.
`
`(6) As in the SIRE system ([McGill et al, 1976], [McGill & Noreault 1977], and [Noreault, Koll &
`
`McGill 1977)), it should be possible to rank the documents retrieved by a Boolean search according
`
`to some other similarity (unction.
`
`(7) Ranking of documents should be possible for any type of search method, for both initial and feed-
`
`back searches.
`
`(8)
`
`.Ranking for feedback runs should be possible using anyone of several evaluation .methods [Chang,
`
`Cirillo & Razon 1971]:
`
`a- Residual collection
`b- Test and control
`c- Full freezing
`d- Partial freezing
`
`(9) Evaluation results should be stored for
`
`later printing and analysis. Printout options should
`
`include:
`
`a- Recall- precision results ror each query
`b- Recall-precision averages ror a batch of queries
`c- Other statistics for each query or query collection
`d- Statistical test results comparing the effectiveness of
`several retrieval trials.
`
`(10) Statistical comparisons should at least include ([Siegel 1956], (Snedecor & Cochran 1980]):
`
`a- T test
`b- Sign test
`c- Wilcoxon signed rank test.
`
`(11) Clustering runs should allow gathering or information about search efficiency.
`
`(12) An interactive version should provide reasonable responsiveness for:
`
`014
`
`Facebook Ex. 1005
`
`
`
`-11-
`
`a- Submission of p-norm, Boolean, or natural language query forms
`b- Clustered or sequential searches
`c- Display of portions of text from selected top-ranked documents
`d- Improvement of current query by feedback processing
`e- Subsequent searching, and repetition of steps (a)-(d).
`
`(13) Document and query representations should allow use of multiple concept types, and composite
`
`similarity functions should be specifiable.
`
`(14) Indexing should allow different processing for the various portions of incoming records, to account
`
`for formatting peculiarities and to enable separation into multiple concept types. Normalization
`
`of names (e.g., use last name followed by first initial) and dates should be automatic in appropri-
`
`ate places.
`
`(15) Words should be able to be grouped, as in phrases appearing in keyword lists, or to be stemmed,
`
`or to have final Us" removed.
`
`(16) A standard stop word list should be used as the default choice, but users should be allowed to
`
`substitute other lists instead.
`
`(17) Automatic identification or statistical phrases (groups of words that co-occur often enough, appear
`
`close together in the same sentence, and each have proper collection wide frequency] should be
`
`possible. To allow this, positional information on each word in the collection must be available.
`
`(18) Auxiliary programs should be able to process data used by the retrieval system. These should
`
`include routines for:
`
`a- Clustering
`b- Construction of feedback queries
`c- Parsing and conversion from infix Boolean or p-norm expressions
`to prefix forms and later to a fast internal representation.
`
`In accord with these principles,
`
`the UNIX SMART system was programmed in stages over about a
`
`three year period.
`
`In the following sections, details of the current design are given. First, underlying
`
`015
`
`Facebook Ex. 1005
`
`
`
`data handling issues are described and then the various main programs are outlined.
`
`-12-
`
`3.5. Data Storage Schemes
`
`Information retrieval calls for storing and accessing large amounts of data. Though specialized
`
`structures like compact vectors or inverted files are probably still the most appropriate ones to use for
`searching, there are many intermediate results and requirements for processing or smaller data group(cid:173)
`
`ings where relations can be effectively used. Furthermore, as relational systems become more efficient,
`
`with· new hardware mechanisms and better algorithms available, and as text retrieval systems are gen(cid:173)
`
`eralized to handle structured as well as textual informatfon, relations may be utilized even further.
`
`In an earlier study [Fox 1981), relational processing was proposed for all aspects of retrieval work.
`
`For a practical implementation of SMART with existing software tools, however, several different data
`
`representations were employed.
`
`First, standard UNIX test files were selected for a wide variety of purposes. The original form of
`
`document and query collections were provided that way. Parameters lor complicated runs, such as
`
`clustering, were kept in this form 80 as to be easily edited. Finally, the output of runs almost always
`
`included textual output, for ease of interpretation.
`
`Second, specialized files were employed. Compact forms of vectors, whether of queries, documents,
`
`or cluster centroids, were recorded as lists of
`
`<integer, real number>
`
`pairs in binary format to save space and speed up processing. Pointers to text files, giving the starting
`
`position and field lengths for each collection document, were stored in a fast access format to enable
`
`quick presentation of retrieved documents to on-line searchers. A final example is that co-occurrence
`
`data for probabilistic retrieval methods was also organized in a specialized format.
`
`016
`
`Facebook Ex. 1005
`
`
`
`-13-
`
`A third scheme for handling retrieval data was that of storing relations as text files. A good
`
`example is during the indexing phase, where UNIX programs can pipe results from one process to the
`
`next and thus save on intermediate storage. Starting with a stream of data organized as tuples, each a
`
`<document number, string containing a word>
`
`pair, eventually a file containing tuples like
`
`<document number, concept type, concept number, document frequency>
`
`is produced.
`
`These textual relations are very useful, since each attribute can be of variable length, and since
`
`processing textual Corms is exceptionally easy under UNIX. One handy transformation is that of
`
`compressing or expanding between relation and list forms .and vice versa. Thus, a document collection
`
`of many
`
`<document number, concept number>
`
`pairs could be compressed into N groupings of form
`
`<document number, list of concept numbers, 0 to end list>
`
`where there are N distinct documents. Later, the groupings could be stored as compact vectors, or
`
`could be expanded back into first normal form relations.
`
`The fourth and final format of data storage employed is that of relations stored in a database of
`
`the INGRES relational system. One database was created for each document collection. Routines or
`
`procedures were generally available to convert INGRES relations to text file type relations. Some of
`
`the relations stored in INGR ES are shown in Figures 2 through 4.
`
`017
`
`Facebook Ex. 1005
`
`
`
`3.6. INGRES Relations
`
`-14-
`
`Much about a system of computer programs can be understood when the detailed form of the data
`
`being manipulated is known. Hence, a description of the relations utilized in the SMART search and
`
`ranking system should be of particular value. These relations are of special
`
`interest since they are
`
`simpler to specify than are other secondary storage files - one need only know the purpose of the rela-
`
`tion and details about the attributes which make it up.
`
`Figure 2 covers the three relations employed for p-norm queries'', Utilizing these illustrates
`
`compromising between efficiency and ease of representation. When Boolean or p-norm queries are
`
`parsed,
`
`they are placed into an internal structure which is a tree form of the p-norm expression.
`
`Figure 2: INGRES Relations Used for P-Norm Queries
`
`1) Tree description: details of parse tree of query representation
`no_cons - number of concepts in the query
`nOJlodes - number of nodes in the tree
`boot
`- whether the query is interpreted as strict Boolean
`- node number of the root of the tree
`root
`2) Concepts: details about each concept in the query
`qid
`- query identifier
`con
`- concept number
`ctype
`- concept type
`3) Nodes: details about each operator or concept node in tree
`type
`- "concept," or else which operator (AND,NOT,OR)
`- if type is "concept," then concept number
`c
`- if type is "concept," then concept type
`ctype
`- if type is "operator," then p-value attached
`p
`- relative weight attached to this node
`wt
`- number of the node which is its first child
`child
`- number of the node which is its first sibling
`sibl
`
`5P-norm queries were first defined in [WU 1981]. Experimental validation of their usefulness is
`given in [Salton, Fox &, Wu 1982), [Salton, Buckley & Fox 1983), [Salton, Fox, Buckley 8i Voorhees
`1983], and in great detail in [Fox 1983b).
`
`018
`
`Facebook Ex. 1005
`
`
`
`-15-
`
`Storage is done by recording essential data: about each query tree, about the leaves or concepts of the
`
`trees, and about
`
`the set of nodes in the tree. Modification of the trees is typically accomplished by
`
`reading the cree back into an internal structure, making revisions (such as is required for Boolean feed(cid:173)
`
`back) and storing the new tree instead of the old. Alternatively, the relational form can be directly
`
`changed.
`
`To carry out a search, the query is loaded from the relations back into internal form, and then is
`
`rapidly evaluated for each document. The internal structure allows fast interpretation of the query and
`
`simplifies revisions; the relational storage scheme is slow to load and store, but makes accessing of the
`
`data much simpler to specify and eases the workload when structural modifications are required.
`
`Figure 3 is about the four relations involved in searehing and ranking. There are two key outputs
`
`of the programs: a top-ranked set, and ranking information about all relevant documents. For interac(cid:173)
`
`tive searches,
`
`the user asks (or, say,
`
`the ten best documents. For batch experiments with relevance
`
`feedback,
`
`the top ten to twenty documents may be selected for presentation to the user. When recall(cid:173)
`
`precision charts or graphs are desired, it is crucial to have the rank of each relevant document so preci(cid:173)
`
`sion at each recall level can be calculated and averaged over all queries in the set.
`
`For cluster searches it is useful to have information about how much of the document and cen(cid:173)
`
`troid files have had to be examined in order to present the user with the desired number of documents.
`
`And, for query collections that are in relational form, the obvious format (or storing relations is used.
`
`Figure 4 deals with the input information required to commence an experimental run. The first
`
`relation connects to all other relations, 80 that with the "run.name" specified, one has the key required
`
`to find out about the document and query collections to use, and about other parameters which must
`
`be known.
`
`"sim_type" selects the desired similarity function, while "dwt" and "qwt" identify the
`
`chosen concept weighting schemes. When multiple concept types are present, their number is given by
`
`019
`
`Facebook Ex. 1005
`
`
`
`-16-
`
`Figure 3: INGRES Relations Used for Search & Ranking
`
`1) Top-ranked set: identifies the n most similar documents to query
`iter
`- feedback iteration step now in effect
`qid
`- identifier of query being processed
`sim
`- similarity of document to the query
`rank
`- rank of the document in retrieved set
`where 1 <rank < n
`- whether this document is judged relevant to query
`rei
`2) Relevant ranks: ranks for all relevant documents, for each search
`iter
`- feedback iteration step now in effect
`qid
`- identifier of query being processed
`did
`- relevant document identifier
`rank
`- rank of this document after retrieval
`where ties are randomly broken
`- similarity of document to the query
`sim
`3) Cluster search efficiency: counts of vectors retrieved
`run_name
`- description of this experimental run
`iter
`- feedback iteration number
`qid
`- identifier of query being processed
`cent_total - total number of centroids examined
`- number of low level centroids examined
`cent_II
`doc_total
`- total number of documents examined
`doc_ret
`- number of documents actually presented to user
`4) Query vectors: relational form of vector queries
`qid
`- query identifier
`con
`- concept "number
`ctype
`- concept type
`wt
`- weight ror current concept
`
`020
`
`Facebook Ex. 1005
`
`
`
`-17-
`
`Figure 4: INGRES Relations Required to Specify Runs
`
`1) Parameters for search run
`- name or title used to refer to this search run
`run_name
`da_title
`- title of document collection selected
`qa_title
`- title of query collection selected
`- what kind of similarity function to U8e
`sim_type
`dwt
`- w hat kind of document concept weighting to use
`qwt
`- what kind of query concept weighting to use
`p-value
`- if given, p to overide that on all operators
`thes
`- name of thesaurus to use (not yet implemented)
`iter
`- feedback iteration number
`- number of pairs to include in BLE expansion
`no_pairs
`no_triples - number of triples to include in BLE expansion
`fdbk_type - what feedback evaluation (e.g., freezing) to use
`no_tr
`- number of documents to remember in top ranked
`set for feedback presentation (e.g., 10)
`- should relevant documents with 0 similarity be
`simO_opt
`randomly assigned ranks or get worst possible
`.
`rei_ranks - name of relation to record ranks of relevants
`topj-anks - name of relation to record information about
`top-ranked documents for feedback presentation
`no_wanted - number desired from cluster search
`numbj-els - estimated number relevant per query, for
`relative recall evaluations
`use_nrels - indication to use "numb_rels", since complete
`relevance data is unavailab Ie
`
`2) Document collection details
`a_title
`- title of collection
`no_docs
`- number of documents
`no_ctypes - number or different concepts used in indexing
`- UNIX tree path name for directory of data files
`tree
`centroid_fa- name of fast access storage file for centroids
`centroid_cw- name of concept-weight storage file for centroids
`rast_3Ccess- name of fast access document vector file
`- name of concept-weight document vector file
`con_wt
`
`3) Query collection details
`a_title
`- title of collection
`no_q
`- number of queries
`q..,type
`- query type: e.g., p-norm or vector
`q,.;-els
`- name of relation with relevance information
`
`021
`
`Facebook Ex. 1005
`
`
`
`-18-
`
`Figure 4 continued: INGRES Relations Required to Specify Runs
`
`4) Concept type multipliers for combined retrieval formula
`run_name
`- name of run (matches that in relation 1 above)
`ctype
`- concept type
`- multiplier or coefficient for thi