throbber
Soae Co•• ider.~io.. for Iapleae.~i.&
`~.e SKAar I.for•• ~io. Ie~riey.l
`S7.~ea V.der V.II
`
`Edward A. Fox
`
`*
`
`83-560
`September 1983
`
`.
`*
`Department of Computer SC1ence
`Cornell University
`Ithaca. New York
`
`14853
`
`now at:
`Department of Computer Science
`Virginia Tech
`Blacksburg. Virginia
`
`24061
`
`This work was supported in part by
`tion. under grant IST-81-08696.
`
`the National Science Founda(cid:173)
`
`001
`
`Facebook Ex. 1005
`
`

`
`TABLE OF CONTENTS
`
`1 Introduction .................................•................................................................................................
`
`2 History of Implementations
`
`2.1 Early Versions
`
`2.2 UNIX Implementation
`
`3 Design
`
`3.1 IBM SMART System
`
`3.2 Use of UNIX
`
`3.3 Design Principles
`
`3.4 Retrieval Capabilities
`
`3.5 Data Storage Schemes
`
`3.6 INGRES Relations
`
`3.7 Program for Search and Ranking
`
`:.::..
`
`3.8 Retrieval Related Programs
`
`3.9 Indexing Package
`
`4 Creation and Clustering of Extended Vectors
`
`4.1 Background
`
`4.2 Constructing Bibliographic Submatrices
`
`4.2.1 Input Data
`
`4.2.2 Initial subvectors
`
`-1-
`
`2
`
`3
`
`3
`
`4
`
`5
`
`5
`
`6
`
`8
`
`9
`
`12
`
`4
`
`20
`
`2
`
`24
`
`27
`
`27
`
`29
`
`29
`
`0
`
`002
`
`Facebook Ex. 1005
`
`

`
`-n-
`
`4.2.3 Updating Subvectors
`
`4.2.4 Construction Algorithm Complexity
`
`4.3 Implementation in SMAR T
`
`4.4 Clustering and Searching
`
`4.4.1 Background
`
`4.4.2 Clustering with Bibliographic Data
`
`4.4.3 Algorithms
`
`4.4.4 Clustering
`
`4.4.4.1 Parameters
`
`4.4.4.2 Actual Procedures in Outline Form
`
`~.~~..
`
`4.4.4.3 Linearized Cluster Tree
`
`4.4.5 Searching
`
`5 EVALUATION
`
`5.1 Background
`
`5.1.1 Recall/Precision
`
`5.1.2 Summaries Available
`
`5.1.3 Key Issues
`
`5.2 Methods Utilized
`
`5.2.1 Similarity Based Ranking
`
`5.2.2 Query Statistics
`
`5.2.3 User Oriented Averages
`
`5.2.4 Single Value Measure
`
`2
`
`35
`
`5
`
`41
`
`2
`
`43
`
`3
`
`44
`
`4
`
`47
`
`2
`
`53
`
`5
`
`55
`
`5
`
`56
`
`7
`
`58
`
`8
`
`59
`
`1
`
`61
`
`003
`
`Facebook Ex. 1005
`
`

`
`-iii-
`
`5.2.5 Statistical Comparison
`
`5.2.6 Sample Evaluation Output
`
`5.3 S Interface
`
`5.3.1 S Package
`
`5.3.2 Use of S
`
`5.4 Discussion and Conclusion
`
`::~:..
`
`6 Fut ure Plans
`
`6.1 Packaging
`
`6.2 Cornell Projects
`
`6.2.1 UNIX Retrieval Tools
`
`6.2.2 Phrase and Thesaurus Construction
`
`6.2.3 Probabilistic Retrieval
`
`6.2.4 Office Information Systems
`
`6.3 Extensions
`
`6.4 Conclusion
`
`References
`
`1
`
`62
`
`0
`
`70
`
`0
`
`73
`
`5
`
`76
`
`6
`
`76
`
`78
`
`78
`
`g
`
`79
`
`2
`
`83
`
`004
`
`Facebook Ex. 1005
`
`

`
`SOME CONSIDERATIONS FOR IMPLEMENTING TIm
`SMART INFORMATION RETRIEVAL SYSTEM UNDER UNIX
`•
`
`Edward A. Fox
`
`Abstract
`
`Since the early 1960's the SMART project has tested out new ideas in information science aimed
`
`at fully automatic document retrieval. Beginning in 1980 development of an enhanced and generalized
`
`version of SMART has progressed at Cornell. The current implementation is in the C language and
`
`runs under the UNIX operating system on a VAX 11/780 computer.
`
`The history of SMAR T is outlined. Considerations that led to the current design are described.
`
`Since SMAR T now allows multiple concept types to be manipulated in connection with an extended
`
`vector representation, storage and processing issues are discussed, including use of INGRES relations.
`
`Clustering algorithms are presented and run parameters are given for document clustering and subse-
`
`quent clustered searching. SMART experiments (e.g. with p-norm queries, or probabilistic methods)
`
`can be compared using the evaluation package. The S statistical package can be applied to performing
`
`other special analysis and descriptive tasks. Finally, to illustrate the usefulness of these facilities, an
`outline is given of current SMART activities and or future plans.
`
`tDepartment of Computer Science, Cornell University, Ithaca, NY 14853; now at Dept. of Comput(cid:173)
`er Science, Virginia Tech, Blacksburg, VA 24061. This work was supported in part by the National
`Science Foundation, under grant IST-81-08696.
`
`-1-
`
`005
`
`Facebook Ex. 1005
`
`

`
`1. Introduction
`
`-2-
`
`Development of the SMART 8ystem has proceeded since 1961 (Salton 1980). This author began
`
`implementing SMART under the UNIX! operating system in 1980 on Cornell University Department of
`
`Computer Science's computers. Preliminary design notions were discussed in [Fox 1981). This report
`
`brings the status of SMART up to date, describing the running system as used for the experiments
`
`detailed in (Fox 1983b).
`
`Section 2 concerns the history of SMART. Early versions are mentioned; for more elaboration see
`
`[Salton 1971a, 1975, 1980]. The current UNIX version is also introduced.
`
`The design of SMART is covered in section 3. Except for clustering and evaluation, all of the
`
`basic package is dealt with. Beginning with points gleaned from the IBM SMART system [SMART
`
`Project 1974], reasoning is given of how tools available under UNIX could be applied to retrieval tasks.
`
`Design principles are then listed, and the capabilities desired for the system are enunciated.
`
`Issues of
`
`data storage are considered,
`
`including use of the INGRES relational database package. With data
`
`requirements taken care of, program modularization is diseussed. Searching, ranking, retrieval, and
`
`index ing are each separately considered.
`
`Section 4 focuses on the formation and manipulation of extended vectors. A brief summary is
`
`given of the model introduced in [Fox 1983b), which has been tested using two new experimental collec(cid:173)
`
`tions [Fox 1983a]. A general procedure is given ror constructing similar document collections, followed
`
`by a description of implementation issues as they relate to SI\fART software. Two types of manipula(cid:173)
`
`tion of these extended vectors, clustering and clustered searching, are then considered.
`
`The SMART evaluation package is discussed next. 01 particular interest may be the integration
`
`of special software with the general purpose S statistical package.
`
`Justification is also given for the
`
`lUNIX is a trademark of Bell Laboratories.
`
`006
`
`Facebook Ex. 1005
`
`

`
`various algorithms and tests incorporated.
`
`-3-
`
`This report concludes with a sketch of recent studies being made with the SlvIART software, and
`
`with mention of ideas for future projects. The interested reader is advised to view the latter purely as
`
`illustrating some of the potential applications for SMAR T in upcoming years.
`
`2. History of Implementationa
`
`2.1. Early Versions
`
`· In the late 1950's, automatic text analysis began to be seriously studied.
`
`In 1961, the SMAR T
`
`(System for Mechanical Analysis and Retrieval of Text) project began, with the object of designing a
`
`fully automatic document retrieval system and of testing out new ideas in information science.
`
`In SMAR T, an automatic indexing component constructs stored representations or documents.
`
`Then a search component selects those documents which seem most similar to a user's query. Finally,
`
`an evaluation component allows rating of batches of results, using relevance judgments obtained from
`
`the user population, thereby facilitating comparisons between various indexing and search schemes [Sal(cid:173)
`
`ton 1980].
`
`SMART was originally intended for operation in a time-sharing environmen t.
`
`In 1964 a prelim(cid:173)
`
`inary design suited for such operation was suggested [Rocchio 1964] and implementation work began
`
`for a version to run under M.I.T. 's CTSS [Cane 1964].
`
`In 1006, Lesk concluded that fully automatic
`
`documentation centers were feasible with current technology (Lesk 1966].
`
`In 1970, a detailed design for
`
`an IBM 360 based conversational system was prepared (Williamson & Williamson 1970). Later, a par(cid:173)
`
`tial implementation for the IBM 370 TSO environment was undertaken.
`
`The most complete early version of SMART was developed for batch operation on the IBM 370 at
`
`Cornell, and was completed around 1974 [SMART Project 1974]. The last major project using that
`
`007
`
`Facebook Ex. 1005
`
`

`
`-4-
`
`system at Cornell was completed in 1980, and involved indexing, clustering, searching, and evaluation
`
`runs to show how SMART methods could be beneficially applied to indexing of business correspondence
`
`[Nodtvedt 1980].
`
`The IBM SMART system had been programmed in FOR TRAN IV with some portions in assem-
`
`bly language. Auxiliary programs, such as (or clustering, were coded in a variety of other languages,
`
`especially in PL/I. Since running the system resulted in expensive computer charges, and since modify-
`
`ing or extending the generally undocumented monolithic system seemed inadvisable, it was decided to
`
`implement a generalized version of SMAR T using more modern methods and less expensive hardware.
`
`2.2. UNIX ImplementatioD
`
`By 1980, several components of the SMART system had been reprogrammed in the C language
`
`[Kernighan & Ritchie 1978] to run under the UNIX operating system [Ritchie & Thompson 1974] on a
`
`Digital Equipment Corporation PDpl 11/60 computer. Later that year the 80ftware was transferred to
`
`a VAX' 11/780, and work since then has been OD that hardware. There are now plans to also port the
`
`programs to a Motorola 68000 series based microcomputer running UNIX.
`
`A number of proposals and tools have been considered to aid in the development effort. On the
`
`VAX, a collection of routines implementing the operators of the relational algebra [Date 1982] were pro-
`
`grammed to allow occasional manipulation of relations; many of the smaller data files were-stored as
`
`relations to aid in standardization. To generalize that approach,
`
`the idea of relational processing of
`
`abstract data types was advocated Cor retrieval operations [Fox Ig81]. Though such a development
`
`scheme was theoretically appealing, there was not enough time or programming support to carry the
`
`project to completion in such a manner.
`
`2pDP is a trademark of Digital Equipment Corporation.
`IVAX is a trademark of Digital Equipment Corporation.
`
`008
`
`Facebook Ex. 1005
`
`

`
`-5-
`
`In 1981, an early version of the INGRES relational database management system [Stonebraker
`
`et.al. 1976] was obtained for use at Cornell.
`
`In 1982, the S statistical package [Becker & Chambers
`
`1981] was also installed on the VAX.
`
`INGRES served as the backbone for data storage and manipula(cid:173)
`
`tion in all situations where efficiency was not of crucial importance, and S became a handy tool for
`
`exploratory data analysis and for statistical calculations. The histograms, scatter plots, and much .of
`
`the statist ical reporting in [Fox 1983a, 1983b) were all done using S.
`
`Though various such tools speeded up and supplemented the retrieval experimentation, the main
`
`part. of SMART should be implementable on any UNIX system with sufficient storage and adequate
`
`memory management.
`
`Indeed, a version 01 SMAR T, used for a number of etandard operations when
`
`efficiency is essential, requires neither INGRES nor S. One possible plan for the future is to have
`
`retrieval experiments carried out on a dedicated microcomputer that runs UNIX. Then the statistical
`
`evaluation and the comparison of results among runs could be done on 8 VAX which supports INGRES
`
`and S.
`
`3. Design
`
`3.1. mM SMART System
`
`The IBM SMART system, completed in the mid 1970's by a group of students and programmers
`
`at CornelJ, was designed so as to be able to carry out the necessary processing without requiring a great
`
`deal of memory or machine resources (Williamson et al. 1971). Further, it was programmed in FOR(cid:173)
`
`TRAN, since members of the team were proficient in that language, even though the language standard
`
`then current lacked good tools for structured programming, text processing, or manipulation of files on
`
`secondary storage units.
`
`009
`
`Facebook Ex. 1005
`
`

`
`SMART was executed as a single program, which brought in appropriate procedures as overlays
`
`when needed. Communication between various subprograms was through data files on secondary
`
`storage, or, more often,
`
`through variables held in a globally accessible common area of high speed
`
`memory. Unfortunately, to save space, portions of that common area were reused for varying purposes,
`
`making the data flow and interconnections exceedingly difficult to follow.
`
`In 1981 an undergraduate student studied these SMART programs in order to prepare an over(cid:173)
`
`view of the system. The task was rather difficult and eonfuaing, and 80 it was decided that further
`
`analysis of that software package would not be encouraged. However, the general structure deduced,
`
`part of which is shown in Figure 1, did provide useful guidance for other development efforts.
`
`As can be seen in Figure 1, an executive routine leads to data initialization and then calls in, one
`
`by one, other procedures required to carry out the batch job submitted. Some of the data provided
`
`specifies values of parameters required by the processing routines; the remaining parameters are given
`
`appropriate settings by default.
`
`In any case, the list of variables needing specification was useful infor(cid:173)
`
`mation in helping plan the description of modules for a subsequent implementation.
`
`3.2. Uce 01 UNIX
`
`In 1980, it was decided that future research studies by this author would make use of the experi(cid:173)
`
`mental facilities provided by the National Science Foundation ror the Department of Computer Science
`
`at Cornell, since there were no charges for time or storage space, and since available development tools
`
`were easier to work with than those on the IBM system. The UNIX4 operating system, in particular,
`
`encouraged interactive program development and testing of new ideas, using quick to develop prototype
`
`implementations; it was also very effective at performing the necessary text processing operations.
`
`·UNIX is a trademark of Bell Laboratories.
`
`010
`
`Facebook Ex. 1005
`
`

`
`-7-
`
`Figure 1: Structure of IBM 370 SMART System
`
`Executive Routine - calls initialization routine
`Data Set Initialization - start or restart runs, and call other routines
`Text Input Routines
`Store Input
`Convert Text to Vectors
`Perform Partial Syntactic Analysis
`Reeode Vectors for New Dictionary
`Vector Input Routines
`Load in or Modify a Vector Collection
`Read in Results of Previous Searches
`List out a Vector Collection
`Search Routines
`Modify Queries for Feedback Runs
`Specify Which Documents should be Correlated
`Correlate Documents against Query
`Sort on Correlation and Assign Ranks
`Prepare Results for Printing
`Clustering Routines
`Define Newly Specified Cluster
`Cluster via Dattola's Algorithm
`Cluster via Rocchio's Algorithm
`Averaging Routines
`Compute Various Averages
`Print Output and Graphs
`Run and Print Statistical Tests
`
`For example, to index a document, one program could take a typical text input file and return a
`
`list of single words. A second could remove "stop" words, while a third could replace words with their
`
`stems. Other routines could sort lists, or number the lines of a file. By taking the output of one run
`
`and "piping" it
`
`into another program, two or more processes could execute virtually simultaneously,
`
`avoiding the need for intermediary files to be actually materialized in their entirety.
`
`With the UNIX tools provided and a moderate number of new ones, a simple system for indexing
`
`documents was devised. To make it easier for others to use, the steps were recorded in a file that could
`
`011
`
`Facebook Ex. 1005
`
`

`
`-8-
`
`thereafter be executed as a "shell script;" UNIX allows procedures to be written in a system command
`
`language or shell and those executable files can later be invoked as if they were a single program.
`
`Eventually, to provide capabilities needed to handle more complex document formats, and to allow
`
`more efficient operation as required for indexing thousands of documents, several sections of the index(cid:173)
`
`ing script were redone as single programs. This methodology of connecting available tools ·to develop
`
`prototypes that test new ideas, and of subsequently adapting and tuning those early versions for more
`
`efficient operation, is very much in accord with the practice of stepwise refinement [Wirth 1971].
`
`3•.3. Design Principlea
`
`To formalize these general notions, a number of design principles were explicitly stated to guide
`
`further planning:
`
`(1) There should be a network of programs rather than a single large one.
`
`(2) Communication between programs should be through data files.
`
`(3) Communication between procedures and functions should be through argument
`
`lists and return
`
`values.
`
`(4) Lists of default settings of parameters should be stored in text form in data files, to facilitate
`
`making changes.
`
`(5) Different shell scripts should be developed to call the various programs automatically, for all stan(cid:173)
`
`dard tasks,
`
`(6) The same program (or at least many of the same procedures) should be usable, with minor exten(cid:173)
`
`sions perhaps, for experimental studies and for "production" retrieval runs. Thus, separate ver(cid:173)
`
`sions should be avoided as much as possible, and one should be able to run large batch experi(cid:173)
`
`ments, interactively carry out a test search, or try out a new type of similarity function, all with
`
`012
`
`Facebook Ex. 1005
`
`

`
`the stand ard system.
`
`-9-
`
`(7) Old SMART data files should be converted into new forms that might be required for indexing or
`
`other addition al processing.
`
`(8) Emphasis should be on getting new methods to work, or reimplementing old SMART components,
`
`as expeditiously as possible. All available tools should be exploited, including INGRES and S, and
`
`coding from scratch should only be done as a last resort.
`
`(9) Tuning should be done as needed. First, sections of shell scripts should be replaced by specialized,
`
`more efficient programs to accomplish the same function. Then, algorithms and data structures
`
`should be revised to reduce processing and data manipulation costs.
`
`3 ••• Retrieval Capabilities
`
`The actual retrieval software was also planned 80 as to provide a number of capabilities in a single
`
`coherent system:
`
`(1) The ranking of retrieved documents was to be done separately from the actual search.
`
`(2) Search methods should include sequential and clustered techniques first and manipulation of
`
`inverted files later.
`
`(3) Vector and Boolean systems should be easily modelled.
`
`(4) Probabilistic methods should be incorporated.
`
`(5) A number of similarity functions should be included, such as:
`
`a- eosine correlation
`b- vector inner product
`c- Boolean expression evaluation
`d- p-norm expression [Salton, Fox &, Wu 19S2] interpretation
`e- probabilistic weighting where terms are assumed independent,
`where term dependencies follow a tree structure, or where
`
`013
`
`Facebook Ex. 1005
`
`

`
`-10-
`
`the BLE scheme ([Duda & Hart 1973] and (Yu, Luk & Siu 1979])
`with singles, pairs, and triples of terms are used.
`
`(6) As in the SIRE system ([McGill et al, 1976], [McGill & Noreault 1977], and [Noreault, Koll &
`
`McGill 1977)), it should be possible to rank the documents retrieved by a Boolean search according
`
`to some other similarity (unction.
`
`(7) Ranking of documents should be possible for any type of search method, for both initial and feed-
`
`back searches.
`
`(8)
`
`.Ranking for feedback runs should be possible using anyone of several evaluation .methods [Chang,
`
`Cirillo & Razon 1971]:
`
`a- Residual collection
`b- Test and control
`c- Full freezing
`d- Partial freezing
`
`(9) Evaluation results should be stored for
`
`later printing and analysis. Printout options should
`
`include:
`
`a- Recall- precision results ror each query
`b- Recall-precision averages ror a batch of queries
`c- Other statistics for each query or query collection
`d- Statistical test results comparing the effectiveness of
`several retrieval trials.
`
`(10) Statistical comparisons should at least include ([Siegel 1956], (Snedecor & Cochran 1980]):
`
`a- T test
`b- Sign test
`c- Wilcoxon signed rank test.
`
`(11) Clustering runs should allow gathering or information about search efficiency.
`
`(12) An interactive version should provide reasonable responsiveness for:
`
`014
`
`Facebook Ex. 1005
`
`

`
`-11-
`
`a- Submission of p-norm, Boolean, or natural language query forms
`b- Clustered or sequential searches
`c- Display of portions of text from selected top-ranked documents
`d- Improvement of current query by feedback processing
`e- Subsequent searching, and repetition of steps (a)-(d).
`
`(13) Document and query representations should allow use of multiple concept types, and composite
`
`similarity functions should be specifiable.
`
`(14) Indexing should allow different processing for the various portions of incoming records, to account
`
`for formatting peculiarities and to enable separation into multiple concept types. Normalization
`
`of names (e.g., use last name followed by first initial) and dates should be automatic in appropri-
`
`ate places.
`
`(15) Words should be able to be grouped, as in phrases appearing in keyword lists, or to be stemmed,
`
`or to have final Us" removed.
`
`(16) A standard stop word list should be used as the default choice, but users should be allowed to
`
`substitute other lists instead.
`
`(17) Automatic identification or statistical phrases (groups of words that co-occur often enough, appear
`
`close together in the same sentence, and each have proper collection wide frequency] should be
`
`possible. To allow this, positional information on each word in the collection must be available.
`
`(18) Auxiliary programs should be able to process data used by the retrieval system. These should
`
`include routines for:
`
`a- Clustering
`b- Construction of feedback queries
`c- Parsing and conversion from infix Boolean or p-norm expressions
`to prefix forms and later to a fast internal representation.
`
`In accord with these principles,
`
`the UNIX SMART system was programmed in stages over about a
`
`three year period.
`
`In the following sections, details of the current design are given. First, underlying
`
`015
`
`Facebook Ex. 1005
`
`

`
`data handling issues are described and then the various main programs are outlined.
`
`-12-
`
`3.5. Data Storage Schemes
`
`Information retrieval calls for storing and accessing large amounts of data. Though specialized
`
`structures like compact vectors or inverted files are probably still the most appropriate ones to use for
`searching, there are many intermediate results and requirements for processing or smaller data group(cid:173)
`
`ings where relations can be effectively used. Furthermore, as relational systems become more efficient,
`
`with· new hardware mechanisms and better algorithms available, and as text retrieval systems are gen(cid:173)
`
`eralized to handle structured as well as textual informatfon, relations may be utilized even further.
`
`In an earlier study [Fox 1981), relational processing was proposed for all aspects of retrieval work.
`
`For a practical implementation of SMART with existing software tools, however, several different data
`
`representations were employed.
`
`First, standard UNIX test files were selected for a wide variety of purposes. The original form of
`
`document and query collections were provided that way. Parameters lor complicated runs, such as
`
`clustering, were kept in this form 80 as to be easily edited. Finally, the output of runs almost always
`
`included textual output, for ease of interpretation.
`
`Second, specialized files were employed. Compact forms of vectors, whether of queries, documents,
`
`or cluster centroids, were recorded as lists of
`
`<integer, real number>
`
`pairs in binary format to save space and speed up processing. Pointers to text files, giving the starting
`
`position and field lengths for each collection document, were stored in a fast access format to enable
`
`quick presentation of retrieved documents to on-line searchers. A final example is that co-occurrence
`
`data for probabilistic retrieval methods was also organized in a specialized format.
`
`016
`
`Facebook Ex. 1005
`
`

`
`-13-
`
`A third scheme for handling retrieval data was that of storing relations as text files. A good
`
`example is during the indexing phase, where UNIX programs can pipe results from one process to the
`
`next and thus save on intermediate storage. Starting with a stream of data organized as tuples, each a
`
`<document number, string containing a word>
`
`pair, eventually a file containing tuples like
`
`<document number, concept type, concept number, document frequency>
`
`is produced.
`
`These textual relations are very useful, since each attribute can be of variable length, and since
`
`processing textual Corms is exceptionally easy under UNIX. One handy transformation is that of
`
`compressing or expanding between relation and list forms .and vice versa. Thus, a document collection
`
`of many
`
`<document number, concept number>
`
`pairs could be compressed into N groupings of form
`
`<document number, list of concept numbers, 0 to end list>
`
`where there are N distinct documents. Later, the groupings could be stored as compact vectors, or
`
`could be expanded back into first normal form relations.
`
`The fourth and final format of data storage employed is that of relations stored in a database of
`
`the INGRES relational system. One database was created for each document collection. Routines or
`
`procedures were generally available to convert INGRES relations to text file type relations. Some of
`
`the relations stored in INGR ES are shown in Figures 2 through 4.
`
`017
`
`Facebook Ex. 1005
`
`

`
`3.6. INGRES Relations
`
`-14-
`
`Much about a system of computer programs can be understood when the detailed form of the data
`
`being manipulated is known. Hence, a description of the relations utilized in the SMART search and
`
`ranking system should be of particular value. These relations are of special
`
`interest since they are
`
`simpler to specify than are other secondary storage files - one need only know the purpose of the rela-
`
`tion and details about the attributes which make it up.
`
`Figure 2 covers the three relations employed for p-norm queries'', Utilizing these illustrates
`
`compromising between efficiency and ease of representation. When Boolean or p-norm queries are
`
`parsed,
`
`they are placed into an internal structure which is a tree form of the p-norm expression.
`
`Figure 2: INGRES Relations Used for P-Norm Queries
`
`1) Tree description: details of parse tree of query representation
`no_cons - number of concepts in the query
`nOJlodes - number of nodes in the tree
`boot
`- whether the query is interpreted as strict Boolean
`- node number of the root of the tree
`root
`2) Concepts: details about each concept in the query
`qid
`- query identifier
`con
`- concept number
`ctype
`- concept type
`3) Nodes: details about each operator or concept node in tree
`type
`- "concept," or else which operator (AND,NOT,OR)
`- if type is "concept," then concept number
`c
`- if type is "concept," then concept type
`ctype
`- if type is "operator," then p-value attached
`p
`- relative weight attached to this node
`wt
`- number of the node which is its first child
`child
`- number of the node which is its first sibling
`sibl
`
`5P-norm queries were first defined in [WU 1981]. Experimental validation of their usefulness is
`given in [Salton, Fox &, Wu 1982), [Salton, Buckley & Fox 1983), [Salton, Fox, Buckley 8i Voorhees
`1983], and in great detail in [Fox 1983b).
`
`018
`
`Facebook Ex. 1005
`
`

`
`-15-
`
`Storage is done by recording essential data: about each query tree, about the leaves or concepts of the
`
`trees, and about
`
`the set of nodes in the tree. Modification of the trees is typically accomplished by
`
`reading the cree back into an internal structure, making revisions (such as is required for Boolean feed(cid:173)
`
`back) and storing the new tree instead of the old. Alternatively, the relational form can be directly
`
`changed.
`
`To carry out a search, the query is loaded from the relations back into internal form, and then is
`
`rapidly evaluated for each document. The internal structure allows fast interpretation of the query and
`
`simplifies revisions; the relational storage scheme is slow to load and store, but makes accessing of the
`
`data much simpler to specify and eases the workload when structural modifications are required.
`
`Figure 3 is about the four relations involved in searehing and ranking. There are two key outputs
`
`of the programs: a top-ranked set, and ranking information about all relevant documents. For interac(cid:173)
`
`tive searches,
`
`the user asks (or, say,
`
`the ten best documents. For batch experiments with relevance
`
`feedback,
`
`the top ten to twenty documents may be selected for presentation to the user. When recall(cid:173)
`
`precision charts or graphs are desired, it is crucial to have the rank of each relevant document so preci(cid:173)
`
`sion at each recall level can be calculated and averaged over all queries in the set.
`
`For cluster searches it is useful to have information about how much of the document and cen(cid:173)
`
`troid files have had to be examined in order to present the user with the desired number of documents.
`
`And, for query collections that are in relational form, the obvious format (or storing relations is used.
`
`Figure 4 deals with the input information required to commence an experimental run. The first
`
`relation connects to all other relations, 80 that with the "run.name" specified, one has the key required
`
`to find out about the document and query collections to use, and about other parameters which must
`
`be known.
`
`"sim_type" selects the desired similarity function, while "dwt" and "qwt" identify the
`
`chosen concept weighting schemes. When multiple concept types are present, their number is given by
`
`019
`
`Facebook Ex. 1005
`
`

`
`-16-
`
`Figure 3: INGRES Relations Used for Search & Ranking
`
`1) Top-ranked set: identifies the n most similar documents to query
`iter
`- feedback iteration step now in effect
`qid
`- identifier of query being processed
`sim
`- similarity of document to the query
`rank
`- rank of the document in retrieved set
`where 1 <rank < n
`- whether this document is judged relevant to query
`rei
`2) Relevant ranks: ranks for all relevant documents, for each search
`iter
`- feedback iteration step now in effect
`qid
`- identifier of query being processed
`did
`- relevant document identifier
`rank
`- rank of this document after retrieval
`where ties are randomly broken
`- similarity of document to the query
`sim
`3) Cluster search efficiency: counts of vectors retrieved
`run_name
`- description of this experimental run
`iter
`- feedback iteration number
`qid
`- identifier of query being processed
`cent_total - total number of centroids examined
`- number of low level centroids examined
`cent_II
`doc_total
`- total number of documents examined
`doc_ret
`- number of documents actually presented to user
`4) Query vectors: relational form of vector queries
`qid
`- query identifier
`con
`- concept "number
`ctype
`- concept type
`wt
`- weight ror current concept
`
`020
`
`Facebook Ex. 1005
`
`

`
`-17-
`
`Figure 4: INGRES Relations Required to Specify Runs
`
`1) Parameters for search run
`- name or title used to refer to this search run
`run_name
`da_title
`- title of document collection selected
`qa_title
`- title of query collection selected
`- what kind of similarity function to U8e
`sim_type
`dwt
`- w hat kind of document concept weighting to use
`qwt
`- what kind of query concept weighting to use
`p-value
`- if given, p to overide that on all operators
`thes
`- name of thesaurus to use (not yet implemented)
`iter
`- feedback iteration number
`- number of pairs to include in BLE expansion
`no_pairs
`no_triples - number of triples to include in BLE expansion
`fdbk_type - what feedback evaluation (e.g., freezing) to use
`no_tr
`- number of documents to remember in top ranked
`set for feedback presentation (e.g., 10)
`- should relevant documents with 0 similarity be
`simO_opt
`randomly assigned ranks or get worst possible
`.
`rei_ranks - name of relation to record ranks of relevants
`topj-anks - name of relation to record information about
`top-ranked documents for feedback presentation
`no_wanted - number desired from cluster search
`numbj-els - estimated number relevant per query, for
`relative recall evaluations
`use_nrels - indication to use "numb_rels", since complete
`relevance data is unavailab Ie
`
`2) Document collection details
`a_title
`- title of collection
`no_docs
`- number of documents
`no_ctypes - number or different concepts used in indexing
`- UNIX tree path name for directory of data files
`tree
`centroid_fa- name of fast access storage file for centroids
`centroid_cw- name of concept-weight storage file for centroids
`rast_3Ccess- name of fast access document vector file
`- name of concept-weight document vector file
`con_wt
`
`3) Query collection details
`a_title
`- title of collection
`no_q
`- number of queries
`q..,type
`- query type: e.g., p-norm or vector
`q,.;-els
`- name of relation with relevance information
`
`021
`
`Facebook Ex. 1005
`
`

`
`-18-
`
`Figure 4 continued: INGRES Relations Required to Specify Runs
`
`4) Concept type multipliers for combined retrieval formula
`run_name
`- name of run (matches that in relation 1 above)
`ctype
`- concept type
`- multiplier or coefficient for thi

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket