`
`2~9
`0 lf 2 t.o I'-/
`
`Enhancement of Text
`Representations Using Related
`Document Titles:j:
`
`G. Salton*
`Y.Zhangt
`TR 86-728
`January 1986
`
`Department of Computer Science
`Cornell University
`Ithaca, NY 14853
`
`:j: This study was supported in part by the National Science Foundation grant IST 83-16166.
`* Department of Computer Science, Cornell University, Ithaca, .NY 14853.
`t Institute of Computer Technology, China Academy of Railway Sciences, Beijing, China.
`
`EXHIBIT 2009
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00481
`
`
`
`
`
`
`
`
`
`
`
`Enhancement of Text Representations Using Related Document Titles
`
`G. Salton
`
`*
`
`and Y. Zhang
`
`**
`
`Ab stract
`
`Various attempts have been made over the years to construct enhanced
`
`document representations by using thesauruses of related terms, term associa-
`
`tion maps, or knowledge frameworks that can be used to extract appropriate
`
`terms and concepts.
`
`None of the proposed methods for the improvement of docu-
`
`ment representation has proved to be generally useful when applied
`
`to
`
`a
`
`variety of different retrieval environments.
`
`Some recent work by Kwok sug-
`
`gests that document indexing may be enhanced by using title words taken from
`
`bibliographically related items.
`
`An evaluation of the process shows that many
`
`useful content words can be extracted from related document titles, as well as
`
`many terms of doubtful value.
`
`Overall, the procedure is not sufficiently
`
`reliable to warrant incorporation into operational automatic retrieval sys-
`
`tems.
`
`*Department of Computer Science, Cornell University. Ithaca, NY 14853.
`
`**
`
`Institute of Computer Technology, China Academy of Railway Sciences, Beijing,
`China.
`
`This study was supported in part by the National Science Foundation under
`grant IST 83-16166.
`
`
`
`
`
`
`
`
`
`
`
`1.
`
`Term and Document Relations
`
`Most existing methods for the automatic content analysis of written texts
`
`are based in part on the extraction of certain words contained in the original
`
`document texts.
`
`While many words appearing in ordinary text are in fact use-
`
`ful for content representation, it is often believed that the use of text
`
`words does not provide a complete description of text meaning.
`
`For this rea-
`
`son, various additional content analysis tools have been introduced in the
`
`hope of obtaining more complete text representations.
`
`Among these tools are
`
`hajse that contain groupings of related words [1,2], automatically con-
`
`structed
`
`association
`
`ps based on co-occurrences of words in the texts
`
`of documents [3,4], and knowledg.
`
`frameworks representing the facts and rela-
`
`tionships that characterize particular subject areas. [5-7]
`
`Various methodologies have been suggested to help in the construction of
`
`the content analysis tools, including for example probabilistic theories of
`
`information processing that account for the use of term relationships and
`
`associations [8-10]. methods that include syntactic considerations for the
`
`construction of term phrases [li-13], and finally interactive procedures in
`
`which individual users may suggest term relationships of importance in their
`
`application based on a dialogue between user and system conducted from a user
`
`terminal. [14-16]
`
`Two main problems arise when term associations are proposed for text
`
`identification and processing:
`
`a)
`
`No theory exists which would help in distinguishing valuable term
`
`associations from less valuable ones, and no obvious help is available
`
`to aid in the construction of useful thesauruses, association maps,
`
`
`
`and knowledge bases.
`
`In practice, these tools are often tailored to
`
`particular text collections, and built for the occasion using ad-hoc,
`
`relatively nontransparent methods.
`
`b)
`
`The available evaluation results indicate that the single term text
`
`identification methods which use no term association theories at all
`
`are preferable, on average, to the more complex indexing methods that
`
`include various kinds of term relationships. 117,18]
`
`More specifi-
`
`cally, it appears easy to show advantages for certain term association
`
`methods when specially constructed vocabulary tools are available that
`
`have been tailored to specific collections or environments.
`
`tJnfor-
`
`tunately, most of the term relationships appear to be of local value
`
`only, because the performance improvements are not maintained when the
`
`test conditions and retrieval procedures change. [19]
`
`One conceptually simple method for the identification of term associa-
`
`tions consists in identifying relationships among the documeits of a col-
`
`lection, and using these to infer appropriate term associations.
`
`Specifi-
`
`cally, a document clustering operation can be performed, leading to the
`
`recognition of groups of related documents, and these groupings can be used
`
`in turn to determine relationships among the terms assigned to particular
`
`document groups.
`
`Instead of performing formal clustering operations, rela-
`
`tions between documents might be ascertained by utilizing for this purpose
`
`certain bibliographic citation links.
`
`In particular, two documents are
`
`bibliographically related when a citation link exists between them, that
`
`is, when one document cites the other, or when both documents are jointly
`
`cited by, or jointly cite, a third item.
`
`In these circumstances, one can
`
`assume that the items cover similar subjects, and hence the vocabulary of
`
`
`
`-4-
`
`one item might help in describing a bibliographically related item.
`
`This
`
`notion is further explored in the remainder of this study.
`
`2.
`
`Use of Related Title Words for Document Representation
`
`Bibliographic citations attached to texts and documents have been used
`
`for many years for the generation of document relationships, the determina-
`
`tion of influential bibliographic items, and the representation of changes
`
`in scientific disciplines over time. [20-25]
`
`Attempts have also been made
`
`to use bibliographic citations directly for document indexing, [26,27]
`
`In a recent series of papers, Kwok has suggested that the content
`
`analysis and indexing of texts might be improved by using in addition to
`
`the standard content identifiers certain words extracted from the titles
`
`bibliographically related documents.
`
`That is, if item A is bibliographi-
`
`cally related to item B, then certain title words from B might be used to
`
`index document A, or vice versa. [28-30]
`
`Consider, as an example, a particular document A as represented in
`
`Fig. 1.
`
`With respect to document A, four different types of document rela-
`
`tionships may be distinguished:
`
`Document A may refer to another document C by having C included in
`
`the reference list attached to A; C is then a cited document with
`
`respect to A.
`
`Document A may itself be cited by some other document B. if A is
`
`included in the reference list of B; B is then a citing document
`
`with respect to A.
`
`
`
`c)
`
`Document A may be cited in common with another document A' by a
`
`third document B'; in that case A and A' are cocited documents.
`
`d)
`
`Finally, documents A and A" may both refer to some common third
`
`document C; in that case A and A" are bibliographically coupled.
`
`The various document relationships with respect to document A are illus-
`
`trated in Fig. 1.
`
`Kwok's basic notion consists in taking each item A included in a col-
`
`lection, and adding to the index terms normally used to represent A. cer-
`
`tain new terms taken from the titles of bibliographically related docu-
`
`ments.
`
`In the experiments which follow, four types of citation relation-
`
`ships are examined using the following modified document collections:
`
`the citing collection, where the document terms are supplemented by
`
`the title words from all citing documents (A is supplemented by
`
`tenus from B)
`
`the cited collection, where the document terms are supplemented by
`
`the title words from all cited documents (A is supplemented by
`
`terms from C)
`
`the citing + cited collection. where the document terms are supple-
`
`mented by the union of the two previous sets (A is supplemented by
`
`terms from B and C)
`
`the cocited collection where the added terms are taken from docu-
`
`ments that are cocited with an item (A is supplemented by terms
`
`from
`
`
`
`When terms taken from related documents are added as identifiers for
`
`particular documents, two kinds of changes may occur in the original term
`
`set used to identify the documents.
`
`First, a certain number of terms may
`
`actually be found that did not appear in the original terni set used for a
`
`given document.
`
`In that case, the new term set identifying the item is
`
`larger than the original set.
`
`Second, a number of terms may be found in
`
`related documents that were already contained in the original set of terms
`
`prior to the term modification.
`
`In that case, no new terms may be added to
`
`the existing identifying term sets.
`
`However the weights of the originally
`
`available terms may be appropriately changed.
`
`In the experiments which follow, two kinds of term weights are used:
`
`A term Irequency (tE) weight, where the weight of a term is defined
`
`as the frequency of occurrence of the term in the document in ques-
`
`tion (or in an appropriately defined document excerpt such as an
`
`abstract);
`
`a
`
`frequency times inverse document frequency (tf
`
`X
`
`idf)
`
`weight, where the weight of a term is defined as the product of the
`
`term frequency multiplied by an inverse function of the document
`
`frequency (the number of documents in a collection to which a term
`
`is assigned).
`
`A typical value for
`
`idf might
`
`be obtained as
`
`log (N-n.)/n., where N is the collection size, and n. is the number
`
`of documents with term i.
`
`When terms chosen from related documents are added to the term sets
`
`characterizing particular documents,
`
`the text or
`
`text excerpts of the
`
`related documents are merged with the text of the original document.
`
`The
`
`
`
`frequency of occurrence of the terms in the merged documents will therefore
`
`be different from the occurrence frequencies in the original documents, and
`
`the term frequency weights will change.
`
`For example, if a term exhibits a
`
`frequency weight of 3 in document A and a frequency weight of 2 in a
`
`related document B, the final weight of the term will be 5 after merging of
`
`the documents.
`
`3.
`
`Experimental Results
`
`Two different document collections are used to evaluate the indexing
`
`process based on the use of title words from bibliographically related
`
`documents: the CACM collection consisting of
`
`3204
`
`articles
`
`originally
`
`appearing in the Communications of the ACM between 1957 and 1979, and the
`
`CISl collection consisting of 146 documents in the field of documentation
`
`and library science originally received from the Institute for Scientific
`
`Information in Philadelphia.
`
`The searches for both collections were car-
`
`ried out with collections of 47 queries each, and the search results are
`
`averaged for the 47 queries used in each case.
`
`The following information was originally available for each of the two
`
`sample collections:
`
`CACM:
`
`original documents plus references to cited and citing docu-
`
`ments; all cited and citing documents used are included in
`
`the original document collection of 3204 items.
`
`CISl:
`
`original documents plus references to cocited documents; the
`
`cocited documents are not necessarily included in the base
`
`collection of 146 documents.
`
`
`
`To perform a document expansion using title words for citing, cited
`
`and cocited documents, it was necessary to obtain the cocitations for CACM,
`
`and the cited and citing documents for Cisl.
`
`Since all citations for CACM
`
`were internal to the collection, the set of cocitations could be generated
`
`from the available citing documents for that collection.
`
`For CISl, only
`
`cocitations were originally available, and the sets of citing and cited
`
`documents were extraneous to the basic collection.
`
`In that case, the set
`
`of citing and cited documents pertaining to each of the 146 collection
`
`items had to be obtained from the citation index and the source index parts
`
`of the Social Science Citation Index, respectively. [31]
`
`The collection statistics for the CACM and CISl collections are sum-
`
`marized in Tables 1 and 2.
`
`For CACN, between 30 and 40 percent of the
`
`documents were actually altered by adding title words from bibliographi-
`
`cally related items; one of the modification methods (citing + cited)
`
`affected over 50 percent of the collection items.
`
`Approximately similar
`
`expansion percentages apply to CISl, except that for the cocited method a
`
`large majority of
`
`the
`
`collection
`
`(85
`
`percent
`
`of
`
`the documents)
`
`was
`
`affected.
`
`Approximately one tenth of the original document terms (normally
`
`between 3 and 5 terms per document) received changed term weights in the
`
`expansion process.
`
`More substantial changes were introduced by addition of
`
`new title terms not originally present in the documents.
`
`The figures shown
`
`at the bottom of Tables 1 and 2 show that about 6 to 8 new terms were added
`
`on average per document for CACM, except for the cocitation method which
`
`supplied nearly 27 new terms to each altered document.
`
`More substantial
`
`term additions occurred for CISl: over 30 terms were added to each altered
`
`document for the citing and citing + cited methods, and over 20 terms were
`
`added through cocitations.
`
`
`
`9
`
`The actual recall-precision search results appear in Tables 3 and 4
`
`for the CACM and CISl collections, respectively.
`
`The tables show search
`
`precision results averaged over 47 searches in each case, computed at ten
`
`levels of the recall for recall levels from 0.1 to 1.0 in steps of 0.1.
`
`Two types of term veightings are used including the tf weights ín the upper
`
`half of each table, and the tf X idf weight in the lower half.
`
`The CACM results of Table 3 show clearly that the addition to the
`
`document terms of title words from bibliographically related documents is
`
`beneficial, since the retrieval effectiveness improves by about 30 percent
`
`on average for the citing + cited method using tf weights, and by 15 per-
`
`cent for the citing + cited method using the more powerful tf
`
`X
`
`idf
`
`weights.
`
`An average effectiveness improvement above 10 percent is normally
`
`considered important enough to warrant
`
`serious
`
`attention,
`
`assuming of
`
`course that the needed bibliographically related documents are in fact
`
`accessible in practice.
`
`Unfortunately, this optimistic conclusion is not maintainable when the
`
`CISl results of Table 4 are considered.
`
`In that case, none of the biblio-
`
`graphic expansion method proved beneficial for the more effective tf
`
`X idt
`
`weighting system, the deterioration in effectiveness ranging from about 1
`
`percent to as much as 7 percent for the citing + cited method.
`
`Even for
`
`the less effective tf weights, only the cocitations afford a modest perfor-
`
`mance improvement.
`
`It is obvious that for CISl many of the terms added in
`
`the document expansion process are in fact poor terms that do not help in
`
`retrieval.
`
`Since most of the term expansion processes lengthen the CISl
`
`documents by about 50 percent, one can conjecture that the changes in docu-
`
`ment indexing are too extensive and too sweeping in that case.
`
`The altera-
`
`
`
`- 10 -
`
`tions in the identifying term sets were much more modest for CACH, and the
`
`effectiveness of the procedures was correspondingly greater.
`
`The performance of the document expansion process may be illustrated
`
`by considering positive and negative examples for each of the two collec-
`
`tions.
`
`Examples of an ideal document expansion process are shown in Tables
`
`5 and 6 for the CACN and CiSl collections.
`
`In the example of Table 5, a
`
`relevant document number 2150 receives an initial retrieval rank of 700
`
`with respect to query 7 (that is. 699 other documents are retrieved ahead
`
`of that particular document).
`
`In its original form, the document exhibits
`
`only two word stems in conmion with the query (PROCES and CONCUR), and both
`
`terms have low term frequency weights.
`
`When the document is expanded using
`
`bibliographical relationships,
`
`the
`
`important initial term CONCUR (from
`
`"concurrence") has a quadrupled weight, and another important term SYNCHRON
`
`(from "synchronization") is added to the term set. As a result, the output
`
`rank of document 2150 improves from 700 to 27.
`
`The same phenomenon may be noted in Table 6 for CISl document 60 and
`
`query 34.
`
`In that case, the output rank improves from 48 to 15 because the
`
`important term INDIC (from "indexing") is added to the document in the
`
`expansion process with a high weight of 6.
`
`The examples of Tables 7 and 8 demonstrate that the terms obtained
`
`from
`
`bibliographically
`
`related
`
`items
`
`may
`
`also
`
`produce
`
`substantial
`
`deterioration in performance.
`
`The output rank of relevant document 2902
`
`with respect to query 27 falls from 18 to 37, because only one new term
`
`(SYSTEN) is added to the document, and that term is not especially useful
`
`for content identification in the CACM collection.
`
`The term SYSTEM is
`
`again added to document 113 of Table 8. and the expanded document has a
`
`
`
`lower retrieval rank (33) than the original (27).
`
`The evaluation results of Tables 3 to 8 lead to the conclusion that
`
`the term association process based on bibliographically related title words
`
`is not reliable.
`
`Important terms may be supplied in some instances, pro-
`
`ducing substantial performance improvements; in other cases, the process
`
`adds indifferent or poor terms to the content descriptions of the docu-
`
`ments.
`
`Since no obvious way exists for distinguishing the positive from
`
`the negative effects, the citation methodology cannot be recommended for
`
`inclusion in practical retrieval environments.
`
`It appears that any term
`
`association process, whether based on statistical word co-occurrence cri-
`
`teria or on intellectually constructed vocabulary aids, must include strict
`
`syntactical and/or semantic controls if the generation of inappropriate
`
`related term groups is to be prevented.
`
`Until a usable theory of term
`
`association is developed,
`
`it appears best to maintain the single term
`
`automatic indexing methods which are simple to implement, and which are
`
`known to produce reasonable retrieval performance.
`
`
`
`- 12 -
`
`References
`
`Construction and Mainte-
`£ 1] D. Soergel, Indexing Languages and Thesauri:
`nance, Melville Publishing Company, Los Angeles, CA, 1974.
`
`G. Salton, Experiments in Automatic Thesaurus Construction for mf or-
`mation Retrieval, in Information Processing 71, North Holland Publish-
`ing Company, Amsterdam, 1972, 115-123.
`
`L.B. Doyle, Semantic Road Maps for Literature Searchers, Journal of
`the ACM, 8. 1961. 553-578.
`
`C 4] V.E. Giuliano, Automatic Message Retrieval by Associative Techniques,
`in Joint Man-Computer Languages, Mitre Corporation Report SS-lO, Bed-
`ford, M, 1962, 1-44.
`
`E 53 M. Minsky, A Framework for Representing Knowledge, P.11. Winston, edi-
`tor, The Psychology of Computer Vision, McGraw Hill Book Company, NY,
`1975, 211-277.
`
`R.C. Schank and LP. Abelson, Scripts. Plans, Goals and Understanding,
`Lawrence Eribaum Associates, Hillsdale, NJ, 1977.
`
`R.J. Brachman and B.C. Smith, Special Issue on Knowledge Representa-
`tion, SIGART Newsletter, No. 70, February 1980.
`
`E 8] C.J. van Rijsbergen. Information Retrieval, Second Edition, Butter-
`worths. London. 1979.
`
`E 93 C.J. van Rijsbergen. A Theoretical Basis for the Use of Cooccurrence
`Data in Information Retrieval. Journal of Documentation, 33. 1979,
`106-119.
`
`C.T. Yu, C. Buckley, K. Lam and G. Saltan. A Generalized Term Depen-
`Information
`Technology:
`Information
`Retrieval,
`Model
`dence
`in
`Research and Development. 2, 1983, 129-154.
`
`P.11. Klingbiel, A Technique for Machine-Aided Indexing, Information
`Storage and Retrieval, 9:9, 1973, 477-494 and 9:2. 1973. 79-84.
`
`A Fully Automatic Syntactically Based
`[123 M. Dillon and A.S. Gray, FASIT:
`Indexing System. Journal of the ASIS, 34:2, 99-108, 1983.
`
`in Automatic
`Automatic
`Phrase
`Readings
`Saltan,
`[13] G.
`Matching,
`in
`Language Processing, D.G. Hays, editor. Am. Elsevier Publishing Com-
`pany, NY, 1966. 169-188.
`
`[143 W.B. Croft. An Expert Assistant for a Document Retrieval System. Proc.
`RIAO-85 Conference. Grenoble, France. March 1985, 131-149.
`
`
`
`- 13 -
`
`£15] B.W. Ballard, J.C. Lusth and N.L. Tinkham, LCD-1:
`A Transportable,
`Knowledge Based Natural Language Processor for Office Environments,
`ACM Transactions on Office Information Systems. 2:1. January 1984, 1-
`25.
`
`A Transportable Natural Language Interface System,
`B.J. Grosz, TEMI:
`of Applied Natural Language Processing Conference, Association
`Proc.
`for Computational Linguistics, Santa Monica, CA, 1983, 39-45.
`
`C.W. Cleverdon and E.M. Keen, Factors Determining the Performance of
`Indexing Systems, Vol. 1:
`Design, Aslib Cranfield Research Project,
`Cranfield, England, 1966.
`
`G. Salton and N.E. Lesk, Computer Evaluation of Indexing and Text Pro-
`cessing. Journal of the ACM, 15:1, January 1968, 8-36.
`
`[19] M.E. Lesk, Word-Word Associations in Document Retrieval Systems, Amer-
`ican Documentation, 20:1, January 1969, 27-38.
`
`N.M. Kessler, Bibliographic Coupling Between Scientific Papers, Ameri-
`can Documentation, 14:1, January 1963, 10-25.
`
`E. Garfield, Citation Indexes for Science, Science, 122:3159, 15 July
`1955. 108-111.
`
`J. Nargolis, Citation Indexing and Evaluation of Scientific Papers.
`Science, Vol. 155, 10 March 1967, 1213-1219.
`
`J.H. Westbrook, Identifying Significant Research. Science, 132:3435,
`28 October 1960, 1229-1234.
`
`A New Measure of
`H. Small, Cocitation in the Scientific Literature:
`the Relationship between Two Documents, Journal of the ASIS, 24:4,
`July-August 1973, 265-269.
`
`J. Bichteler and E.A. Eaton, The Combined Use of Bibliographic Cou-
`pling and Cocitation in Document Retrieval, Journal of the ASIS. 31:4,
`July 1980, 278-282.
`
`M.M. Kessler, Comparison of Results of Bibliographic Coupling and Ana-
`lytic Subject Indexing, Am. Documentation, 16:3, July 1965, 223-233.
`
`G. Salton, Automatic Indexing using Bibliographic Citation8, Journal
`of Documentation, 27:2, June 1971, 98-110.
`
`K.L. Kwok, A Probabilistic Theory of Indexing and Similarity Measure
`Based on Cited and Citing Documents, Journal of the ASIS, 36:5. 1985,
`342-351.
`
`K.L. Kwok, A Document-Document Similarity Measure Based on Cited
`Titles and Probability Theory and its Application to Relevance Feed-
`back Retrieval, in Research and Development in Information Retrieval,
`C.J. van Rijabergen. editor, Cambridge University Press, 1984. 221-
`232.
`
`
`
`- 14 -
`
`K.L. Kwok, The Use of Titles and Cited Titles as Document Representa-
`tiolis for Automatic Classification, Information Processing and Manage-
`ment, Vol. 11, 1975, 201-206.
`
`E. Garfield, Citation Indexing - Its Theory and Application in Sci-
`ence, Technology and Humanities, J. Wiley and Sons, New York, 1979.
`
`
`
`Document A"
`Refers to C
`
`r
`) Documents A, A" are
`\ Bibliographically
`Coupled
`
`Document B refers
`to Document A
`
`Citing Document B
`
`Base Document
`A
`
`Document C
`is cited by A
`
`Documents A, A' are
`Cocited
`
`Cited Document C
`
`Document A'
`is cited by B
`
`Document Pairs
`
`Bibliographic Relation
`
`B - A
`A - C
`A - A"
`A - A'
`
`citing-cited
`citing-cited
`bibliographically coupled
`cocited
`
`Citation Relations between Documents
`
`Fig. 1
`
`
`
`- 16 -
`
`Cited
`Collection
`
`Citing
`Collection
`
`Cocited
`Collection
`
`Cited
`Citin
`Collect j
`
`1145
`
`1111
`
`985
`
`1751
`
`35.9%
`
`34.7%
`
`30.1%
`
`54.7%
`
`2652
`
`2652
`
`12050
`
`5304
`
`16951
`
`16951
`
`16951
`
`16951
`
`Number of Documents Which Were
`Altered by Citation Process
`
`Proportion of Documents Which Were
`Altered by Citation Process
`
`Total Number of Bibliographically
`Related Documents Used
`for Document Vector Alteration
`
`Number of Distinct Terms in
`Çollection
`
`Mean Number of Terms per Document
`
`38.9
`
`39.3
`
`45.0
`
`41.5
`
`Totàl Number of Terms with Changed
`Term Weights
`
`Mean Number of Terms with Changed
`Weights Among Altered Documents
`
`Mean Number of Terms with Changed
`Weights for All Document
`
`Total Number of New Terms Added
`
`Mean Number of Added Terras Among
`Altered Documents
`
`Mean Number of Added Terms For
`All Documents
`
`3580
`
`3108
`
`4375
`
`6205
`
`3.12
`
`1.12
`
`7154
`
`6.23
`
`2.80
`
`4.44
`
`3.54
`
`0.97
`
`1.37
`
`1.94
`
`8187
`
`7.38
`
`26589
`
`26.99
`
`15321
`
`8.75
`
`2.23
`
`2.56
`
`8.30
`
`4.78
`
`CACH Collection Statistics
`(3204 documents, 47 queries)
`
`Table 1
`
`
`
`- 17 -
`
`Cited
`Collection
`
`Citing
`Collection
`
`Cocited
`Collection
`
`Cited +
`Citing
`Collection
`
`Number of Documents Which Were
`Altered by Citation Process
`
`Proportion of Documents Which Were
`Altered by Citation Process
`
`Total Number of Bibliographically
`Related Documents Used
`for Document Vector Alteration
`
`Number of Distinct Terms in
`Collection
`
`36
`
`25%
`
`96
`
`54
`
`37%
`
`555
`
`124
`
`85%
`
`787
`
`54
`
`37%
`
`651
`
`2074
`
`2273
`
`2044
`
`2297
`
`Mean Number of Terms per Document
`
`54.8
`
`64.1
`
`71.0
`
`65.5
`
`514
`
`9.52
`
`Total Number of Terms with Changed
`Term Weight8
`
`Mean Number of Terms with Changed
`Weights Among Altered Documents
`
`Mean Number of Terms with Changed
`Weights for All Document
`
`138
`
`3.83
`
`467
`
`529
`
`8.65
`
`4.27
`
`0.95
`
`3.20
`
`3.62
`
`3.52
`
`Total Number of New Terms Added
`
`Mean Number of Added Terms Among
`Altered Documents
`
`Mean Number of Added Terms For
`All Documents
`
`261
`
`7.25
`
`1.79
`
`1624
`
`30.07
`
`2628
`
`21.19
`
`1829
`
`33.87
`
`11.12
`
`18.00
`
`12.53
`
`CISl Collection Statistics
`(146 documents, 47 queries)
`
`Table 2
`
`
`
`- 18 -
`
`Recall
`
`Original
`Collection
`
`Original
`+ Cited
`
`Original
`+ Citing
`
`Original
`+ Cocited
`
`Original
`+ Cited
`+ Citing
`
`0.1
`0.2
`0.3
`0.4
`0.5
`0.6
`0.7
`0.8
`0.9
`1.0
`
`.3445
`.2957
`.1982
`.1454
`.1057
`.0752
`.0559
`.0440
`.0242
`.0172
`
`.3881
`.3279
`.2378
`.2796
`.1214
`.0973
`.0741
`.0480
`.0277
`.0184
`
`.4073
`.3320
`.2548
`.1869
`.1350
`.0880
`.0568
`.0451
`.0287
`.0181
`
`+18%
`+12%
`-i-29%
`+29%
`+28%
`+17%
`+ 2%
`+ 3%
`+19%
`+ 5%
`
`.3529
`.2968
`.2437
`.1837
`.1246
`.0985
`.0714
`.0436
`.0244
`.0185
`
`.4388
`.3549
`.2729
`.2076
`.1507
`.1074
`.0706
`.0496
`.0298
`.0198
`
`+27%
`+20%
`+38%
`+43%
`+43%
`+43%
`+26%
`+13%
`+23%
`+15%
`
`a) CACM Collection (term frequency weights)
`(citing +16%, cited + citing + 29%, cocited + 13.6%)
`
`Recall
`
`Original
`Collection
`
`Original
`+ Cited
`
`Original
`+ Citing
`
`Original
`+ Cocited
`
`Original
`+ Cited
`+ Citing
`
`0.1
`0.2
`0.3
`0.4
`0.5
`0.6
`0.7
`0.8
`0.9
`1.0
`
`.5274
`.4408
`.3721
`.2939
`.2344
`.1895
`.1290
`.0953
`.0598
`.0418
`
`.5549
`.4684
`.4069
`.3337
`.2641
`.2282
`.1534
`.1144
`.0694
`.0464
`
`.5504
`.4566
`.4078
`.3345
`.2756
`.2215
`.1418
`.1042
`.0687
`.0397
`
`+ 4%
`+ 4%
`+10%
`+14%
`+18%
`+17%
`+10%
`+ 6%
`+15%
`- 5%
`
`.5051
`.4466
`.3723
`.3143
`.2476
`.2046
`.1425
`.1070
`.0696
`.0444
`
`.5454
`.4636
`.4169
`.3426
`.2837
`.2340
`.1663
`.1131
`.0736
`.0410
`
`+ 3%
`+ 5%
`+12%
`+17%
`+21%
`+23%
`+29%
`+19%
`+23%
`- 2%
`
`b)
`
`ACH Collection (term frequency times inverse document frequency)
`(citing + 9.3%
`cited + citing + 15%, cocited + 7.1%)
`
`Document Term Expansion - CACM
`
`Table 3
`
`
`
`- 19 -
`
`Recall
`
`Original
`Collection
`
`Original
`+ Cited
`
`Original
`+ Citing
`
`Original
`+ Cocited
`
`Original
`+ Cited
`+ Citing
`
`0.1
`0.2
`0.3
`0.4
`0.5
`0.6
`0.7
`0.8
`0.9
`1.0
`
`.3793
`.3414
`.3014
`.2818
`.2547
`.1844
`.1728
`.1461
`.1231
`.1139
`
`.3702
`.3299
`.2853
`.2707
`.2380
`.1837
`.1722
`.1457
`.1217
`.1123
`
`.3683
`.3410
`.2880
`.2669
`.2477
`.1838
`.1714
`.1513
`.1242
`.1148
`
`- 3%
`0%
`- 4%
`- 5%
`- 3%
`0%
`- 1%
`+ 4%
`+ 1%
`+ 1%
`
`.4237
`.3816
`.3291
`.2994
`.2731
`.1911
`.1708
`.1501
`.1309
`.1263
`
`.3638
`.3348
`.2895
`.2690
`.2452
`.1842
`.1717
`.1526
`.1269
`.1166
`
`- 4%
`- 2%
`- 4%
`- 5%
`- 4%
`0%
`- 1%
`+ 4%
`+ 3%
`+ 2%
`
`CISl Collection (term frequency weights)
`a)
`(citing -1%. cocited + 6.9%. citing + cited -1.1%)
`
`Recall
`
`Original
`Collection
`
`Original
`+ Cited
`
`Original
`+ Citing
`
`Original
`+ Cocited
`
`Original
`+ Cited
`+ Citing
`
`0.1
`0.2
`0.3
`0.4
`0.5
`0.6
`0.7
`0.8
`0.9
`1.0
`
`.4909
`.4308
`.3634
`.3290
`.3092
`.2114
`.1933
`.1755
`.1523
`.1469
`
`.4762
`.4260
`.3572
`.3213
`.3045
`.2108
`.1862
`.1691
`.1499
`.1375
`
`.4937
`.4372
`.3590
`.3221
`.3028
`.2160
`.1718
`.1522
`.1301
`.1230
`
`+ 1%
`+ 1%
`- 1%
`- 2%
`- 2%
`+ 2%
`-11%
`-13%
`-15%
`-16%
`
`.4484
`.3928
`.3332
`.3122
`.2898
`.2358
`.1899
`.1814
`.1589
`.1529
`
`.4955
`.4349
`.3596
`.3274
`.3100
`.2078
`.1642
`.1451
`.1223
`.1139
`
`+ 1%
`+ 1%
`- 1%
`0%
`0%
`- 2%
`-15%
`-17%
`-20%
`-22%
`
`b)
`
`CISl Collection (term frequency times inverse document frequency)
`(citing -5.6%. citing + cited -7.5%, cocited -16%)
`
`Document Term Expansion - CiSl
`
`Table 4
`
`
`
`- 20
`
`AI: Query Z., Document 21.5Q
`
`Query 7
`
`I am interested in distributed algorithms - concurrent programs in
`which processes communicate and synchronize by using message passing.
`Areas of particular interest include fault-tolerance and techniques
`for understanding the correctness of these algorithms.
`
`Document:
`2150
`
`.T (Title)
`Concurrent Control with "Readers" and "Writers"
`.W (Abstract)
`The problem of the mutual exclusion of several independent processes
`from simultaneous access to a "critical section" is discussed for
`the case where there are two distinct classes of processes known as
`"readers" and "writers."
`The "readers" may share the section with
`each other, but the "writers" must have exclusive acce8s.
`Two
`solutions are presented:
`one of the case where we wish minimum
`delay for the readers; the other for the case where we wish writing
`to take place as early as possible.
`.B (Citation)
`CACM October, 1971
`.A (Authors)
`Courois, P. J.
`Heymans, F.
`Parnas, D. L.
`.K (Keywords)
`mutual exclusion, critical section, shared access to resources
`.0 (Computing Reviews Categories)
`4.30 4.32
`
`Original
`(Output Rank 700)
`
`Modified Document
`(Output Rank 27)
`
`Query Terms
`in Document
`
`Weights
`
`Query Terms
`in Document
`
`Weights
`
`Remarks
`
`PROCES
`CONCUR
`
`2.00
`1.00
`
`PROCES
`PROGRAM
`SYNCHRON
`CONCUR
`
`2.00
`2.00
`2.00
`4.00
`
`new indifferent term
`new good term
`good term with increased weight
`
`Positive Document Modification (CACM Collection)
`
`Table 5
`
`
`
`- 21 -
`
`CISl Query j4. J.ocument
`
`Query 34:
`
`Methods of coding in computerized indexing systems
`
`Document:
`60
`
`What Is It?
`
`.T (Title)
`Information Science:
`.A (Author)
`Borko, H.
`.W (Abstract)
`In seeking a new sense of identity, we ask, in this article, the
`What is information science?
`What does information
`question:
`Tentative answers to these questions are given in the
`science do?
`hope of stimulating discussion that will help clarify the nature
`of our field and our work
`
`Original Document
`(Output Rank 48)
`
`Modified Document
`(Output Rank 15)
`
`Query Terms
`in Document
`
`Weights
`
`Query Terms
`in Document
`
`Weights
`
`Remarks
`
`SYSTEM
`COMPUT
`
`2.00
`2.00
`
`SYSTEM
`COMPUT
`INDIC
`
`9.00
`1 00
`6.00
`
`weight increase
`
`new term
`
`Positive Document Modification (Cisl Collection)
`
`Table 6
`
`
`
`- 22 -
`
`CACN Query 27, Documenl 2902
`
`Query 27:
`
`Memory management aspects of operating systems
`
`Document:
`2902
`
`.T (Title)
`Dynamic Memory Allocation in Computer Simulation
`.W (Abstract)
`This paper investigates the performance of 35 dynamic memory allocation
`algorithms when used to service simulation programs as represented
`Algorithm performance was measured in terms of
`by 18 test cases.
`processing time, memory usage, and external memory fragmentation.
`Algorithms maintaining separate free space lists for each size of
`memory block used tended to perform quite well compared with other
`Simple algorithms operating on memory ordered lists
`algorithms.
`(without any free list) performed surprisingly well.
`Algorithms
`employing power-of-two block sizes had favorable processing require-
`ments but generally unfavorable memory usage.
`Algorithms employing
`LIFO, FIFO, or memory ordered free lists generally performed poorly
`compared with others.
`.B (Citation)
`CACM November, 1977
`.A (Author)
`Nielsen, N. R.
`.K (Keywords)
`algorithm performance, dynamic memory allocation, dynamic memory
`management, dynamic storage allocation, garbage collection, list
`processing, memory allocation, memory management, programming
`techniques. simulation, simulation memory management, simulation
`techniques, space allocation, storage allocation
`
`Original Document
`(Output Rank 18)
`
`Modified Document
`(Output Rank 37)
`
`Query Terms
`in Document
`
`Weights
`
`Query Terms
`in Document
`
`Weights
`
`Remarks
`
`OPER
`MEN
`
`1 .00
`8.00
`
`SYSTEM
`OPER
`MEN
`
`2.00
`1.00
`9.00
`
`new unimportant term
`
`slight weight increase
`
`Negative Document Modification (CACM Collection)
`
`Table 7
`
`
`
`Cisl Query 19, Document 113
`
`Query 19:
`
`Techniques of machine matching and machine searching systems.
`Coding and matching methods.
`
`Document:
`119
`
`Problems in the
`
`.T (Title)
`Measuring the Quality of Sociological Research:
`Use of the Science Citation Index
`.A (Author)
`Cole, S.
`.W (Abstract)
`The problem of assessing the "quality" of scientific publications
`has long been a major impediment to progress in the sociology of
`Most researchers have typically paid homage to the belief
`science.
`that quantity of output is not the equivalent of quality and have
`then gone ahead and used publication counts anyway (Cole, 1963;
`There seemed to be no
`Crane, 1965; Prince, 1963; Wilson, 1964).
`practicable way to measure the quality of large numbers of papers
`The invention
`or the life's work of large numbers of scientists.
`of the Science Citation index (SCI) a few years ago provides a
`new and reliable tool