`Bzperiaeatal Colleetioaa i• Co•pater
`aad laforaatioa Seieaee Coataiaia&
`Teztaal aad Biblioaraphie Coaeepta
`
`*
`Edward A. Fox
`
`83-561
`September 1983
`
`* Department of Computer Science
`Cornell University
`Ithaca. Hew York 14853
`
`now at:
`Department of Computer Science
`Virginia Tech
`Blacksburg. VA 24061
`
`This work was supported in part by the National Science Founda(cid:173)
`tion. under grant IST-81-08696.
`
`001
`
`Facebook Ex. 1007
`
`
`
`TABLE OF CONTENTS
`
`1 Introduction ...................................................................................................................................
`
`1.1 Extended Vectors .......................................................................................................................
`
`1.2 Contrast with Other Collections .................................................................................................
`
`1.2.1 Document Collections ..............................................................................................................
`
`1.2.2 Query Collections .....................................................................................................................
`
`2 Multiple Concept Types ................................................................................................................
`
`3 CACM .........................................................................................................................................•..
`
`3.1 Documents ................................... .... ... ......................................... ................................................
`
`3.2 Illustrative Charts and Figures ...................................................................................................
`
`3.3 Queries ......... .................. ............................................... ......... ........... ..........................................
`
`3.4 Relevance Judgments ··································································.················································
`
`3.5 Retrieval Performance ................................................................................................................
`
`4 lSI ...................................................................................................................................................
`
`4.1 Documents ........................................ :..........................................................................................
`
`2
`
`3
`
`4
`
`5
`
`6
`
`0
`
`13
`
`4
`
`16
`
`0
`
`41
`
`4
`
`44
`
`4
`
`4.1.1 Background ...............................................................................................................................
`
`44
`
`4.1.2 Tape Information Provided ......................................................................................................
`
`4.1.3 Collection Preparation .............................................................................................................
`
`4.2 Subvectors .................................. .............................. .......... ............ .......................................... ...
`
`4.3 Queries .......................... ............ .... ........ ......................................................................................
`
`5
`
`47
`
`9
`
`56
`
`-·-
`
`002
`
`Facebook Ex. 1007
`
`
`
`-n-
`
`4.4 Relevance Judgments ..................................................................................................................
`
`8
`
`4.5 Retrieval Performance .................................................................................................................
`
`58
`
`5 Conclusion ......................................................................................................................................
`
`9
`
`References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .
`
`61
`
`003
`
`Facebook Ex. 1007
`
`
`
`CHARACTERIZATION OF TWO NEW EXPERIMENTAL
`COLLECTIONS IN COMPUTER AND INFORMATION SCIENCE
`CONTAINING TEXTUAL AND BIBLIOGRAPmC CONCEPTS
`•
`Edward A. Fox
`
`Two new collections are described which are particularly useful for investigating the interaction
`
`between textual and bibliographic data in the automatic indexing and retrieval of documents. An
`
`extension to the vector space model has been proposed whereby various types of concepts are included
`
`in the representation of such documents. Experiments using an enhanced version of the SMART sys-
`
`tem have shown such an extended model to perform better than simpler schemes. The CACM and lSI
`
`collections developed for this research should be of value for future related studies.
`
`The lSI collection has author, title/abstract, and co-citation data for the 1460 most higly cited
`
`articles and manuscripts in information science in the 1969-1977 period. The CACM collection contains
`
`7 types of concepts for the 3204 articles published in the Communication. of the ACM up through 1979.
`
`These collections have 76 and 52 queries, respectively, along with relevance judgments.
`
`*Department of Computer Science, Cornell University, Ithaca, NY 14853; now at Dept. of Comput(cid:173)
`. er Science, Virginia Tech, Blacksburg, VA 24061. This work was supported in part by the National
`Science Foundation, under grant IST-81-08696.
`
`-1-
`
`004
`
`Facebook Ex. 1007
`
`
`
`1. Introduction
`
`-2-
`
`In order to retrieve documents relevant to the request of a particular user it is necessary to first
`
`index or represent the content of articles and manuscripts. For many years this has been done by
`
`trained indexers who assign keyword lists or sets of descriptors from a controlled vocabulary [Borko &
`
`Bernier 1978). Since the early 1960's an alternative method of automatic indexing has been developed
`
`whereby word stems, words, phrases, or thesaurus category indicators are selected from the.title and
`
`abstract and a weighted vector indicating the importance of each is constructed [Salton 1980). Part of
`
`this r.eport deals with the vectors derived in this fashion from two collections in information and com(cid:173)
`
`puter science.
`
`Another source of data about documents is from their bibliographie references. Citation indexes
`
`can be used to locate those entries referred to by an article, or which cite it (Garfield 1964, 1979). Link(cid:173)
`
`ages between documents based on bibliographic coupling (Kessler 1962) and co-citation counts [Small
`
`1973) have been utilized for a variety of analysis and retrieval purposes (e.g., [Biehteler & Eaton 1980],
`
`[Garfield 1970), [Kessler 1963a, 1963b, 1965), [Small & Koenig 1977), [Small 1978, 1980, 1981], [Wein(cid:173)
`
`berg 1974). Preliminary experimentation has shown that the vectors produced by automatic indexing
`
`of document texts can be usefully supplemented by bibliographic information to produce a representa(cid:173)
`
`tion that can be more effectively searched than if either component were used alone ([Michelson et al.
`
`1971), [Salton 1963, 1971]).
`
`To facilitate exploration of the effects of extending the vector space model to include a variety or
`
`types of concepts it was necessary to have test collections containing such concepts. · One collection
`
`containing 1460 of the most highly cited documents in information science published between 1969 and
`
`1977 [Small 1981) was developed based on citation and eo-citation data provided by the Institute for
`
`Scientific Information. This lSI collection contains three types of concepts: author names, word stems
`
`005
`
`Facebook Ex. 1007
`
`
`
`-3-
`
`from the title and abstract sections, and co-citations between each of the articles. A second collection,
`
`of the 3204 articles published in Communication• of the ACM in the years up through 1979, contains
`
`the above three types of concepts plus: categories assigned from a hierarchical subject classification
`
`scheme, bibliographic coupling connections, date information, and direct references between articles.
`
`The CACM collection also includes 52 user supplied queries in 3 different forms and relevance judge(cid:173)
`
`ments indicating which documents relate to which queries. The lSI collection has a total of 76 queries.
`
`This report describes the CACM and lSI collections in detail. It supplements the theoretical and
`
`expe11imental descriptions of (Fox 1983b) and the elaboration on implementation issues contained in
`
`(Fox 1983a). Ample discussion has been included to enable interested researchers to understand the
`
`characteristics of these collections and to determine if they might be of use in related research investi-
`
`gations.
`
`The rest of this introductory section provides useful background for subsequent sections. The
`
`notion of extended vectors is introduced, and some overview tables are shown indicating how these col-
`
`lections relate to other test collections discussed in (Fox 1983b).
`
`1.1. Extended Vector.
`
`When only the terms of documents are considered, a simple collection representation scheme
`
`results. A dictionary can be formed to contain the T distinct word stems and so textually based con(cid:173)
`
`cepts can be numbered from 1 through T. Each of the N documents Di is represented by a vector of
`
`length T,
`
`iJi == ( tmib tmi2, ... , tmiT )
`so the entire collection is a Nx T matrix,
`
`C =
`
`{ tm;i }
`
`1 < i < N, 1 < j < T.
`
`(1-1)
`
`(1-2)
`
`006
`
`Facebook Ex. 1007
`
`
`
`-4-
`
`If other types of information are provided in addition, then each term vector (1-1) can become a
`
`subvector of a more complete document vector. Similarly, the matrix (1-2) becomes the term subma(cid:173)
`
`trix.
`
`It makes sense to have a separate author submatrix indicating which authors contributed to
`
`which articles based on a second dictionary of author names. When some subject classification scheme
`
`is adopted, such as the category system used for Computing Review•, then a dictionary for those entries
`
`can also be constructed. Thus, the CACM collection has term (tm), author (au), and Computing
`
`Review• category (cr) submatrices, along with others .
`
`. More data about articles comes from the references present in the bibliographies of each publica(cid:173)
`
`tion. An NxN matrix can be built indicating which articles refer to which others, or which ones are
`
`cited by others. Similar matrices indicate the degree of bibliographic coupling or the number of co(cid:173)
`
`citations received by pairs of articles. Each of these matrices then becomes a submatrix of a large N
`
`row collection matrix.
`
`The extended vector model is thus based upon the idea of having multiple concept types. Each
`
`document vector has subvectors, one for each concept type included in the representation. Further dis(cid:173)
`
`cussion of the model, and experimental evidence of its utility, can be found in [Fox 1983b). More detail
`
`regarding how the CACM and lSI collections are represented according to this model is given in sec(cid:173)
`
`tions 2 and 3 below.
`
`1.2. Contrast with Other Collection•
`
`In [Fox 1983b), 1 small and 4 moderate to medium size test collections were employed to test vari(cid:173)
`
`ous hypothesis. The CACM and lSI collections were two of those utilized. To provide a suitable con(cid:173)
`
`trast, the following subsections present summary information about all 5 of those collections.
`
`007
`
`Facebook Ex. 1007
`
`
`
`1.2.1. Document Collections
`
`-5-
`
`Table 1 summarizes essential data about each of the document collections. All collections contain
`
`information about documents such as monographs or journal articles. At the very least, in almost
`
`every case, a title and abstract were available originally. They deal with a a number of subjects, from
`
`the "soft" social science like material that makes up a fair proportion of the lSI collection, to the terse,
`
`medical articles used in Medlars studies.
`
`( 1) The small AD I collection is about librarianship, microforms, and other topics in documentation
`
`and information science as of 1963.
`
`(2) The CACM documents include all articles in issues of the CommunictJtion1 of the ACM from the
`
`first issue in 1958 to the last number of 1979. A considerable range of computer science literature
`
`is covered by those 3204 entries in the publication that for many years served as the premier
`
`Table 1: Document Collections Summary
`
`Short
`Name
`
`No. of No. of Av.No.
`Terms
`Docs.
`Terms
`
`Subvecton
`Included
`
`Subject
`Matter
`
`Years
`Covered
`
`ADI
`
`82
`
`886
`
`CACM
`
`3,204
`
`10,446
`
`27.1
`
`40.1
`
`IN SPEC 12,684
`
`14,683
`
`35.4
`
`tm
`
`Documen-
`tation
`au,bi,bc, Computer
`cc,cr,
`Science
`ln,tm
`tm
`
`1963
`
`1958
`-1979
`
`1979
`
`1969
`-1977
`to1969
`
`Electrical
`Engineering
`Information
`Science
`Medicine
`
`lSI
`
`1,460·
`
`7,392
`
`104.9
`
`au,cc,tm
`
`Medlars
`
`1,033
`
`8,750
`
`55.8
`
`tm
`
`008
`
`Facebook Ex. 1007
`
`
`
`periodical in the field.
`
`-6-
`
`(3)
`
`INSPEC, which stands for Information Services in Physics, Electrotechnology, Computers and
`
`Control, covers three Science Ab11r11etl publications, ElectrictJl antl Electronic• Ab1tract•, Com-
`
`puter antl Control Ab11racll, and Plau•ic• Ab1tract1. The content focuses mainly in electrical
`
`engineering and computer science subjects.
`
`(4) The 1460 lSI entries were selected based on co-citation information relating to a study conducted
`
`by Dr. Henry Small of the Institute for Scientific Information© (lSI @). They are the items that
`
`could be located at the Cornell University library out of a total of 1627 names listed in the field of
`
`information science. Each was published between 1969 and 1977 and received at least five cita-
`
`tions.
`
`(5) The 1033 Medlars articles were selected out of a large medical collection available at the National
`
`Library of Medicine.
`
`Thus, the five collections are from various sources, have different sizes, and deal with a number of sub-
`
`jects.
`
`1.2.2. Query Collection•
`
`. -,
`Since considerable effort was made to study the characteristics and behavior of various query for-
`
`mutations, it is worthwhile to examine the five different query collections and all the versions present
`
`for each. Table 2 gives statistics (for the 4 larger collections) relating to the queries and the number of
`
`relevant documents for each query.
`
`009
`
`Facebook Ex. 1007
`
`
`
`-7-
`
`Table 2: Query and Relevant Document Characteristics
`
`Collection
`Name
`
`No. of Av. Length Rels Per Query Rels in Top 10
`Queries
`Cos Query Av.No. Av.% Av.No. Av.%
`
`CACM
`
`IN SPEC
`
`52
`
`77
`
`lSI
`
`35,76
`
`Medlars
`
`30
`
`11.4
`
`16.0
`
`8.1
`
`10.4
`
`15.3
`
`33.0
`
`49.8
`
`23.2
`
`0.5
`
`0.3
`
`3.4
`
`2.3
`
`1.9
`
`2.0
`
`1.7
`
`3.8
`
`19
`
`9
`
`4
`
`18
`
`To give some idea as to the average length of each query, that value is given for the cosine version
`
`based on the original natural language (NL ). The lSI value is low, because only the 35 queries for
`
`which both vector and Boolean logic (BL) forms are available were considered, and those g-qeries are
`
`rather short. The INSPEC queries are much longer; users were not very experienced and so described
`
`their interests with entire paragraphs, filled with many unimportant words.
`
`Query generality, that is the number of relevant documents per query, is illustrated in the next
`
`two columns of the table. The first number is given in absolute terms and the second is a percentage of
`
`the total number of documents. CACM queries have very few relevant documents, both in actual
`
`numbers and as a percentage. INSPEC questions have roughly twice as many. Medlars queries have
`
`slightly fewer relevants, but the percentage of documents that are relevant is much higher. And the lSI
`
`collection seems to have too many ~elevant documents per query, both in absolute numbers and as a
`
`percentage; those queries were much too vague to give small, sharply defined relevant sets.
`
`010
`
`Facebook Ex. 1007
`
`
`
`-8-
`
`The final two columns of Table 2 illustrate the retrieval behavior of the queries using cosine corre(cid:173)
`
`lation by considering the top 10 ranked documents. For Medlars, almost 4 of the first 10 are relevant
`
`on average, indicating that the queries are likely to have high precision. Further, almost 20% of the
`
`relevant documents are already retrieved, indicating that recall is probably good as well. For CACM,
`
`recall seems high, but precision is reduced to roughly half. In the other two collections, around 2
`
`relevant documents are found in the top 10 retrieved, so precision is low as well. And there is even a
`
`lower percentage of the number of relevant documents that are identified. For lSI, in particular, only
`
`4% '?f the relevant documents are found in the top 10, indicating that recall will be rather low. With
`
`so many relevant documents per query, such behavior could be expected.
`
`Table 3 gives descriptive details about the five different groups of query collections.
`
`011
`
`Facebook Ex. 1007
`
`
`
`-9-
`
`Table 3: Query Collections Summary
`
`Coli.
`Name
`
`NL Queries
`No.
`Origin
`
`BL Queries
`No.
`Origin
`
`ADI
`
`35
`
`CACM
`
`52
`
`INSPEC
`
`77
`
`lSI
`
`76
`
`Written by 2
`Harvard computer
`science students
`
`Cornell & other
`computer
`personnel
`
`Students etc.
`at Syracuse Univ.
`
`ADI, ISPRA &
`SIGIR Forum
`abstracts
`
`Medlars
`
`30
`
`NLM files
`
`35
`
`52
`
`77
`
`35
`
`30
`
`author- 4 forms
`librarians-
`1 form each
`
`2 comp. sci.
`grad. students-
`1 form each
`
`7 searchers
`at Syracuse
`
`AD I part only
`(see ADI coli.
`above)
`
`NLM searchers,
`then expanded
`using MESH
`
`There is one set of natural language queries for each of the five, but in some eases there are several
`
`Boolean forms based on each natural language query.
`
`(1) The ADI collection originally had 35 natural language quer1es. Three searchers then each
`
`rephrased the natural language questions into a Boolean representation.
`
`(2) 52 CACM queries were submitted from a variety of sources, and two students (in the Cornell gra-
`
`duate level information retrieval course) responsible for creating the query collection each proposed
`
`Boolean forms.
`
`(3) One set of INSPEC queries was selected from the various types of representations devised by
`
`seven different searchers working at Syracuse University.
`
`012
`
`Facebook Ex. 1007
`
`
`
`-10-
`
`(4) For lSI, the same 35 natural language and Boolean queries used for ADI were employed. Multiple
`
`concept type testing, however, required more natural language queries, so 41 additional ones were
`
`selected from a set of queries for the ISPRA collection and from some new queries based on
`
`abstracts published in issues of the ACM SIGIR Forum.
`
`(5) Medlars natural language queries were as provided from the National Library of Medicine (NLM)
`
`files and Boolean forms were subsequently constructed at Cornell. The latter were based on
`
`Boolean expressions employed by searchers. An expansion process utilizing the Medical Subject
`
`.Headings of the MESH thesaurus [National Library of Medicine 1968] enabled all category names
`
`to be replaced by words that might occur in document texts.
`
`·2. Multiple Concept Types
`
`To supplement the discussion of documents in section 1.2.1, it is necessary to consider the multiple
`
`concept model proposed in [Fox 1983b). The word "concept" is used to indicate a basic item of index(cid:173)
`
`ing information so that a collection of concepts identifies the context of the article or monograph. Con(cid:173)
`
`cepts can be "terms" -really approximations to stems of words from the title, abstract, or supplied list
`
`of keywords -or they may beindexing categories or bibliographic connection markers (e.g., a pointer to
`
`a document co-cited with the one being indexed).
`
`Table 1 indicates how many distinct terms are present in the dictionary of word stems for each
`
`collection. The dictionary is produced as part of the automatic indexing process, and is the minimum
`
`size required to accommodate all distinct stems after a stop word list has been applied to document
`
`texts. Vocabulary size goes up rapidly as the number of documents increases (e.g., see the jumps for
`
`ADI and Medlars), but then gradually tapers off (e.g., see values for the large CACM and INSPEC col(cid:173)
`
`lections). The size is somewhat influenced by the nature of the documents; there are more distinct
`
`013
`
`Facebook Ex. 1007
`
`
`
`-11-
`
`stems in the 1033 Medlars medical articles which have a specialized terminology than for the 1460 lSI
`
`information science records which contain fewer technical names.
`
`The multiple concept type model introduced in section 1.1 calls for additional concepts beside
`
`terms. In the lSI collection, two additional types, authors and co-citations, were considered. Author
`
`names were entered in the author dictionary, which had a total of 1255 items. Co-citations have no
`
`attendant dictionary. Since each document can, a priori, be co-cited with any other, the number of
`concepts equals N, the number of collection documents. The value of the i'' co-citation concept for
`the ;tit document, namely ccii' is the number o( times the i and i'' documents are eo-cited.
`
`Table 4 gives statistics for lSI subvectors.
`
`Table 4: lSI Subvector Length Statistics
`
`Statistic
`Measured
`
`Subvector
`ee
`
`au
`
`mean
`median
`mm
`max
`stdv
`
`Total
`no. of
`concepts
`
`1.4
`1
`1
`7
`0.8
`
`54.0
`40
`1
`276
`46.4
`
`tm
`
`49.6
`47
`8
`179
`21.5
`
`1255
`
`1460
`
`7392
`
`Even though there are only 1255 ditterent authors for 1460 documents, each document has an average
`
`of 1.4 authors. The term subvectors are much longer, averaging around 50 concepts, but the standard
`
`deviation (stdv) indicates that the distribution is spread rather widely. Co-citation subvectors have
`
`014
`
`Facebook Ex. 1007
`
`
`
`·12-
`
`about the same average length, but an even larger standard deviation. Since the lSI articles were
`
`chosen as ones with many citations, and since the collection covers the field of information science for a
`
`number of years, it is not surprising that there are so many entries in the co-citation submatrix. Since
`
`the lengths of terms and co-citation vectors are comparable, comparisons between the two as to their
`
`relative utility for retrieval should be of considerable interest.
`
`The CACM collection was developed in part to allow testing of other concept types besides those
`
`found in lSI documents. Computing RetJieto• categories (cr) allow a manual indexing system to be
`
`mixed in with automatic indexing entries. Bibliographic coupling ( & ) indicates the number of refer(cid:173)
`
`ences shared between two documents' bibliographies. Links (iii) are references to or citations from
`
`other articles.
`
`Table 5 gives statistics for the various concept types in the CACM document collection.
`
`Table 5: CACM Subvector Length Statistics
`
`Statistic
`Measured
`
`mean
`median
`m1n
`max
`stdv
`
`Total
`no. of
`concepts
`
`au
`
`be
`
`1.3
`1
`1
`7
`0.7
`
`4.2
`0
`0
`183
`10.8
`
`Subvector
`cc
`cr
`
`3.7
`0
`0
`111
`10.7
`
`1.2
`0
`0
`28
`1.9
`
`In
`
`2.7
`2
`1
`74
`3.1
`
`tm
`
`25.0
`15
`1
`168
`22.7
`
`2647
`
`3204
`
`3204
`
`200
`
`3204
`
`10446
`
`Note: The minimum length for iii is one since a document is, by definition, linked to itself, and
`so the diagonal of the submatrix is set to ones.
`
`015
`
`Facebook Ex. 1007
`
`
`
`-13-
`
`Since & , iii, and cc subvectors are based only on connections among the chosen CACM articles, their
`mean length is much shorter than that of the tm subvector. The cc subvectors for CACM have aver(cid:173)
`
`age length of less than 4 concepts, while those for lSI have over 50. In a larger document collection,
`
`with better total coverage of a subject discipline, more of the citations and references would be internal
`
`to that collection and so these subvectors would probably be longer. Thus, while the lSI collection has
`
`unusually long bibliographic connection subvectors, the CACM collection has abnormally short ones,
`
`and any evidence from the CACM tests that these subvectors are useful should be suggestive of their
`
`valu~ in a more realistic environment with large numbers of documents.
`
`It should be noted that links are relatively well bounded in number since a given article rarely has
`
`a very long bibliography and the quantity of citations to an article is limited by how many entries in
`
`the collection deal with that specific topic. Co-citations are somewhat. more numerous, especially in
`
`such a homogeneous collection, since many later articles may refer to a given article as well as others
`
`appearing in the same journal. Bibliographic couplings are slightly more plentiful, perhaps because
`
`many articles cite "classics" in their sub-area, and because there is a good deal of referencing of CACM
`
`articles by CACM authors.
`
`A final note about the various concept types is that in both the lSI and CACM collections, the
`
`total number of terms exceeds the number of documents. For very large collections, the opposite would
`
`be true. However, there is no reason to believe that such a consideration would affect retrieval
`
`behavior, whereas subvector length is likely to be an important consideration, relating to the specificity
`
`of indexing.
`
`3. CACM
`
`016
`
`Facebook Ex. 1007
`
`
`
`3.1. Documents
`
`-14-
`
`Robert Dattola of Xerox Corporation provided a magnetic tape containing the title, abstract,
`
`author list, keywords, Computing Review• categories, and date of publication of articles published in
`
`the Communication• of tlae ACM from the earliest issue, in 1958, to the last number in 1979. The col(cid:173)
`
`lection format was changed, document numbers (dids) were assigned, and editing was done to correct
`
`spelling and typographical errors. Some duplicates were eliminated and missing articles added, and a
`
`final renumbering took place.
`
`'Carol Fox and Jill Warner looked through printed copies of each article to locate the bibliography.
`
`For each article, a list of the "dids" representing articles referenced in the CACM collection was even(cid:173)
`
`tually obtained. Many articles were located using a chronologically ordered list of all articles in the col(cid:173)
`
`lection. However, since there were many errors in bibliographies, a special search program was written
`
`to ide1.1tify the nearest matches to a supplied reference - to compensate for errors in year, month,
`
`author name, and title .
`
`. The purpose of obtaining that data was to form bibliographic subvectors. Based on the above
`
`mentioned lists, a relational form was produced:
`
`Raw _data ( citing, cited )
`
`which contained pairs of identifiers for the citing article and the one contained in the article's bibliogra(cid:173)
`
`phy. Figure 1 shows the steps required, in QUEL like notation (Stonebraker et al. 1976), to produce
`from Raw_data first normal form (Date 1982] versions of the desired relations for & , ii, and cc subvec(cid:173)
`
`tors.
`
`017
`
`Facebook Ex. 1007
`
`
`
`-15-
`
`Figure 1: Relational Processing to Obtain
`CACM Bibliographic Subvectors
`
`Given: Raw _data (citing, cited)
`
`Desired:
`BC (citing!, citing2, coupling_no)
`LN (linkl, link2)
`CC (citedl, cited2, co-citing_no)
`
`QUEL Like Statements:
`
`1. For BC
`Modify Raw _data to hash on cited
`Range of bel is Raw_data
`Range of be2 is Raw_data
`Retrieve into be_entries
`( citing! = bel.citing,
`citing2 = be2.citing,
`counter = be I. cited )
`where bel.cited =- be2.cited
`Range of bee is be_entries
`Retrieve into BC
`(bee.citingl, bee.citing2,
`coupling_no- count ( bee.counter))
`
`2. For LN
`Retrieve into LN
`( linkl = bel.citing,
`link2 = bel.cited )
`Append to LN
`( link2 = be !.citing,
`linkl = bel.cited )
`
`018
`
`Facebook Ex. 1007
`
`
`
`-1~
`
`Figure 1 continued: Relational Processing to Obtain
`CACM Bibliographic Subvectors
`
`3. For CC
`Modify Raw _data to hash on citing
`Range of eel is Raw_data
`Range of cc2 is Raw_data
`Retrieve into cc_entries
`( cited! =- ccl.cited,
`cited2 = cc2.cited,
`counter = ccl.citing )
`where ccl.citing = cc2.citing
`Range of cce is cc_entries
`Retrieve into BC
`(cce.citedl, cce.cited2,
`co-citing_no = count ( cce.counter ) )
`
`The actual processing was similar. Vectors were eventually formed from the relations after appropriate
`
`sorting and compression to unnormalized lists.
`
`Since the CACM collection is the first one studied with so many different concept types, and since
`
`the data is available for a number of years, various histograms, scatter plots, and charts are given in
`
`the next subsection. The intention is to illustrate the form, content, and distribution of the different
`
`types or bibliographic concepts.
`
`3.2. Illustrative Charta and Figures
`
`Table 5 gave statistics on the CACM subvector lengths. However, a much more detailed under-
`
`standing can be gleaned from the graphical presentations below. To begin with, Figures 2 through 6
`
`are histograms with mean values given for various statistics for each year in the period 1958-1979.
`
`Figure 2 shows the number of articles each year. Clearly, the publication grew in size during the
`
`early years, and then there were changes in editorial policies at various later times. For example, in the
`
`019
`
`Facebook Ex. 1007
`
`
`
`-17-
`
`early years, computing algorithms in fairly large numbers were published, each one counting as an arti(cid:173)
`
`cle. Subsequently this practice was discontinued and algorithms were collected for a separate publica(cid:173)
`
`tion.
`
`Figure 3 shows the average number of citations to an article for each year. Lower values in later
`
`years can be explained by the fact that there were few articles in the collection that were published
`
`subsequently, and so citations were not possible. Early articles were not highly cited, perhaps because
`
`the first volumes had many methodological or other reports that were not of great interest afterwards.
`
`The ·peak in citations in the middle years is attributable to the fact that CACM was a key publication
`
`in computer science during that time, and many important developments were described there, espe(cid:173)
`
`cially after 1962. One might expect that the three bibliographic subvectors would vary in length
`
`depending in part on the distribution just described of citations.
`
`Figure 4 shows the average number of bibliographic couplings per year. In general, the number
`
`increases as time goes by, since there are more available prior articles in the collection that can be
`
`referred to. Thus, the number of couplings for years 1976 through 1978 are very high - bibliographic
`
`references can be to any previous article, even back to 1958. All in all, the distribution is a fairly uni(cid:173)
`
`form one, perhaps explaining why there is such a large standard deviation in & subvector lengths.
`
`Figure 5 gives the average number of links for each year. Since links include both references and
`
`citations, it is not surprising that the curve is a relatively flat one. Only for the early years, when there
`
`were not many prior articles to refer to and when the subject matter was not such as to elicit many
`
`later citations; is the level of links fairly low.
`
`Figure 6, showing average number of c~citations, has a very different form than that of the other
`
`figures shown. A high peak appears near the middle, and lower values are at either tail. Apparently,
`
`1966 was a good year, since articles then were c~cited with many others. Referring back to Figure 3
`
`020
`
`Facebook Ex. 1007
`
`
`
`-18-
`
`Figure 2: Mean Number of CACM Articles Published Each Year
`
`No. of
`Articles
`
`300
`
`uo
`
`200
`
`150
`
`100
`
`50
`
`0
`
`I
`I
`I
`I
`I
`I
`I
`I
`I I
`I I
`I
`
`I
`I
`I
`I
`I
`I
`I
`I II
`I
`I
`I
`I
`I II
`I
`I
`I
`I
`I
`I I
`I
`I
`I
`I
`I
`I I
`I
`II I
`I
`I
`I
`I I
`I
`II I
`I
`I
`I
`I I
`I
`II I
`I
`I
`I
`I
`I I
`II
`I
`II I
`I
`I
`I
`II
`I
`I I
`I
`II I
`I
`I
`II
`I
`I I
`I
`I
`II I
`I
`I
`I
`II
`I
`I I
`I
`I
`II II I
`I
`I
`I
`II
`I
`I I
`I
`I
`II II I
`I
`I
`II
`I
`I II I
`I
`I
`II II I
`I
`I
`I
`I II I
`II
`I
`I
`II II II I
`I
`I
`I
`II
`I
`I II I
`I
`II II II I
`I
`I I
`I I
`II
`I
`I II I
`I
`I
`I
`II II II I
`I
`11058 50 ao a 12 u 14 15 .. 17 18 Ill 70 71 a a
`
`I
`I
`I
`II II
`Ill
`II II
`I II
`II II
`Ill
`I II II II II
`I II II II II II
`I II II II II II
`I II II II II II
`I II II II II II
`I II II II II II
`11111111111
`4 75 71 77 78 71
`
`Year of CACM Volume Considered
`
`021
`
`Facebook Ex. 1007
`
`
`
`Figure 3: Mean Number o( Citations Per Artic:le Per Year
`
`-19-
`
`No. o(
`Citations
`
`1.0
`
`.
`
`1.5
`
`.
`
`1.0
`
`.
`
`O.l
`
`10
`
`11
`
`IS
`
`U M U
`
`H
`
`IT
`
`II
`
`II
`
`10
`
`11
`
`1! 11
`
`11
`
`11
`
`11
`
`11
`
`11
`
`0.0 ~-------------------------------------------------------------
`n
`
`U
`
`U
`
`Year
`
`022
`
`Facebook Ex. 1007
`
`
`
`-2~
`
`Figure 4: Mean Number of Bibliographic Couplings
`Per Article