throbber
Charaeterisatio• of Two •ew
`Bzperiaeatal Colleetioaa i• Co•pater
`aad laforaatioa Seieaee Coataiaia&
`Teztaal aad Biblioaraphie Coaeepta
`
`*
`Edward A. Fox
`
`83-561
`September 1983
`
`* Department of Computer Science
`Cornell University
`Ithaca. Hew York 14853
`
`now at:
`Department of Computer Science
`Virginia Tech
`Blacksburg. VA 24061
`
`This work was supported in part by the National Science Founda(cid:173)
`tion. under grant IST-81-08696.
`
`001
`
`Facebook Inc. Ex. 1206
`
`

`

`TABLE OF CONTENTS
`
`1 Introduction ...................................................................................................................................
`
`1.1 Extended Vectors .......................................................................................................................
`
`1.2 Contrast with Other Collections .................................................................................................
`
`1.2.1 Document Collections ..............................................................................................................
`
`1.2.2 Query Collections .....................................................................................................................
`
`2 Multiple Concept Types ................................................................................................................
`
`3 CACM .........................................................................................................................................•..
`
`3.1 Documents ................................... .... ... ......................................... ................................................
`
`3.2 Illustrative Charts and Figures ...................................................................................................
`
`3.3 Queries ......... .................. ............................................... ......... ........... ..........................................
`
`3.4 Relevance Judgments ··································································.················································
`
`3.5 Retrieval Performance ................................................................................................................
`
`4 lSI ...................................................................................................................................................
`
`4.1 Documents ........................................ :..........................................................................................
`
`2
`
`3
`
`4
`
`5
`
`6
`
`0
`
`13
`
`4
`
`16
`
`0
`
`41
`
`4
`
`44
`
`4
`
`4.1.1 Background ...............................................................................................................................
`
`44
`
`4.1.2 Tape Information Provided ......................................................................................................
`
`4.1.3 Collection Preparation .............................................................................................................
`
`4.2 Subvectors .................................. .............................. .......... ............ .......................................... ...
`
`4.3 Queries .......................... ............ .... ........ ......................................................................................
`
`5
`
`47
`
`9
`
`56
`
`-·-
`
`002
`
`Facebook Inc. Ex. 1206
`
`

`

`-n-
`
`4.4 Relevance Judgments ..................................................................................................................
`
`8
`
`4.5 Retrieval Performance .................................................................................................................
`
`58
`
`5 Conclusion ......................................................................................................................................
`
`9
`
`References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .
`
`61
`
`003
`
`Facebook Inc. Ex. 1206
`
`

`

`CHARACTERIZATION OF TWO NEW EXPERIMENTAL
`COLLECTIONS IN COMPUTER AND INFORMATION SCIENCE
`CONTAINING TEXTUAL AND BIBLIOGRAPmC CONCEPTS
`•
`Edward A. Fox
`
`Two new collections are described which are particularly useful for investigating the interaction
`
`between textual and bibliographic data in the automatic indexing and retrieval of documents. An
`
`extension to the vector space model has been proposed whereby various types of concepts are included
`
`in the representation of such documents. Experiments using an enhanced version of the SMART sys-
`
`tem have shown such an extended model to perform better than simpler schemes. The CACM and lSI
`
`collections developed for this research should be of value for future related studies.
`
`The lSI collection has author, title/abstract, and co-citation data for the 1460 most higly cited
`
`articles and manuscripts in information science in the 1969-1977 period. The CACM collection contains
`
`7 types of concepts for the 3204 articles published in the Communication. of the ACM up through 1979.
`
`These collections have 76 and 52 queries, respectively, along with relevance judgments.
`
`*Department of Computer Science, Cornell University, Ithaca, NY 14853; now at Dept. of Comput(cid:173)
`. er Science, Virginia Tech, Blacksburg, VA 24061. This work was supported in part by the National
`Science Foundation, under grant IST-81-08696.
`
`-1-
`
`004
`
`Facebook Inc. Ex. 1206
`
`

`

`1. Introduction
`
`-2-
`
`In order to retrieve documents relevant to the request of a particular user it is necessary to first
`
`index or represent the content of articles and manuscripts. For many years this has been done by
`
`trained indexers who assign keyword lists or sets of descriptors from a controlled vocabulary [Borko &
`
`Bernier 1978). Since the early 1960's an alternative method of automatic indexing has been developed
`
`whereby word stems, words, phrases, or thesaurus category indicators are selected from the.title and
`
`abstract and a weighted vector indicating the importance of each is constructed [Salton 1980). Part of
`
`this r.eport deals with the vectors derived in this fashion from two collections in information and com(cid:173)
`
`puter science.
`
`Another source of data about documents is from their bibliographie references. Citation indexes
`
`can be used to locate those entries referred to by an article, or which cite it (Garfield 1964, 1979). Link(cid:173)
`
`ages between documents based on bibliographic coupling (Kessler 1962) and co-citation counts [Small
`
`1973) have been utilized for a variety of analysis and retrieval purposes (e.g., [Biehteler & Eaton 1980],
`
`[Garfield 1970), [Kessler 1963a, 1963b, 1965), [Small & Koenig 1977), [Small 1978, 1980, 1981], [Wein(cid:173)
`
`berg 1974). Preliminary experimentation has shown that the vectors produced by automatic indexing
`
`of document texts can be usefully supplemented by bibliographic information to produce a representa(cid:173)
`
`tion that can be more effectively searched than if either component were used alone ([Michelson et al.
`
`1971), [Salton 1963, 1971]).
`
`To facilitate exploration of the effects of extending the vector space model to include a variety or
`
`types of concepts it was necessary to have test collections containing such concepts. · One collection
`
`containing 1460 of the most highly cited documents in information science published between 1969 and
`
`1977 [Small 1981) was developed based on citation and eo-citation data provided by the Institute for
`
`Scientific Information. This lSI collection contains three types of concepts: author names, word stems
`
`005
`
`Facebook Inc. Ex. 1206
`
`

`

`-3-
`
`from the title and abstract sections, and co-citations between each of the articles. A second collection,
`
`of the 3204 articles published in Communication• of the ACM in the years up through 1979, contains
`
`the above three types of concepts plus: categories assigned from a hierarchical subject classification
`
`scheme, bibliographic coupling connections, date information, and direct references between articles.
`
`The CACM collection also includes 52 user supplied queries in 3 different forms and relevance judge(cid:173)
`
`ments indicating which documents relate to which queries. The lSI collection has a total of 76 queries.
`
`This report describes the CACM and lSI collections in detail. It supplements the theoretical and
`
`expe11imental descriptions of (Fox 1983b) and the elaboration on implementation issues contained in
`
`(Fox 1983a). Ample discussion has been included to enable interested researchers to understand the
`
`characteristics of these collections and to determine if they might be of use in related research investi-
`
`gations.
`
`The rest of this introductory section provides useful background for subsequent sections. The
`
`notion of extended vectors is introduced, and some overview tables are shown indicating how these col-
`
`lections relate to other test collections discussed in (Fox 1983b).
`
`1.1. Extended Vector.
`
`When only the terms of documents are considered, a simple collection representation scheme
`
`results. A dictionary can be formed to contain the T distinct word stems and so textually based con(cid:173)
`
`cepts can be numbered from 1 through T. Each of the N documents Di is represented by a vector of
`
`length T,
`
`iJi == ( tmib tmi2, ... , tmiT )
`so the entire collection is a Nx T matrix,
`
`C =
`
`{ tm;i }
`
`1 < i < N, 1 < j < T.
`
`(1-1)
`
`(1-2)
`
`006
`
`Facebook Inc. Ex. 1206
`
`

`

`-4-
`
`If other types of information are provided in addition, then each term vector (1-1) can become a
`
`subvector of a more complete document vector. Similarly, the matrix (1-2) becomes the term subma(cid:173)
`
`trix.
`
`It makes sense to have a separate author submatrix indicating which authors contributed to
`
`which articles based on a second dictionary of author names. When some subject classification scheme
`
`is adopted, such as the category system used for Computing Review•, then a dictionary for those entries
`
`can also be constructed. Thus, the CACM collection has term (tm), author (au), and Computing
`
`Review• category (cr) submatrices, along with others .
`
`. More data about articles comes from the references present in the bibliographies of each publica(cid:173)
`
`tion. An NxN matrix can be built indicating which articles refer to which others, or which ones are
`
`cited by others. Similar matrices indicate the degree of bibliographic coupling or the number of co(cid:173)
`
`citations received by pairs of articles. Each of these matrices then becomes a submatrix of a large N
`
`row collection matrix.
`
`The extended vector model is thus based upon the idea of having multiple concept types. Each
`
`document vector has subvectors, one for each concept type included in the representation. Further dis(cid:173)
`
`cussion of the model, and experimental evidence of its utility, can be found in [Fox 1983b). More detail
`
`regarding how the CACM and lSI collections are represented according to this model is given in sec(cid:173)
`
`tions 2 and 3 below.
`
`1.2. Contrast with Other Collection•
`
`In [Fox 1983b), 1 small and 4 moderate to medium size test collections were employed to test vari(cid:173)
`
`ous hypothesis. The CACM and lSI collections were two of those utilized. To provide a suitable con(cid:173)
`
`trast, the following subsections present summary information about all 5 of those collections.
`
`007
`
`Facebook Inc. Ex. 1206
`
`

`

`1.2.1. Document Collections
`
`-5-
`
`Table 1 summarizes essential data about each of the document collections. All collections contain
`
`information about documents such as monographs or journal articles. At the very least, in almost
`
`every case, a title and abstract were available originally. They deal with a a number of subjects, from
`
`the "soft" social science like material that makes up a fair proportion of the lSI collection, to the terse,
`
`medical articles used in Medlars studies.
`
`( 1) The small AD I collection is about librarianship, microforms, and other topics in documentation
`
`and information science as of 1963.
`
`(2) The CACM documents include all articles in issues of the CommunictJtion1 of the ACM from the
`
`first issue in 1958 to the last number of 1979. A considerable range of computer science literature
`
`is covered by those 3204 entries in the publication that for many years served as the premier
`
`Table 1: Document Collections Summary
`
`Short
`Name
`
`No. of No. of Av.No.
`Terms
`Docs.
`Terms
`
`Subvecton
`Included
`
`Subject
`Matter
`
`Years
`Covered
`
`ADI
`
`82
`
`886
`
`CACM
`
`3,204
`
`10,446
`
`27.1
`
`40.1
`
`IN SPEC 12,684
`
`14,683
`
`35.4
`
`tm
`
`Documen-
`tation
`au,bi,bc, Computer
`cc,cr,
`Science
`ln,tm
`tm
`
`1963
`
`1958
`-1979
`
`1979
`
`1969
`-1977
`to1969
`
`Electrical
`Engineering
`Information
`Science
`Medicine
`
`lSI
`
`1,460·
`
`7,392
`
`104.9
`
`au,cc,tm
`
`Medlars
`
`1,033
`
`8,750
`
`55.8
`
`tm
`
`008
`
`Facebook Inc. Ex. 1206
`
`

`

`periodical in the field.
`
`-6-
`
`(3)
`
`INSPEC, which stands for Information Services in Physics, Electrotechnology, Computers and
`
`Control, covers three Science Ab11r11etl publications, ElectrictJl antl Electronic• Ab1tract•, Com-
`
`puter antl Control Ab11racll, and Plau•ic• Ab1tract1. The content focuses mainly in electrical
`
`engineering and computer science subjects.
`
`(4) The 1460 lSI entries were selected based on co-citation information relating to a study conducted
`
`by Dr. Henry Small of the Institute for Scientific Information© (lSI @). They are the items that
`
`could be located at the Cornell University library out of a total of 1627 names listed in the field of
`
`information science. Each was published between 1969 and 1977 and received at least five cita-
`
`tions.
`
`(5) The 1033 Medlars articles were selected out of a large medical collection available at the National
`
`Library of Medicine.
`
`Thus, the five collections are from various sources, have different sizes, and deal with a number of sub-
`
`jects.
`
`1.2.2. Query Collection•
`
`. -,
`Since considerable effort was made to study the characteristics and behavior of various query for-
`
`mutations, it is worthwhile to examine the five different query collections and all the versions present
`
`for each. Table 2 gives statistics (for the 4 larger collections) relating to the queries and the number of
`
`relevant documents for each query.
`
`009
`
`Facebook Inc. Ex. 1206
`
`

`

`-7-
`
`Table 2: Query and Relevant Document Characteristics
`
`Collection
`Name
`
`No. of Av. Length Rels Per Query Rels in Top 10
`Queries
`Cos Query Av.No. Av.% Av.No. Av.%
`
`CACM
`
`IN SPEC
`
`52
`
`77
`
`lSI
`
`35,76
`
`Medlars
`
`30
`
`11.4
`
`16.0
`
`8.1
`
`10.4
`
`15.3
`
`33.0
`
`49.8
`
`23.2
`
`0.5
`
`0.3
`
`3.4
`
`2.3
`
`1.9
`
`2.0
`
`1.7
`
`3.8
`
`19
`
`9
`
`4
`
`18
`
`To give some idea as to the average length of each query, that value is given for the cosine version
`
`based on the original natural language (NL ). The lSI value is low, because only the 35 queries for
`
`which both vector and Boolean logic (BL) forms are available were considered, and those g-qeries are
`
`rather short. The INSPEC queries are much longer; users were not very experienced and so described
`
`their interests with entire paragraphs, filled with many unimportant words.
`
`Query generality, that is the number of relevant documents per query, is illustrated in the next
`
`two columns of the table. The first number is given in absolute terms and the second is a percentage of
`
`the total number of documents. CACM queries have very few relevant documents, both in actual
`
`numbers and as a percentage. INSPEC questions have roughly twice as many. Medlars queries have
`
`slightly fewer relevants, but the percentage of documents that are relevant is much higher. And the lSI
`
`collection seems to have too many ~elevant documents per query, both in absolute numbers and as a
`
`percentage; those queries were much too vague to give small, sharply defined relevant sets.
`
`010
`
`Facebook Inc. Ex. 1206
`
`

`

`-8-
`
`The final two columns of Table 2 illustrate the retrieval behavior of the queries using cosine corre(cid:173)
`
`lation by considering the top 10 ranked documents. For Medlars, almost 4 of the first 10 are relevant
`
`on average, indicating that the queries are likely to have high precision. Further, almost 20% of the
`
`relevant documents are already retrieved, indicating that recall is probably good as well. For CACM,
`
`recall seems high, but precision is reduced to roughly half. In the other two collections, around 2
`
`relevant documents are found in the top 10 retrieved, so precision is low as well. And there is even a
`
`lower percentage of the number of relevant documents that are identified. For lSI, in particular, only
`
`4% '?f the relevant documents are found in the top 10, indicating that recall will be rather low. With
`
`so many relevant documents per query, such behavior could be expected.
`
`Table 3 gives descriptive details about the five different groups of query collections.
`
`011
`
`Facebook Inc. Ex. 1206
`
`

`

`-9-
`
`Table 3: Query Collections Summary
`
`Coli.
`Name
`
`NL Queries
`No.
`Origin
`
`BL Queries
`No.
`Origin
`
`ADI
`
`35
`
`CACM
`
`52
`
`INSPEC
`
`77
`
`lSI
`
`76
`
`Written by 2
`Harvard computer
`science students
`
`Cornell & other
`computer
`personnel
`
`Students etc.
`at Syracuse Univ.
`
`ADI, ISPRA &
`SIGIR Forum
`abstracts
`
`Medlars
`
`30
`
`NLM files
`
`35
`
`52
`
`77
`
`35
`
`30
`
`author- 4 forms
`librarians-
`1 form each
`
`2 comp. sci.
`grad. students-
`1 form each
`
`7 searchers
`at Syracuse
`
`AD I part only
`(see ADI coli.
`above)
`
`NLM searchers,
`then expanded
`using MESH
`
`There is one set of natural language queries for each of the five, but in some eases there are several
`
`Boolean forms based on each natural language query.
`
`(1) The ADI collection originally had 35 natural language quer1es. Three searchers then each
`
`rephrased the natural language questions into a Boolean representation.
`
`(2) 52 CACM queries were submitted from a variety of sources, and two students (in the Cornell gra-
`
`duate level information retrieval course) responsible for creating the query collection each proposed
`
`Boolean forms.
`
`(3) One set of INSPEC queries was selected from the various types of representations devised by
`
`seven different searchers working at Syracuse University.
`
`012
`
`Facebook Inc. Ex. 1206
`
`

`

`-10-
`
`(4) For lSI, the same 35 natural language and Boolean queries used for ADI were employed. Multiple
`
`concept type testing, however, required more natural language queries, so 41 additional ones were
`
`selected from a set of queries for the ISPRA collection and from some new queries based on
`
`abstracts published in issues of the ACM SIGIR Forum.
`
`(5) Medlars natural language queries were as provided from the National Library of Medicine (NLM)
`
`files and Boolean forms were subsequently constructed at Cornell. The latter were based on
`
`Boolean expressions employed by searchers. An expansion process utilizing the Medical Subject
`
`.Headings of the MESH thesaurus [National Library of Medicine 1968] enabled all category names
`
`to be replaced by words that might occur in document texts.
`
`·2. Multiple Concept Types
`
`To supplement the discussion of documents in section 1.2.1, it is necessary to consider the multiple
`
`concept model proposed in [Fox 1983b). The word "concept" is used to indicate a basic item of index(cid:173)
`
`ing information so that a collection of concepts identifies the context of the article or monograph. Con(cid:173)
`
`cepts can be "terms" -really approximations to stems of words from the title, abstract, or supplied list
`
`of keywords -or they may beindexing categories or bibliographic connection markers (e.g., a pointer to
`
`a document co-cited with the one being indexed).
`
`Table 1 indicates how many distinct terms are present in the dictionary of word stems for each
`
`collection. The dictionary is produced as part of the automatic indexing process, and is the minimum
`
`size required to accommodate all distinct stems after a stop word list has been applied to document
`
`texts. Vocabulary size goes up rapidly as the number of documents increases (e.g., see the jumps for
`
`ADI and Medlars), but then gradually tapers off (e.g., see values for the large CACM and INSPEC col(cid:173)
`
`lections). The size is somewhat influenced by the nature of the documents; there are more distinct
`
`013
`
`Facebook Inc. Ex. 1206
`
`

`

`-11-
`
`stems in the 1033 Medlars medical articles which have a specialized terminology than for the 1460 lSI
`
`information science records which contain fewer technical names.
`
`The multiple concept type model introduced in section 1.1 calls for additional concepts beside
`
`terms. In the lSI collection, two additional types, authors and co-citations, were considered. Author
`
`names were entered in the author dictionary, which had a total of 1255 items. Co-citations have no
`
`attendant dictionary. Since each document can, a priori, be co-cited with any other, the number of
`concepts equals N, the number of collection documents. The value of the i'' co-citation concept for
`the ;tit document, namely ccii' is the number o( times the i and i'' documents are eo-cited.
`
`Table 4 gives statistics for lSI subvectors.
`
`Table 4: lSI Subvector Length Statistics
`
`Statistic
`Measured
`
`Subvector
`ee
`
`au
`
`mean
`median
`mm
`max
`stdv
`
`Total
`no. of
`concepts
`
`1.4
`1
`1
`7
`0.8
`
`54.0
`40
`1
`276
`46.4
`
`tm
`
`49.6
`47
`8
`179
`21.5
`
`1255
`
`1460
`
`7392
`
`Even though there are only 1255 ditterent authors for 1460 documents, each document has an average
`
`of 1.4 authors. The term subvectors are much longer, averaging around 50 concepts, but the standard
`
`deviation (stdv) indicates that the distribution is spread rather widely. Co-citation subvectors have
`
`014
`
`Facebook Inc. Ex. 1206
`
`

`

`·12-
`
`about the same average length, but an even larger standard deviation. Since the lSI articles were
`
`chosen as ones with many citations, and since the collection covers the field of information science for a
`
`number of years, it is not surprising that there are so many entries in the co-citation submatrix. Since
`
`the lengths of terms and co-citation vectors are comparable, comparisons between the two as to their
`
`relative utility for retrieval should be of considerable interest.
`
`The CACM collection was developed in part to allow testing of other concept types besides those
`
`found in lSI documents. Computing RetJieto• categories (cr) allow a manual indexing system to be
`
`mixed in with automatic indexing entries. Bibliographic coupling ( & ) indicates the number of refer(cid:173)
`
`ences shared between two documents' bibliographies. Links (iii) are references to or citations from
`
`other articles.
`
`Table 5 gives statistics for the various concept types in the CACM document collection.
`
`Table 5: CACM Subvector Length Statistics
`
`Statistic
`Measured
`
`mean
`median
`m1n
`max
`stdv
`
`Total
`no. of
`concepts
`
`au
`
`be
`
`1.3
`1
`1
`7
`0.7
`
`4.2
`0
`0
`183
`10.8
`
`Subvector
`cc
`cr
`
`3.7
`0
`0
`111
`10.7
`
`1.2
`0
`0
`28
`1.9
`
`In
`
`2.7
`2
`1
`74
`3.1
`
`tm
`
`25.0
`15
`1
`168
`22.7
`
`2647
`
`3204
`
`3204
`
`200
`
`3204
`
`10446
`
`Note: The minimum length for iii is one since a document is, by definition, linked to itself, and
`so the diagonal of the submatrix is set to ones.
`
`015
`
`Facebook Inc. Ex. 1206
`
`

`

`-13-
`
`Since & , iii, and cc subvectors are based only on connections among the chosen CACM articles, their
`mean length is much shorter than that of the tm subvector. The cc subvectors for CACM have aver(cid:173)
`
`age length of less than 4 concepts, while those for lSI have over 50. In a larger document collection,
`
`with better total coverage of a subject discipline, more of the citations and references would be internal
`
`to that collection and so these subvectors would probably be longer. Thus, while the lSI collection has
`
`unusually long bibliographic connection subvectors, the CACM collection has abnormally short ones,
`
`and any evidence from the CACM tests that these subvectors are useful should be suggestive of their
`
`valu~ in a more realistic environment with large numbers of documents.
`
`It should be noted that links are relatively well bounded in number since a given article rarely has
`
`a very long bibliography and the quantity of citations to an article is limited by how many entries in
`
`the collection deal with that specific topic. Co-citations are somewhat. more numerous, especially in
`
`such a homogeneous collection, since many later articles may refer to a given article as well as others
`
`appearing in the same journal. Bibliographic couplings are slightly more plentiful, perhaps because
`
`many articles cite "classics" in their sub-area, and because there is a good deal of referencing of CACM
`
`articles by CACM authors.
`
`A final note about the various concept types is that in both the lSI and CACM collections, the
`
`total number of terms exceeds the number of documents. For very large collections, the opposite would
`
`be true. However, there is no reason to believe that such a consideration would affect retrieval
`
`behavior, whereas subvector length is likely to be an important consideration, relating to the specificity
`
`of indexing.
`
`3. CACM
`
`016
`
`Facebook Inc. Ex. 1206
`
`

`

`3.1. Documents
`
`-14-
`
`Robert Dattola of Xerox Corporation provided a magnetic tape containing the title, abstract,
`
`author list, keywords, Computing Review• categories, and date of publication of articles published in
`
`the Communication• of tlae ACM from the earliest issue, in 1958, to the last number in 1979. The col(cid:173)
`
`lection format was changed, document numbers (dids) were assigned, and editing was done to correct
`
`spelling and typographical errors. Some duplicates were eliminated and missing articles added, and a
`
`final renumbering took place.
`
`'Carol Fox and Jill Warner looked through printed copies of each article to locate the bibliography.
`
`For each article, a list of the "dids" representing articles referenced in the CACM collection was even(cid:173)
`
`tually obtained. Many articles were located using a chronologically ordered list of all articles in the col(cid:173)
`
`lection. However, since there were many errors in bibliographies, a special search program was written
`
`to ide1.1tify the nearest matches to a supplied reference - to compensate for errors in year, month,
`
`author name, and title .
`
`. The purpose of obtaining that data was to form bibliographic subvectors. Based on the above
`
`mentioned lists, a relational form was produced:
`
`Raw _data ( citing, cited )
`
`which contained pairs of identifiers for the citing article and the one contained in the article's bibliogra(cid:173)
`
`phy. Figure 1 shows the steps required, in QUEL like notation (Stonebraker et al. 1976), to produce
`from Raw_data first normal form (Date 1982] versions of the desired relations for & , ii, and cc subvec(cid:173)
`
`tors.
`
`017
`
`Facebook Inc. Ex. 1206
`
`

`

`-15-
`
`Figure 1: Relational Processing to Obtain
`CACM Bibliographic Subvectors
`
`Given: Raw _data (citing, cited)
`
`Desired:
`BC (citing!, citing2, coupling_no)
`LN (linkl, link2)
`CC (citedl, cited2, co-citing_no)
`
`QUEL Like Statements:
`
`1. For BC
`Modify Raw _data to hash on cited
`Range of bel is Raw_data
`Range of be2 is Raw_data
`Retrieve into be_entries
`( citing! = bel.citing,
`citing2 = be2.citing,
`counter = be I. cited )
`where bel.cited =- be2.cited
`Range of bee is be_entries
`Retrieve into BC
`(bee.citingl, bee.citing2,
`coupling_no- count ( bee.counter))
`
`2. For LN
`Retrieve into LN
`( linkl = bel.citing,
`link2 = bel.cited )
`Append to LN
`( link2 = be !.citing,
`linkl = bel.cited )
`
`018
`
`Facebook Inc. Ex. 1206
`
`

`

`-1~
`
`Figure 1 continued: Relational Processing to Obtain
`CACM Bibliographic Subvectors
`
`3. For CC
`Modify Raw _data to hash on citing
`Range of eel is Raw_data
`Range of cc2 is Raw_data
`Retrieve into cc_entries
`( cited! =- ccl.cited,
`cited2 = cc2.cited,
`counter = ccl.citing )
`where ccl.citing = cc2.citing
`Range of cce is cc_entries
`Retrieve into BC
`(cce.citedl, cce.cited2,
`co-citing_no = count ( cce.counter ) )
`
`The actual processing was similar. Vectors were eventually formed from the relations after appropriate
`
`sorting and compression to unnormalized lists.
`
`Since the CACM collection is the first one studied with so many different concept types, and since
`
`the data is available for a number of years, various histograms, scatter plots, and charts are given in
`
`the next subsection. The intention is to illustrate the form, content, and distribution of the different
`
`types or bibliographic concepts.
`
`3.2. Illustrative Charta and Figures
`
`Table 5 gave statistics on the CACM subvector lengths. However, a much more detailed under-
`
`standing can be gleaned from the graphical presentations below. To begin with, Figures 2 through 6
`
`are histograms with mean values given for various statistics for each year in the period 1958-1979.
`
`Figure 2 shows the number of articles each year. Clearly, the publication grew in size during the
`
`early years, and then there were changes in editorial policies at various later times. For example, in the
`
`019
`
`Facebook Inc. Ex. 1206
`
`

`

`-17-
`
`early years, computing algorithms in fairly large numbers were published, each one counting as an arti(cid:173)
`
`cle. Subsequently this practice was discontinued and algorithms were collected for a separate publica(cid:173)
`
`tion.
`
`Figure 3 shows the average number of citations to an article for each year. Lower values in later
`
`years can be explained by the fact that there were few articles in the collection that were published
`
`subsequently, and so citations were not possible. Early articles were not highly cited, perhaps because
`
`the first volumes had many methodological or other reports that were not of great interest afterwards.
`
`The ·peak in citations in the middle years is attributable to the fact that CACM was a key publication
`
`in computer science during that time, and many important developments were described there, espe(cid:173)
`
`cially after 1962. One might expect that the three bibliographic subvectors would vary in length
`
`depending in part on the distribution just described of citations.
`
`Figure 4 shows the average number of bibliographic couplings per year. In general, the number
`
`increases as time goes by, since there are more available prior articles in the collection that can be
`
`referred to. Thus, the number of couplings for years 1976 through 1978 are very high - bibliographic
`
`references can be to any previous article, even back to 1958. All in all, the distribution is a fairly uni(cid:173)
`
`form one, perhaps explaining why there is such a large standard deviation in & subvector lengths.
`
`Figure 5 gives the average number of links for each year. Since links include both references and
`
`citations, it is not surprising that the curve is a relatively flat one. Only for the early years, when there
`
`were not many prior articles to refer to and when the subject matter was not such as to elicit many
`
`later citations; is the level of links fairly low.
`
`Figure 6, showing average number of c~citations, has a very different form than that of the other
`
`figures shown. A high peak appears near the middle, and lower values are at either tail. Apparently,
`
`1966 was a good year, since articles then were c~cited with many others. Referring back to Figure 3
`
`020
`
`Facebook Inc. Ex. 1206
`
`

`

`-18-
`
`Figure 2: Mean Number of CACM Articles Published Each Year
`
`No. of
`Articles
`
`300
`
`uo
`
`200
`
`150
`
`100
`
`50
`
`0
`
`I
`I
`I
`I
`I
`I
`I
`I
`I I
`I I
`I
`
`I
`I
`I
`I
`I
`I
`I
`I II
`I
`I
`I
`I
`I II
`I
`I
`I
`I
`I
`I I
`I
`I
`I
`I
`I
`I I
`I
`II I
`I
`I
`I
`I I
`I
`II I
`I
`I
`I
`I I
`I
`II I
`I
`I
`I
`I
`I I
`II
`I
`II I
`I
`I
`I
`II
`I
`I I
`I
`II I
`I
`I
`II
`I
`I I
`I
`I
`II I
`I
`I
`I
`II
`I
`I I
`I
`I
`II II I
`I
`I
`I
`II
`I
`I I
`I
`I
`II II I
`I
`I
`II
`I
`I II I
`I
`I
`II II I
`I
`I
`I
`I II I
`II
`I
`I
`II II II I
`I
`I
`I
`II
`I
`I II I
`I
`II II II I
`I
`I I
`I I
`II
`I
`I II I
`I
`I
`I
`II II II I
`I
`11058 50 ao a 12 u 14 15 .. 17 18 Ill 70 71 a a
`
`I
`I
`I
`II II
`Ill
`II II
`I II
`II II
`Ill
`I II II II II
`I II II II II II
`I II II II II II
`I II II II II II
`I II II II II II
`I II II II II II
`11111111111
`4 75 71 77 78 71
`
`Year of CACM Volume Considered
`
`021
`
`Facebook Inc. Ex. 1206
`
`

`

`Figure 3: Mean Number o( Citations Per Artic:le Per Year
`
`-19-
`
`No. o(
`Citations
`
`1.0
`
`.
`
`1.5
`
`.
`
`1.0
`
`.
`
`O.l
`
`10
`
`11
`
`IS
`
`U M U
`
`H
`
`IT
`
`II
`
`II
`
`10
`
`11
`
`1! 11
`
`11
`
`11
`
`11
`
`11
`
`11
`
`0.0 ~--------------------------------------------------

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket