`1 { ~
`
`.:
`
`C~arae~eriza~ioa of Two •••
`Kzperiaea~al Colleetioaa ia Caap•~er
`aa4 Iafor.a~ioa Seieaee Coataiaiaa
`~ez~aal aa4 •i~liocr•p•ie Coacept•
`
`*
`Edward A. Fox
`
`83-561
`September 1983
`
`* Department of Computer Science
`Cornell University
`Ithaca. New York 14853
`
`now at:
`Department of Computer Science
`Virginia 'Iech
`Blacksburg. VA 24061
`
`This work was supported in part by the National Science Founda(cid:173)
`tion. under grant IS'I-81-08696.
`
`001
`
`Facebooklnc.Ex. 1007
`EXHIBIT 2012
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00480
`
`
`
`TABLE OF CONTENTS
`
`i Introduction
`
`1.1 Extended Vectors
`
`1.2 Contrast with Other Collections
`
`1.2.1 Document Collections
`
`1.2.2 Query Collections
`
`2 Multiple Concept Types
`
`3CACM
`
`3.1 Documents
`
`3.2 Illustrative Charts and Figures
`
`3.3 Queries
`
`3.4 Relevance Judgments
`
`3.5 Retrieval Performance
`
`41S1
`
`4.1 Documents
`
`4.1.1 Background
`
`4.1.2 Tape Information Provided
`
`4.1.3 Collection Preparation
`
`4.2 Subvectors
`
`4.3 Queries
`
`2
`
`3
`
`4
`
`5
`
`8
`
`O
`
`13
`
`4
`
`16
`
`o
`
`41
`
`4
`
`44
`
`4
`
`44
`
`5
`
`47
`
`9
`
`56
`
`-I-
`
`002
`
`Facebook Inc. Ex. ioor'
`
`
`
`-u-
`
`4.4 Relevance Judgments
`
`45 Retrieval Performance
`
`5 Conclusion
`
`References
`
`8
`
`58
`
`9
`
`81
`
`003
`
`Facebook Inc. Ex. 1007
`
`
`
`CHARACTERIZATION OF TWO NEW EXPERIMENTAL
`COLLECTIONS IN COMPUTER AND INFORMATION SCIENCE
`CONTAINING TEXTUAL AND BIBLIOGRAPHIC CONCEPTS
`
`Edward A. Fox *
`
`Abstract
`
`Two new collections are described which are particularly useful for investigating the interaction
`
`between textual and bibliographic data in the automatic indexing and retrieval of documents. An
`
`extension to the vector space model has been proposed whereby various types of concepts are included
`
`in the representation of such documents. Experiments using an enhanced version of the SMART sys-
`
`tem have shown such an extended model to perform better than simpler schemes. The CACM and IS!
`
`collections developed for this research should be of value for future related studies.
`
`The ¡SI collection has author, title/abstract, and co-citation data for the 1460 most bigly cited
`
`articles and manuscripts in information science in the 1969-1977 period. The CACM collection contains
`
`7 types of concepts for the 3204 articles published in the Communications of the ACM up through 1979.
`
`These collections have 76 and 52 queries, respectively, along with relevance judgments.
`
`*Department of Computer Science, Cornell University, Ithaca, NY 14853; now at Dept. of Comput-
`er Science, Virginia Tech, Blacksburg, VA 24061. This work was supported in part by the National
`Science Foundation, under grant IST-81-08896.
`
`-1-
`
`004
`
`Facebook Inc. Ex. 1007
`
`
`
`1. Introduction
`
`-2-
`
`In order to retrieve documents relevant to the request of a particular user it is necessary to first
`
`index or represent the content of articles and manuscripts. For many years this has been done by
`
`trained indexers 'who assign keyword lists or sets of descriptors from a controlled vocabulary [Borko &
`
`Bernier 1978]. Since the early 1960's an alternative method of automatic indexing ha., been developed
`
`whereby word steins, words, phrases, or thesaurus category indicators are selected from thi..title and
`
`abstract and a weighted vector indicating the importance of each ¡s constructed [Salton 1980]. Part of
`
`this report deals with the vectors derived in this fashion from two collections in information and com-
`
`puter science.
`
`Another source of data about documents is from their bibliographic references. Citation indexes
`
`can be used to locate those entries referred to by an article, or which cite it [Garfield 1964, 19791. Link-
`
`ages between documents based on bibliographic coupling [Kessler 1962] and co-citation counts [Small
`
`1973] have been utilized for a variety of analysis and retrieval purposes (e.g., [Bichteler & Eaton 1980],
`
`[Garfield 1970), [Kessler 1963a, 1983b, 1905], [Small & Koenig 1977], [Small 1978, 1980, 1981), [Wein-
`
`berg 1974]. PreLiminary experimentation has shown that the vectors produced by automatic indexing
`
`of document texts can be usefully supplemented by bibliographic information to produce a representa-
`
`tion that can be more effectively searched than if either component were used alone ([Michelson et al.
`
`1971], [Salton 1983, 1971]).
`
`To facilitate exploration of the effects of extending the vector space model to include a variety of
`
`types of concepts it was necessary to have test collections containing auch concepts. One collection
`
`containing 1460 of the most highly cited documents in information science published between 1969 and
`
`1977 [Small 1981] was developed based on citation and co-citation data provided by the Institute for
`
`Scientific Information. This ISI collection contains three types of concepts: author names, word stems
`
`005
`
`Facebook Inc. Ex. 1007
`
`
`
`-3-
`
`from the title and abstract sections, and co-citations between each of the articles. A second collection,
`
`of the 3204 articles published in Communication. of the ACM in the years np through 1979, contains
`
`the above three types of concepts plus: categories assigned from a hierarchical subject classification
`
`scheme, bibliographic coupling connections, date information, and direct references between articles.
`
`The CACM collection also includes 52 user supplied queries ¡n 3 different forms and relevance judge-
`
`ments indicating which documents relate to which queries. The ¡SI collection has a total of 76 queries.
`
`This report describes the CACM and ¡SI collections in detail.
`
`It supplements the theoretical and
`
`experimental descriptions of [Fox 1983bj and the elaboration on implementation issues contained in
`
`(Fox 1983a]. Ample ¿liscussion has been included to enable interested researchers to understand the
`
`characteristics of these collections and to determine if. they might be of use in related research investi-
`
`gations.
`
`The rest of this introductory section provides useful background for subsequent sections. The
`
`notion of extended vectors is introduced, and some overview tables are shown indicating how these col-
`
`lections relate to other test collections discussed ¡n [Fox 1983bJ.
`
`1.1. Extended Veetori
`
`When only the terms of documents are considered, a simple collection representation scheme
`
`results. A dictionary can be formed to contain the T distinct word stems and so textually based con-
`
`cepts can be numbered from i through T. Each of the N documents D1 is represented by a vector of
`
`length T,
`
`so the entire collection is a Nx T matrix,
`
`(Im11, Im12, ..., tm. )
`
`C = (Imj,) 1iN,1jT.
`
`(1-1)
`
`(1-2)
`
`006
`
`Facebook Inc. Ex. 1007
`
`
`
`-4.
`
`If other types of information are provided in addition, then each term vector (1-1) can become a
`
`subvector of a more complete document vector. Similarly, the matrix (1-2) becomes the term subma-
`
`trix.
`
`It makes sense to bave a separate author submatrix indicating which authors contributed to
`
`which articles based on a second dictionary of author names. When some subject classification scheme
`
`is adopted, such as the category system used for Computing Review., then a dictionary for those entries
`
`can also be constructed. Thus, the CACM collection has term (tin), author (au), and computing
`
`Review, category (cr) submatrices, along with others.
`
`More data about articles comes from the references present in the biblIographies of each publica-
`
`tion. An NxN matrix can be built indicating which articles refer to which others, or which ones are
`
`cited by others. Similar matrices indicate the degree of bibliographic coupling or the number of co-
`
`citations received by pairs of articles. Each of these matrices then becomes a submatrix of a large N
`
`row collection matrix.
`
`The extended vector model is thus based upon the idea of having multiple concept types. Each
`
`document vector has subvectors, one for each concept type included in the representation. Further dis..
`
`cussion of the model, and experimental evidence of its utility, can be found ¡n [Fox 1983b]. More detail
`
`regarding how the CACM and IS! collections are represented according to thIs model ¡s given in sec-
`
`tions 2 and 3 below.
`
`1.2. Contrast with Other Collection.
`
`In [Fox 1983bJ, I small and 4 moderate to medium size test collections were employed to test vari-
`
`ous hypothesis. The CACM and ¡SI collections were two of those utilized. To provide a suitable con-
`
`trast, the following subsections present summary information about all 5 of those collections.
`
`007
`
`Facebook Inc. Ex. 1007
`
`
`
`1.2.1. Document Collections
`
`-5-
`
`Table i summarizes essential data about each of the document collections. All collections contain
`
`information about documents such as monograph. or journal articles. At the very least, in almost
`
`every case, a title and abstract were available originally. They deal with a a number of subjects, from
`
`the "soft" social science like material that makes np a fair proportion of the IS! collection, to the terse,
`
`medical articles used in Mediare studies.
`
`The small AD! collection is about librarianship, microforms, and other topics in documentation
`
`and information science as of 1963.
`
`The CACM documente include all articles in issues of the Communication, of the ACM from the
`
`first issue in 1958 to the last number of 1979. A considerable range of computer science literature
`
`is covered by those 3204 entries in the publication that for many years served as the premier
`
`Table 1: Document Collections Summary
`
`Short
`Naine
`
`No. of
`Doce.
`
`No. of
`Terms
`
`Av.No.
`Terms
`
`Subvectors
`Included
`
`Subject
`Matter
`
`Years
`Covered
`
`ADI
`
`82
`
`888
`
`CACM
`
`3,204
`
`10,448
`
`27.1
`
`40.1
`
`INSPEC 12,884
`
`14,683
`
`35.4
`
`tin
`
`au,bi,bc,
`ce,cr,
`ln,tm
`tin
`
`IS!
`
`1,480
`
`Mediare
`
`1,033
`
`7,392
`
`8,750
`
`104.9
`
`au,cc,tm
`
`55.8
`
`tin
`
`Documen-
`tation
`Computer
`Science
`
`Electrical
`Engineering
`Information
`Science
`Medicine
`
`1983
`
`1958
`-1979
`
`1079
`
`1969
`-1977
`to 1969
`
`008
`
`Facebook Inc. Ex. 1007
`
`
`
`periodical in the field.
`
`-8-
`
`INSPEC, which stands for Information Services in Physics, Electrotechnology, Computers and
`
`Control, covers three Science Abstracta publications, Electrical and Electronics Abstracta, Com-
`
`puter and Control Abstracta, and Phaica Abstracta. The content focuses mainly in electrical
`
`engineering and computer science subjects.
`
`The 1480 IS! entries were selected based on co-citation information relating to a study conducted
`
`by Dr. Henry Small of the Institute for Scientific Information © (IS! ©). They are the items that
`
`could be located at the Cornell University library out of o total of 1827 names listed in the field of
`
`information science. Each was published between 1969 and 1977 and received at least five cita-
`
`tions.
`
`The 1033 Medlars articles were selected out of a large medical collection available at the National
`
`Library of Medicine.
`
`Thus, the five collections are from various sources, have different sizes, and deal with a number of sub-
`
`jets.
`
`1.2.2. Query CoHectn.
`
`Since considerable effort was made to study the characteristics and behavior of various cuery for-
`
`mulations, it is worthwhile to examine the five different query collections and all the versions present
`
`for each. Table 2 gives statistics (for the 4 larger collections) relating to the queries and the number of
`
`relevant documents for each query.
`
`009
`
`Facebook Inc. Ex. 1007
`
`
`
`-7-
`
`Table 2: Query and Relevant Document Characteristics
`
`Collection
`Name
`
`No. of
`Queries
`
`Av. Length
`Cos Query
`
`Reis Per Query
`A.No. Av. %
`
`Reis in Top 10
`Av.No. Av. %
`
`CACM
`
`INSPEC
`
`52
`
`77
`
`IS!
`
`35,78
`
`Mediare
`
`30
`
`11.4
`
`16.0
`
`8.1
`
`10.4
`
`15.3
`
`33.0
`
`49.8
`
`23.2
`
`0.5
`
`0.3
`
`3.4
`
`2.3
`
`1.9
`
`2.0
`
`1.7
`
`3.8
`
`19
`
`9
`
`4
`
`18
`
`To give some idea as to the average length of each query, that value is given for the cosine version
`
`based on the original natural language (NL). The ISI value is low, because only the 35 queries for
`
`which both vector and Boolean logic (DL) forms are availabte were considered, and those queries are
`
`rather short. The INSPEC queries are much longer; users were not very experienced and so described
`
`their interests with entire paragraphs, filled with many unimportant words.
`
`Query generality, that is the number of relevant documents per query, is illustrated in the next
`
`two columns of the table. The first number is given in absolute terms and the second is a percentage of
`
`the total number of documents. CACM queries have very few relevant documents, both in actual
`
`numbers and as a percentage. INSPEC questions have roughly twice as many. Mediare queries have
`
`slightly fewer relevante, but the percentage of documents that are relevant is much higher. And the IS!
`
`collection seems to have too many Eclevant documents per query, both ¡n absolute numbers and as a
`
`percentage; those queries were much too vague to give small, sharply defined r levant sets.
`
`010
`
`Facebook Inc. Ex. 1007
`
`
`
`-8-
`
`The final two columns of Table 2 illustrate the retrieval behavior of the queries using cosine corre-
`
`lation by considering the top 10 ranked documents. For Medlars, almost 4 of the first 10 are relevant
`
`on average, indicating that the queries are likely to have high precision. Further, almost 20% of the
`
`relevant documents are already retrieved, indicating that recall is probably good as well. For CACM,
`
`recall seems high, but precision is reduced to roughly half. In the other two collections, around 2
`
`relevant documents are found in the top 10 retrieved, so precision is low as well. And there is even a
`
`lower percentage of the number of relevant documents that are identified. For ¡SI, in particular, only
`
`4% of the relevant documents are found in the top 10, indicating that recall will be rather low. With
`
`so many relevant documents per query, such behavior could be expected.
`
`Table 3 gives descriptive details about the five different groups of query collections.
`
`011
`
`Facebook Inc. Ex. 1007
`
`
`
`-9-
`
`Table 3: Query Collections Summary
`
`NL Queries
`Origin
`No.
`
`BL Queries
`Origin
`No.
`
`Coil.
`Name
`
`AD!
`
`CACM
`
`INSPEC
`
`IS!
`
`35
`
`52
`
`77
`
`70
`
`Written by 2
`Harvard computer
`science students
`
`Cornell & other
`comput er
`personnel
`
`Students etc.
`at Syracuse Univ.
`
`AD!, ISPRA &
`SIGIR Forum
`abstracts
`
`35
`
`52
`
`77
`
`35
`
`30
`
`author- 4 forms
`librarians-
`I forni each
`
`2 comp. sci.
`grad. students-
`i form each
`
`7 searchers
`at Syracuse
`
`ADI part only
`(see AD! coil.
`above)
`
`NLM searchers,
`then expanded
`using MESH
`
`Medlars
`
`30
`
`NLM files
`
`There is one set of natural language queries for each of the five, but in some cases there are several
`
`Boolean forms based on each natural language query.
`
`The AD! collection originally had 35 natural language queries. Three searchers then each
`
`rephrased the natural language questions into a Boolean representation.
`
`52 CACM queries were submitted fröm a variety of sources, and two students (in the Cornell gra-
`
`duate level information retrieval course) responsible for creating the query collection each proposed
`
`Boolean forms.
`
`One set of INSPEC queries was selected from the various types of representations devised by
`
`seven different searchers working at Syracuse University.
`
`012
`
`Facebook Inc. Ex. 1007
`
`
`
`-10-
`
`For IS!, the sanie 35 natural language and Boolean queries used for ADI were employed. Multiple
`
`concept type testing, however, required more natural language queries, so 41 addition al ones were
`
`selected from a set of queries for the ISPRA collection and from some new queries based on
`
`abstracts published in issues of the ACM SIGIR Forum.
`
`Mediare natural language queries were as provided from the National Library of Medicine (NLM)
`
`files and Boolean forms were subsequently constructed at Cornell. The latter were based on
`
`Boolean expressions employed by searchers. An expansion process utilizing the Medical Subject
`
`Headings of the MESH thesaurus (National Library of Medicine 1968J enabled all category names
`
`to be replaced by words that might occur in document texts.
`
`2. Multiple Concept Type.
`
`To supplement the discussion of documents in section 1.2.1, it is necessary to consider the multiple
`
`concept model proposed in [Fox 1983b]. The word "concept" is used to indicate a basic item of index-
`
`ing information so that a collection of concepts identifies the context of the article or monograph. Con-
`
`cepts can be "terms" - really approximations to stems of words from the titte, abstract, or supplied list
`
`of keyword. - or they may beindexing categories or bibliographic connection markers (e.g., a pointer to
`
`a document co-cited with the one being indexed).
`
`Table i indicates how many distinct terms are present in the dictionary of word stems for each
`
`collection. The dictionary is produced as part of the automatic indexing process, and is the minimum
`
`size required to accommodate all distinct items after a stop word list has been applied to document
`
`texts. Vocabulary size goes up rapidly as the number of documents increases (e.g., see the jumps for
`
`ADI and Mediare), but then gradually tapers off (e.g., see values for the large CACM and INSPEC col-
`
`lections). The size is somewhat influenced by the nature of the documents; there are more distinct
`
`013
`
`Facebook Inc. Ex. 1007
`
`
`
`-11-
`
`stems in the 1033 MedIare medical articles which have a specialized terminology than for the 1480 IS!
`
`information science records which contain fewer technical names.
`
`The multiple concept type model introduced in section 1.1 calls for additional concepts beside
`
`tenus.
`
`In the IS1 collection, two additional types, authors ánd co-citations, were considered. Author
`
`names were entered in the author dictionary, which had a total of 1255 items. Co-citations have no
`
`attendant dictionary. Since each document can, a priori, be co-cited with any other, the number of
`
`concepts equals N, the number of collection documents. The value of the j
`
`co-citation concept for
`
`the ¡h document, namely cc11, is the number of times the ¿ and j
`
`documents are co-cited.
`
`Table 4 gives statistics for ISI subvectors.
`
`Table 4: IS! Subvector Length Statistics
`
`Statistic
`Measured
`
`Subvector
`cc
`
`au
`
`mean
`median
`min
`max
`stdv
`
`Total
`no. of
`concepts
`
`tm
`
`49.8
`47
`8
`179
`21.5
`
`1.4
`1
`
`1
`7
`0.8
`
`54.0
`40
`1
`278
`48.4
`
`1255
`
`1480
`
`7392
`
`Even though there are only 1255 different authors for 1400 documents, each document has an average
`
`of 1.4 authors. The term subvectors are much longer, averaging around 50 concepts, but the standard
`
`deviation (stdv) indicates that the distribution is spread rather widely. Co-citation subvectore have
`
`014
`
`Facebook Inc. Ex. 1007
`
`
`
`-12-
`
`about the same average length, but an even larger standard deviation. Since the IS! articles were
`
`chosen as ones with many citations, and since the collection covers the field of information science for a
`
`number of years, it is not surprising that there are so many entries in the co-citation submatrix. Since
`
`the lengths of terms and co-citation vectors are comparable, comparisons between the two as to their
`
`relative utility for retrieval should be of considerable interest.
`
`The CACM collection was developed in part to allow testing of other concept types besides those
`
`found in IS! documents. Computing Review, categories (e?) allow a manual indexing system to be
`
`mixed in with automatic indexing entries. Bibliographic coupling () indicates the number of refer-
`
`ences shared between two documents' bibliographies. Links (fi) are references to or citations from
`
`other articles.
`
`Table 5 gives statistics for the various concept types in the CACM document collection.
`
`Table 5: CACM Sub'rector Length Statistics
`
`In
`
`2.7
`2
`1
`74,
`3.1
`
`tin
`
`25.0
`15
`1
`188
`22.7
`
`au
`
`be
`
`Subvector
`er
`cc
`
`3.7
`0
`0
`111
`10.7
`
`1.2
`0
`0
`28
`1.9
`
`.
`
`4.2
`0
`0
`183
`10.8
`
`1.3
`i
`
`1
`
`70
`
`.7
`
`2847
`
`3204'
`
`3204
`
`200
`
`3204
`
`10448
`
`Statistic
`Measured
`
`mean
`median
`min
`max
`st.dv
`
`Total
`no. of
`concepts
`
`is one since a document is, by definition, linked to itself, and
`Note: The minimum length for i
`so the diagonal of the submatrix is set to ones.
`
`015
`
`Facebook Inc. Ex. 1007
`
`
`
`-13-
`
`Since ¿Ta, t, and e
`
`subvectors are based only on connections among the chosen CACM articles, their
`
`mean length ¡s much shorter than that of the t
`
`subvector. The
`
`subvectors for CACM have aver-
`
`age length of less than 4 concepts, while those for IS! have over 50. In a larger document collection,
`
`with better total coverage of a subject discipline, more of the citations and references would be internal
`
`to that collection and so these subvectors would probably be longer. Thus, while the IS! còllection has
`
`unusually long bibliographic connection subvectors, the CACM collection has abnormally short ones,
`
`and any evidence froni the CACM tests that these subvectors are useful should be suggestive of their
`
`valnç in a more realistic environment with large numbers of documents.
`
`lt should be noted that links are relatively well bounded in number since a given article rarely has
`
`a very long bibliography and the quantity of citations to an article is limited by how many entries in
`
`the collection deal with that specific topic. Co-citations are somewhat more numerous, especially in
`
`such a homogeneous collection, since many later articles may refer to a given article as well as others
`
`appearing in the same journal. Bibliographic couplings are slightly more plentiful, perhaps because
`
`many articles cite "classics" in their sub-area, and because there is a good deal of referencing of CACM
`
`articles by CACM authors.
`
`A final note about the various concept types is that in both the IS! and CACM collections, the
`
`total number of terms exceeds the number of documents. For very large collections, the opposite would
`
`be true. However, there is no reason to believe that such a consideration would affect retrieval
`
`behavior, whereas subvector length is likely to be an important consideration, relating to the specificity
`
`of indexing.
`
`3. CACM
`
`016
`
`Facebook Inc. Ex. 1007
`
`
`
`3.1. Docuiñenta
`
`_1sl_
`
`Robert Dattola of Xerox Corporation provided a magnetic tape containing the title, abstract,
`
`author list, keywords, Computing Revicwa categories, and date of publication of articles published in
`
`the Communications cf the ACM from the earliest issue, in 1958, to the last number in 1979. The col-
`
`lection format was changed, document numbers (dids) were assigned, and editing was doue to correct
`
`spelling and typographical errors. Some duplicates were eliminated and missing articles added, and a
`
`final renumbering took place.
`
`Carol Fox and Jill Warner looked through printed copies of each article to locate the bibliography.
`
`Fr each article, a list of the "dids" representing articles referenced in the CACM collection was even-
`
`tually obtained. Many articles were lócated using a chronologically ordered list of all articles in the col-
`
`lection. However, since there were many errors in bibliographies, a special search program was written
`
`to identify the nearest matches to a supplied reference - to compensate for errors in year, month,
`
`author name, and title.
`
`The purpose of obtaining that data was to form bibliographic subvectors. Based on the above
`
`mentioned lists, a relational tarin was produced:
`
`Raw_data (citing, cited)
`
`which contained pairs of identifiers for the citing article and the one contained in the article's bibliogra-
`
`phy. Figure 1 shows the steps required, in QUEL like notation [Stonebraker et al. 19761, to produce
`
`from Raw_data first normal form [Date 1982j versions of the desired relations for 6, [ii, and E subvec-
`
`tors.
`
`017
`
`Facebook Inc. Ex. 1007
`
`
`
`-15-
`
`Figure 1: Relational Processing to Obtain
`CACM Bibliographic Subvectors
`
`Given: Raw _data (citing, cited)
`
`Desired:
`BC (citingi, citing2, coupling_no)
`LN (Ilül, link2)
`CC (citedi, cited2, co-citing_no)
`
`QUFL Like Statements:
`
`For BC
`Modify Raw_data to hash on cited
`Range of bd is Raw_data
`Range of bc2 is Raw_data
`Retrieve into bc_entries
`( citingi = bcl.citing,
`citing2
`bc2.citing,
`counter = bd .cited)
`where bcl.cited = bc2.cited
`Range of bce is bc_entries
`Retrieve into BC
`(bce.citingl, bce.citing2,
`coupling_no = count ( bce.counter))
`
`For LN
`Retrieve into LN
`(linki = bcl.citiug,
`link2 = bel .cited )
`Append to LN
`(link2 = bel .eiting,
`liuki = bcl.cited )
`
`018
`
`Facebook Inc. Ex. 1007
`
`
`
`-18.
`
`Figure 1 continued: Relational Processing to Obtain
`CACM Bibliographic Snbvectors
`
`3. For CC
`Modify Raw_data to hash on citing
`Range of ccl is Raw_data
`Range of cc2 is Raw_data
`Retrieve into cc_entries
`ccl.cited,
`( citedi
`cited2
`cc2.cited,
`counter = ccl .citing)
`where ccl.citing = cc2.citing
`Range of ccc is cc_entries
`Retrieve into BC
`(cce.citedl, cce.cited2,
`co-citing...no = count (cce.counter ))
`
`The actual processing was similar. Vectors were eventually formed from the relations after appropriate
`
`sorting and compression to unnormalized lists.
`
`Since the CACM collection is the first one studied with so many different concept types, and since
`
`the data is available for a number of years, various histograms, scatter plots, and charts are given in
`
`the next subsection. The intention is to illustrate the form, content, and distribution of the different
`
`types of bibliographic eoncèpts.
`
`3.2. Illustrative Charts and Figures
`
`Table 5 gave statistics on the CACM subvector lengths. However, a much more detailed under-
`
`standing can be gleaned from the graphical presentations below. To begin with, Figures 2 through 8
`
`are histograms with mean values given for various statistics for each year in the period 1958-1979.
`
`Figure 2 shows the number of articles each year. Clearly, the publication grew in size during the
`
`early years, and then there were changes in editorial policies at various later times. For example, in the
`
`019
`
`Facebook Inc. Ex. 1007
`
`
`
`-17-
`
`early years, computing algorithms in fairly large numbers were published, each one counting as an arti-
`
`cle. Subsequently this practice was discontinued and algorithms were collected for a separate publica-
`
`tion.
`
`Figure 3 shows the average number of citations to an article for each year. Lower values in later
`
`years can be explained by the fact that there were few articles in the collection that were published
`
`subsequently, and so citations were not possible. Early articles were not highly cited, perhaps because
`
`the first volumes had many methodological or other reports that were not of great interest afterwards.
`
`The peak in citations in the middle years is attributable to the fact that CACM was a key publication
`
`in computer science during that time, and many ¡mportaut developments were described there, espe-
`
`cially after 1982. One might expect that the three bibliographic subvectors would vary in length
`
`depending in part -on the distribution just described of citations.
`
`Figure 4 shows the average number of bibliographic couplings per year. In general, the number
`
`increases as time goes by, since there are more available prior articles in the collection that can be
`
`referred to. Thus, the number of couplings for years 1978 through 1978 are very high - bibliographic
`
`references can be to any previous article, even back to 1958. All ¡n all, the distribution is a fairly uni-
`
`form one, perhaps explaining why there is such a large standard deviation in í subvector lengths.
`
`Figure 5 gives the average number of links for each year. Since links include both references and
`
`citations, it is not surprising that the curve is a relatively flat one. Only for the early years, when there
`
`were not many prior articles to refer to and when the subject matter was not such as to elicit many
`
`later citations, ¡s the level of links fairly low.
`
`Figure 8, showing average number of co-citations, has a very different form than that of the other
`
`figures shown. A high peak appears near the middle, and lower values are at either tail. Apparently,
`
`1966 was a good year, since articles then were co-cited with many others. Referring back to Figure 3
`
`020
`
`Facebook Inc. Ex. 1007
`
`
`
`-18-
`
`Figure 2: Mean Number of CACM Articles Published Each Year
`
`No. of
`Articles
`
`300
`
`250
`
`200
`
`150
`
`100
`
`50
`
`o
`
`1058 59 50 01 62 13 54 65 60 67 58 59 70 71 72 73 74 75 70 77 78 7
`
`Year of CACM Volume Considered
`
`021
`
`Facebook Inc. Ex. 1007
`
`
`
`Figure 3: Mean Nu.mber of Citations Per Article Per Year
`
`-19-
`
`No. of
`Citations
`
`s
`
`ii
`
`sa
`
`11
`
`41 u 44
`
`&
`
`$1
`
`17
`
`II
`
`II
`
`70
`
`71
`
`71
`
`71
`
`74
`
`71
`
`75
`
`77
`
`71
`
`75
`
`Year
`
`022
`
`Facebook Inc. Ex. 1007
`
`
`
`-20-
`
`Figure 4: Mean Number of Bibliographic Coupling3
`Per Article By Year
`
`No. of
`Bibliographic
`Couplings
`
`o
`
`II
`
`Il
`
`IO Sill $314 U III? USI loll 157$ 74717171717$
`
`Year
`
`023
`
`Facebook Inc. Ex. 1007
`
`
`
`Figure 5: Mean Number of Links Per Article By Year
`
`-21-
`
`No. of
`Links
`
`il
`
`42
`
`IO
`
`II
`
`62
`
`63
`
`14
`
`14
`
`II
`
`47
`
`II
`
`Ii
`
`70
`
`72
`
`72
`
`72
`
`11
`
`73
`
`71
`
`71
`
`76
`
`13
`
`Year
`
`one sees that the most citations, after those to 1968, were to articles in 1966. Indeed, an examination
`
`024
`
`Facebook Inc. Ex. 1007
`
`
`
`-22-
`
`of articles in that year does reinforce the conviction that there were many important articles that year,
`
`and that they related to a considerable number of other articles. The number of co-citations could
`
`serve as another indication of merit, to go along with the number of citations, but probably the
`
`emphasis on number of related articles rather than number of citing articles would reduce the appeal of
`
`such a proposal. In any case, one would expect the average number of co-citations to be greatest dur-
`
`ing the iniddlle years of the range, since there are more articles later that can cite both the given article
`
`and one other (which could be either earlier or later).
`
`.Figure 7 highlights the difference in the distribution of 6 and ët lengths over the years. Biblio.
`
`graphic couplings are greatest towards the end of the range, and co-citations peak between the begin-
`
`ning and middle of the period. Hence, when mean
`
`length is divided by mean e length, the ratio is
`
`around 1-2 for middle years, lower beforehand, and higher after. An obvious strategy is to somehow
`
`add together the effects of bibliographic coupling and co-citations, since then the overall depth of
`
`bibliographic connection based indexing would be fairly uniform over the complete range of years of
`
`interest.
`
`Turning next to distributions of frequencies for the various subvectors, one sees the type of curves
`
`that are expected, in Figures 8 through 11. The number of documents for each subvector length is
`
`shown. Almost every article has fewer than 10 Computing Reviewa categories assigned or 10 links to
`
`other articles. There are a goodly number of documents, however, with fairly long 6 and
`
`subvec-
`
`tora, as can be seen in Figures 10 and 11.
`
`Most articles have less than .4 entries in their e? subvector, since categories are usually fairly
`
`broad and few articles deal with a much wider diversity of subjects. Usually at least a few categories
`
`are assigned to an article, so Figure 8 actually shows a slight initial increase before the usual descending
`
`portion of the curve is reached.
`
`025
`
`.
`
`Facebook Inc. Ex. 1007
`
`
`
`-23-
`
`Figure 8: Mean Number of Co-Citations Per Article By Year
`
`No. of
`Co-Citations
`
`Io
`
`I
`
`4
`
`G
`
`¡I U
`
`II
`
`II
`
`Il
`
`II U 17 II U 70
`
`14
`
`71
`
`71
`
`71
`
`74
`
`71
`
`71
`
`77
`
`71
`
`TI
`
`Year
`
`026
`
`Facebook Inc. Ex. 1007
`
`
`
`-24-
`
`Figure 7: CACM Bibliographic Connection Ratio for 1958-79
`(For Each Year, Mean Value of Bc/CC, Counted When
`Both Values are Poskive for an Article)
`
`Mean No.
`of BC/CC
`
`II
`
`il
`
`PD
`
`II
`
`12 U 14
`
`II
`
`II
`
`IT
`
`Il
`
`II
`
`ID
`
`71
`
`71 71
`
`74
`
`71
`
`lI
`
`77
`
`II
`
`71
`
`Year
`
`027
`
`Facebook Inc. Ex. 1007
`
`
`
`Figire 8: CACM Frequency Distribution for E? Subvector Length
`
`-25-
`
`No. of
`Documents
`
`E? Subvector Length
`
`028
`
`Facebook Inc. Ex. 1007
`
`
`
`Figure 9: CACM Frequency Distribution for E Subvector Length
`
`No. of
`Documents
`
`Loe
`
`aoe
`
`ioe
`
`C
`
`o
`
`2
`
`4
`
`I
`
`S
`
`10
`
`1z Subvector Length
`
`029
`
`Facebook Inc. Ex. 1007
`
`
`
`Figure W: CACM Frequency Distribution for
`
`Subvector Length
`
`-27-
`
`No of
`Documents
`
`6 Subvector Length
`
`030
`
`Facebook Inc. Ex. 1007
`
`
`
`Fi&ure 11: CACM Frequency Distributkn tor E Subvector Length
`
`-28-
`
`No. of
`Docuin eDt5
`
`E Subvector Length
`
`031
`
`Facebook Inc. Ex. 1007
`
`
`
`-29-
`
`Figure 9, dealing with links, appears to follow the distribution predicted by Zipfs law [Salton 1975
`
`page 189]. Figures 10 and 11 do also, except as one moves beyond frequencies of around 15. Mter
`
`that, those two curves seem to level off, showing occasional peaks randomly distributed.
`
`To better compare the subvector length distributions, scatter plots are given in Figures 12
`
`through 15. Figure 12 shows bibliographic couplings against citations. There is no obvious linear rela-
`
`tionsliip, but values are concentrated in the region close to the origin. For one to ten citations, there
`
`are usually Ics than 50 to 100 bibliographic couplings. There are some articles with many citations but
`
`few bibliographic