throbber
: /.-.....=-,
`1 { ~
`
`.:
`
`C~arae~eriza~ioa of Two •••
`Kzperiaea~al Colleetioaa ia Caap•~er
`aa4 Iafor.a~ioa Seieaee Coataiaiaa
`~ez~aal aa4 •i~liocr•p•ie Coacept•
`
`*
`Edward A. Fox
`
`83-561
`September 1983
`
`* Department of Computer Science
`Cornell University
`Ithaca. New York 14853
`
`now at:
`Department of Computer Science
`Virginia 'Iech
`Blacksburg. VA 24061
`
`This work was supported in part by the National Science Founda(cid:173)
`tion. under grant IS'I-81-08696.
`
`001
`
`Facebooklnc.Ex. 1007
`EXHIBIT 2012
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00480
`
`

`

`TABLE OF CONTENTS
`
`i Introduction
`
`1.1 Extended Vectors
`
`1.2 Contrast with Other Collections
`
`1.2.1 Document Collections
`
`1.2.2 Query Collections
`
`2 Multiple Concept Types
`
`3CACM
`
`3.1 Documents
`
`3.2 Illustrative Charts and Figures
`
`3.3 Queries
`
`3.4 Relevance Judgments
`
`3.5 Retrieval Performance
`
`41S1
`
`4.1 Documents
`
`4.1.1 Background
`
`4.1.2 Tape Information Provided
`
`4.1.3 Collection Preparation
`
`4.2 Subvectors
`
`4.3 Queries
`
`2
`
`3
`
`4
`
`5
`
`8
`
`O
`
`13
`
`4
`
`16
`
`o
`
`41
`
`4
`
`44
`
`4
`
`44
`
`5
`
`47
`
`9
`
`56
`
`-I-
`
`002
`
`Facebook Inc. Ex. ioor'
`
`

`

`-u-
`
`4.4 Relevance Judgments
`
`45 Retrieval Performance
`
`5 Conclusion
`
`References
`
`8
`
`58
`
`9
`
`81
`
`003
`
`Facebook Inc. Ex. 1007
`
`

`

`CHARACTERIZATION OF TWO NEW EXPERIMENTAL
`COLLECTIONS IN COMPUTER AND INFORMATION SCIENCE
`CONTAINING TEXTUAL AND BIBLIOGRAPHIC CONCEPTS
`
`Edward A. Fox *
`
`Abstract
`
`Two new collections are described which are particularly useful for investigating the interaction
`
`between textual and bibliographic data in the automatic indexing and retrieval of documents. An
`
`extension to the vector space model has been proposed whereby various types of concepts are included
`
`in the representation of such documents. Experiments using an enhanced version of the SMART sys-
`
`tem have shown such an extended model to perform better than simpler schemes. The CACM and IS!
`
`collections developed for this research should be of value for future related studies.
`
`The ¡SI collection has author, title/abstract, and co-citation data for the 1460 most bigly cited
`
`articles and manuscripts in information science in the 1969-1977 period. The CACM collection contains
`
`7 types of concepts for the 3204 articles published in the Communications of the ACM up through 1979.
`
`These collections have 76 and 52 queries, respectively, along with relevance judgments.
`
`*Department of Computer Science, Cornell University, Ithaca, NY 14853; now at Dept. of Comput-
`er Science, Virginia Tech, Blacksburg, VA 24061. This work was supported in part by the National
`Science Foundation, under grant IST-81-08896.
`
`-1-
`
`004
`
`Facebook Inc. Ex. 1007
`
`

`

`1. Introduction
`
`-2-
`
`In order to retrieve documents relevant to the request of a particular user it is necessary to first
`
`index or represent the content of articles and manuscripts. For many years this has been done by
`
`trained indexers 'who assign keyword lists or sets of descriptors from a controlled vocabulary [Borko &
`
`Bernier 1978]. Since the early 1960's an alternative method of automatic indexing ha., been developed
`
`whereby word steins, words, phrases, or thesaurus category indicators are selected from thi..title and
`
`abstract and a weighted vector indicating the importance of each ¡s constructed [Salton 1980]. Part of
`
`this report deals with the vectors derived in this fashion from two collections in information and com-
`
`puter science.
`
`Another source of data about documents is from their bibliographic references. Citation indexes
`
`can be used to locate those entries referred to by an article, or which cite it [Garfield 1964, 19791. Link-
`
`ages between documents based on bibliographic coupling [Kessler 1962] and co-citation counts [Small
`
`1973] have been utilized for a variety of analysis and retrieval purposes (e.g., [Bichteler & Eaton 1980],
`
`[Garfield 1970), [Kessler 1963a, 1983b, 1905], [Small & Koenig 1977], [Small 1978, 1980, 1981), [Wein-
`
`berg 1974]. PreLiminary experimentation has shown that the vectors produced by automatic indexing
`
`of document texts can be usefully supplemented by bibliographic information to produce a representa-
`
`tion that can be more effectively searched than if either component were used alone ([Michelson et al.
`
`1971], [Salton 1983, 1971]).
`
`To facilitate exploration of the effects of extending the vector space model to include a variety of
`
`types of concepts it was necessary to have test collections containing auch concepts. One collection
`
`containing 1460 of the most highly cited documents in information science published between 1969 and
`
`1977 [Small 1981] was developed based on citation and co-citation data provided by the Institute for
`
`Scientific Information. This ISI collection contains three types of concepts: author names, word stems
`
`005
`
`Facebook Inc. Ex. 1007
`
`

`

`-3-
`
`from the title and abstract sections, and co-citations between each of the articles. A second collection,
`
`of the 3204 articles published in Communication. of the ACM in the years np through 1979, contains
`
`the above three types of concepts plus: categories assigned from a hierarchical subject classification
`
`scheme, bibliographic coupling connections, date information, and direct references between articles.
`
`The CACM collection also includes 52 user supplied queries ¡n 3 different forms and relevance judge-
`
`ments indicating which documents relate to which queries. The ¡SI collection has a total of 76 queries.
`
`This report describes the CACM and ¡SI collections in detail.
`
`It supplements the theoretical and
`
`experimental descriptions of [Fox 1983bj and the elaboration on implementation issues contained in
`
`(Fox 1983a]. Ample ¿liscussion has been included to enable interested researchers to understand the
`
`characteristics of these collections and to determine if. they might be of use in related research investi-
`
`gations.
`
`The rest of this introductory section provides useful background for subsequent sections. The
`
`notion of extended vectors is introduced, and some overview tables are shown indicating how these col-
`
`lections relate to other test collections discussed ¡n [Fox 1983bJ.
`
`1.1. Extended Veetori
`
`When only the terms of documents are considered, a simple collection representation scheme
`
`results. A dictionary can be formed to contain the T distinct word stems and so textually based con-
`
`cepts can be numbered from i through T. Each of the N documents D1 is represented by a vector of
`
`length T,
`
`so the entire collection is a Nx T matrix,
`
`(Im11, Im12, ..., tm. )
`
`C = (Imj,) 1iN,1jT.
`
`(1-1)
`
`(1-2)
`
`006
`
`Facebook Inc. Ex. 1007
`
`

`

`-4.
`
`If other types of information are provided in addition, then each term vector (1-1) can become a
`
`subvector of a more complete document vector. Similarly, the matrix (1-2) becomes the term subma-
`
`trix.
`
`It makes sense to bave a separate author submatrix indicating which authors contributed to
`
`which articles based on a second dictionary of author names. When some subject classification scheme
`
`is adopted, such as the category system used for Computing Review., then a dictionary for those entries
`
`can also be constructed. Thus, the CACM collection has term (tin), author (au), and computing
`
`Review, category (cr) submatrices, along with others.
`
`More data about articles comes from the references present in the biblIographies of each publica-
`
`tion. An NxN matrix can be built indicating which articles refer to which others, or which ones are
`
`cited by others. Similar matrices indicate the degree of bibliographic coupling or the number of co-
`
`citations received by pairs of articles. Each of these matrices then becomes a submatrix of a large N
`
`row collection matrix.
`
`The extended vector model is thus based upon the idea of having multiple concept types. Each
`
`document vector has subvectors, one for each concept type included in the representation. Further dis..
`
`cussion of the model, and experimental evidence of its utility, can be found ¡n [Fox 1983b]. More detail
`
`regarding how the CACM and IS! collections are represented according to thIs model ¡s given in sec-
`
`tions 2 and 3 below.
`
`1.2. Contrast with Other Collection.
`
`In [Fox 1983bJ, I small and 4 moderate to medium size test collections were employed to test vari-
`
`ous hypothesis. The CACM and ¡SI collections were two of those utilized. To provide a suitable con-
`
`trast, the following subsections present summary information about all 5 of those collections.
`
`007
`
`Facebook Inc. Ex. 1007
`
`

`

`1.2.1. Document Collections
`
`-5-
`
`Table i summarizes essential data about each of the document collections. All collections contain
`
`information about documents such as monograph. or journal articles. At the very least, in almost
`
`every case, a title and abstract were available originally. They deal with a a number of subjects, from
`
`the "soft" social science like material that makes np a fair proportion of the IS! collection, to the terse,
`
`medical articles used in Mediare studies.
`
`The small AD! collection is about librarianship, microforms, and other topics in documentation
`
`and information science as of 1963.
`
`The CACM documente include all articles in issues of the Communication, of the ACM from the
`
`first issue in 1958 to the last number of 1979. A considerable range of computer science literature
`
`is covered by those 3204 entries in the publication that for many years served as the premier
`
`Table 1: Document Collections Summary
`
`Short
`Naine
`
`No. of
`Doce.
`
`No. of
`Terms
`
`Av.No.
`Terms
`
`Subvectors
`Included
`
`Subject
`Matter
`
`Years
`Covered
`
`ADI
`
`82
`
`888
`
`CACM
`
`3,204
`
`10,448
`
`27.1
`
`40.1
`
`INSPEC 12,884
`
`14,683
`
`35.4
`
`tin
`
`au,bi,bc,
`ce,cr,
`ln,tm
`tin
`
`IS!
`
`1,480
`
`Mediare
`
`1,033
`
`7,392
`
`8,750
`
`104.9
`
`au,cc,tm
`
`55.8
`
`tin
`
`Documen-
`tation
`Computer
`Science
`
`Electrical
`Engineering
`Information
`Science
`Medicine
`
`1983
`
`1958
`-1979
`
`1079
`
`1969
`-1977
`to 1969
`
`008
`
`Facebook Inc. Ex. 1007
`
`

`

`periodical in the field.
`
`-8-
`
`INSPEC, which stands for Information Services in Physics, Electrotechnology, Computers and
`
`Control, covers three Science Abstracta publications, Electrical and Electronics Abstracta, Com-
`
`puter and Control Abstracta, and Phaica Abstracta. The content focuses mainly in electrical
`
`engineering and computer science subjects.
`
`The 1480 IS! entries were selected based on co-citation information relating to a study conducted
`
`by Dr. Henry Small of the Institute for Scientific Information © (IS! ©). They are the items that
`
`could be located at the Cornell University library out of o total of 1827 names listed in the field of
`
`information science. Each was published between 1969 and 1977 and received at least five cita-
`
`tions.
`
`The 1033 Medlars articles were selected out of a large medical collection available at the National
`
`Library of Medicine.
`
`Thus, the five collections are from various sources, have different sizes, and deal with a number of sub-
`
`jets.
`
`1.2.2. Query CoHectn.
`
`Since considerable effort was made to study the characteristics and behavior of various cuery for-
`
`mulations, it is worthwhile to examine the five different query collections and all the versions present
`
`for each. Table 2 gives statistics (for the 4 larger collections) relating to the queries and the number of
`
`relevant documents for each query.
`
`009
`
`Facebook Inc. Ex. 1007
`
`

`

`-7-
`
`Table 2: Query and Relevant Document Characteristics
`
`Collection
`Name
`
`No. of
`Queries
`
`Av. Length
`Cos Query
`
`Reis Per Query
`A.No. Av. %
`
`Reis in Top 10
`Av.No. Av. %
`
`CACM
`
`INSPEC
`
`52
`
`77
`
`IS!
`
`35,78
`
`Mediare
`
`30
`
`11.4
`
`16.0
`
`8.1
`
`10.4
`
`15.3
`
`33.0
`
`49.8
`
`23.2
`
`0.5
`
`0.3
`
`3.4
`
`2.3
`
`1.9
`
`2.0
`
`1.7
`
`3.8
`
`19
`
`9
`
`4
`
`18
`
`To give some idea as to the average length of each query, that value is given for the cosine version
`
`based on the original natural language (NL). The ISI value is low, because only the 35 queries for
`
`which both vector and Boolean logic (DL) forms are availabte were considered, and those queries are
`
`rather short. The INSPEC queries are much longer; users were not very experienced and so described
`
`their interests with entire paragraphs, filled with many unimportant words.
`
`Query generality, that is the number of relevant documents per query, is illustrated in the next
`
`two columns of the table. The first number is given in absolute terms and the second is a percentage of
`
`the total number of documents. CACM queries have very few relevant documents, both in actual
`
`numbers and as a percentage. INSPEC questions have roughly twice as many. Mediare queries have
`
`slightly fewer relevante, but the percentage of documents that are relevant is much higher. And the IS!
`
`collection seems to have too many Eclevant documents per query, both ¡n absolute numbers and as a
`
`percentage; those queries were much too vague to give small, sharply defined r levant sets.
`
`010
`
`Facebook Inc. Ex. 1007
`
`

`

`-8-
`
`The final two columns of Table 2 illustrate the retrieval behavior of the queries using cosine corre-
`
`lation by considering the top 10 ranked documents. For Medlars, almost 4 of the first 10 are relevant
`
`on average, indicating that the queries are likely to have high precision. Further, almost 20% of the
`
`relevant documents are already retrieved, indicating that recall is probably good as well. For CACM,
`
`recall seems high, but precision is reduced to roughly half. In the other two collections, around 2
`
`relevant documents are found in the top 10 retrieved, so precision is low as well. And there is even a
`
`lower percentage of the number of relevant documents that are identified. For ¡SI, in particular, only
`
`4% of the relevant documents are found in the top 10, indicating that recall will be rather low. With
`
`so many relevant documents per query, such behavior could be expected.
`
`Table 3 gives descriptive details about the five different groups of query collections.
`
`011
`
`Facebook Inc. Ex. 1007
`
`

`

`-9-
`
`Table 3: Query Collections Summary
`
`NL Queries
`Origin
`No.
`
`BL Queries
`Origin
`No.
`
`Coil.
`Name
`
`AD!
`
`CACM
`
`INSPEC
`
`IS!
`
`35
`
`52
`
`77
`
`70
`
`Written by 2
`Harvard computer
`science students
`
`Cornell & other
`comput er
`personnel
`
`Students etc.
`at Syracuse Univ.
`
`AD!, ISPRA &
`SIGIR Forum
`abstracts
`
`35
`
`52
`
`77
`
`35
`
`30
`
`author- 4 forms
`librarians-
`I forni each
`
`2 comp. sci.
`grad. students-
`i form each
`
`7 searchers
`at Syracuse
`
`ADI part only
`(see AD! coil.
`above)
`
`NLM searchers,
`then expanded
`using MESH
`
`Medlars
`
`30
`
`NLM files
`
`There is one set of natural language queries for each of the five, but in some cases there are several
`
`Boolean forms based on each natural language query.
`
`The AD! collection originally had 35 natural language queries. Three searchers then each
`
`rephrased the natural language questions into a Boolean representation.
`
`52 CACM queries were submitted fröm a variety of sources, and two students (in the Cornell gra-
`
`duate level information retrieval course) responsible for creating the query collection each proposed
`
`Boolean forms.
`
`One set of INSPEC queries was selected from the various types of representations devised by
`
`seven different searchers working at Syracuse University.
`
`012
`
`Facebook Inc. Ex. 1007
`
`

`

`-10-
`
`For IS!, the sanie 35 natural language and Boolean queries used for ADI were employed. Multiple
`
`concept type testing, however, required more natural language queries, so 41 addition al ones were
`
`selected from a set of queries for the ISPRA collection and from some new queries based on
`
`abstracts published in issues of the ACM SIGIR Forum.
`
`Mediare natural language queries were as provided from the National Library of Medicine (NLM)
`
`files and Boolean forms were subsequently constructed at Cornell. The latter were based on
`
`Boolean expressions employed by searchers. An expansion process utilizing the Medical Subject
`
`Headings of the MESH thesaurus (National Library of Medicine 1968J enabled all category names
`
`to be replaced by words that might occur in document texts.
`
`2. Multiple Concept Type.
`
`To supplement the discussion of documents in section 1.2.1, it is necessary to consider the multiple
`
`concept model proposed in [Fox 1983b]. The word "concept" is used to indicate a basic item of index-
`
`ing information so that a collection of concepts identifies the context of the article or monograph. Con-
`
`cepts can be "terms" - really approximations to stems of words from the titte, abstract, or supplied list
`
`of keyword. - or they may beindexing categories or bibliographic connection markers (e.g., a pointer to
`
`a document co-cited with the one being indexed).
`
`Table i indicates how many distinct terms are present in the dictionary of word stems for each
`
`collection. The dictionary is produced as part of the automatic indexing process, and is the minimum
`
`size required to accommodate all distinct items after a stop word list has been applied to document
`
`texts. Vocabulary size goes up rapidly as the number of documents increases (e.g., see the jumps for
`
`ADI and Mediare), but then gradually tapers off (e.g., see values for the large CACM and INSPEC col-
`
`lections). The size is somewhat influenced by the nature of the documents; there are more distinct
`
`013
`
`Facebook Inc. Ex. 1007
`
`

`

`-11-
`
`stems in the 1033 MedIare medical articles which have a specialized terminology than for the 1480 IS!
`
`information science records which contain fewer technical names.
`
`The multiple concept type model introduced in section 1.1 calls for additional concepts beside
`
`tenus.
`
`In the IS1 collection, two additional types, authors ánd co-citations, were considered. Author
`
`names were entered in the author dictionary, which had a total of 1255 items. Co-citations have no
`
`attendant dictionary. Since each document can, a priori, be co-cited with any other, the number of
`
`concepts equals N, the number of collection documents. The value of the j
`
`co-citation concept for
`
`the ¡h document, namely cc11, is the number of times the ¿ and j
`
`documents are co-cited.
`
`Table 4 gives statistics for ISI subvectors.
`
`Table 4: IS! Subvector Length Statistics
`
`Statistic
`Measured
`
`Subvector
`cc
`
`au
`
`mean
`median
`min
`max
`stdv
`
`Total
`no. of
`concepts
`
`tm
`
`49.8
`47
`8
`179
`21.5
`
`1.4
`1
`
`1
`7
`0.8
`
`54.0
`40
`1
`278
`48.4
`
`1255
`
`1480
`
`7392
`
`Even though there are only 1255 different authors for 1400 documents, each document has an average
`
`of 1.4 authors. The term subvectors are much longer, averaging around 50 concepts, but the standard
`
`deviation (stdv) indicates that the distribution is spread rather widely. Co-citation subvectore have
`
`014
`
`Facebook Inc. Ex. 1007
`
`

`

`-12-
`
`about the same average length, but an even larger standard deviation. Since the IS! articles were
`
`chosen as ones with many citations, and since the collection covers the field of information science for a
`
`number of years, it is not surprising that there are so many entries in the co-citation submatrix. Since
`
`the lengths of terms and co-citation vectors are comparable, comparisons between the two as to their
`
`relative utility for retrieval should be of considerable interest.
`
`The CACM collection was developed in part to allow testing of other concept types besides those
`
`found in IS! documents. Computing Review, categories (e?) allow a manual indexing system to be
`
`mixed in with automatic indexing entries. Bibliographic coupling () indicates the number of refer-
`
`ences shared between two documents' bibliographies. Links (fi) are references to or citations from
`
`other articles.
`
`Table 5 gives statistics for the various concept types in the CACM document collection.
`
`Table 5: CACM Sub'rector Length Statistics
`
`In
`
`2.7
`2
`1
`74,
`3.1
`
`tin
`
`25.0
`15
`1
`188
`22.7
`
`au
`
`be
`
`Subvector
`er
`cc
`
`3.7
`0
`0
`111
`10.7
`
`1.2
`0
`0
`28
`1.9
`
`.
`
`4.2
`0
`0
`183
`10.8
`
`1.3
`i
`
`1
`
`70
`
`.7
`
`2847
`
`3204'
`
`3204
`
`200
`
`3204
`
`10448
`
`Statistic
`Measured
`
`mean
`median
`min
`max
`st.dv
`
`Total
`no. of
`concepts
`
`is one since a document is, by definition, linked to itself, and
`Note: The minimum length for i
`so the diagonal of the submatrix is set to ones.
`
`015
`
`Facebook Inc. Ex. 1007
`
`

`

`-13-
`
`Since ¿Ta, t, and e
`
`subvectors are based only on connections among the chosen CACM articles, their
`
`mean length ¡s much shorter than that of the t
`
`subvector. The
`
`subvectors for CACM have aver-
`
`age length of less than 4 concepts, while those for IS! have over 50. In a larger document collection,
`
`with better total coverage of a subject discipline, more of the citations and references would be internal
`
`to that collection and so these subvectors would probably be longer. Thus, while the IS! còllection has
`
`unusually long bibliographic connection subvectors, the CACM collection has abnormally short ones,
`
`and any evidence froni the CACM tests that these subvectors are useful should be suggestive of their
`
`valnç in a more realistic environment with large numbers of documents.
`
`lt should be noted that links are relatively well bounded in number since a given article rarely has
`
`a very long bibliography and the quantity of citations to an article is limited by how many entries in
`
`the collection deal with that specific topic. Co-citations are somewhat more numerous, especially in
`
`such a homogeneous collection, since many later articles may refer to a given article as well as others
`
`appearing in the same journal. Bibliographic couplings are slightly more plentiful, perhaps because
`
`many articles cite "classics" in their sub-area, and because there is a good deal of referencing of CACM
`
`articles by CACM authors.
`
`A final note about the various concept types is that in both the IS! and CACM collections, the
`
`total number of terms exceeds the number of documents. For very large collections, the opposite would
`
`be true. However, there is no reason to believe that such a consideration would affect retrieval
`
`behavior, whereas subvector length is likely to be an important consideration, relating to the specificity
`
`of indexing.
`
`3. CACM
`
`016
`
`Facebook Inc. Ex. 1007
`
`

`

`3.1. Docuiñenta
`
`_1sl_
`
`Robert Dattola of Xerox Corporation provided a magnetic tape containing the title, abstract,
`
`author list, keywords, Computing Revicwa categories, and date of publication of articles published in
`
`the Communications cf the ACM from the earliest issue, in 1958, to the last number in 1979. The col-
`
`lection format was changed, document numbers (dids) were assigned, and editing was doue to correct
`
`spelling and typographical errors. Some duplicates were eliminated and missing articles added, and a
`
`final renumbering took place.
`
`Carol Fox and Jill Warner looked through printed copies of each article to locate the bibliography.
`
`Fr each article, a list of the "dids" representing articles referenced in the CACM collection was even-
`
`tually obtained. Many articles were lócated using a chronologically ordered list of all articles in the col-
`
`lection. However, since there were many errors in bibliographies, a special search program was written
`
`to identify the nearest matches to a supplied reference - to compensate for errors in year, month,
`
`author name, and title.
`
`The purpose of obtaining that data was to form bibliographic subvectors. Based on the above
`
`mentioned lists, a relational tarin was produced:
`
`Raw_data (citing, cited)
`
`which contained pairs of identifiers for the citing article and the one contained in the article's bibliogra-
`
`phy. Figure 1 shows the steps required, in QUEL like notation [Stonebraker et al. 19761, to produce
`
`from Raw_data first normal form [Date 1982j versions of the desired relations for 6, [ii, and E subvec-
`
`tors.
`
`017
`
`Facebook Inc. Ex. 1007
`
`

`

`-15-
`
`Figure 1: Relational Processing to Obtain
`CACM Bibliographic Subvectors
`
`Given: Raw _data (citing, cited)
`
`Desired:
`BC (citingi, citing2, coupling_no)
`LN (Ilül, link2)
`CC (citedi, cited2, co-citing_no)
`
`QUFL Like Statements:
`
`For BC
`Modify Raw_data to hash on cited
`Range of bd is Raw_data
`Range of bc2 is Raw_data
`Retrieve into bc_entries
`( citingi = bcl.citing,
`citing2
`bc2.citing,
`counter = bd .cited)
`where bcl.cited = bc2.cited
`Range of bce is bc_entries
`Retrieve into BC
`(bce.citingl, bce.citing2,
`coupling_no = count ( bce.counter))
`
`For LN
`Retrieve into LN
`(linki = bcl.citiug,
`link2 = bel .cited )
`Append to LN
`(link2 = bel .eiting,
`liuki = bcl.cited )
`
`018
`
`Facebook Inc. Ex. 1007
`
`

`

`-18.
`
`Figure 1 continued: Relational Processing to Obtain
`CACM Bibliographic Snbvectors
`
`3. For CC
`Modify Raw_data to hash on citing
`Range of ccl is Raw_data
`Range of cc2 is Raw_data
`Retrieve into cc_entries
`ccl.cited,
`( citedi
`cited2
`cc2.cited,
`counter = ccl .citing)
`where ccl.citing = cc2.citing
`Range of ccc is cc_entries
`Retrieve into BC
`(cce.citedl, cce.cited2,
`co-citing...no = count (cce.counter ))
`
`The actual processing was similar. Vectors were eventually formed from the relations after appropriate
`
`sorting and compression to unnormalized lists.
`
`Since the CACM collection is the first one studied with so many different concept types, and since
`
`the data is available for a number of years, various histograms, scatter plots, and charts are given in
`
`the next subsection. The intention is to illustrate the form, content, and distribution of the different
`
`types of bibliographic eoncèpts.
`
`3.2. Illustrative Charts and Figures
`
`Table 5 gave statistics on the CACM subvector lengths. However, a much more detailed under-
`
`standing can be gleaned from the graphical presentations below. To begin with, Figures 2 through 8
`
`are histograms with mean values given for various statistics for each year in the period 1958-1979.
`
`Figure 2 shows the number of articles each year. Clearly, the publication grew in size during the
`
`early years, and then there were changes in editorial policies at various later times. For example, in the
`
`019
`
`Facebook Inc. Ex. 1007
`
`

`

`-17-
`
`early years, computing algorithms in fairly large numbers were published, each one counting as an arti-
`
`cle. Subsequently this practice was discontinued and algorithms were collected for a separate publica-
`
`tion.
`
`Figure 3 shows the average number of citations to an article for each year. Lower values in later
`
`years can be explained by the fact that there were few articles in the collection that were published
`
`subsequently, and so citations were not possible. Early articles were not highly cited, perhaps because
`
`the first volumes had many methodological or other reports that were not of great interest afterwards.
`
`The peak in citations in the middle years is attributable to the fact that CACM was a key publication
`
`in computer science during that time, and many ¡mportaut developments were described there, espe-
`
`cially after 1982. One might expect that the three bibliographic subvectors would vary in length
`
`depending in part -on the distribution just described of citations.
`
`Figure 4 shows the average number of bibliographic couplings per year. In general, the number
`
`increases as time goes by, since there are more available prior articles in the collection that can be
`
`referred to. Thus, the number of couplings for years 1978 through 1978 are very high - bibliographic
`
`references can be to any previous article, even back to 1958. All ¡n all, the distribution is a fairly uni-
`
`form one, perhaps explaining why there is such a large standard deviation in í subvector lengths.
`
`Figure 5 gives the average number of links for each year. Since links include both references and
`
`citations, it is not surprising that the curve is a relatively flat one. Only for the early years, when there
`
`were not many prior articles to refer to and when the subject matter was not such as to elicit many
`
`later citations, ¡s the level of links fairly low.
`
`Figure 8, showing average number of co-citations, has a very different form than that of the other
`
`figures shown. A high peak appears near the middle, and lower values are at either tail. Apparently,
`
`1966 was a good year, since articles then were co-cited with many others. Referring back to Figure 3
`
`020
`
`Facebook Inc. Ex. 1007
`
`

`

`-18-
`
`Figure 2: Mean Number of CACM Articles Published Each Year
`
`No. of
`Articles
`
`300
`
`250
`
`200
`
`150
`
`100
`
`50
`
`o
`
`1058 59 50 01 62 13 54 65 60 67 58 59 70 71 72 73 74 75 70 77 78 7
`
`Year of CACM Volume Considered
`
`021
`
`Facebook Inc. Ex. 1007
`
`

`

`Figure 3: Mean Nu.mber of Citations Per Article Per Year
`
`-19-
`
`No. of
`Citations
`
`s
`
`ii
`
`sa
`
`11
`
`41 u 44
`
`&
`
`$1
`
`17
`
`II
`
`II
`
`70
`
`71
`
`71
`
`71
`
`74
`
`71
`
`75
`
`77
`
`71
`
`75
`
`Year
`
`022
`
`Facebook Inc. Ex. 1007
`
`

`

`-20-
`
`Figure 4: Mean Number of Bibliographic Coupling3
`Per Article By Year
`
`No. of
`Bibliographic
`Couplings
`
`o
`
`II
`
`Il
`
`IO Sill $314 U III? USI loll 157$ 74717171717$
`
`Year
`
`023
`
`Facebook Inc. Ex. 1007
`
`

`

`Figure 5: Mean Number of Links Per Article By Year
`
`-21-
`
`No. of
`Links
`
`il
`
`42
`
`IO
`
`II
`
`62
`
`63
`
`14
`
`14
`
`II
`
`47
`
`II
`
`Ii
`
`70
`
`72
`
`72
`
`72
`
`11
`
`73
`
`71
`
`71
`
`76
`
`13
`
`Year
`
`one sees that the most citations, after those to 1968, were to articles in 1966. Indeed, an examination
`
`024
`
`Facebook Inc. Ex. 1007
`
`

`

`-22-
`
`of articles in that year does reinforce the conviction that there were many important articles that year,
`
`and that they related to a considerable number of other articles. The number of co-citations could
`
`serve as another indication of merit, to go along with the number of citations, but probably the
`
`emphasis on number of related articles rather than number of citing articles would reduce the appeal of
`
`such a proposal. In any case, one would expect the average number of co-citations to be greatest dur-
`
`ing the iniddlle years of the range, since there are more articles later that can cite both the given article
`
`and one other (which could be either earlier or later).
`
`.Figure 7 highlights the difference in the distribution of 6 and ët lengths over the years. Biblio.
`
`graphic couplings are greatest towards the end of the range, and co-citations peak between the begin-
`
`ning and middle of the period. Hence, when mean
`
`length is divided by mean e length, the ratio is
`
`around 1-2 for middle years, lower beforehand, and higher after. An obvious strategy is to somehow
`
`add together the effects of bibliographic coupling and co-citations, since then the overall depth of
`
`bibliographic connection based indexing would be fairly uniform over the complete range of years of
`
`interest.
`
`Turning next to distributions of frequencies for the various subvectors, one sees the type of curves
`
`that are expected, in Figures 8 through 11. The number of documents for each subvector length is
`
`shown. Almost every article has fewer than 10 Computing Reviewa categories assigned or 10 links to
`
`other articles. There are a goodly number of documents, however, with fairly long 6 and
`
`subvec-
`
`tora, as can be seen in Figures 10 and 11.
`
`Most articles have less than .4 entries in their e? subvector, since categories are usually fairly
`
`broad and few articles deal with a much wider diversity of subjects. Usually at least a few categories
`
`are assigned to an article, so Figure 8 actually shows a slight initial increase before the usual descending
`
`portion of the curve is reached.
`
`025
`
`.
`
`Facebook Inc. Ex. 1007
`
`

`

`-23-
`
`Figure 8: Mean Number of Co-Citations Per Article By Year
`
`No. of
`Co-Citations
`
`Io
`
`I
`
`4
`
`G
`
`¡I U
`
`II
`
`II
`
`Il
`
`II U 17 II U 70
`
`14
`
`71
`
`71
`
`71
`
`74
`
`71
`
`71
`
`77
`
`71
`
`TI
`
`Year
`
`026
`
`Facebook Inc. Ex. 1007
`
`

`

`-24-
`
`Figure 7: CACM Bibliographic Connection Ratio for 1958-79
`(For Each Year, Mean Value of Bc/CC, Counted When
`Both Values are Poskive for an Article)
`
`Mean No.
`of BC/CC
`
`II
`
`il
`
`PD
`
`II
`
`12 U 14
`
`II
`
`II
`
`IT
`
`Il
`
`II
`
`ID
`
`71
`
`71 71
`
`74
`
`71
`
`lI
`
`77
`
`II
`
`71
`
`Year
`
`027
`
`Facebook Inc. Ex. 1007
`
`

`

`Figire 8: CACM Frequency Distribution for E? Subvector Length
`
`-25-
`
`No. of
`Documents
`
`E? Subvector Length
`
`028
`
`Facebook Inc. Ex. 1007
`
`

`

`Figure 9: CACM Frequency Distribution for E Subvector Length
`
`No. of
`Documents
`
`Loe
`
`aoe
`
`ioe
`
`C
`
`o
`
`2
`
`4
`
`I
`
`S
`
`10
`
`1z Subvector Length
`
`029
`
`Facebook Inc. Ex. 1007
`
`

`

`Figure W: CACM Frequency Distribution for
`
`Subvector Length
`
`-27-
`
`No of
`Documents
`
`6 Subvector Length
`
`030
`
`Facebook Inc. Ex. 1007
`
`

`

`Fi&ure 11: CACM Frequency Distributkn tor E Subvector Length
`
`-28-
`
`No. of
`Docuin eDt5
`
`E Subvector Length
`
`031
`
`Facebook Inc. Ex. 1007
`
`

`

`-29-
`
`Figure 9, dealing with links, appears to follow the distribution predicted by Zipfs law [Salton 1975
`
`page 189]. Figures 10 and 11 do also, except as one moves beyond frequencies of around 15. Mter
`
`that, those two curves seem to level off, showing occasional peaks randomly distributed.
`
`To better compare the subvector length distributions, scatter plots are given in Figures 12
`
`through 15. Figure 12 shows bibliographic couplings against citations. There is no obvious linear rela-
`
`tionsliip, but values are concentrated in the region close to the origin. For one to ten citations, there
`
`are usually Ics than 50 to 100 bibliographic couplings. There are some articles with many citations but
`
`few bibliographic

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket