`Bihliographic Inforn'lation*
`
`GEllARD SAL'l'O)l"
`
`If u)'vard University, j' Cambridge, ,);1 assachll.sett.,
`
`Abstract. Automatic docuUlcntation systems which use the words con\;aiued in the
`individual documents as Ii jlrillCipal source of doellment ident.ifif)aLioIlS !nil)' not perform
`satisfactorily undor all circumstances. Methods have therefore been devised witLin th~
`last few yeal'8 for computing association measures between words !tHd betlveen document6,
`/tnd for uaing such associated words, or information contained in nssoGiated documents, to
`supplement and refine the original document identifications. It is sllggeJ>ted in this study
`that bibliographic oitations may pruvide Ii siw.ple means for obtaining associaLe<1 ducuments
`to be incorpor<l.ted in an automatic documentation system.
`The standard associative retrieval techniques are first briefly reviewed. A computer
`experiment is then described which tends Lo confirm t.he hypotliesiil Lhll.t documonts ex(cid:173)
`hibiting similJ1,r citation sets also deal with similar 8ubjectmatter, Finally, a fully aule.
`matic document. retrieval system is proposed "'hieh usee bibliographic information ill addj·
`t.ion to other standard crjt(~ria for the identifica~ion of document content, and for the
`detection of relevant information.
`
`1. Introduction
`In recent years considerable attention has been devoted to the design of auto(cid:173)
`matic documentation systems. If the system is to operate fully automatically,
`the intervention of human experts for the analysis of do(nnnp.Tilt content and for
`the preparation of document identifica.tions ought t.o be elimin:'l.ted. Under these
`circumstances the retrieval system must of necessity be based primarily on the
`.yords occurring In the individual texis, and on the terms used to formulate the
`search requests.
`n has been suggested [1] that an aeceptable system can be generated byex.
`tracting from the texts and from the information requests those linguistic unit,s
`which are believed to be representative of document content, and by defining a.
`standard of comparison between words extracted from documents and wordi"
`used in the requests fO!' documents. To determine which words are particularly
`significant as an indication of document content a variety of criterja may be
`used, including the position of the words in the texts, the word type", the vocabu(cid:173)
`lary size, and most impul'~aHLly the frequency of occurrence of the individual
`words. The most significant words are then used as "index termJ3" to characterize
`the documents, and the most significant sentences, that is, those containing It
`large number of significant words, are used as abstracts for the documents.
`A typical automatic indexing and abstracting system based on word frequency
`
`• Heceived July, l(}62; revised March, 1963. This study was supported in part by the
`Air Force Cambridge }{escarch Ln.bora,tories tlnd in pal-t by SylVI111ill Eleetrie Product.s, Inc.
`t Comput;LLiorl Laboratory.
`
`440
`
`001
`
`Facebook Ex. 1012
`
`
`
`DOCCMENT IH.'1'RIEVAL USIT\G BIBLIOGRAPHIC Il\[i'OmH1'ION
`
`441
`
`Linear Text
`
`J __ ~ ~l
`1- -
`L-.- Itemize "ordll 1n the text an~
`a",go "1'" numb",
`_
`
`Combine varying forms of similar
`
`i
`! words, e.g. , oy deletion of word suffixes
`
`Perform word frequency oounts and----l
`eliminate high-frequenoy funct1.on ~~
`
`,.
`
`-~
`
`Compute an index of alg-
`nlflcance for all sentencea
`baBed on number of included
`significant worda
`
`1
`Collect the moat elgnl-
`ftaant aent~ncea to fOrM
`an "automatic abstract"
`
`l "mp""
`
`'n index .,
`slgnlf1aanoe for remaining
`words baaed on frequency
`of occurrence
`
`1
`
`Generate a. l1st of 51e:ni-
`fioa.nt warda to serve a13
`"index terms" repreaentin;s
`document content
`
`FIG. 1. Typical automatic indexing and abstracting :;yatem baaed on word froqueney
`counts.
`
`counts is shown in Figure 1. The principal drawback of the system. outlined in
`Figure 1 is the lack of any normalization procedure designed to take into account
`differences between individual authors or between individual document types.
`Thm:, a given set of documents covering some homogeneous subject area may
`quite possibly give rise to inany different index sets. Si.m.iL'fI.rly, completely dif(cid:173)
`ferent document sets may be obtained in answer to only slightly differing search
`requests.
`In order to reduce the importance attached to the individual words and to their
`frequencies of occurrence, the introduction of a synonym dictionary, or thesaurus,
`is often proposed. All words extracted from documents 01' search requests could
`then be replaced by standard thesaurue forms before being used. This solution,
`while attractive in theory, is difficult to implement because no definite criteria
`exist for the construotion of good or useful thesauruses, and because the genera(cid:173)
`tion of any thesaurus is a complex and time-consuming undertaking. For this
`reason, several workers [2, 3, 4, 5J have been interested in automatic procedures
`designed to supplement the original terms extracted from the doouments with
`
`002
`
`Facebook Ex. 1012
`
`
`
`(TE;rLUm 8AL'l'O~
`
`nm\" tcl'm~ related to the old ones ill v(1,rious ways. Indexing t.echniq ltes which
`make U::>e 01 'Ollch ",18sociEltccl" terms have come to be known H:3 "associativ~
`indexing," and the c:oJ"l'csponditlg retrieval operations are knO\Yll as "associativ:
`l'ctrieval. "
`The present report fmggest.s an extension of the usual assoeiativ'c retrieval
`t.echniques by taking into account bibEogl'aphie citatjol1s H,nd ot.her illformation
`pceuliar to the aut.hor of a given document. It i8 suggested, spccifieuUy, that the
`set. of identifying words extrncted from the documents be supplellwntcd by nell'
`words obtain('d in purt from the bibliographic information provided with the
`documents; 1.hese new expa.nded sets of index terms may then give a more ac.
`curate repre:>entatioll of document content than the original OIlCS ilild ma.)' thus
`provide it more effective retrieva.l mechanlSlIJ.
`The st.andard assoniat.ive indexing t.echniques are fiJst briefly revicmed. There
`after, some properties of bibliographic citations are described, and t.he mJe of
`bibliographjc information as an indication of document content is evaluated. A
`small computer experiment using citations is then swnmarizeu and the b1g(cid:173)
`nificancc of t,be numeric results is discllSsed . .Finally, a propoood fully automat.ic
`document. retrieval syst.em llsing bibliographic information ill addit.ion to other
`criteria is describod.
`
`2. A88Of .. ,viatilie InfOl'mat-io·n Retrieval
`::Vlost associat.ive retrieval systems 0.1'0 based on tho st.atist.ical word frequency
`counting procedures previou .. c;ly illustrated in Figure 1. Thus, given a document
`t.o extract a set of n distinct high-frequency
`it is possible
`collection,
`words WI , JY2 , ... , TV" , such that each document '''ithin the collect.ion is initially
`identified by some subset of the set of n given words.
`In pra<:,tical ret.rieval systems, it becomes useful t.o provide for some additional
`flexibility. For example, given a search request expressed in terms of words in the
`natural language, it. may be convenient to alter somewhat the original request,
`either by making it more specific and thus presumably reducing the sille of the
`document f;ct which fulfiL.:; t.he rcqllc~j, or, alternat.ively, by making it more
`general. In the same way, given a set of terms identifying a specified document,
`it may be useful to alter somewhat the original set by delet.ion of old termS or
`addition of new ones ill such a way that documents dealing with similar subject
`matter are identified by similar sets of index tel'ms.
`An analogolls problem arises in connection with the dOClllment sets which are
`obtained in answer to certain search requests. It js of tell useful to alter these
`document sets by addition of further documents which may alf;o have some
`relevance or, alternatively, by deletion of documents whieh arc not directly
`relevant. Both questions can be treated by determining a meaf{u·,.e of associalion
`between words 01' index t,erms on the one hand and between documents on the
`other, and by using this association measure for the alterat.ion of the con·e·
`sponding index term a.nd document subsets.
`Consider nrst. the problem of word assoeiatiom;. VV ords may be related ill
`
`003
`
`Facebook Ex. 1012
`
`
`
`DOCUMENT RE'1.'RI 8YAT, TJ8IKG BIBLIOGRAPHIC I)l"FORM:A'!'roN
`
`443
`
`TaMs
`
`WI
`
`Dr
`
`('~::
`
`TemlS
`
`W,
`
`W,
`
`iR!'
`
`R~l
`
`. CI " e.'
`C","
`C,'
`tV"
`(a) Typi(:al Lerm,document incidence matrix C (C;' = n ...... document D,
`contains term Wi exactly n times)
`Terms
`Wi
`
`~::)' = c
`
`IV,
`
`NO,)
`
`n"~
`
`R
`
`W~ t(::: R2'
`
`'If'n
`R1"
`Rn~
`(b) Typical term-term similarit.y matrix R
`
`Fw. 2. Matrices used for the generation of term associaj,iolls
`
`many different ways: for example, they may exhibit the same word stems, or
`they may have similar syntactic properties, or thtly may be usable in the same
`context~, and 80 on. The criteria of assoeiation used in most H,utomatic programs
`do not nOrlIw,lly require a determination of syntactic Of semantic properties.
`Rather, they are bl1sed on simple co-occurrence of words in the same texts or
`sentences, or on co-occurrence with individual or joint frequencies greater than
`some given threshold value.
`Given a set of m documents and a set of n index terms, a typical procedure for
`the generation of term associations is as follows:
`
`(a)
`
`It terlll-ducument 'incillenl:e matrix C iH constructed which lists index terllls against
`documents; matrix element C/ is defined to be equal to k if 3,nd only if dooument j
`contains term i exactly k t.imc~;
`(b) a coefficient uf Similarity between t.erms is then defined based un the frequenoy of co(cid:173)
`occurrence of pairs of terms in t.he individua.l documents;
`:J. term-hwrn similarity matrix R is then generated which exhibits aU similarity c()(cid:173)
`efficients between pairs of index terms;
`term asS()clationa inc defined 1'or those pairs whose associated similarity coefficient is
`greater than some stated threllhold value.
`
`(e)
`
`(d)
`
`A sample term-document incidence matrix Cis sho'wn in Figure 2(a). To ob(cid:173)
`tain a. coefficient of similarity between two terms based on the frequency of
`co-occurl'ence iu the uocuments of a given collection, it if') only necessary to
`perform a pairwise comparison of the corresponding rows of C. Many different
`types of similarity coefficients ha.ve been suggested in the literature [2, 3, 4, 5];
`a simple coefficient of similarity between rows oia numeric matrix, and one which
`may be I1S mE',aningful as any of the others, is the cosine of the angle between the
`
`004
`
`Facebook Ex. 1012
`
`
`
`44-1
`
`GgUA1W SALTON
`
`c()lTe~pol1ding nt-dimensional vectors [6]. The sim,ilarit.y coeiflcicllts can be dis(cid:173)
`played in an n X n symmetric term-similarit.y D.lll.trix R, where the coefficient of
`similarity R/ between term Wi and tcnn W j is
`
`R/ = R/
`
`The term-similarit.y matrix R corregpondil1g to the term-document IJJlltrix C
`of Figure Z (a) is show11 in Figure 2 (b). Since R is symmetric) only the right
`(01' leit) triangular part of R must be scanned ill onler to detect pairs of terms
`with large similarity coofficients.
`To generate document associations instead of t.erm assoeiationg the same pro.
`cedures can be uscd, since the strength of association between doeumentti may be
`conveniently il.s:mmed to be a function of the number and frequencies of the
`shared terms in their respective term lists. Document similarit.ies am therefore
`obtained by comparing pairs of column",> (instead of rows) of t.he term-document
`matrix C, and a document document similarity matrix is constructed and used
`in the same way as the previously described term-term matrix K
`Consider now a t.ypical system for document. retrieH't.l using term and docu"
`ment associations as shown in Figure 3. A list of high-frequency terms he; first
`generated for each document by word freqllcnr.y counting procedures. N orIllaliza(cid:173)
`tion mayor may not be effected by thesaurus lookup. A tenn-tonn 16imilarity
`matrix is then constructed by using co-occurrence of terms within sentences,
`rather than within documents, as a criterion. It should benot.ed tha.t as new term
`associations are defined, tbe original incidence matrix ca.n be revised by inclusion
`in some of the matrix columns of new, associated terms which are not originally
`contained in tbe respective sentences or documents. The revised incidence matrix
`then gives rise t.o a new term-term similarity matrix, incorporating second-order
`associations, and eo on. This feedback process is represented by an upwa.rd(cid:173)
`pointing arrow in Figure 3.
`To retrieve documents in answer to search requests, the programs already
`available can be used by adding to the term-document matrix C a new column
`e",+l, representing the request terms. Specifically, element C~.+l is set equal to
`w if tem) WI. is used in the search request with weight w; if word W.~ is not used
`in the given search request c:.+1 is set equal to O. If no weights are specified by
`the requestor t.he values of the elements of column C m+! are restricted to 0 and 1.
`An estimate of document relevance is then obtained by computing for each docu(cid:173)
`ment the similn.rity coefficient between the request column C"'+I and the respec(cid:173)
`tive document column. The documents can be arranged in dec.reasing order of
`simila.rity coefficients, and all documents ,,,it.h a sufficiently large coefficient
`ca.n be judgeti to be relevant to the given 1'€l[Uel:it. Clearly, the final relevance
`criterion dependf:! not only on the terms assigned to the various documents or
`on the words used in the documents and search :request.s, but also on other term!!
`associated 1\ith the original ones through co-occurrence in a gi \'e11 document
`collection.
`
`005
`
`Facebook Ex. 1012
`
`
`
`DOCOMENTRT':'l'RIEVAL USING BlBLIOGRAPlllC INFORMA'l'ION
`
`445
`
`For each document generate
`l1et or high-frequency wordfl
`to /Serve ae "index terms"
`(Bee F1g. 1)
`
`--
`r
`I
`I
`I ,
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`
`L -
`
`i
`Conlltl7lJct term-sentence
`incIdence matr1x
`1111ti.ng
`mentences againllt
`included
`terme
`
`f..
`
`1
`
`Compute term-term 81mt-
`l~rlty matrix and generate
`t.rm a'lIIo()1.ationa for
`doQUDont identification
`
`_~
`
`,
`
`/
`1/
`
`TheaauruB look-up and
`6ubctitution of thesauru~
`head" for high-frequency
`terms
`
`I
`I
`~
`Con5truct term-document
`incidence matrix listing
`documents against
`1ncluded te1"l!t8
`
`I 1
`
`Compare vector of request
`co~ute document-document
`terms with term-document
`81m11ar1t;T IMtrlx and
`generate document a!l"Oc1at1onl5~-~
`incidence matrix and
`identify relevant documents
`
`FIG. 3. Typical autoUllitic document retrieval system using term and document aasociationB
`-. optional paths
`-> compulsory ps,ths
`
`:3. Bibliographic Information as a Factor in Content Analysis
`In the preceding section procedures were described for expanding the set. of
`terms used as document identifications by inclusion of inIormation derived
`from the texts of other documents in the same collection. Since the retrieval
`effectiveness depends to a large measure on the accuracy and completeness of
`the content identifications, it, is of interest to inquire whether additional pertinent
`data available with many documents might not also be used to provide important
`content indications. In particular, it may be conjectured that information as(cid:173)
`sociated with the autho'l' of a given document, fo1' example data contained in
`related publications of the same author, may furnish usable content indicators.
`The same considerations may also apply to information obtained from publica(cid:173)
`tions dted by a given author in his list of references, 01' from those citing the given
`document.
`An attempt is therefore made in the next few semions to evaluate the utility
`of bibliographic citations as an aid to automatic content analysis. When this
`problem is first considered, the initial reaction must clearly be one of·skepticism.
`Indeed, it is well known that many different practices are followed by individua.l
`authors of tMhnical papers in the construction of bibliographies. Most important
`aB a controlling element is the document type. Survey and tutorial articles carry
`
`006
`
`Facebook Ex. 1012
`
`
`
`GEHAHD "AlIre",
`
`nl<.>I\' ('stellsivc bibiiugrt(phif:~ (.h!111 ::;peci{ic l'c::::cal'elt rOpOl'tf;. Si milady, art,ideo
`C{)I'l';';llg ~l \\idi:' ,·arid:.- of (OVies may be eit.eci more f)'('CJ11011tly than others
`\1 (licit are wore sp{~('iah('d. '\" a result, t\yO articks which Co\'cr identical t.opics
`from ;;olllc\\·hat ditTerent poitlt.s of view may includc quite di"titlct. hibliographies,
`.\ :occoEd impona.nt eritcriOll i~ the Hvailability of tilE' eit{~d docllment.. Thus
`report,; lw:lllded in cer(aill boob; or ill irnpOl'tallt joul'lIals nre likdy jo be Cited
`morc often than thosc no(' gCl\cl'lllly aVl1iluble to the public. By (,he same Loken,
`uJ\classifkd papers ilre cited more freely than classified ouc:::;. The dat,e of publica_
`tion is a related factor which also anects the probability of being cited. Very
`recent dO(;Umellt~ which have not had a chlwee to circulate, and very old OIl~B
`which no longer circulatr are, ill general, cited more rarely t.han current articles
`which have bern distl'ibutf'd within t.he recent past.
`A final consideration pertains specifically to the author of fl. document. In
`mauy casco per::.onal preferences are evident both as to Humber and type of
`papert' cited; authors buye varying backgrounds, and there may also exist. a
`tendl!ncy toward self-citation regardless of relevancy.
`Because of these and ot.her variatiulll:>, citation and ref'erenc0 lists' have not.
`generally been used as an iudic:1tioll of document content. HtLther, such list;,
`are used to detect trellds in the literuture as a whole, and to serve ,.15 adj uncts to
`eertain kinds of litemturc searches [7, 8J. Citation indexes have, for example, been
`used in attempt.s to identify sip;nificant reoellrch by equating frequency of cita(cid:173)
`tion "ith relative significance of the subject matter [9); they have also served to
`trace t.lHj flow of information and to measure the relative importallce of various
`journals to the scientific comnnlllity [lOJ.
`There exists considerable evidellce, however, that in addition to being useful
`for tho above-mentioned stalH.lard applications, bibliographic information might
`also help in conteut analysis. Finst, it is clear that for the large majority
`of authors, at least some of the items included in a reference list will be highly
`pertinent documents " ... hose subject matter overlaps dratitically ",-ith that of the
`citing document. The same is true of other documents published by the same
`author whether they appear as part of the references or not. Second, it, has been
`stated that many experts who arc reasonably familiar wit.h research and develop(cid:173)
`ment in t.heir field of activity use a citation iudex in preference to a subject index
`by following up t.he references to the standard, well-known wot'ks which are
`judged important. In such cases bibliographic references are then in fact used as
`content indicators. Third, citations are relatively simple to incorporate in an
`aut.omatic system, sinr.t: they are directly available with most technical docu(cid:173)
`ment..<.;, and since they can be manipulated automatieally nearly as ea.<;ily IlS
`onlinary numing text.
`If it could be showll that citations wore usable as conteut indicatol's, then the
`associative techniques descrihed in Section 2 could be further refined lJyadding
`I A citat1'QT£ index consists of a Re~ of bibliographic reierences (the set of cited documenw),
`each followed by a list of a.ll those documents (the citing docllments) which include the
`given cited document as a referflnc~. A reference index, on the other hUlld, lists all cited
`doellm6nt~ under e:«;h citing document.
`
`007
`
`Facebook Ex. 1012
`
`
`
`DO(:1B-lENT ImTHlEVAf. USJNG BIBLlOGHAPHlC lNFOn~!ATION
`
`447
`
`l-ttf,'ri.
`(lnC!O/u·J7.l~
`
`1
`
`D.
`
`Ciltng dO;::'1(Jn.c'fl/..')
`Di
`
`Dm
`
`i(-Xll~X,-1 .:-x::-)-
`~~ I :(. ::~ . .. ~~~
`
`D,
`
`x
`
`(X/ = 1 H document D; is cIted by document. DJ )
`FIG. 4. ~lntrix X exhibiting direct cii.ations
`
`to
`term-document matrix illustrated in Figure 2(11) further document
`the
`Golumns representing cited documents, citing documents, or documents written
`hy the same author. These new documents wouJd then provide new associated
`term" which might be c4ually as important as the term associations derived from
`other documents in the same collection.
`To test the significance of bibliographic citations, a comparison was mad€
`between eiLaLion similarities and index term simila.rities for an indexed documeni
`collection. Specifically, a measure of similarity was computed between each paD
`of documents in the collection, based on the number of overlapping index rerms
`a similar measure , ... ~as then computed for the same pairs of documents, based or
`the number of overlapping citations; finally, the similarity measures obtained
`from index terms and citations respectively were compared by calculating 13
`similarity index between citation similarities and index term similarities. Ar
`overall measure was ~lso computed for the complere documont collection b5
`taking into account the similarity measures between all document pairs.
`If use were to be made of bibliographic information for purposes of cont-eni
`analysis, one ,vould hope that large similarity measures between specific docu·
`ment. pairs due to overlapping index terms would also be Jrefiected in large meas
`ures between the same document pair8 due to overlapping citations, or Yice~versa
`To test this hypothesis, the data obtained for the actual document collectiOl
`were compared with data obtained from a randomly constructed, fictitious collec·
`tion for which no correlation should exist between index terms and citations. Th(
`actual procedure!'> used in the experiment are summarized in the next section. 2
`
`4. Comparison of Cilation Similarities with Index Term Similarities
`Consider a collection of in documents each of which is characterized by tht
`properly of being cited by one or more of the other documents in the same colIee
`tion. Each document can then be represented by an m-dimensional logical vecto
`X~, where X/ = 1 if and only if document i is cited by document j, and X/ = I
`otherwise. If these m vectors are arranged in rows one below the other a squal"l
`logical incidence matrix is formed similar to the matrix exhibired in Figure 4
`The number of ones in the row vectors of X ropresents the degree of "citedness'
`for the documents lisred at the head of the rows. Similarly, the number of one
`
`= A more complete exposition of the experiment is given in [oj.
`
`008
`
`Facebook Ex. 1012
`
`
`
`448
`
`mmARD SAUro",
`
`ill the column vectors of X measures the amount of "citing" for Lhe documents
`listed at the head of the column~. For a closed document collection, thc set of row
`identiflers is the same as the set of coluIIlu identifiers aR shown in Figure 4.
`A measure of similarity between row (column) vectors C!ll1 be obtained by
`calculating the cosine factor, previously exhibited in Sectioll 2, for each pair of
`row" (columns). The result of such a computation can again be represented by
`a ,similarity matrix R, similar to that shown in Figurc 2(b), where H;' is t.he value
`of the similarity coefficient between t.he ith and jth rows (columns) of X.
`The coefficients of R nOw represent a measure of I$imilarity beb\'{~'en documents
`based on the number of overlapping di1'ccl citations. Thi::; eOIlcept may be ex:
`tended by using as a basis for t,he calculation of similarity eocfficients not the
`existence of direct links between documents (links of length one), but links of
`length t.wo, three, four, or more. Consider, as an example, a document collection
`in which document A cites document B, or B cites A. The corresponding tlocu.
`ments are then said to be linked direetly. On the other hand, if A does not cite B
`hut A cites (01' is cited hy) C which in turn cites (or is cited by) B, no direc;
`link exists between A aud B. Instead, A and B are then linked by a path of length
`two, since an extraneOllS document C exists between docnment.s A and B. Simi.
`larly, if the path between two documents includes two extraneous documents,
`they are linked by a path of length three, and so on.
`Given a square citation matrix X it is possible by matrix multiplication to
`obtain matrices X', X", etc., exhibiting respectively the existence of paths of
`length two, three, and so on [I1J. Specifically,
`
`i
`
`/\ X/'),
`
`V (Xk
`k=l
`....
`[X"J/ = V (X~: A (X')/'), and so on.
`
`[X']/
`
`k-l
`
`Boolean multiplication is used, since the new connection matrices X', X", etc.,
`are again defined as logical matrices. (X') / is then equal t,o 1 if and only if at
`least one path of length two exists between documents Di and D j
`; otherwise,
`(X') / is equal t.o D. It Ulay be noted that X', unlike X, can have nonzero diagonal
`elements, corresponding to the case where two document.s mutually cite each
`other.
`As before, the cosine measure can again be used to obtain it row or column
`si:rrillarity matrix Rt from citation matrix X', or R" from X It
`, and so on. These
`new correlation matrices measure document similarity based OIl common cita·
`tion links of length two, three, and so on. While it is theoretically possible to
`work "ith correlation matrices nn based on overlapping citation links of length n,
`it is likely that the similarity between subject matter and citations will diminish
`rapidly as the length of the citation links increases. For present purposes, the
`investigation of overlapping citations is therefore restricted to the consideration
`of direct link"! and links of length two, three, and four.
`A measure of similarity based on the number of overlapping index terms be·
`tween documents can be obtained as explained previously, by starting with II
`
`009
`
`Facebook Ex. 1012
`
`
`
`DOCCJ\fENT HF.'l'ftlF;V:\L USIKG BIBLIOGRAPHIC !::-l"FORMATION
`
`449
`
`term-document incidence matrix C similar to that shown in Figure 2(a), and
`using the eosine measure to compute the elements S j i of an m by m symmetric
`similarity matrix S. U the documents appear as column headings as in Figure
`2(a), the nimilarity coellicients are obtained by matching pairs of columns of C.
`Since the tcnn-doeument matrix C is not in geneml a square matrLx, matrix
`multiplication cannot be used to obtain second order effects, similar to the cita(cid:173)
`tion links of length two or more. Instead, it is first necessary to compare the
`index terms by performing a row comparison of the rows of C. This produces a
`new n by n symmetric term matrix C* \\'hioh displays similarity between index
`terms. This matrix can be used to eliminate from the set of index terms those
`terms which exhibit a large number of joint occurrences wit.h ot.her terms. A
`reduced set of index terms can then be formed and a new term-document matrix
`C' constructed, from which a new correlation matrix 5' is formed. Higher order
`term similarity can of course be obtained if desired by squaring or cubing C*.
`The last step needed jn the testing procedure is a comparison between the
`document similarity coefficients obtained from the citati.ons and the coefficients
`obtained from the index term.'). Specifically, it is necessary to compare the ele(cid:173)
`ment of matrices R, R', Rd
`, and so OIl, with the equivalent ones in Sor 5'. Since
`the comparison of individual pairs of equivalent coefficienttl may not be very
`meaningful, one coefficient. ,,,ill be computed for each document by comparing
`equiL·alent document rows in Rand S. The cosine measure ean again be used for
`that purpose in the form
`
`x, = ·cos (R', S')
`
`to obtain 11 "cross-correlation vector" x. Each element of a cross-correlation
`vector is thus a measure of similarity for a given document, derived by comparing
`similarity coefficientfi obtained from citations with the corresponding similarity
`coefficient::; obtained from index terms for that given document. Large values of
`the vector element Xi will indicate a close similarity between the measures ob(cid:173)
`tained from citations and those obtained from index terms, since then nonzero
`terms in R' will correspond to nonzero terms in Si. Small values of the element
`Xi , on the other hand, win indicate no similarity in the two types of measures.
`A single "overall" cross-correlation coefficient can also be obtained for the
`complete document set, in addition to the cross-correlation vector, by comparing
`the complete matrix R with the complete matrix S. ThiR is done by considering
`all elements within a given matrix as belonging to a. single vector of dimension
`m2
`, and comparing the two resulting vectors by means of the cosine measure.
`The "overall" cross-correlation coefficient will be large jf many of the cross(cid:173)
`correlation vector elements arelarge, that is, if for many documents large_simi(cid:173)
`larity coeffioients obtained from oitations correspond to large coefficients obtained
`from index terms.
`The complete procedure is summarized in the flow-chart of Figure 5. For the
`
`010
`
`Facebook Ex. 1012
`
`
`
`(,E1URD S.\LTO""
`
`L'C'neidel' witn caetl document
`i the set of applioable Inde~
`land tne oet of ~ppl!oable
`l .. _
`
`CO~,stT'uct; a
`tncl~ence matrix £ listin~
`document 6 W/'i 1n5t included
`te.r1'1s
`
`'Cofl~i;r\Jct a dOCl;ment-doc1.lment
`:si"UarHy matrix ~ bMed on
`ove:'l~pplne index terns
`
`a cltRtlon lnc~~
`l listinc. each c1 ted
`--,
`(lo,~un1enlt a:.:;e.ln~t all cltinc
`I
`I
`_~ I
`cloCtlrrent8
`I
`
`,..-___ 1_ .. _____
`
`'~cn6tr\lct e. document-document
`similarity m3. trlx. X!: based on
`overlappinG citations
`
`... _--]
`Compute a cross-correlat1on
`vector z:. !l.nd o'lerall cross-correlat1on
`coefficient x to measure 81~llaritleB
`between document rows Rand S, and
`lJetween the compJ.ete matrl.ces, res;>ectlv~lY
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`i
`r-C;;~~~;;t:ct-~-e-I'T1l--t-e-!"'l!-. -s-1-rn-l-1a-r-i-t-y" r-c-o-n-~-t-ru-ct-6-0-_tl-a-red:- cu~ecl,;}1
`••.
`I :latrix:;.. and use 1t to
`J...
`I
`incidence Matrices X',
`i
`... -»1 generate new te!"!r.-doc' .. lment
`exhlb1.tin~· citation -.linkS of
`. _J
`I
`m~tricea £', E",,··,
`length two, three, ••.• and
`L ______ ._. __ a_n_d_3_o_o_n _____ .-l
`so on
`
`,
`
`FIG. ;i.
`
`(Xmlpal'isOll iii r:itatlOl) similarities with index term similnritics
`
`actual experiment, a collection of sixty-two documents dealing with linguistics
`and machine translation was chosen. A set of fifty-six index terms ,vas used for.
`manuul indexjng of the documents. The t.wo bt1.$ic inputs used for the computer
`experiments were thus logical matriees of dimen.:;;ion 52 by 62 and 62 by 56,
`listing, respectively, cited VerSLlS cit.ing documents, and documents vel'SUS terms.
`These two input matrices correspond in format to the examples of Figures 4
`and 2(a). From the two logical input rn..'l,trice" a set. of three principal an~ six
`auxiliary ~imilarit,y matrices was obtained by performing comparison<; bet~'ccn
`pairs of ro\vs or columns. The similarity matrices all cOfl'espond in format to the
`example oi Figure 2(b) .. The crrED and CI'I':-W similarity matrices of dimension
`52 by 62 were obtained from the original c.itation mutJ'ix by row and column
`comparisons, respectively. The TJ)C;\il' similarity matrLx, aliso of dimension 62
`by 62, \vas similarly obtained by column comparisons from the original term ..
`document matrix. Additional citation similarity matrices, designated c'rD2,
`C'rv3, er.o·i, aDd CNG2, C.K(,3, CNG4 were obtained from the squared, cubed,
`!lnd fourth power logical citation matriecs, as previously explained.
`
`011
`
`Facebook Ex. 1012
`
`
`
`DOCFMT·;wr HETRIE\'AL rSING BIBLIOGR. .... PHlC INFOHlI1ATION
`
`451
`
`OVERALL
`COEFFICIENTS
`
`CITED
`CITING
`
`/
`
`ACTUAL DOCUMENT
`COLLECTION
`
`0.5
`
`04
`
`0.3
`
`0..2
`
`0..1
`
`o
`
`".. ... -
`~ '\..
`--....
`"'../
`---..... ,
`/
`/ "
`.... ---_ ... ----... ----
`
`//
`
`'~
`
`-
`
`PSEUDO-COLLECTION
`"''''..
`
`__ .. L.
`.... ___ ._L......._ .••...• .1.
`I !
`I
`CNG 2 CNG 3
`eTO 2
`eTO:3 CTD 4 CITNG
`CITED
`eNG 4
`TDCMP TDCMP TOCMP TDCMP TDCMP TDCMP TDCMP TDCMP
`
`FIG, fl. Comparison of overall similarity coefficients
`
`Eight cross-correlation operations were performed by eorrelating each of the
`eight citat