throbber
Associative Document Uetrieval Techniques Using
`Bihliographic Inforn'lation*
`
`GEllARD SAL'l'O)l"
`
`If u)'vard University, j' Cambridge, ,);1 assachll.sett.,
`
`Abstract. Automatic docuUlcntation systems which use the words con\;aiued in the
`individual documents as Ii jlrillCipal source of doellment ident.ifif)aLioIlS !nil)' not perform
`satisfactorily undor all circumstances. Methods have therefore been devised witLin th~
`last few yeal'8 for computing association measures between words !tHd betlveen document6,
`/tnd for uaing such associated words, or information contained in nssoGiated documents, to
`supplement and refine the original document identifications. It is sllggeJ>ted in this study
`that bibliographic oitations may pruvide Ii siw.ple means for obtaining associaLe<1 ducuments
`to be incorpor<l.ted in an automatic documentation system.
`The standard associative retrieval techniques are first briefly reviewed. A computer
`experiment is then described which tends Lo confirm t.he hypotliesiil Lhll.t documonts ex(cid:173)
`hibiting similJ1,r citation sets also deal with similar 8ubjectmatter, Finally, a fully aule.
`matic document. retrieval system is proposed "'hieh usee bibliographic information ill addj·
`t.ion to other standard crjt(~ria for the identifica~ion of document content, and for the
`detection of relevant information.
`
`1. Introduction
`In recent years considerable attention has been devoted to the design of auto(cid:173)
`matic documentation systems. If the system is to operate fully automatically,
`the intervention of human experts for the analysis of do(nnnp.Tilt content and for
`the preparation of document identifica.tions ought t.o be elimin:'l.ted. Under these
`circumstances the retrieval system must of necessity be based primarily on the
`.yords occurring In the individual texis, and on the terms used to formulate the
`search requests.
`n has been suggested [1] that an aeceptable system can be generated byex.
`tracting from the texts and from the information requests those linguistic unit,s
`which are believed to be representative of document content, and by defining a.
`standard of comparison between words extracted from documents and wordi"
`used in the requests fO!' documents. To determine which words are particularly
`significant as an indication of document content a variety of criterja may be
`used, including the position of the words in the texts, the word type", the vocabu(cid:173)
`lary size, and most impul'~aHLly the frequency of occurrence of the individual
`words. The most significant words are then used as "index termJ3" to characterize
`the documents, and the most significant sentences, that is, those containing It
`large number of significant words, are used as abstracts for the documents.
`A typical automatic indexing and abstracting system based on word frequency
`
`• Heceived July, l(}62; revised March, 1963. This study was supported in part by the
`Air Force Cambridge }{escarch Ln.bora,tories tlnd in pal-t by SylVI111ill Eleetrie Product.s, Inc.
`t Comput;LLiorl Laboratory.
`
`440
`
`001
`
`Facebook Ex. 1012
`
`

`
`DOCCMENT IH.'1'RIEVAL USIT\G BIBLIOGRAPHIC Il\[i'OmH1'ION
`
`441
`
`Linear Text
`
`J __ ~ ~l
`1- -
`L-.- Itemize "ordll 1n the text an~
`a",go "1'" numb",
`_
`
`Combine varying forms of similar
`
`i
`! words, e.g. , oy deletion of word suffixes
`
`Perform word frequency oounts and----l
`eliminate high-frequenoy funct1.on ~~
`
`,.
`
`-~
`
`Compute an index of alg-
`nlflcance for all sentencea
`baBed on number of included
`significant worda
`
`1
`Collect the moat elgnl-
`ftaant aent~ncea to fOrM
`an "automatic abstract"
`
`l "mp""
`
`'n index .,
`slgnlf1aanoe for remaining
`words baaed on frequency
`of occurrence
`
`1
`
`Generate a. l1st of 51e:ni-
`fioa.nt warda to serve a13
`"index terms" repreaentin;s
`document content
`
`FIG. 1. Typical automatic indexing and abstracting :;yatem baaed on word froqueney
`counts.
`
`counts is shown in Figure 1. The principal drawback of the system. outlined in
`Figure 1 is the lack of any normalization procedure designed to take into account
`differences between individual authors or between individual document types.
`Thm:, a given set of documents covering some homogeneous subject area may
`quite possibly give rise to inany different index sets. Si.m.iL'fI.rly, completely dif(cid:173)
`ferent document sets may be obtained in answer to only slightly differing search
`requests.
`In order to reduce the importance attached to the individual words and to their
`frequencies of occurrence, the introduction of a synonym dictionary, or thesaurus,
`is often proposed. All words extracted from documents 01' search requests could
`then be replaced by standard thesaurue forms before being used. This solution,
`while attractive in theory, is difficult to implement because no definite criteria
`exist for the construotion of good or useful thesauruses, and because the genera(cid:173)
`tion of any thesaurus is a complex and time-consuming undertaking. For this
`reason, several workers [2, 3, 4, 5J have been interested in automatic procedures
`designed to supplement the original terms extracted from the doouments with
`
`002
`
`Facebook Ex. 1012
`
`

`
`(TE;rLUm 8AL'l'O~
`
`nm\" tcl'm~ related to the old ones ill v(1,rious ways. Indexing t.echniq ltes which
`make U::>e 01 'Ollch ",18sociEltccl" terms have come to be known H:3 "associativ~
`indexing," and the c:oJ"l'csponditlg retrieval operations are knO\Yll as "associativ:
`l'ctrieval. "
`The present report fmggest.s an extension of the usual assoeiativ'c retrieval
`t.echniques by taking into account bibEogl'aphie citatjol1s H,nd ot.her illformation
`pceuliar to the aut.hor of a given document. It i8 suggested, spccifieuUy, that the
`set. of identifying words extrncted from the documents be supplellwntcd by nell'
`words obtain('d in purt from the bibliographic information provided with the
`documents; 1.hese new expa.nded sets of index terms may then give a more ac.
`curate repre:>entatioll of document content than the original OIlCS ilild ma.)' thus
`provide it more effective retrieva.l mechanlSlIJ.
`The st.andard assoniat.ive indexing t.echniques are fiJst briefly revicmed. There
`after, some properties of bibliographic citations are described, and t.he mJe of
`bibliographjc information as an indication of document content is evaluated. A
`small computer experiment using citations is then swnmarizeu and the b1g(cid:173)
`nificancc of t,be numeric results is discllSsed . .Finally, a propoood fully automat.ic
`document. retrieval syst.em llsing bibliographic information ill addit.ion to other
`criteria is describod.
`
`2. A88Of .. ,viatilie InfOl'mat-io·n Retrieval
`::Vlost associat.ive retrieval systems 0.1'0 based on tho st.atist.ical word frequency
`counting procedures previou .. c;ly illustrated in Figure 1. Thus, given a document
`t.o extract a set of n distinct high-frequency
`it is possible
`collection,
`words WI , JY2 , ... , TV" , such that each document '''ithin the collect.ion is initially
`identified by some subset of the set of n given words.
`In pra<:,tical ret.rieval systems, it becomes useful t.o provide for some additional
`flexibility. For example, given a search request expressed in terms of words in the
`natural language, it. may be convenient to alter somewhat the original request,
`either by making it more specific and thus presumably reducing the sille of the
`document f;ct which fulfiL.:; t.he rcqllc~j, or, alternat.ively, by making it more
`general. In the same way, given a set of terms identifying a specified document,
`it may be useful to alter somewhat the original set by delet.ion of old termS or
`addition of new ones ill such a way that documents dealing with similar subject
`matter are identified by similar sets of index tel'ms.
`An analogolls problem arises in connection with the dOClllment sets which are
`obtained in answer to certain search requests. It js of tell useful to alter these
`document sets by addition of further documents which may alf;o have some
`relevance or, alternatively, by deletion of documents whieh arc not directly
`relevant. Both questions can be treated by determining a meaf{u·,.e of associalion
`between words 01' index t,erms on the one hand and between documents on the
`other, and by using this association measure for the alterat.ion of the con·e·
`sponding index term a.nd document subsets.
`Consider nrst. the problem of word assoeiatiom;. VV ords may be related ill
`
`003
`
`Facebook Ex. 1012
`
`

`
`DOCUMENT RE'1.'RI 8YAT, TJ8IKG BIBLIOGRAPHIC I)l"FORM:A'!'roN
`
`443
`
`TaMs
`
`WI
`
`Dr
`
`('~::
`
`TemlS
`
`W,
`
`W,
`
`iR!'
`
`R~l
`
`. CI " e.'
`C","
`C,'
`tV"
`(a) Typi(:al Lerm,document incidence matrix C (C;' = n ...... document D,
`contains term Wi exactly n times)
`Terms
`Wi
`
`~::)' = c
`
`IV,
`
`NO,)
`
`n"~
`
`R
`
`W~ t(::: R2'
`
`'If'n
`R1"
`Rn~
`(b) Typical term-term similarit.y matrix R
`
`Fw. 2. Matrices used for the generation of term associaj,iolls
`
`many different ways: for example, they may exhibit the same word stems, or
`they may have similar syntactic properties, or thtly may be usable in the same
`context~, and 80 on. The criteria of assoeiation used in most H,utomatic programs
`do not nOrlIw,lly require a determination of syntactic Of semantic properties.
`Rather, they are bl1sed on simple co-occurrence of words in the same texts or
`sentences, or on co-occurrence with individual or joint frequencies greater than
`some given threshold value.
`Given a set of m documents and a set of n index terms, a typical procedure for
`the generation of term associations is as follows:
`
`(a)
`
`It terlll-ducument 'incillenl:e matrix C iH constructed which lists index terllls against
`documents; matrix element C/ is defined to be equal to k if 3,nd only if dooument j
`contains term i exactly k t.imc~;
`(b) a coefficient uf Similarity between t.erms is then defined based un the frequenoy of co(cid:173)
`occurrence of pairs of terms in t.he individua.l documents;
`:J. term-hwrn similarity matrix R is then generated which exhibits aU similarity c()(cid:173)
`efficients between pairs of index terms;
`term asS()clationa inc defined 1'or those pairs whose associated similarity coefficient is
`greater than some stated threllhold value.
`
`(e)
`
`(d)
`
`A sample term-document incidence matrix Cis sho'wn in Figure 2(a). To ob(cid:173)
`tain a. coefficient of similarity between two terms based on the frequency of
`co-occurl'ence iu the uocuments of a given collection, it if') only necessary to
`perform a pairwise comparison of the corresponding rows of C. Many different
`types of similarity coefficients ha.ve been suggested in the literature [2, 3, 4, 5];
`a simple coefficient of similarity between rows oia numeric matrix, and one which
`may be I1S mE',aningful as any of the others, is the cosine of the angle between the
`
`004
`
`Facebook Ex. 1012
`
`

`
`44-1
`
`GgUA1W SALTON
`
`c()lTe~pol1ding nt-dimensional vectors [6]. The sim,ilarit.y coeiflcicllts can be dis(cid:173)
`played in an n X n symmetric term-similarit.y D.lll.trix R, where the coefficient of
`similarity R/ between term Wi and tcnn W j is
`
`R/ = R/
`
`The term-similarit.y matrix R corregpondil1g to the term-document IJJlltrix C
`of Figure Z (a) is show11 in Figure 2 (b). Since R is symmetric) only the right
`(01' leit) triangular part of R must be scanned ill onler to detect pairs of terms
`with large similarity coofficients.
`To generate document associations instead of t.erm assoeiationg the same pro.
`cedures can be uscd, since the strength of association between doeumentti may be
`conveniently il.s:mmed to be a function of the number and frequencies of the
`shared terms in their respective term lists. Document similarit.ies am therefore
`obtained by comparing pairs of column",> (instead of rows) of t.he term-document
`matrix C, and a document document similarity matrix is constructed and used
`in the same way as the previously described term-term matrix K
`Consider now a t.ypical system for document. retrieH't.l using term and docu"
`ment associations as shown in Figure 3. A list of high-frequency terms he; first
`generated for each document by word freqllcnr.y counting procedures. N orIllaliza(cid:173)
`tion mayor may not be effected by thesaurus lookup. A tenn-tonn 16imilarity
`matrix is then constructed by using co-occurrence of terms within sentences,
`rather than within documents, as a criterion. It should benot.ed tha.t as new term
`associations are defined, tbe original incidence matrix ca.n be revised by inclusion
`in some of the matrix columns of new, associated terms which are not originally
`contained in tbe respective sentences or documents. The revised incidence matrix
`then gives rise t.o a new term-term similarity matrix, incorporating second-order
`associations, and eo on. This feedback process is represented by an upwa.rd(cid:173)
`pointing arrow in Figure 3.
`To retrieve documents in answer to search requests, the programs already
`available can be used by adding to the term-document matrix C a new column
`e",+l, representing the request terms. Specifically, element C~.+l is set equal to
`w if tem) WI. is used in the search request with weight w; if word W.~ is not used
`in the given search request c:.+1 is set equal to O. If no weights are specified by
`the requestor t.he values of the elements of column C m+! are restricted to 0 and 1.
`An estimate of document relevance is then obtained by computing for each docu(cid:173)
`ment the similn.rity coefficient between the request column C"'+I and the respec(cid:173)
`tive document column. The documents can be arranged in dec.reasing order of
`simila.rity coefficients, and all documents ,,,it.h a sufficiently large coefficient
`ca.n be judgeti to be relevant to the given 1'€l[Uel:it. Clearly, the final relevance
`criterion dependf:! not only on the terms assigned to the various documents or
`on the words used in the documents and search :request.s, but also on other term!!
`associated 1\ith the original ones through co-occurrence in a gi \'e11 document
`collection.
`
`005
`
`Facebook Ex. 1012
`
`

`
`DOCOMENTRT':'l'RIEVAL USING BlBLIOGRAPlllC INFORMA'l'ION
`
`445
`
`For each document generate
`l1et or high-frequency wordfl
`to /Serve ae "index terms"
`(Bee F1g. 1)
`
`--
`r
`I
`I
`I ,
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`
`L -
`
`i
`Conlltl7lJct term-sentence
`incIdence matr1x
`1111ti.ng
`mentences againllt
`included
`terme
`
`f..
`
`1
`
`Compute term-term 81mt-
`l~rlty matrix and generate
`t.rm a'lIIo()1.ationa for
`doQUDont identification
`
`_~
`
`,
`
`/
`1/
`
`TheaauruB look-up and
`6ubctitution of thesauru~
`head" for high-frequency
`terms
`
`I
`I
`~
`Con5truct term-document
`incidence matrix listing
`documents against
`1ncluded te1"l!t8
`
`I 1
`
`Compare vector of request
`co~ute document-document
`terms with term-document
`81m11ar1t;T IMtrlx and
`generate document a!l"Oc1at1onl5~-~
`incidence matrix and
`identify relevant documents
`
`FIG. 3. Typical autoUllitic document retrieval system using term and document aasociationB
`-. optional paths
`-> compulsory ps,ths
`
`:3. Bibliographic Information as a Factor in Content Analysis
`In the preceding section procedures were described for expanding the set. of
`terms used as document identifications by inclusion of inIormation derived
`from the texts of other documents in the same collection. Since the retrieval
`effectiveness depends to a large measure on the accuracy and completeness of
`the content identifications, it, is of interest to inquire whether additional pertinent
`data available with many documents might not also be used to provide important
`content indications. In particular, it may be conjectured that information as(cid:173)
`sociated with the autho'l' of a given document, fo1' example data contained in
`related publications of the same author, may furnish usable content indicators.
`The same considerations may also apply to information obtained from publica(cid:173)
`tions dted by a given author in his list of references, 01' from those citing the given
`document.
`An attempt is therefore made in the next few semions to evaluate the utility
`of bibliographic citations as an aid to automatic content analysis. When this
`problem is first considered, the initial reaction must clearly be one of·skepticism.
`Indeed, it is well known that many different practices are followed by individua.l
`authors of tMhnical papers in the construction of bibliographies. Most important
`aB a controlling element is the document type. Survey and tutorial articles carry
`
`006
`
`Facebook Ex. 1012
`
`

`
`GEHAHD "AlIre",
`
`nl<.>I\' ('stellsivc bibiiugrt(phif:~ (.h!111 ::;peci{ic l'c::::cal'elt rOpOl'tf;. Si milady, art,ideo
`C{)I'l';';llg ~l \\idi:' ,·arid:.- of (OVies may be eit.eci more f)'('CJ11011tly than others
`\1 (licit are wore sp{~('iah('d. '\" a result, t\yO articks which Co\'cr identical t.opics
`from ;;olllc\\·hat ditTerent poitlt.s of view may includc quite di"titlct. hibliographies,
`.\ :occoEd impona.nt eritcriOll i~ the Hvailability of tilE' eit{~d docllment.. Thus
`report,; lw:lllded in cer(aill boob; or ill irnpOl'tallt joul'lIals nre likdy jo be Cited
`morc often than thosc no(' gCl\cl'lllly aVl1iluble to the public. By (,he same Loken,
`uJ\classifkd papers ilre cited more freely than classified ouc:::;. The dat,e of publica_
`tion is a related factor which also anects the probability of being cited. Very
`recent dO(;Umellt~ which have not had a chlwee to circulate, and very old OIl~B
`which no longer circulatr are, ill general, cited more rarely t.han current articles
`which have bern distl'ibutf'd within t.he recent past.
`A final consideration pertains specifically to the author of fl. document. In
`mauy casco per::.onal preferences are evident both as to Humber and type of
`papert' cited; authors buye varying backgrounds, and there may also exist. a
`tendl!ncy toward self-citation regardless of relevancy.
`Because of these and ot.her variatiulll:>, citation and ref'erenc0 lists' have not.
`generally been used as an iudic:1tioll of document content. HtLther, such list;,
`are used to detect trellds in the literuture as a whole, and to serve ,.15 adj uncts to
`eertain kinds of litemturc searches [7, 8J. Citation indexes have, for example, been
`used in attempt.s to identify sip;nificant reoellrch by equating frequency of cita(cid:173)
`tion "ith relative significance of the subject matter [9); they have also served to
`trace t.lHj flow of information and to measure the relative importallce of various
`journals to the scientific comnnlllity [lOJ.
`There exists considerable evidellce, however, that in addition to being useful
`for tho above-mentioned stalH.lard applications, bibliographic information might
`also help in conteut analysis. Finst, it is clear that for the large majority
`of authors, at least some of the items included in a reference list will be highly
`pertinent documents " ... hose subject matter overlaps dratitically ",-ith that of the
`citing document. The same is true of other documents published by the same
`author whether they appear as part of the references or not. Second, it, has been
`stated that many experts who arc reasonably familiar wit.h research and develop(cid:173)
`ment in t.heir field of activity use a citation iudex in preference to a subject index
`by following up t.he references to the standard, well-known wot'ks which are
`judged important. In such cases bibliographic references are then in fact used as
`content indicators. Third, citations are relatively simple to incorporate in an
`aut.omatic system, sinr.t: they are directly available with most technical docu(cid:173)
`ment..<.;, and since they can be manipulated automatieally nearly as ea.<;ily IlS
`onlinary numing text.
`If it could be showll that citations wore usable as conteut indicatol's, then the
`associative techniques descrihed in Section 2 could be further refined lJyadding
`I A citat1'QT£ index consists of a Re~ of bibliographic reierences (the set of cited documenw),
`each followed by a list of a.ll those documents (the citing docllments) which include the
`given cited document as a referflnc~. A reference index, on the other hUlld, lists all cited
`doellm6nt~ under e:«;h citing document.
`
`007
`
`Facebook Ex. 1012
`
`

`
`DO(:1B-lENT ImTHlEVAf. USJNG BIBLlOGHAPHlC lNFOn~!ATION
`
`447
`
`l-ttf,'ri.
`(lnC!O/u·J7.l~
`
`1
`
`D.
`
`Ciltng dO;::'1(Jn.c'fl/..')
`Di
`
`Dm
`
`i(-Xll~X,-1 .:-x::-)-
`~~ I :(. ::~ . .. ~~~
`
`D,
`
`x
`
`(X/ = 1 H document D; is cIted by document. DJ )
`FIG. 4. ~lntrix X exhibiting direct cii.ations
`
`to
`term-document matrix illustrated in Figure 2(11) further document
`the
`Golumns representing cited documents, citing documents, or documents written
`hy the same author. These new documents wouJd then provide new associated
`term" which might be c4ually as important as the term associations derived from
`other documents in the same collection.
`To test the significance of bibliographic citations, a comparison was mad€
`between eiLaLion similarities and index term simila.rities for an indexed documeni
`collection. Specifically, a measure of similarity was computed between each paD
`of documents in the collection, based on the number of overlapping index rerms
`a similar measure , ... ~as then computed for the same pairs of documents, based or
`the number of overlapping citations; finally, the similarity measures obtained
`from index terms and citations respectively were compared by calculating 13
`similarity index between citation similarities and index term similarities. Ar
`overall measure was ~lso computed for the complere documont collection b5
`taking into account the similarity measures between all document pairs.
`If use were to be made of bibliographic information for purposes of cont-eni
`analysis, one ,vould hope that large similarity measures between specific docu·
`ment. pairs due to overlapping index terms would also be Jrefiected in large meas
`ures between the same document pair8 due to overlapping citations, or Yice~versa
`To test this hypothesis, the data obtained for the actual document collectiOl
`were compared with data obtained from a randomly constructed, fictitious collec·
`tion for which no correlation should exist between index terms and citations. Th(
`actual procedure!'> used in the experiment are summarized in the next section. 2
`
`4. Comparison of Cilation Similarities with Index Term Similarities
`Consider a collection of in documents each of which is characterized by tht
`properly of being cited by one or more of the other documents in the same colIee
`tion. Each document can then be represented by an m-dimensional logical vecto
`X~, where X/ = 1 if and only if document i is cited by document j, and X/ = I
`otherwise. If these m vectors are arranged in rows one below the other a squal"l
`logical incidence matrix is formed similar to the matrix exhibired in Figure 4
`The number of ones in the row vectors of X ropresents the degree of "citedness'
`for the documents lisred at the head of the rows. Similarly, the number of one
`
`= A more complete exposition of the experiment is given in [oj.
`
`008
`
`Facebook Ex. 1012
`
`

`
`448
`
`mmARD SAUro",
`
`ill the column vectors of X measures the amount of "citing" for Lhe documents
`listed at the head of the column~. For a closed document collection, thc set of row
`identiflers is the same as the set of coluIIlu identifiers aR shown in Figure 4.
`A measure of similarity between row (column) vectors C!ll1 be obtained by
`calculating the cosine factor, previously exhibited in Sectioll 2, for each pair of
`row" (columns). The result of such a computation can again be represented by
`a ,similarity matrix R, similar to that shown in Figurc 2(b), where H;' is t.he value
`of the similarity coefficient between t.he ith and jth rows (columns) of X.
`The coefficients of R nOw represent a measure of I$imilarity beb\'{~'en documents
`based on the number of overlapping di1'ccl citations. Thi::; eOIlcept may be ex:
`tended by using as a basis for t,he calculation of similarity eocfficients not the
`existence of direct links between documents (links of length one), but links of
`length t.wo, three, four, or more. Consider, as an example, a document collection
`in which document A cites document B, or B cites A. The corresponding tlocu.
`ments are then said to be linked direetly. On the other hand, if A does not cite B
`hut A cites (01' is cited hy) C which in turn cites (or is cited by) B, no direc;
`link exists between A aud B. Instead, A and B are then linked by a path of length
`two, since an extraneOllS document C exists between docnment.s A and B. Simi.
`larly, if the path between two documents includes two extraneous documents,
`they are linked by a path of length three, and so on.
`Given a square citation matrix X it is possible by matrix multiplication to
`obtain matrices X', X", etc., exhibiting respectively the existence of paths of
`length two, three, and so on [I1J. Specifically,
`
`i
`
`/\ X/'),
`
`V (Xk
`k=l
`....
`[X"J/ = V (X~: A (X')/'), and so on.
`
`[X']/
`
`k-l
`
`Boolean multiplication is used, since the new connection matrices X', X", etc.,
`are again defined as logical matrices. (X') / is then equal t,o 1 if and only if at
`least one path of length two exists between documents Di and D j
`; otherwise,
`(X') / is equal t.o D. It Ulay be noted that X', unlike X, can have nonzero diagonal
`elements, corresponding to the case where two document.s mutually cite each
`other.
`As before, the cosine measure can again be used to obtain it row or column
`si:rrillarity matrix Rt from citation matrix X', or R" from X It
`, and so on. These
`new correlation matrices measure document similarity based OIl common cita·
`tion links of length two, three, and so on. While it is theoretically possible to
`work "ith correlation matrices nn based on overlapping citation links of length n,
`it is likely that the similarity between subject matter and citations will diminish
`rapidly as the length of the citation links increases. For present purposes, the
`investigation of overlapping citations is therefore restricted to the consideration
`of direct link"! and links of length two, three, and four.
`A measure of similarity based on the number of overlapping index terms be·
`tween documents can be obtained as explained previously, by starting with II
`
`009
`
`Facebook Ex. 1012
`
`

`
`DOCCJ\fENT HF.'l'ftlF;V:\L USIKG BIBLIOGRAPHIC !::-l"FORMATION
`
`449
`
`term-document incidence matrix C similar to that shown in Figure 2(a), and
`using the eosine measure to compute the elements S j i of an m by m symmetric
`similarity matrix S. U the documents appear as column headings as in Figure
`2(a), the nimilarity coellicients are obtained by matching pairs of columns of C.
`Since the tcnn-doeument matrix C is not in geneml a square matrLx, matrix
`multiplication cannot be used to obtain second order effects, similar to the cita(cid:173)
`tion links of length two or more. Instead, it is first necessary to compare the
`index terms by performing a row comparison of the rows of C. This produces a
`new n by n symmetric term matrix C* \\'hioh displays similarity between index
`terms. This matrix can be used to eliminate from the set of index terms those
`terms which exhibit a large number of joint occurrences wit.h ot.her terms. A
`reduced set of index terms can then be formed and a new term-document matrix
`C' constructed, from which a new correlation matrix 5' is formed. Higher order
`term similarity can of course be obtained if desired by squaring or cubing C*.
`The last step needed jn the testing procedure is a comparison between the
`document similarity coefficients obtained from the citati.ons and the coefficients
`obtained from the index term.'). Specifically, it is necessary to compare the ele(cid:173)
`ment of matrices R, R', Rd
`, and so OIl, with the equivalent ones in Sor 5'. Since
`the comparison of individual pairs of equivalent coefficienttl may not be very
`meaningful, one coefficient. ,,,ill be computed for each document by comparing
`equiL·alent document rows in Rand S. The cosine measure ean again be used for
`that purpose in the form
`
`x, = ·cos (R', S')
`
`to obtain 11 "cross-correlation vector" x. Each element of a cross-correlation
`vector is thus a measure of similarity for a given document, derived by comparing
`similarity coefficientfi obtained from citations with the corresponding similarity
`coefficient::; obtained from index terms for that given document. Large values of
`the vector element Xi will indicate a close similarity between the measures ob(cid:173)
`tained from citations and those obtained from index terms, since then nonzero
`terms in R' will correspond to nonzero terms in Si. Small values of the element
`Xi , on the other hand, win indicate no similarity in the two types of measures.
`A single "overall" cross-correlation coefficient can also be obtained for the
`complete document set, in addition to the cross-correlation vector, by comparing
`the complete matrix R with the complete matrix S. ThiR is done by considering
`all elements within a given matrix as belonging to a. single vector of dimension
`m2
`, and comparing the two resulting vectors by means of the cosine measure.
`The "overall" cross-correlation coefficient will be large jf many of the cross(cid:173)
`correlation vector elements arelarge, that is, if for many documents large_simi(cid:173)
`larity coeffioients obtained from oitations correspond to large coefficients obtained
`from index terms.
`The complete procedure is summarized in the flow-chart of Figure 5. For the
`
`010
`
`Facebook Ex. 1012
`
`

`
`(,E1URD S.\LTO""
`
`L'C'neidel' witn caetl document
`i the set of applioable Inde~
`land tne oet of ~ppl!oable
`l .. _
`
`CO~,stT'uct; a
`tncl~ence matrix £ listin~
`document 6 W/'i 1n5t included
`te.r1'1s
`
`'Cofl~i;r\Jct a dOCl;ment-doc1.lment
`:si"UarHy matrix ~ bMed on
`ove:'l~pplne index terns
`
`a cltRtlon lnc~~
`l listinc. each c1 ted
`--,
`(lo,~un1enlt a:.:;e.ln~t all cltinc
`I
`I
`_~ I
`cloCtlrrent8
`I
`
`,..-___ 1_ .. _____
`
`'~cn6tr\lct e. document-document
`similarity m3. trlx. X!: based on
`overlappinG citations
`
`... _--]
`Compute a cross-correlat1on
`vector z:. !l.nd o'lerall cross-correlat1on
`coefficient x to measure 81~llaritleB
`between document rows Rand S, and
`lJetween the compJ.ete matrl.ces, res;>ectlv~lY
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`i
`r-C;;~~~;;t:ct-~-e-I'T1l--t-e-!"'l!-. -s-1-rn-l-1a-r-i-t-y" r-c-o-n-~-t-ru-ct-6-0-_tl-a-red:- cu~ecl,;}1
`••.
`I :latrix:;.. and use 1t to
`J...
`I
`incidence Matrices X',
`i
`... -»1 generate new te!"!r.-doc' .. lment
`exhlb1.tin~· citation -.linkS of
`. _J
`I
`m~tricea £', E",,··,
`length two, three, ••.• and
`L ______ ._. __ a_n_d_3_o_o_n _____ .-l
`so on
`
`,
`
`FIG. ;i.
`
`(Xmlpal'isOll iii r:itatlOl) similarities with index term similnritics
`
`actual experiment, a collection of sixty-two documents dealing with linguistics
`and machine translation was chosen. A set of fifty-six index terms ,vas used for.
`manuul indexjng of the documents. The t.wo bt1.$ic inputs used for the computer
`experiments were thus logical matriees of dimen.:;;ion 52 by 62 and 62 by 56,
`listing, respectively, cited VerSLlS cit.ing documents, and documents vel'SUS terms.
`These two input matrices correspond in format to the examples of Figures 4
`and 2(a). From the two logical input rn..'l,trice" a set. of three principal an~ six
`auxiliary ~imilarit,y matrices was obtained by performing comparison<; bet~'ccn
`pairs of ro\vs or columns. The similarity matrices all cOfl'espond in format to the
`example oi Figure 2(b) .. The crrED and CI'I':-W similarity matrices of dimension
`52 by 62 were obtained from the original c.itation mutJ'ix by row and column
`comparisons, respectively. The TJ)C;\il' similarity matrLx, aliso of dimension 62
`by 62, \vas similarly obtained by column comparisons from the original term ..
`document matrix. Additional citation similarity matrices, designated c'rD2,
`C'rv3, er.o·i, aDd CNG2, C.K(,3, CNG4 were obtained from the squared, cubed,
`!lnd fourth power logical citation matriecs, as previously explained.
`
`011
`
`Facebook Ex. 1012
`
`

`
`DOCFMT·;wr HETRIE\'AL rSING BIBLIOGR. .... PHlC INFOHlI1ATION
`
`451
`
`OVERALL
`COEFFICIENTS
`
`CITED
`CITING
`
`/
`
`ACTUAL DOCUMENT
`COLLECTION
`
`0.5
`
`04
`
`0.3
`
`0..2
`
`0..1
`
`o
`
`".. ... -
`~ '\..
`--....
`"'../
`---..... ,
`/
`/ "
`.... ---_ ... ----... ----
`
`//
`
`'~
`
`-
`
`PSEUDO-COLLECTION
`"''''..
`
`__ .. L.
`.... ___ ._L......._ .••...• .1.
`I !
`I
`CNG 2 CNG 3
`eTO 2
`eTO:3 CTD 4 CITNG
`CITED
`eNG 4
`TDCMP TDCMP TOCMP TDCMP TDCMP TDCMP TDCMP TDCMP
`
`FIG, fl. Comparison of overall similarity coefficients
`
`Eight cross-correlation operations were performed by eorrelating each of the
`eight citat

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket