throbber
83. M. David and A C. Lamer, Science 251, 813
`(1992).
`84. K.-l Igarashi. M. David, A C. lamer. D. S. FlO(cid:173)
`bloom. Mol. Cell. Bioi. 13, 3984 (1993).
`85. R. SChreiber, personal communication.
`· 86. E. H. Ascher, H. Charllonneau, N. K. Tonks,
`Science253, 401 (1991).
`87. J. Mirk<Witcti andJ. E. Darnell Jr .. MeL Bioi. Ce/13,
`1085 (1992}.
`88. H. B. Sadowski, K. Shuat, J. E. Demel Jr., M. Z.
`Gilman, Science 261, 1739 (1993).
`89. 0. Silvemoinen, C. Schindler, J. Schlessinger, D.
`E. t.s-Jy, ibid., p. 1736.
`.
`SO. S. Ruff.Jamison. K. Chen. S. Cohen, ibid .. p. 1733.
`
`91. A C. Latner et al., ibid., p. 1730.
`92. H. Kotanides and N. C. Reich, ibid. 262, 1265
`(1993).
`.
`93. A Bonn!, 0. A F.rank. C. Schindler, M. E. Green(cid:173)
`berg, ibid., p. 1575.
`94. R. Graham and M. Gilman. ibid. 251, 189 (1991).
`95. D. E. levy, in Interferon: Principles and Medical
`Applicslions, S. Baron et at:, Eds. {University of
`Texas Medical Branch at Galveslon, Galveston,
`lX, 1992), pp. 161-173.
`96. S. Holland. G. R Slark. l M. Kerr, unpublished
`obsBMifions.
`.
`.97. D. Watllng, G. R Stark. I. M. Kerr, unpubUshed
`observations.
`
`Automatic Analysis, Theme
`Generation, and Summarization
`of Machine-Readable Texts
`
`Gerard Salton, James Allan, Chris Buckley, Amit Singhal
`Vast amounts of text material are now available In machine-readable fonn for automatic
`processing: Her4;1, approaches are outlined for rTJanipulatlng and accessing texts in arbitrary
`subject areas if! accordance with user needs. In particular, methods are given for deter(cid:173)
`mining text themes, traversing teXts selectively; and extracting summary statements that
`reflect text content
`
`Many kinds of texts are currently available
`in machine-readable form and are amenable
`to automatic processing. Because the avail(cid:173)
`able databases are large and cover many
`dif!'erent subject areas, automatic aids must
`be provided. to use~ interested in accessing
`the data. It has been ruggested that links be
`placed between rein ted pieces of text, con(cid:173)
`necting, for example, particular text para(cid:173)
`gmphs to other paragraphs covering related
`. subject matter. Such a linked text sauc(cid:173)
`ture, often catled hypertext, makes it pos(cid:173)
`sible for the reeder to start with particular ·
`text passages and use the linked structure to
`find related text elements (1). Unfortunate(cid:173)
`ly, until now, viable methods for automati(cid:173)
`cally building large hypertext structures and
`for using such structures in a sophisticated
`way have not been available. Here we give·
`methods for constructing text 'relation maps
`and for using text relations to access ai)d use
`text databases. In particular, we outline
`procedures for determining text themes, tra(cid:173)
`versing texts selectively, and extracting sum(cid:173)
`~ statements that rdlect text content.
`
`Text Analysis and Retrieva1:
`The Smart System
`
`The Smart system is a sop!}isticated text
`·retrieval tool, developed over the past 30
`· years, that is based on the vector space
`
`The authors are In Jhe Department of Computer ScJ.. ·
`ence, Cornell university, Ithaca. NY 14853-7501, USA
`
`model of retrieval (2) . In the vector space
`model, all information items-stored texts
`as well as information queries-are repre(cid:173)
`sented by sets, or vecto~. of terms. A term
`is typicatly a word, a word stem, or a phrase
`associated with the text under consider(cid:173)
`ation. In principle, the teims might be
`chosen from a controlled vocabulary Ust or
`a thesaurus, but b~cause of the dilliculties of
`constructing such controlled vocabu.laries
`for unrestricted topic areas, it is- convenient
`to derive the terms directly from the texts
`the
`under consideration. Collectively,
`tenns assigned to a particular text represent
`text content.
`Becaus~ the terms are not equally useful
`for content representation, it is important to
`introauce a term-weighting system that as(cid:173)
`signs high weights to terms deemed important
`and lower weights to the less important terms.
`J:.. powerful rmn-weighting system of this .
`kind is the weD-known equatiol;i '!, x life
`(tenn !Rquency rimes inverse collection
`frequency), which favors terms with a high
`frequency, (f.) in ·particular documents but
`with a low.~cy overall in the collection
`(JJ. Such terms distin_guish the documents in
`which they occur- from the remaining items.·
`When all texts or text queries are repre(cid:173)
`sented by weighted term vectors of the form
`D1 .., (dil, du, •.. , d;j, where du. is the
`weight assigned to term k in document D,; a
`similarity measure can he computed be(cid:173)
`tween pairs of vectors that reflects text
`simUariry. Thus, given document D1 and
`SCIENCE • VOL 264 • 3 JUNE 1994
`
`query Q1 (or sample document D1), a .simi(cid:173)
`larity computation of the form sim(D,, q)
`= l:l=t d;kdp. can produce a ranked list of
`documents in decreasing order of similarity
`with a query (or with a sample document).
`When ranked retrieval output is provided
`for the user, it is easy to use relevance
`feedback procedures to build impro~ed que(cid:173)
`ries on the basis of the relevance of previ(cid:173)
`ously retrieved materials •
`In the Smart system, the terms used to
`identify the text items are entities extract(cid:173)
`ed from the document texts after elimina(cid:173)
`tion of common words and removal of
`word suffixes. When the document vocab..
`ulary itself forms the basis for text content
`repres~ntation, distinct documents with
`large overlapping vocabularies may be dif- ·
`ficult to distinguish. For example, the
`vectors covering biographies of John Fitz(cid:173)
`gerald Kennedy and·Antbony M. Kenne(cid:173)
`dy, the current Supreme Court justice,
`will show many similarities because both
`Kennedys attended Harvard University,
`were high officials of the government, and
`had close relationships with U.S. presi(cid:173)
`dents. The global vector similarity fun<:(cid:173)
`tion described earlier cannot cope with
`ambiguities of · this kind by itself. An
`l!dditional step designed to verify that the
`matching vocabulary occurs loca!J,y in sim(cid:173)
`ilar contexts must therefore be introduced
`as part of the rel;rlevai algorithm. This is
`accomplished by insisting on certain local(cid:173)
`ly matching substructures, such as text
`se~tences or text paragraplu, in addition
`to the global vector match', before accept(cid:173)
`ing two texts as legitimately similar (.3).
`Consider, as an examp)e, a
`typical
`search conducted in the 29-velume Funk
`and Wagnalls encyclopedia, using as a que(cid:173)
`the
`ry
`text of article 9667, entided
`"William Uoyd Garrison" (Garrison wa5
`the best known of the American abolition(cid:173)
`ists, who opposed slavery in the early part of
`the 19th century) (4). The upper portion of
`Table 1 shows the top 10 items retriev~ in
`.response to a global vector 'comparison.
`The top retrieved item is article 9667 itself,
`with a petfect query similarity of 1.00,
`followed by additional articles dealing with
`abolitionism and the slavery issue, retrieved
`with lower siinilarity values.
`, The upper portion of Table 1 consists of
`relevant items only, with the exception of
`article 9628, entided "Gar," retrieved in
`position eight on the ranked UsC.: Gar is a
`type of fish, obviously unrelated to the
`slavery issue but erroneously retrieved be(cid:173)
`Cause truncated tenns were used in the text
`· vectors, and the truncated form of ''Oarri·
`·son" matches !'Gar." (Removal of "·ison"
`as part of the stemming process first reduced·
`"Garrison" to "Garr," as in "comparison"
`and "compar''; removal of the duplicated
`consonant then reduc!ld "Garr" to the final
`
`1421
`
`EXHIBIT 2031
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00479
`
`

`

`"Gar.") The lower portion of Table 1 shows
`the results obtained with an additional local
`text comparison that required at least one
`matching text sentence between the query
`article and each retrieved document. There
`are no matching sentences in documents
`9667 ("Garrison") and 9628 ("Gar"), be(cid:173)
`cause gar, meaning fish, and "Gar" derived
`'from the name Garrison are obviously not
`used in similar contexts. Hence the offend(cid:173)
`ing document 9628 was removed from the
`retrieved list. Most linguistic ambiguities
`are· similarly resolvable by this global-local
`vector-matching process. The lower por(cid:173)
`tion of Table 1 also differs from the upper in
`that certain text passages are retrieved (la(cid:173)
`beled "c" for section and "p" for paragraph)
`in addition to certain full document texts.
`The passage retrieval issue is examined in
`more detail in the next section.
`
`Text Decomposition and Structure
`
`Practical retrieval searches deal with text
`items that are heterogeneous in both sub(cid:173)
`ject matter and text length. Thus, in the
`same text environment it may be necessary
`to cope with short e-mail messages as well
`as long book-sized texts. In an encyclope(cid:173)
`dia, three-word articles representing cross(cid:173)
`references from one subject to another oc(cid:173)
`cur routinely, in addition to many long
`
`Table 1. Text retrieval strategies. Query: article
`9667, "William Lloyd Garrison." Section indicat(cid:173)
`ed by "c"; paragraph indicated by "p."
`
`Document
`number
`
`Query
`simi(cid:173)
`larity
`
`Title of retrieved item
`
`9667
`18173
`76
`21325
`827
`
`21326
`8097
`
`9628
`2883
`5584
`
`9667
`18173
`2974.c33'
`76
`21325.c8
`827
`
`Global text comparison only
`1.00 Garrison, William Uoyd
`Phillips, Wendell
`0.53
`0.48 Abolitionists
`0.40 Slavery
`0.36 American Anti-Slavery
`Society
`0.35 Slave Trade
`0.35
`Emancipation
`Proclamation
`0.30 Gar
`0.27 Birney, James Gillespie
`0.27 Clay, Cassius Marcellus
`Global-local text comparison
`and retrieval of text passages
`1.00 Garrison, William Uoyd
`0.53 Phillips, Wendell
`0.50 Blacks in Americas
`0.48 Abolitionists
`0.42 Slavery
`0.36 American Anti-Slavery
`Society
`Emancipation
`Proclamation
`United States of
`America
`0.29 Villard, Henry
`0.28 Civil War, American
`
`8097
`
`0.35
`
`23173.c97'
`
`0.31
`
`23545.p5*
`5539.c28*
`
`*New article retrieved in restricted search.
`
`1422
`
`·: ~ . ..
`
`~. ~~. ; -,.
`
`:·~., . ;-.: :·~ ;-:-:
`
`the 175-page article
`treatments such as
`entitled "United States of America." In a
`vector-processing environment, long arti(cid:173)
`cles that deal with diverse subject matter
`are difficult to retrieve in response to short,
`more specific queries, because the overall
`vector similarity is likely to be small for
`such items. Thus, the full article "United
`States of America" is not retrieved in the
`top 10 items in response to the query aboUE
`William Lloyd Garrison, even though cer(cid:173)
`tain sections in the article specifically deal
`with abolitionism.
`The rejection oflong articles can reduce
`retrieval performance in some cases. More
`generally, long articles are difficult for users
`to handle even when retrieval is possible,
`because long texts cannot easily be ab(cid:173)
`sorbed and processed. This suggests that
`long texts be broken down into smaller text
`passages and that access be provided to
`shorter text excerpts in addition to full
`texts. Various attempts have been made in
`the past to implement passage retrieval
`capabilities, but flexible .systems capable of
`handling text excerpts do not currently
`exist (5).
`The Smart system can deal with text
`segments of varying length, including text
`sections, paragraphs, groups of adjacent
`sentences, and individual sentences. The
`lower portion of Table 1 thus shows the
`results of a mixed search in which text
`sections and paragraphs are retrieved in(cid:173)
`stead of full texts whenever the query sim(cid:173)
`ilarity for a shorter text passage exceeds the
`similarity for the full article. A number of
`new items are promoted into the top 10 list
`when text passages are retrievable, includ(cid:173)
`ing section 33 of document 2974, "Blacks
`in the Americas," and section 97 of "Unit(cid:173)
`ed States of America." The text of docu(cid:173)
`ment 2974.c33 (section 33 of document
`2974) covers the founding of the American
`Anti-Slavery Society by William Lloyd
`Garrison in 1833. The relevance of this
`text to abolitionism and William Lloyd
`Garrison explains its good retrieval rank
`and high similarity coefficient of 0.50.
`The available evidence indicates that
`when searching an encyclopedia, the use of
`the combined global and local similarity
`computations improves retrieval effective(cid:173)
`ness by about 10% over the use of global
`vector similarity measurements alone. An
`additional 10% improvement is obtainable
`by use of the passage retrieval capability
`that identifies document excerpts in addi(cid:173)
`tion to full texts ( 6). The results obtained
`by extensive testing in the TREC (Text
`Retrieval Evaluation Conference) environ(cid:173)
`ment indicate that the Smart system pro(cid:173)
`duces consistently superior retrieval perfor(cid:173)
`mance (7). Furthermore, response times are
`comparable to those obtainable in commer(cid:173)
`cial retrieval environments. A Smart search
`
`SCIENCE • VOL 264
`
`• 3 JUNE 1994
`
`of the TREC collections (700,000 full-text
`documents, or 2.4 gigabytes of text) has
`typical response times of 3 s for a 10-term
`query or 6 s for a 20-term query.
`When text passages are available for
`processing and similarity measurements are
`easily computed betwet;n texts and text
`excerpts, text relation maps can be gener(cid:173)
`ated that show text similarities that exceed
`a particular threshold value. Figure 1 shows
`a relation map for four encyclopedia articles
`related to William Lloyd Garrison ("Sla(cid:173)
`very," "U.S. Civil War," "Abolitionists,"
`
`U.S. Civil War
`5539
`
`William lloyd Garrison
`9667
`
`0.23
`
`0.49
`
`Abolitionists
`76
`
`0.40
`
`0.53
`
`21325
`Slave I}'
`( Unks below 0.20 Ignored)
`
`(Lin/cs below 0.30 ignored)
`
`CLinics below0.30 ignored)
`Fig. 1 (top). Basic text relation map. Vertices
`(nodes) represent texts; lines (links between
`nodes) represent text relations above a similarity
`threshold of 0.20. In all figures, "c" indicates
`section, "p" indicates paragraph.
`Fig. 2 (mid(cid:173)
`dle). Well-connected text relation map for para(cid:173)
`Fig. 3
`graphs of article 21385, "Smoking."
`(bottom). Poorly connected text relation map for
`paragraphs of article 21933, "Symphony."
`
`

`

`and "Garrison"). The texts themselves are
`represented by nodes (vertices) of the map,
`and the pairwise text similarities are indi(cid:173)
`cated by links {branches) between the cor(cid:173)
`responding node pairs. Figure 1 shows all
`similarities between full articles exceeding a
`similarity threshold of 0.20 (8). Text link(cid:173)
`ing has been used in the past to build
`hypertext structures, but the links between
`related text pieces are normally assumed to
`be placed subjectively by individual text
`authors--a procedure manifestly impracti(cid:173)
`cal in environments where large masses of
`heterogeneous texts are stored for process(cid:173)
`ing (9).
`A study of various kinds of text relations
`between texts and text excerpts can reveal a
`good deal of information about the internal
`structure of individual texts, as well as the
`relations between different texts. Consider,
`as an example, the paragraph map for arti(cid:173)
`cle 21385, "Smoking," shown in Fig. 2,
`which includes all pairwise paragraph simi(cid:173)
`larities exceeding 0.30. In the correspond(cid:173)
`ing graph, there are no disconnected com(cid:173)
`ponents, and many similarities exist be(cid:173)
`tween adjacent paragraphs. The convex
`graph structure reflects a homogeneous
`treatment of the topic; in this case, the
`"Smoking" article emphasizes the health
`problems connected with smoking and the
`difficulties that arise when people attempt
`to quit smoking. For a homogeneous map
`such as this, it should be easy to determine
`the basic text content by looking at only a
`few carefully chosen paragraphs.
`In contrast, consider the paragraph rela(cid:173)
`tion map in Fig. 3, which shows paragraph
`similarities for article 21933, "Symphony,"
`and uses the same similarity threshold of
`0.30. This map is much less dense; there are
`many outliers consisting of a single node
`only, and there is a disconnected compo(cid:173)
`nent that includes paragraphs 2 and 3 of
`section 5. Clearly, the "Symphony" topic
`does not receive the same homogeneous
`treatment in the encyclopedia as "Smok(cid:173)
`ing," and a determination of text content
`by selectively looking at particular text
`excerpts is much more problematic in this
`case. Attempts have been made in the past
`to relate certain manually linked hypertext
`structures to the corresponding text charac(cid:173)
`teristics, but a. detailed structural text anal(cid:173)
`ysis based on automatically linked struc(cid:173)
`tures at various levels of detail has not so far
`been undertaken (10).
`In Figs. 1 to 3, the text nodes are equally
`spaced around the circumference of a circu(cid:173)
`lar structure. This makes it easy to recog(cid:173)
`nize the links between individual text ex(cid:173)
`cerpts, but the actual link location in the
`running text is obscured. In particular, it is
`difficult to tell whether a link is placed at
`the beginning, in the middle, or at the end
`of a text. An alternative display format is
`
`shown in Fig. 4, in which the space as(cid:173)
`signed to each text along the circumference
`is proportional to the text length, and each
`text link is placed in its proper position
`within the texts. Figure 4 shows a paragraph
`map for four related articles ("Mohandas
`Gandhi," "Indira Gandhi," "Nehru," and
`"India") with the use of a similarity thresh(cid:173)
`old of 0.30. It is obvious that the text of
`article 12017 ("India") is much longer than
`that of the other articles and that the
`coverage ofMohandas Gandhi (the Mahat(cid:173)
`ma) is in tum more detailed than that of
`Indira Gandhi and Nehru.
`Various kinds of topic relationships can
`be distinguished in Fig. 4, depending on the
`particular linking pattern between text ele(cid:173)
`ments. For example, when multiple links
`relate a particular (shorter) document such
`as "Indira Gandhi" (9619) and a subsection
`of a longer document such as "India"
`(12017), a narrower-broader text relation
`normally exists. Similarly, when a particu(cid:173)
`lar section of one document has multiple
`links to a particular section of another
`document, the two text items usually share
`a common subtopic. One can thus conclude
`that "Nehru" (!6579) and the two "Gan(cid:173)
`dhis" (9619 and 9620) represent subtopics
`of "India" (!20!7). Similarly, "Mohandas
`Gandhi" and "Nehru," and "Indira Gan(cid:173)
`dhi" and "Nehru," are pairs of related
`documents that share common subtopics.
`Finally, the lives of the two Gandhis appear
`to be largely unrelated-a single linked
`paragraph pair exists that refers to unrest in
`India, a condition that plagued both poli(cid:173)
`ticians. The relation between Mohandas
`and Indira Gandhi is entirely through Neh(cid:173)
`ru, who was a disciple of the Mahatma and
`also the father of Indira.
`This type of analysis gives an objective
`view of the topic coverage in individual
`texts and of the information shared among
`sets of related texts. In the rest of this
`article, we examine three kinds of text
`analysis systems in more detail, which leads
`to the identification of text themes, the
`selective traversal of texts, and the summa(cid:173)
`rization of text content by extraction of
`important text excerpts.
`
`Text Theme Identification
`
`A text theme can be defined as a specific
`subject that is discussed in some depth in a
`particular text or in a number of related
`texts. Themes represent centers of atten(cid:173)
`tion and cover subjects of principal interest
`to text authors and presumably also to text
`readers. The identification of text themes is
`useful for many purposes--for example, to
`obtain a snapshot of text content and as an
`aid in deciding whether actually to read a
`text.
`Various approaches based on linguistic
`
`SCIENCE • VOL 264
`
`• 3 JUNE 1994
`
`text analysis methods suggest themselves for
`the identification of text themes ( 11). In
`the present context, the text relation maps
`are used as inputs to a clustering process
`that is designed to identi/Y groups of text
`excerpts that are closely related w each
`other but also relatively disconnected from
`the rest of the text (12). The following
`simple process leads to text theme identifi(cid:173)
`cation: First, the triangles in the relation
`map are recognized (a triangle is a group of
`three text excerpts, each of which is related
`to the other two to a degree that is above
`the stated similarity threshold). A centroid
`vector is then constructed for each triangle,
`as the average vector for the group of three
`related items. Finally, triangles are merged
`into a common group (theme) whenever
`the corresponding centroids are sufficiently
`similar (that is, when the pairwise centroid
`similarity exceeds a stated threshold). Each
`theme may be represented by a global cen(cid:173)
`troid vector that is constructed as the aver(cid:173)
`age vector of all text excerpts included in
`the theme.
`Figure 5 shows the four themes derived by
`this method for the Gandhi-lndia subject
`area shown in Fig. 4. The following themes
`are apparent: (i) the single solid triangle
`consisting of paragraphs 9619.p5, 16579.p4,
`and 16579.p5 on the right-hand edge of Fig.
`5 (main subject: Nehru); (ii) the single
`hashed triangle consisting of paragraphs
`9619.p3, 12017.p219, and 12017.pZZO
`(main subject: Sikhs, Punjab); (iii)
`the
`group of dark triangles consisting of para(cid:173)
`graphs 9619.p7, 12017 .p211, 12017 .p216,
`1Z017.p218, and !2017.pZZ2 (main subject:
`Indira Gandhi); (iv) the group of light tri(cid:173)
`angles consisting of paragraphs 9620.p3,
`9620.p6, 9620.p8, 9620.p11, 9620.p14,
`9620.p15, 9620.pl8, 12017.p148, and
`16579.p4 (main subject: Mohandas Gan(cid:173)
`dhi). The clear separation between the two
`Gandhis already noted in the map of Fig. 4 is
`present also in the theme map of Fig. 5, in
`which no overlap exists between the dark
`and light triangle groupings.
`An alternative, less onerous but also less
`refined theme generation method is to build
`a text relation map with the use of a high
`similarity threshold (where the number of
`linked text excerpts is small). Each discon(cid:173)
`nected component of the map, consisting of
`groups of highly related text excerpts, is
`then identified with a particular theme.
`The graph obtained by use of a text simi(cid:173)
`larity threshold of 0.50 for the Gandhi(cid:173)
`India subject area is shown in Fig. 6. The
`high similarity threshold reduces the simi(cid:173)
`larity map to three areas, identified as Mo(cid:173)
`handas Gandhi (top theme), Indira Gandhi
`(middle), and Nehru {bottom). These
`themes duplicate those of Fig. 5, but the
`second theme in Fig. 5, which covers Indira
`Gandhi's problems with the Sikhs in Pun-
`
`1423
`
`

`

`jab, is no longer recognized as a separate
`subject.
`When text relation maps are used as the
`main input, themes can be generated at
`various levels of detaiL The larger the text
`excerpts used for text grouping purposes,
`the wider in general is the scope of the
`corresponding themes. Contrariwise, when
`sentences and other short excerpts are used
`in the grouping process, the theme cover(cid:173)
`age is normally narrow. Thus, when themes
`are derived from the texts of documents
`9667 and 76 ("William Lloyd Garrison"
`and-"Abolitionists," respectively), a theme
`derived from paragraph relations might cov(cid:173)
`er the "beginnings of U.S. abolitionism"; a
`more detailed theme derived from sentence
`relations might cover the "founding of the
`newspaper liberator," which was a mile(cid:173)
`stone in the early years of the abolitionist
`movement. By suitable variation of the
`scope of the theme generation process, it is
`thus possible to derive a smaller number of
`broader themes or a larger number of nar(cid:173)
`rower themes.
`
`Selective Text Traversal
`
`When large text collections are in use,
`flexible methods should be availabl~ that
`will skim the texts while concentrating on
`text passages that may be of immediate
`interest. Such a skimming operation can
`then be used both for selective text travers(cid:173)
`al, in .which only text passages deemed of
`special importance are actually retrieved or
`read, and for text summarization, in which
`summaries are constructed by extraction of
`selected text excerpts.
`In selective text traversal (13), starting
`with a text relation map and a particular
`text excerpt of special interest, a user may
`follow three different traversal strategies: (i)
`The path may cover many of the central
`nodes, which are defined as nodes with a
`large number of links to other nodes of the
`map. (ii) The path may use text excerpts
`located in strategic positions within the
`corresponding documents-for example,
`the first paragraphs in each text section or
`the first sentences in 'each paragraph. (iii)
`The path may use the link weight as the
`main path generation criterion by starting
`with the desired initial node and choosing
`as the next node the one with maximum
`similarity to the current node. This last
`strategy is known as a depth-first search.
`When individual text excerpts are se(cid:173)
`lected for path formation or summarization,
`a number of factors must receive special
`attention; among these are the coherence
`of the resulting text, that is, the ease with
`which the text can be read and understood;
`the exhaustivity of coverage of the li~l
`text, that is, the degree to which all the
`main subject areas are covered; the text
`
`the accuracy with
`that is,
`chronology,
`which timing factors are recognized; and
`finally, the amount of repetition in the
`selected text excerpts. Some of these factors
`are handled relatively easily; for example,
`text chronology is often maintained by the
`use of only forward-pointing paths and
`backtracking is not allowed (if a particular
`paragraph is included in a path, no other
`text excerpt appearing earlier in the same
`document can appear in the same path).
`In the present context, text coherence is
`used as the main criterion, and forward
`depth-first paths are used in which each
`chosen text excerpt is linked to the most
`
`similar text excerpt nor yet seen at this
`point. In a depth-first path, each chosen
`excerpt is closely related to the next one,
`and the chance of poor transitions berween
`selected paragraphs is minimized. Consider,
`as an example, the paragraph map in Fig. 7,
`which is based on six documents related to
`the Greek god Zeus (article 24674). The
`assumption is that the path starts with the
`initial text paragraph of"Zeus" (24674.p3).
`A short depth-first path may be defined as a
`single-link path that includes only the ini(cid:173)
`tial text excerpt plus the next most similar
`excerpt. In Fig. 7,
`this defines path
`24674.p3 to 17ZJZ.p4 (paragraph 4 of the
`
`Award to
`M. Gandhi
`~-----~
`
`Mohandas
`Gandhi
`
`lndir.t
`.. , p3-7 Gandhi
`(9619)
`
`Indira Gandhi
`· ·.
`...... ~~~
`·.
`.,i".;;, Prime Minister·
`;d ~~ ..... CP ~
`'§, '8)0
`Absorption of
`Kashrmr (1957)
`
`( Unks from 0.30 to 0.63 shoWII)
`Rg. 4. Paragraph similarity map for articles related to "India" (12017). Length of curved segmenls
`is proportional to text length; links are placed in correct relative position within each text.
`
`Theme4
`Mohandas Gandhi
`~
`~ ~.->:
`.,
`l?..
`%<'o
`7,.
`'IJ's
`
`Fig, 5. Text themes de(cid:173)
`rived by merging of trian(cid:173)
`gles for four articles related
`to "India" (12017).
`
`... ~
`"'

`16579: Nehru, Jawaharial
`12017: India
`9620: Gandhi, Mohandas Karamchand
`9619: Gandhi, Indira Priyadarshinl
`
`(Links below 0.35 ignored)
`
`1424
`
`SCIENCE • VOL 264
`
`• 3 JUNE 1994
`
`

`

`12017.p 147 ............ -_-_-_-_-.-:. ~:::"' ,-~~:~~:8
`Mohandas Gandhi
`
`12017.p14ii"
`
`9619
`_p?
`12017.p214 ····························_-_-_-.:o>"
`Theme 3
`Indira Gandhi
`
`12017.p216 ·--.
`
`120i"fp218
`16579.p4····
`
`_.-16579.p5
`
`Theme 1
`Nehru
`
`(Unksrelow0.50ignorerf)
`
`J~~ ~iaru,Jawaharlal
`9620: Gandhi. Mollandas Karamd1and
`9619: Gandhi, Indira Poiyadarshini
`Fig. 6. Simplified text themes derived from
`high-threshold (disconnected) text relation map
`for articles related to "India" (12017).
`
`article "Mount Olympus"). The corre(cid:173)
`sponding paragraphs introduce Zeus as the
`god of the sky and the ruler of the Olympi(cid:173)
`an gods and then proceed by identifying the
`12 major Olympian deities, including Zeus,
`his wife Hera, and his siblings and children.
`A more complete forward depth-first
`path proceeds from item 17232.p4 to in(cid:173)
`clude 10391.p6 (paragraph 6 of 10391,
`"Greek Religion and Mythology") and four
`additional paragraphs presented in detail in
`Fig. 7. The complete forward depth-first
`path includes information about Rhea,
`Zeus' mother; Cronus, Zeus' father; and the
`Titans, a race of giants that included Rhea
`and Cronus, among other gods.
`Instead of initiating the text traversal at
`the beginning of a text, it is also possible for
`a searcher to use context-dependent text(cid:173)
`traversal strategies that start with a special
`text excerpt of immediate interest. For ex(cid:173)
`ample, someone interested in the foreign
`policy of President Nixon might locate
`paragraph 622 of article 23173 ("United
`States of America") by using a standard text
`search. A depth-first path starting at
`23l73.p622 can then be used to obtain
`further information. Such a path also in(cid:173)
`cludes paragraphs 9086.p 13 (paragraph 13
`of article 9086, "Gerald· R. Ford") and
`16855.pll (paragraph 11 of 16855, "Rich(cid:173)
`ard M. Nixon"). The corresponding texts
`deal with the exchange of visits between
`President Nixon and Leonid Brezhnev; the
`continuation of detente between the Unit(cid:173)
`ed States and the Soviet Union that was
`pursued by President Ford and Secretary of
`State Kissinger; and finally, Nixon's ap(cid:173)
`proach to the People's Republic of China.
`A completely different topic will be covered
`by a depth-first path starting with paragraph
`the Watergate
`23173.p624, describing
`break-in. The corresponding coverage in(cid:173)
`cludes Nixon's presumed implication in the
`Watergate burglary (23173.p624), Vice
`President Ford's staunch defense of Nixon
`during his term as vice president (9086.p8),
`
`sometimes be replaced by a shorter excerpt
`whose similarity to the previous text ele(cid:173)
`ment is large. In the depth-first path of Fig.
`7, the long paragraph 10391.p6 that deals
`with the divine hierarchy on Mount Olym(cid:173)
`pus (node 3 in the figure) may be replaced
`by a group of three adjacent sentences
`(1 039l.g I 7) consisting of the first three
`sentences of the paragraph. Similarly, para(cid:173)
`graph 22566.pl3 is replaceable by sentence
`group 22566.g7, which includes only the
`last three sentences of the paragraph.
`An alternative way of reducing the path
`size is to use theme generation methods to
`obtain text excerpts covering the desired
`subject area. Figure 8 shows a theme gen(cid:173)
`eration map obtained by triangle merging
`for the Zeus subject area used in Fig. 7. Two
`themes are distinguished, "Olympian dei(cid:173)
`ties" and "birth of Zeus." The two text
`excerpts that are most similar to the respec(cid:173)
`tive theme centroids are 17232.p4 ("The 12
`major Olympian deities were Zeus and his
`wife Hera ••. ") and 24674.p5 ("Zeus was
`the youngest son of the Titans Cronus and
`Rhea ... "). An appropriate short path cov(cid:173)
`ering Zeus can then be obtained as 24674.p3
`(the initial paragraph of the Zeus article),
`followed by 17232.p4 and 24674.p5, repre(cid:173)
`senting the most important paragraphs in
`the two themes, respectively.
`
`Text Summarization
`
`In the absence of deep linguistic analysis
`methods that are applicable to unrestricted
`subject areas, it is not possible to build
`intellectually satisfactory text summaries
`(14). However, by judicious text extraction
`methods, collections of text passages can be
`identified that provide adequate coverage of
`the subject areas of interest. For example,
`when homogeneous text relation maps are
`available, a good summary is normally ob(cid:173)
`tainable by use of one of the longer text(cid:173)
`traversal paths in chronological (forward)
`text order.
`Consider, as an example, the paragraph
`map for document 16585 ("Horatio, Vis(cid:173)
`count Nelson") (Fig. 9). Two paths are
`shown that start with the initial text para(cid:173)
`graph 16585.p3. The dashed path traverses
`all "bushy" nodes--that is, no

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket