`(1992).
`84. K.-1. Igarashi, M. David, A C. lamer, D. S. F~n
`bloom. MDI. Cell. 8/a/. 13, 3984 (1993).
`85. R. Schreiber. personal communication.
`· 86. E. H. Asel'ler, H. CharbOMeau, N. K. Tanks,
`Science253, 401 (1991).
`87. J. Mir1<ovilcti andJ. E. Darnell Jr., Mcf. Bioi. Cel/3,
`1085 (1992).
`88. H. B. Sadowski, K. Shuat, J. E. DarneD Jr., M. 2.
`Gilman, Science 261, 1739 (1993).
`89. 0. Sllvemoinen, C. Schindler, J. Schlessinger, D.
`E. Levy, ibid., p. 1736.
`.
`SO. S. Ruff-Jamison. K. Chen, S. Cohen, ibid .• p. 1733.
`
`91. A C. lamer et al .. ibid., p. 1730.
`92. H. Kotanides and N. C. Reich, ibid. 262. 1265
`(1993).
`-
`93. A Bonn!, D. A frank. C. Schindler, M. E.. Green·
`berg, ibid .. p. 1575.
`94. R. Graham and M. Gilman. ibid. 251, 189 (1991).
`95. D. E. levy, in Interferon: Principles and Medical
`Applicalions, S. Baron et a/.~ Eds. (University of
`Texas Medical Branch at Galveslon, Galveston,
`lX, 1992). pp. 161-173.
`96. S. HOlland. G. R Stark, I. M. Kerr, unpubHshed
`observations.
`.
`-97. D. Watling, G. R Stark, I. M. Kerr, unpub6shed
`obsel'lations.
`
`Automatic Analysis, Theme
`Generation, and Summarization
`of Machine-Readable Texts
`
`Gerard Salton, James Allan, Chris Buckley, Amit Singhal
`Vast amounts of text matertal are now available in machine-readable form for automatic
`processing: Here, approaches are outlined for manipulating and access!ng texts in arf:iitrary
`subject areas ~ accordance with user needs. in particular, methods are given for deter(cid:173)
`mining text themes, traversing teXts selectively; and extracting summary statements that
`reflect text content
`
`Many kinds of texts are currendy available
`in machine-readable form and are amenable
`to automatic processing. Because the avail(cid:173)
`able databases are large and cover many
`different subject areas, au[Omatic aids must
`be provided. to use~ interested in accessing
`the data. It has been suggested that links be
`placed between related pieces of text, con(cid:173)
`necting, for example, particular text para(cid:173)
`gxaphs to other paragraphs covering relat~
`. subject matter. Such a linked text strue·
`ture, often called hypertext, makes it pos(cid:173)
`sible for the reader to start with particular ·
`text passages and use the linked structure to
`find related text elements (1). Unfortunate(cid:173)
`ly, until now, viable methods for automati(cid:173)
`cally building large hypertext structures and
`for using such structures in a sophisticated
`way have not been available. Here we give·
`methods for constructing text 'relation maps
`and for using text relations to access at)d use
`text darabases. In particular, we oudine
`procedures for determining text themes, tra(cid:173)
`versing texts selectively, and extracting sum(cid:173)
`~ statements that reflect text content.
`
`Text Analysis and Retrieval:
`The Smart System
`
`The Smart system is a sopQisticated text
`·retrieval tool, developed over the past 30
`· years, that is based on the vector space
`
`The authors are in lhe Department of Computer Sci- ·
`enca, CornelllJniverslly, Ithaca. NY 14853-7501, USA
`
`model of rerrieval (2) . In the vector space
`model, all information irems-stored texts
`as well as information queries-are repre(cid:173)
`sented by sets, or vecto~, of terms. A term
`is typically a word, a word stem, or a phrase
`associated with the text under consider(cid:173)
`ation. In principle, the teims might be
`chosen from a controlled vocabulary list or
`a thesaurus, but b~cause of the difliculties of
`constructing such controlled vocabuJaries
`for unrestricted topic areas, it is· convenient
`to derive the tenns directly from the texts
`under consideration. Collectively,
`the
`tenns assigned to a particular text represent
`text content.
`Becaus~ the terms are not equally useful
`for content representation, it is important to
`introduce a term-weighting system that as(cid:173)
`signs high weights to terms deemed important
`and lower weights to the less important terms.
`A powerful Wm.-weighting system of this .
`kind is the weD-known equatlmj f. X life
`times inverse collection
`(term frequency
`frequency), which favors terms with a high
`frequency, (f,) in ·particular documents but
`with a low.frequ~ncy overall in the collection
`(fJ. Such terms distinguish the documents in
`which they occur. from the remaining items.·
`When all texts or text queries are repre(cid:173)
`sented by weighted term vectors of the form
`D1 = (~J, du, • . ., clJ, where du. is the
`weight assigned to term k in document D,; a
`similarity measure can he computed be(cid:173)
`tween pairs of vectors that reflects text
`similarity. Thus, given document D1 and
`SCIENCE • VOL 264 • 3 jUNE 1994
`
`query Q1 (or sample document D1), a .simi(cid:173)
`larity computation of the form sim(D,, Q1)
`= ~1= 1 d;kdp. can produce a ranked list of
`documents in decreasing order of similarity
`with a query (or with a sample document).
`When ranked retrieval output is provided
`for the user, it is easy to use relevance
`feedback procedures to build improyed que(cid:173)
`ries on the basis of the relevance of previ(cid:173)
`ously retrieved materials.
`In the Smart system, the tetms used to
`identify the text items are entities extract(cid:173)
`ed from the document texts after elimina(cid:173)
`tion of common words and removal of
`word suffixes. When the document vocab(cid:173)
`ulary itself forms the basis for text content
`repres(:ntation, distinct documents with
`large overlapping vocabularies may be dif- ·
`ficult to distinguish. For example, the
`vectors covering biographies of John Fitz(cid:173)
`gerald Kennedy and·Antbony M. Kenne·
`dy, the current Supreme Court justice,
`will show many similarities because both
`Kennedys attended Harvard University,
`were high officials of the government, and
`had close relationships with U.S. presi·
`dents. The global vector similarity func(cid:173)
`tion described earlier cannot cope with
`ambiguities of · tbis kind by itself. An
`l!dditional step designed to verify that the
`matching vocabulary occurs locallY in sim(cid:173)
`ilar contexts must therefore be introduced
`as part of the rel;rievai algorithm. This is
`accomplished by insisting on certain local(cid:173)
`ly matching substructures, such as text
`se~tences or text paragraphs, in addition
`to the global vector match', before accept(cid:173)
`ing two texts as legitimately similar (3).
`Consider, as an examp)e, a typical
`search conducted in the 29-vcilume Funk
`and Wagnalls encyclopedia, using as a que(cid:173)
`the
`text of article 9667, entided
`ry
`"William lloyd Garrison" (Garrison wa5
`the best known of the American abolition(cid:173)
`ists, who opposed slavery in the early part of
`the 19th century) (4). The upper portiOn of
`Table lshows the top 10 items retrieved in
`.response to a global vector ·comparison.
`The top retrieved item is article 9667 itself,
`with a petfect query similarity of LOO,
`followed by additional articles dealing with
`abolitionism and the slavery issue, retrieved
`with lower siini.larity values.
`.. The upper portion of Table 1 consists of
`relevant items only, with the exception of
`article 9628, entided "Gart retrieved in
`position eight on the ranked list.'. Oar is a
`type of fish, obviously unrelated to the
`slavery issue but euoneously retrieved be(cid:173)
`Cause truncated terms were used in the text
`· vectors, and the truncated form of "Garri(cid:173)
`. son" matches !'Gar." (Removal of "-ison"
`as part of the stemming process first reduced·
`"Garrison" to uGarr.,'' as in "comparison»
`and "compar"; removal of the duplicated
`consonant then reduc~ "Garr" to the final
`
`1421
`
`EXHIBIT 2031
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00481
`
`
`
`"Gar.") The lower portion of Table 1 shows
`the results obtained with an additional local
`text comparison that required at least one
`matching text sentence between the query
`article and each retrieved document. There
`are no matching sentences in documents
`9667 ("Garrison") and 9628 ("Gar"), be(cid:173)
`cause gar, meaning fish, and "Gar" derived
`'from the name Garrison are obviously not
`used in similar contexts. Hence the offend(cid:173)
`ing document 9628 was removed from the
`retrieved list. Most linguistic ambiguities
`are· similarly resolvable by this global-local
`vector-matching process. The lower por(cid:173)
`tion of Table 1 also differs from the upper in
`that certain text passages are retrieved (la(cid:173)
`beled "c" for section and "p" for paragraph)
`in addition to certain full document texts.
`The passage retrieval issue is examined in
`more detail in the next section.
`
`Text Decomposition and Structure
`
`Practical retrieval searches deal with text
`items that are heterogeneous in both sub(cid:173)
`ject matter and text length. Thus, in the
`same text environment it may be necessary
`to cope with short e-mail messages as well
`as long book-sized texts. In an encyclope(cid:173)
`dia, three-word articles representing cross(cid:173)
`references from one subject to another oc(cid:173)
`cur routinely, in addition to many long
`
`Table 1. Text retrieval strategies. Query: article
`9667, "William Lloyd Garrison." Section indicat(cid:173)
`ed by "c"; paragraph indicated by "p."
`
`Document
`number
`
`Query
`simi(cid:173)
`larity
`
`Title of retrieved item
`
`9667
`18173
`76
`21325
`827
`
`21326
`8097
`
`9628
`2883
`5584
`
`9667
`18173
`2974.c33'
`76
`21325.c8
`827
`
`Global text comparison only
`1.00 Garrison, William Uoyd
`Phillips, Wendell
`0.53
`0.48 Abolitionists
`0.40 Slavery
`0.36 American Anti-Slavery
`Society
`0.35 Slave Trade
`0.35
`Emancipation
`Proclamation
`0.30 Gar
`0.27 Birney, James Gillespie
`0.27 Clay, Cassius Marcellus
`Global-local text comparison
`and retrieval of text passages
`1.00 Garrison, William Uoyd
`0.53 Phillips, Wendell
`0.50 Blacks in Americas
`0.48 Abolitionists
`0.42 Slavery
`0.36 American Anti-Slavery
`Society
`Emancipation
`Proclamation
`United States of
`America
`0.29 Villard, Henry
`0.28 Civil War, American
`
`8097
`
`0.35
`
`23173.c97'
`
`0.31
`
`23545.p5*
`5539.c28*
`
`*New article retrieved in restricted search.
`
`1422
`
`·: ~ . ..
`
`~. ~~. ; -,.
`
`:·~., . ;-.: :·~ ;-:-:
`
`the 175-page article
`treatments such as
`entitled "United States of America." In a
`vector-processing environment, long arti(cid:173)
`cles that deal with diverse subject matter
`are difficult to retrieve in response to short,
`more specific queries, because the overall
`vector similarity is likely to be small for
`such items. Thus, the full article "United
`States of America" is not retrieved in the
`top 10 items in response to the query aboUE
`William Lloyd Garrison, even though cer(cid:173)
`tain sections in the article specifically deal
`with abolitionism.
`The rejection oflong articles can reduce
`retrieval performance in some cases. More
`generally, long articles are difficult for users
`to handle even when retrieval is possible,
`because long texts cannot easily be ab(cid:173)
`sorbed and processed. This suggests that
`long texts be broken down into smaller text
`passages and that access be provided to
`shorter text excerpts in addition to full
`texts. Various attempts have been made in
`the past to implement passage retrieval
`capabilities, but flexible .systems capable of
`handling text excerpts do not currently
`exist (5).
`The Smart system can deal with text
`segments of varying length, including text
`sections, paragraphs, groups of adjacent
`sentences, and individual sentences. The
`lower portion of Table 1 thus shows the
`results of a mixed search in which text
`sections and paragraphs are retrieved in(cid:173)
`stead of full texts whenever the query sim(cid:173)
`ilarity for a shorter text passage exceeds the
`similarity for the full article. A number of
`new items are promoted into the top 10 list
`when text passages are retrievable, includ(cid:173)
`ing section 33 of document 2974, "Blacks
`in the Americas," and section 97 of "Unit(cid:173)
`ed States of America." The text of docu(cid:173)
`ment 2974.c33 (section 33 of document
`2974) covers the founding of the American
`Anti-Slavery Society by William Lloyd
`Garrison in 1833. The relevance of this
`text to abolitionism and William Lloyd
`Garrison explains its good retrieval rank
`and high similarity coefficient of 0.50.
`The available evidence indicates that
`when searching an encyclopedia, the use of
`the combined global and local similarity
`computations improves retrieval effective(cid:173)
`ness by about 10% over the use of global
`vector similarity measurements alone. An
`additional 10% improvement is obtainable
`by use of the passage retrieval capability
`that identifies document excerpts in addi(cid:173)
`tion to full texts ( 6). The results obtained
`by extensive testing in the TREC (Text
`Retrieval Evaluation Conference) environ(cid:173)
`ment indicate that the Smart system pro(cid:173)
`duces consistently superior retrieval perfor(cid:173)
`mance (7). Furthermore, response times are
`comparable to those obtainable in commer(cid:173)
`cial retrieval environments. A Smart search
`
`SCIENCE • VOL 264
`
`• 3 JUNE 1994
`
`of the TREC collections (700,000 full-text
`documents, or 2.4 gigabytes of text) has
`typical response times of 3 s for a 10-term
`query or 6 s for a 20-term query.
`When text passages are available for
`processing and similarity measurements are
`easily computed betwet;n texts and text
`excerpts, text relation maps can be gener(cid:173)
`ated that show text similarities that exceed
`a particular threshold value. Figure 1 shows
`a relation map for four encyclopedia articles
`related to William Lloyd Garrison ("Sla(cid:173)
`very," "U.S. Civil War," "Abolitionists,"
`
`U.S. Civil War
`5539
`
`William lloyd Garrison
`9667
`
`0.23
`
`0.49
`
`Abolitionists
`76
`
`0.40
`
`0.53
`
`21325
`Slave I}'
`( Unks below 0.20 Ignored)
`
`(Lin/cs below 0.30 ignored)
`
`CLinics below0.30 ignored)
`Fig. 1 (top). Basic text relation map. Vertices
`(nodes) represent texts; lines (links between
`nodes) represent text relations above a similarity
`threshold of 0.20. In all figures, "c" indicates
`section, "p" indicates paragraph.
`Fig. 2 (mid(cid:173)
`dle). Well-connected text relation map for para(cid:173)
`Fig. 3
`graphs of article 21385, "Smoking."
`(bottom). Poorly connected text relation map for
`paragraphs of article 21933, "Symphony."
`
`
`
`and "Garrison"). The texts themselves are
`represented by nodes (vertices) of the map,
`and the pairwise text similarities are indi(cid:173)
`cated by links {branches) between the cor(cid:173)
`responding node pairs. Figure 1 shows all
`similarities between full articles exceeding a
`similarity threshold of 0.20 (8). Text link(cid:173)
`ing has been used in the past to build
`hypertext structures, but the links between
`related text pieces are normally assumed to
`be placed subjectively by individual text
`authors--a procedure manifestly impracti(cid:173)
`cal in environments where large masses of
`heterogeneous texts are stored for process(cid:173)
`ing (9).
`A study of various kinds of text relations
`between texts and text excerpts can reveal a
`good deal of information about the internal
`structure of individual texts, as well as the
`relations between different texts. Consider,
`as an example, the paragraph map for arti(cid:173)
`cle 21385, "Smoking," shown in Fig. 2,
`which includes all pairwise paragraph simi(cid:173)
`larities exceeding 0.30. In the correspond(cid:173)
`ing graph, there are no disconnected com(cid:173)
`ponents, and many similarities exist be(cid:173)
`tween adjacent paragraphs. The convex
`graph structure reflects a homogeneous
`treatment of the topic; in this case, the
`"Smoking" article emphasizes the health
`problems connected with smoking and the
`difficulties that arise when people attempt
`to quit smoking. For a homogeneous map
`such as this, it should be easy to determine
`the basic text content by looking at only a
`few carefully chosen paragraphs.
`In contrast, consider the paragraph rela(cid:173)
`tion map in Fig. 3, which shows paragraph
`similarities for article 21933, "Symphony,"
`and uses the same similarity threshold of
`0.30. This map is much less dense; there are
`many outliers consisting of a single node
`only, and there is a disconnected compo(cid:173)
`nent that includes paragraphs 2 and 3 of
`section 5. Clearly, the "Symphony" topic
`does not receive the same homogeneous
`treatment in the encyclopedia as "Smok(cid:173)
`ing," and a determination of text content
`by selectively looking at particular text
`excerpts is much more problematic in this
`case. Attempts have been made in the past
`to relate certain manually linked hypertext
`structures to the corresponding text charac(cid:173)
`teristics, but a. detailed structural text anal(cid:173)
`ysis based on automatically linked struc(cid:173)
`tures at various levels of detail has not so far
`been undertaken (10).
`In Figs. 1 to 3, the text nodes are equally
`spaced around the circumference of a circu(cid:173)
`lar structure. This makes it easy to recog(cid:173)
`nize the links between individual text ex(cid:173)
`cerpts, but the actual link location in the
`running text is obscured. In particular, it is
`difficult to tell whether a link is placed at
`the beginning, in the middle, or at the end
`of a text. An alternative display format is
`
`shown in Fig. 4, in which the space as(cid:173)
`signed to each text along the circumference
`is proportional to the text length, and each
`text link is placed in its proper position
`within the texts. Figure 4 shows a paragraph
`map for four related articles ("Mohandas
`Gandhi," "Indira Gandhi," "Nehru," and
`"India") with the use of a similarity thresh(cid:173)
`old of 0.30. It is obvious that the text of
`article 12017 ("India") is much longer than
`that of the other articles and that the
`coverage ofMohandas Gandhi (the Mahat(cid:173)
`ma) is in tum more detailed than that of
`Indira Gandhi and Nehru.
`Various kinds of topic relationships can
`be distinguished in Fig. 4, depending on the
`particular linking pattern between text ele(cid:173)
`ments. For example, when multiple links
`relate a particular (shorter) document such
`as "Indira Gandhi" (9619) and a subsection
`of a longer document such as "India"
`(12017), a narrower-broader text relation
`normally exists. Similarly, when a particu(cid:173)
`lar section of one document has multiple
`links to a particular section of another
`document, the two text items usually share
`a common subtopic. One can thus conclude
`that "Nehru" (!6579) and the two "Gan(cid:173)
`dhis" (9619 and 9620) represent subtopics
`of "India" (!20!7). Similarly, "Mohandas
`Gandhi" and "Nehru," and "Indira Gan(cid:173)
`dhi" and "Nehru," are pairs of related
`documents that share common subtopics.
`Finally, the lives of the two Gandhis appear
`to be largely unrelated-a single linked
`paragraph pair exists that refers to unrest in
`India, a condition that plagued both poli(cid:173)
`ticians. The relation between Mohandas
`and Indira Gandhi is entirely through Neh(cid:173)
`ru, who was a disciple of the Mahatma and
`also the father of Indira.
`This type of analysis gives an objective
`view of the topic coverage in individual
`texts and of the information shared among
`sets of related texts. In the rest of this
`article, we examine three kinds of text
`analysis systems in more detail, which leads
`to the identification of text themes, the
`selective traversal of texts, and the summa(cid:173)
`rization of text content by extraction of
`important text excerpts.
`
`Text Theme Identification
`
`A text theme can be defined as a specific
`subject that is discussed in some depth in a
`particular text or in a number of related
`texts. Themes represent centers of atten(cid:173)
`tion and cover subjects of principal interest
`to text authors and presumably also to text
`readers. The identification of text themes is
`useful for many purposes--for example, to
`obtain a snapshot of text content and as an
`aid in deciding whether actually to read a
`text.
`Various approaches based on linguistic
`
`SCIENCE • VOL 264
`
`• 3 JUNE 1994
`
`text analysis methods suggest themselves for
`the identification of text themes ( 11). In
`the present context, the text relation maps
`are used as inputs to a clustering process
`that is designed to identi/Y groups of text
`excerpts that are closely related w each
`other but also relatively disconnected from
`the rest of the text (12). The following
`simple process leads to text theme identifi(cid:173)
`cation: First, the triangles in the relation
`map are recognized (a triangle is a group of
`three text excerpts, each of which is related
`to the other two to a degree that is above
`the stated similarity threshold). A centroid
`vector is then constructed for each triangle,
`as the average vector for the group of three
`related items. Finally, triangles are merged
`into a common group (theme) whenever
`the corresponding centroids are sufficiently
`similar (that is, when the pairwise centroid
`similarity exceeds a stated threshold). Each
`theme may be represented by a global cen(cid:173)
`troid vector that is constructed as the aver(cid:173)
`age vector of all text excerpts included in
`the theme.
`Figure 5 shows the four themes derived by
`this method for the Gandhi-lndia subject
`area shown in Fig. 4. The following themes
`are apparent: (i) the single solid triangle
`consisting of paragraphs 9619.p5, 16579.p4,
`and 16579.p5 on the right-hand edge of Fig.
`5 (main subject: Nehru); (ii) the single
`hashed triangle consisting of paragraphs
`9619.p3, 12017.p219, and 12017.pZZO
`(main subject: Sikhs, Punjab); (iii)
`the
`group of dark triangles consisting of para(cid:173)
`graphs 9619.p7, 12017 .p211, 12017 .p216,
`1Z017.p218, and !2017.pZZ2 (main subject:
`Indira Gandhi); (iv) the group of light tri(cid:173)
`angles consisting of paragraphs 9620.p3,
`9620.p6, 9620.p8, 9620.p11, 9620.p14,
`9620.p15, 9620.pl8, 12017.p148, and
`16579.p4 (main subject: Mohandas Gan(cid:173)
`dhi). The clear separation between the two
`Gandhis already noted in the map of Fig. 4 is
`present also in the theme map of Fig. 5, in
`which no overlap exists between the dark
`and light triangle groupings.
`An alternative, less onerous but also less
`refined theme generation method is to build
`a text relation map with the use of a high
`similarity threshold (where the number of
`linked text excerpts is small). Each discon(cid:173)
`nected component of the map, consisting of
`groups of highly related text excerpts, is
`then identified with a particular theme.
`The graph obtained by use of a text simi(cid:173)
`larity threshold of 0.50 for the Gandhi(cid:173)
`India subject area is shown in Fig. 6. The
`high similarity threshold reduces the simi(cid:173)
`larity map to three areas, identified as Mo(cid:173)
`handas Gandhi (top theme), Indira Gandhi
`(middle), and Nehru {bottom). These
`themes duplicate those of Fig. 5, but the
`second theme in Fig. 5, which covers Indira
`Gandhi's problems with the Sikhs in Pun-
`
`1423
`
`
`
`jab, is no longer recognized as a separate
`subject.
`When text relation maps are used as the
`main input, themes can be generated at
`various levels of detaiL The larger the text
`excerpts used for text grouping purposes,
`the wider in general is the scope of the
`corresponding themes. Contrariwise, when
`sentences and other short excerpts are used
`in the grouping process, the theme cover(cid:173)
`age is normally narrow. Thus, when themes
`are derived from the texts of documents
`9667 and 76 ("William Lloyd Garrison"
`and-"Abolitionists," respectively), a theme
`derived from paragraph relations might cov(cid:173)
`er the "beginnings of U.S. abolitionism"; a
`more detailed theme derived from sentence
`relations might cover the "founding of the
`newspaper liberator," which was a mile(cid:173)
`stone in the early years of the abolitionist
`movement. By suitable variation of the
`scope of the theme generation process, it is
`thus possible to derive a smaller number of
`broader themes or a larger number of nar(cid:173)
`rower themes.
`
`Selective Text Traversal
`
`When large text collections are in use,
`flexible methods should be availabl~ that
`will skim the texts while concentrating on
`text passages that may be of immediate
`interest. Such a skimming operation can
`then be used both for selective text travers(cid:173)
`al, in .which only text passages deemed of
`special importance are actually retrieved or
`read, and for text summarization, in which
`summaries are constructed by extraction of
`selected text excerpts.
`In selective text traversal (13), starting
`with a text relation map and a particular
`text excerpt of special interest, a user may
`follow three different traversal strategies: (i)
`The path may cover many of the central
`nodes, which are defined as nodes with a
`large number of links to other nodes of the
`map. (ii) The path may use text excerpts
`located in strategic positions within the
`corresponding documents-for example,
`the first paragraphs in each text section or
`the first sentences in 'each paragraph. (iii)
`The path may use the link weight as the
`main path generation criterion by starting
`with the desired initial node and choosing
`as the next node the one with maximum
`similarity to the current node. This last
`strategy is known as a depth-first search.
`When individual text excerpts are se(cid:173)
`lected for path formation or summarization,
`a number of factors must receive special
`attention; among these are the coherence
`of the resulting text, that is, the ease with
`which the text can be read and understood;
`the exhaustivity of coverage of the li~l
`text, that is, the degree to which all the
`main subject areas are covered; the text
`
`the accuracy with
`that is,
`chronology,
`which timing factors are recognized; and
`finally, the amount of repetition in the
`selected text excerpts. Some of these factors
`are handled relatively easily; for example,
`text chronology is often maintained by the
`use of only forward-pointing paths and
`backtracking is not allowed (if a particular
`paragraph is included in a path, no other
`text excerpt appearing earlier in the same
`document can appear in the same path).
`In the present context, text coherence is
`used as the main criterion, and forward
`depth-first paths are used in which each
`chosen text excerpt is linked to the most
`
`similar text excerpt nor yet seen at this
`point. In a depth-first path, each chosen
`excerpt is closely related to the next one,
`and the chance of poor transitions berween
`selected paragraphs is minimized. Consider,
`as an example, the paragraph map in Fig. 7,
`which is based on six documents related to
`the Greek god Zeus (article 24674). The
`assumption is that the path starts with the
`initial text paragraph of"Zeus" (24674.p3).
`A short depth-first path may be defined as a
`single-link path that includes only the ini(cid:173)
`tial text excerpt plus the next most similar
`excerpt. In Fig. 7,
`this defines path
`24674.p3 to 17ZJZ.p4 (paragraph 4 of the
`
`Award to
`M. Gandhi
`~-----~
`
`Mohandas
`Gandhi
`
`lndir.t
`.. , p3-7 Gandhi
`(9619)
`
`Indira Gandhi
`· ·.
`...... ~~~
`·.
`.,i".;;, Prime Minister·
`;d ~~ ..... CP ~
`'§, '8)0
`Absorption of
`Kashrmr (1957)
`
`( Unks from 0.30 to 0.63 shoWII)
`Rg. 4. Paragraph similarity map for articles related to "India" (12017). Length of curved segmenls
`is proportional to text length; links are placed in correct relative position within each text.
`
`Theme4
`Mohandas Gandhi
`~
`~ ~.->:
`.,
`l?..
`%<'o
`7,.
`'IJ's
`
`Fig, 5. Text themes de(cid:173)
`rived by merging of trian(cid:173)
`gles for four articles related
`to "India" (12017).
`
`... ~
`"'
`§
`16579: Nehru, Jawaharial
`12017: India
`9620: Gandhi, Mohandas Karamchand
`9619: Gandhi, Indira Priyadarshinl
`
`(Links below 0.35 ignored)
`
`1424
`
`SCIENCE • VOL 264
`
`• 3 JUNE 1994
`
`
`
`12017.p 147 ............ -_-_-_-_-.-:. ~:::"' ,-~~:~~:8
`Mohandas Gandhi
`
`12017.p14ii"
`
`9619
`_p?
`12017.p214 ····························_-_-_-.:o>"
`Theme 3
`Indira Gandhi
`
`12017.p216 ·--.
`
`120i"fp218
`16579.p4····
`
`_.-16579.p5
`
`Theme 1
`Nehru
`
`(Unksrelow0.50ignorerf)
`
`J~~ ~iaru,Jawaharlal
`9620: Gandhi. Mollandas Karamd1and
`9619: Gandhi, Indira Poiyadarshini
`Fig. 6. Simplified text themes derived from
`high-threshold (disconnected) text relation map
`for articles related to "India" (12017).
`
`article "Mount Olympus"). The corre(cid:173)
`sponding paragraphs introduce Zeus as the
`god of the sky and the ruler of the Olympi(cid:173)
`an gods and then proceed by identifying the
`12 major Olympian deities, including Zeus,
`his wife Hera, and his siblings and children.
`A more complete forward depth-first
`path proceeds from item 17232.p4 to in(cid:173)
`clude 10391.p6 (paragraph 6 of 10391,
`"Greek Religion and Mythology") and four
`additional paragraphs presented in detail in
`Fig. 7. The complete forward depth-first
`path includes information about Rhea,
`Zeus' mother; Cronus, Zeus' father; and the
`Titans, a race of giants that included Rhea
`and Cronus, among other gods.
`Instead of initiating the text traversal at
`the beginning of a text, it is also possible for
`a searcher to use context-dependent text(cid:173)
`traversal strategies that start with a special
`text excerpt of immediate interest. For ex(cid:173)
`ample, someone interested in the foreign
`policy of President Nixon might locate
`paragraph 622 of article 23173 ("United
`States of America") by using a standard text
`search. A depth-first path starting at
`23l73.p622 can then be used to obtain
`further information. Such a path also in(cid:173)
`cludes paragraphs 9086.p 13 (paragraph 13
`of article 9086, "Gerald· R. Ford") and
`16855.pll (paragraph 11 of 16855, "Rich(cid:173)
`ard M. Nixon"). The corresponding texts
`deal with the exchange of visits between
`President Nixon and Leonid Brezhnev; the
`continuation of detente between the Unit(cid:173)
`ed States and the Soviet Union that was
`pursued by President Ford and Secretary of
`State Kissinger; and finally, Nixon's ap(cid:173)
`proach to the People's Republic of China.
`A completely different topic will be covered
`by a depth-first path starting with paragraph
`the Watergate
`23173.p624, describing
`break-in. The corresponding coverage in(cid:173)
`cludes Nixon's presumed implication in the
`Watergate burglary (23173.p624), Vice
`President Ford's staunch defense of Nixon
`during his term as vice president (9086.p8),
`
`sometimes be replaced by a shorter excerpt
`whose similarity to the previous text ele(cid:173)
`ment is large. In the depth-first path of Fig.
`7, the long paragraph 10391.p6 that deals
`with the divine hierarchy on Mount Olym(cid:173)
`pus (node 3 in the figure) may be replaced
`by a group of three adjacent sentences
`(1 039l.g I 7) consisting of the first three
`sentences of the paragraph. Similarly, para(cid:173)
`graph 22566.pl3 is replaceable by sentence
`group 22566.g7, which includes only the
`last three sentences of the paragraph.
`An alternative way of reducing the path
`size is to use theme generation methods to
`obtain text excerpts covering the desired
`subject area. Figure 8 shows a theme gen(cid:173)
`eration map obtained by triangle merging
`for the Zeus subject area used in Fig. 7. Two
`themes are distinguished, "Olympian dei(cid:173)
`ties" and "birth of Zeus." The two text
`excerpts that are most similar to the respec(cid:173)
`tive theme centroids are 17232.p4 ("The 12
`major Olympian deities were Zeus and his
`wife Hera ••. ") and 24674.p5 ("Zeus was
`the youngest son of the Titans Cronus and
`Rhea ... "). An appropriate short path cov(cid:173)
`ering Zeus can then be obtained as 24674.p3
`(the initial paragraph of the Zeus article),
`followed by 17232.p4 and 24674.p5, repre(cid:173)
`senting the most important paragraphs in
`the two themes, respectively.
`
`Text Summarization
`
`In the absence of deep linguistic analysis
`methods that are applicable to unrestricted
`subject areas, it is not possible to build
`intellectually satisfactory text summaries
`(14). However, by judicious text extraction
`methods, collections of text passages can be
`identified that provide adequate coverage of
`the subject areas of interest. For example,
`when homogeneous text relation maps are
`available, a good summary is normally ob(cid:173)
`tainable by use of one of the longer text(cid:173)
`traversal paths in chronological (forward)
`text order.
`Consider, as an example, the paragraph
`map for document 16585 ("Horatio, Vis(cid:173)
`count Nelson") (Fig. 9). Two paths are
`shown that start with the initial text para(cid:173)
`graph 16585.p3. The dashed path traverses
`all "bushy" nodes--that is, nodes in which
`the number of