throbber
In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 130-136.
`
`Little Words Can Make a Big Difference for Text Classification
`
`Ellen Riloff
`Department of Computer Science
`University of Utah
`Salt Lake City, UT 84112
`E-mail: riloff@cs.utah.edu
`
`Abstract
`
`Most information retrieval systems use stopword lists
`and stemming algorithms. However, we have found
`that recognizing singular and plural nouns, verb forms,
`negation, and prepositions can produce dramatically
`different text classification results. We present results
`from text classification experiments that compare rel-
`evancy signatures, which use local linguistic context,
`with corresponding indexing terms that do not. In two
`different domains, relevancy signatures produced better
`results than the simple indexing terms. These experi-
`ments suggest that stopword lists and stemming algo-
`rithms may remove or conflate many words that could
`be used to create more effective indexing terms.
`
`Introduction
`Most information retrieval systems use a stopword list to
`prevent common words from being used as indexing terms.
`Highly frequent words, such as determiners and preposi-
`tions, are not considered to be content words because they
`appear in virtually every document. Stopword lists are al-
`most universally accepted as a necessary part of an informa-
`tion retrieval system. For example, consider the following
`quote from a recent information retrieval textbook:
`
`“It has been recognized since the earliest days of infor-
`mation retrieval (Luhn 1957) that many of the most fre-
`quently occurring words in English (like “the”,“of”,
`“and”,“to”,etc.)
`are worthless indexing terms.”
`([Frakes and Baeza-Yates, 1992], p. 113)
`
`Many information retrieval systems also use a stemming
`algorithm to conflate morphologically related words into a
`
`single indexing term. The motivation behind stemming al-
`gorithms is to improve recall by generalizing over morpho-
`logical variants. Stemming algorithms are commonly used,
`although experiments to determine their effectiveness have
`produced mixed results (e.g., see [Harman, 1991; Krovetz,
`1993]).
`One benefit of stopword lists and stemming algorithms is
`that they significantly reduce the storage requirements of in-
`verted files. But at what price? We have found that some
`types of words, which would be removed by stopword lists
`or merged by stemming algorithms, play an important role in
`making certain domain discriminations. For example, simi-
`lar expressions containing different prepositions and auxil-
`iary verbs behave very differently. We have also found that
`singular and plural nouns produce dramatically different text
`classification results.
`First, we will describe a text classification algorithm that
`uses linguistic expressions called “relevancy signatures”to
`classify texts. Next, we will present results from text clas-
`sification experiments in two domains which show that sim-
`ilar signatures produce substantially different classification
`results. Finally, we discuss the implications of these results
`for information retrieval systems.
`
`Relevancy Signatures
`Relevancy signatures represent linguistic expressions that
`can be used to classify texts for a specific domain (i.e., topic).
`The linguistic expressions are extracted from texts automat-
`ically using an information extraction system called CIR-
`CUS. The next section gives a brief introduction to informa-
`tion extraction and the CIRCUS sentence analyzer, and the
`following section describes relevancy signatures and how
`they are used to classify texts.
`
`Information Extraction
`CIRCUS [Lehnert, 1991] is a conceptual sentence analyzer
`that extracts domain-specific information from text. For ex-
`ample, in the domain of terrorism, CIRCUS can extract the
`names of perpetrators, victims, targets, weapons, dates, and
`locations associated with terrorist incidents. Information is
`
`Page 1 of 7
`
`GOOGLE EXHIBIT 1013
`
`

`

`extracted using a dictionary of domain-specific structures
`called concept nodes. Each concept node recognizes a spe-
`cific linguistic pattern and uses the pattern as a template for
`extracting information.
`For example, a concept node dictionary for the terror-
`ism domain contains a concept node called $murder-passive-
`
`dered”and extracts X as a murder victim. A similar concept
`node called $murder-active-perpetrator$ is triggered by the
`
`victim$ which is triggered by the pattern “ X was mur-
`pattern “ X murdered ...”and extracts X as the perpetra-
`
`tor of a murder. A concept node is activated during sentence
`processing when it recognizes its pattern in a text.
`Figure 1 shows a sample sentence and instantiated concept
`nodes produced by CIRCUS. Two concept nodes are gener-
`ated in response to the passive form of the verb “murdered”.
`One concept node, $murder-passive-victim$, extracts the
`“three peasants”as murder victims, and a second concept
`node, $murder-passive-perpetrator$, extracts the “guerril-
`las”as perpetrators.
`
`Linguistic Pattern
`
` subject passive-verb
` subject active-verb
` subject verb dobj
` subject verb infinitive
` subject auxiliary noun
`active-verb  dobj
`infinitive  dobj
`verb infinitive  dobj
`gerund  dobj
`noun auxiliary  dobj
`noun prep  np
`active-verb prep  np
`passive-verb prep  np
`infinitive prep  np
`
`Example
`
`linked
`completed acquisition
`agreed to form
`is conglomerate
`
` entity was formed
` entity
` entity
` entity
` entity
`acquire  entity
`to acquire  entity
`agreed to establish  entity
`producing  product
`partner is  entity
`partnership between  entity
`buy into  entity
`was signed between  entity
`to collaborate on  product
`
`Figure 2: Concept node patterns and examples from the joint
`ventures domain
`
`Sentence:
`Three peasants were murdered by guerrillas.
`
`$murder-passive-victim$
`victim = “three peasants”
`
`$murder-passive-perpetrator$
`perpetrator = “guerrillas”
`
`Figure 1: Two instantiated concept nodes
`
`Theoretically, concept nodes can be arbitrarily complex
`but, in practice, most of them recognize simple linguistic
`constructs. Most concept nodes represent one of the general
`linguistic patterns shown in Figure 2.
`All of the information extraction done by CIRCUS hap-
`pens through concept nodes, so it is crucial to have a good
`concept node dictionary for a domain. Multiple concept
`nodes may be generated for a sentence, or no concept nodes
`may be generated at all. Sentences that do not activate any
`concept nodes are effectively ignored.
`Building a concept node dictionary by hand can be ex-
`tremely time-consuming and tedious. We estimate that it
`took approximately 1500 person-hours for two experienced
`system developers to build a concept node dictionary by
` In principle, a single concept node can extract more than one
`item. However, concept nodes produced by AutoSlog [Riloff,
`1994; Riloff, 1993] extract only one item at a time. The joint ven-
`ture results presented in this paper are based on a concept node dic-
`tionary produced by AutoSlog.
` These are the linguistic patterns used by AutoSlog to create the
`joint ventures dictionary (see [Riloff, 1994; Riloff, 1993] for de-
`tails). The concept node dictionary for the terrorism domain was
`hand-crafted and contains some more complicated patterns as well.
`
`hand for the terrorism domain. However, we have since
`developed a system called AutoSlog [Riloff, 1994; Riloff,
`1993] that creates concept node dictionaries automatically
`using an annotated training corpus. Given a training corpus
`for the terrorism domain, a dictionary created by AutoSlog
`achieved 98% of the performance of the hand-crafted dictio-
`nary and required only 5 person-hours to build.
`
`Relevancy Signatures
`Motivation Most information retrieval systems classify
`texts on the basis of multiple words and phrases. However,
`for some classification tasks, classifying texts on the basis
`of a single linguistic expression can be effective. Although
`single words do not usually provide enough context to be
`reliable indicators for a domain, slightly larger phrases can
`be reliable. For example, the word “dead”is not a reliable
`keyword for murder because people die in many ways that
`have nothing to do with murder. However, some expressions
`containing the word “dead” are reliable indicators of mur-
`der. Figure 3 shows several expressions involving the words
`“dead”and “fire”,and the percentage of occurrences of each
`expression that appeared in relevant texts. These results are
`based on 1500 texts from the MUC-4 corpus. The texts in
`the MUC-4 corpus were retrieved from a general database
`because they contain one or more words related to terrorism
`but only half of them actually describe a relevant terrorist
`incident.
`Figure 3 shows that every occurrence of the expression
`“was found dead”appeared in a relevant text. However, only
`
` MUC-4 was the Fourth Message Understanding Conference
`held in 1992 [MUC-4 Proceedings, 1992].
` The MUC-4 organizers defined terrorism according to a com-
`plicated set of guidelines but, in general, a relevant event was a spe-
`cific incident that occurred in Latin America involving a terrorist
`perpetrator and civilian target.
`
`Page 2 of 7
`
`
`

`

`Expression
`was found dead
`left dead
`
` number dead
`
`set on fire
`opened fire
`
` weapon fire
`
`Rel. %
`100%
`61%
`47%
`100%
`87%
`59%
`
`Figure 3: Strength of Associations for Related Expressions
`
`47% of the occurrences of “ number
`
`61% of the occurrences of the expression “left dead”and
`dead”(e.g., “there
`were 61 dead”)appeared in relevant texts. This is because
`the expression “was found dead”has an implicit connotation
`of foul play, which suggests that murder is suspected. In con-
`
`trast, the expressions “left dead”and “ number dead”of-
`
`ten refer to military casualties that are not terrorist in nature.
`Figure 3 also shows that several expressions involving
`the word “fire”have different correlations with relevance.
`The expression “set on fire”was strongly correlated with rel-
`evant texts describing arson incidents, and the expression
`“opened fire”was highly correlated with relevant texts de-
`scribing terrorist shooting incidents. However, the expres-
`
`sion “ weapon fire”(e.g., “rifle fire”or “gun fire”)was not
`
`highly correlated with terrorist texts because it often appears
`in texts describing military incidents.
`These results show that similar linguistic expressions can
`have very different associations with relevance for a domain.
`Furthermore, many of these distinctions would be difficult,
`if not impossible, for a human to anticipate. Based on these
`observations, we developed a text classification algorithm
`that automatically identifies linguistic expressions that are
`strongly associated with a domain and uses them to classify
`new texts. Our approach uses an underlying information ex-
`traction system, CIRCUS, to recognize linguistic context.
`
` murdered, $murder-passive-victim$
`
`The Relevancy Signatures Algorithm A signature is de-
`fined as a pair consisting of a word and a concept node
`triggered by that word. Each signature represents a unique
`set of linguistic expressions. For example, the signature
`represents all ex-
`pressions of the form “was murdered”,“were murdered”,
`“have been murdered”,etc. Signatures are generated auto-
`matically by applying CIRCUS to a text corpus.
`A relevancy signature is a signature that is highly corre-
`lated with relevant texts in a preclassified training corpus. To
`generate relevancy signatures for a domain, the training cor-
`pus is processed by CIRCUS, which produces a set of instan-
`tiated concept nodes for each text. Each concept node is then
`transformed into a signature by pairing the name of the con-
`cept node with the word that triggered it. Once a set of signa-
`tures has been acquired from the corpus, for each signature
`we estimate the conditional probability that a text is relevant
`given that it contains the signature. The formula is:
`
`
 (      !#"$% &   ')(* ) = +-, */. *01#243 546729864:+ , */. *
`the
`the number of occurrences of
`is
`where ;=<!>@?A
`and
`in
`the
`training
`corpus
`signature
`BCED >
`is the number of occurrences of the sig-
`;=<!>@?A@FHGJILKJM7N#IPOQNLR
`in relevant texts in the training corpus. The ep-
`nature BSCTD >
`silon is used loosely to denote the number of occurrences of
`the signature that “appeared in”relevant texts.
`Finally, two thresholds are used to identify the signatures
`that are most highly correlated with relevant texts. A rele-
`vance threshold R selects signatures with conditional prob-
`
`ability U R, and a frequency threshold M selects signatures
`
`that have appeared at least M times in the training corpus.
`For example, R = .85 specifies that at least 85% of the occur-
`rences of a signature in the training corpus appeared in rele-
`vant texts, and M = 3 specifies that the signature must have
`appeared at least 3 times in the training corpus.
`To classify a new text, the text is analyzed by CIRCUS
`and the resulting concept nodes are transformed into signa-
`tures. Then the signatures are compared with the list of rele-
`vancy signatures for the domain. If any of the relevancy sig-
`natures are found, then the text contains an expression that is
`strongly associated with the domain so it is classified as rel-
`evant. If no relevancy signatures are found, then the text is
`classified as irrelevant. The presence of a single relevancy
`signature is enough to produce a relevant classification.
`
`Experimental Results for Similar Expressions
`Previous experiments demonstrated that the relevancy sig-
`natures algorithm can achieve high-precision text classifi-
`cation and performed better than an analogous word-based
`algorithm in two domains:
`terrorism and joint ventures
`(see [Riloff, 1994; Riloff and Lehnert, 1994] for details).
`In this paper, we focus on the effectiveness of similar lin-
`guistic expressions for classification. In many cases, sim-
`ilar signatures generated substantially different conditional
`probabilities. In particular, we found that recognizing sin-
`gular and plural nouns, different verb forms, negation, and
`prepositions was critically important in both the terrorism
`and joint ventures domains. The results are based on 1500
`texts from the MUC-4 terrorism corpus and 1080 texts from
`In both corpora, roughly 50% of the
`texts were relevant to the targeted domain. Although most
`general-purpose corpora contain a much smaller percentage
`of relevant texts, our goal is to simulate a pipelined system
`in which a traditional information retrieval system is first ap-
`plied to a general-purpose corpus to identify potentially rele-
`vant texts. This prefiltered corpus is then used by our system
`W These texts were randomly selected from a corpus of 1200
`texts, of which 719 came from the MUC-5 joint ventures cor-
`pus [MUC-5 Proceedings, 1993] and 481 came from the Tip-
`ster detection corpus [Tipster Proceedings, 1993; Harman, 1992]
`(see [Riloff, 1994] for details of how these texts were chosen).
`
`a joint ventures corpus.V
`
`Page 3 of 7
`
`

`

`to make more fine-grained domain discriminations.X
`Singular and Plural Nouns
`Figures 4 and 5 show signatures that represent singular and
`plural forms of the same noun, and their conditional prob-
`abilities in the terrorism and joint ventures corpora, respec-
`tively. Singular and plural words produced dramatically dif-
`ferent correlations with relevant texts in both domains. For
`example, Figure 4 shows that 83.9% of the occurrences of the
`singular noun “assassination”appeared in relevant texts, but
`only 51.3% of the occurrences of the plural form “assassina-
`tions”appeared in relevant texts. Similarly, in the joint ven-
`tures domain, 100% of the occurrences of “venture between”
`appeared in relevant texts, but only 75% of the occurrences
`of “ventures between”appeared in relevant texts. And these
`were not isolated cases; Figures 4 and 5 show many more ex-
`amples of this phenomenon.
`
`often referred to general types of incidents. For example, the
`word “assassination”usually referred to the assassination of
`a specific person or group of people, such as “the assassina-
`tion of John Kennedy”or “the assassination of three diplo-
`mats.”In contrast, the word “assassinations”often referred
`to assassinations in general, such as “there were many as-
`sassinations in 1980”,or “assassinations often have political
`ramifications.”In both domains, a text was considered to be
`relevant only if it referred to a specific incident of the appro-
`
`priate type._
`Verb Forms
`We also observed that different verb forms (active, passive,
`infinitive) behaved very differently. Figures 6 and 7 show
`the statistics for various verb forms in both domains. In gen-
`eral, passive verbs were more highly correlated with rele-
`vance than active verbs in the terrorism domain. For exam-
`
`Signature
`
` assassination, $murder$
` assassinations, $murder$
` car bombY , $weapon-vehicle-bomb$
` car bombs, $weapon-vehicle-bomb$
` corpse, $dead-body$
` corpses, $dead-body$
` disappearance, $disappearance$
` disappearances, $disappearance$
` grenade, $weapon-grenade$
` grenades, $weapon-grenade$
` murder, $murder$
` murders, $murder$
`
`Figure 4: Singular/plural terrorism signatures
`
`Rel. %
`83.9%
`51.3%
`100.0%
`75.0%
`100.0%
`50.0%
`83.3%
`22.2%
`81.3%
`34.1%
`83.8%
`56.7%
`
`Rel. %
`100.0%
`0.0%
`100.0%
`75.0%
`95.4%
`50.0%
`96.0%
`52.4%
`
`Signature
`
`tie-up, $entity-tie-up-with$
`tie-ups, $entity-tie-ups-with$ [Z
` venture, $entity-venture-between$
` ventures, $entity-ventures-between$
` venture, $entity-venture-of$
` ventures, $entity-ventures-of$
` venture, $entity-venture-with$
` ventures, $entity-ventures-with$
`
`Figure 5: Singular/plural joint ventures signatures
`
`The reason revolves around the fact that singular nouns
`usually referred to a specific incident, while the plural nouns
`\ In fact, the MUC-4 and MUC-5 corpora were constructed by
`applying a keyword search to large databases of news articles.
`] CIRCUS uses a phrasal lexicon to represent important phrases
`as single words. The underscore indicates that the phrase “car
`bomb”was treated as a single lexical item.
`^ This signature only appeared once in the corpus.
`
`peared in relevant texts but only 54.1% of the occurrences of
`
`ple, 77.8% of the occurrences of “was bombed by X ”ap-
`“ X bombed ...”appeared in relevant texts. In the MUC-
`
`4 corpus, passive verbs were most frequently used to de-
`scribe terrorist events, while active verbs were equally likely
`to describe military events. Two possible reasons are that (1)
`the perpetrator is often not known in terrorist events, which
`makes the passive form more appropriate, and (2) the passive
`form connotes a sense of victimization, which news reporters
`might have been trying to convey.
`
`Signature
`
` blamed, $suspected-or-accused-active$
` blamed, $suspected-or-accused-passive$
` bombed, $actor-passive-bombed-by$
` bombed, $actor-active-bomb$
` broke, $damage-active$
` broken, $damage-passive$
` burned, $arson-passive$
` burned, $arson-active$
` charged, $perpetrator-passive$
` charged, $perpetrator-active$
`left, $location-passive$
`left, $location-active$
`
`Rel. %
`84.6%
`33.3%
`77.8%
`54.1%
`80.0%
`62.5%
`100.0%
`76.9%
`68.4%
`37.5%
`87.5%
`20.0%
`
`Figure 6: Terrorism signatures with different verb forms
`
`However, the active verb form was more highly correlated
`with relevant texts for the words “blamed”and “broke”.Ter-
`rorists were often actively “blamed”for an incident, while
`all kinds of people “were blamed”or “have been blamed”
`for other types of things. The active form “broke”was often
`used to describe damage to physical targets while the passive
`form was often used in irrelevant phrases such as “talks were
`broken off”,or “a group was broken up”.
`` The relevance criteria are based on the MUC-4 and Tipster
`guidelines [MUC-4 Proceedings, 1992; Tipster Proceedings, 1993].
`
`Page 4 of 7
`
`
`
`
`
`

`

`Signature
`
` assemble, $entity-active-assemble$
` assemble, $prod-infinitive-to-assemble$
` construct, $entity-active-construct$
` constructed, $facility-passive-constructed$
`form, $entity-infinitive-to-form$
`form, $entity-obj-active-form$
` put, $entity-passive-put-up-by$
` put, $entity-active-put-up$
` manufacture, $prod-infinitive-to-manufacture$
` manufacture, $prod-active-manufacture$
` manufactured, $prod-passive-manufactured$
` operate, $facility-active-operate$
` operated, $facility-passive-operated$
` supplied, $entity-active-supplied$
` supplied, $entity-passive-supplied-by$
`
`Rel. %
`87.5%
`68.8%
`100.0%
`63.6%
`83.1%
`69.2%
`84.2%
`50.0%
`86.7%
`53.8%
`52.6%
`85.0%
`66.7%
`83.3%
`65.0%
`
`Figure 7: Joint venture signatures with different verb forms
`
`In the joint ventures domain, Figure 7 also shows sig-
`nificant differences in the relevancy rates of different verb
`forms. In most cases, active verbs were more relevant than
`passive verbs because active verbs often appeared in the fu-
`ture tense. This makes sense when describing joint ven-
`ture activities because, by definition, companies are planning
`events in the future. For example, many texts reported that
`a joint venture company “will assemble”a new product, or
`“will construct”a new facility.
`In contrast, passive verbs
`usually represent the past tense and don’t necessarily men-
`tion the actor (e.g., the company). For example, the phrase
`“a facility was constructed”implies that the construction has
`already happened and does not indicate who was responsible
`for the construction. Infinitive verbs were also common in
`this domain because companies often intend to do things as
`part of joint venture agreements.
`
`Prepositions
`In the next set of experiments, we investigated the role of
`prepositions as part of the text representation. First, we
`probed the joint ventures corpus!a with joint venture key-
`words and computed the recall and precision rates for these
`words, which appear in Figure 8. For example, we retrieved
`all texts containing the word “consortium”and found that
`69.7% of them were relevant and 3.6% of the relevant texts
`were retrieved. Some of the keywords achieved high re-
`call and precision rates. For example, 88.9% of the texts
`containing the words “joint”and “venture” b were relevant.
`But only 73.2% of the texts containing the hyphenated word
`“joint-venture”were relevant. This is because the hyphen-
`ated form “joint-venture”is often used as a modifier, as in
`“joint-venture law”or “joint-venture proposals”,where the
`main concept is not a specific joint venture. Figure 8 also
`shows much higher precision for the singular forms “ven-
`c These results are from the full joint ventures corpus of 1200
`texts.
` Not necessarily in adjacent positions.
`
`ture”and “joint venture”than for the plural forms, which is
`consistent with our previous results for singular and plural
`nouns.
`
`Words
`joint, venture
`tie-up
`venture
`jointly
`joint-venture
`consortium
`joint, ventures
`partnership
`ventures
`
`Recall Precision
`93.3%
`88.9%
`2.5%
`84.2%
`95.5%
`82.8%
`11.0%
`78.9%
`6.4%
`73.2%
`3.6%
`69.7%
`19.3%
`66.7%
`7.0%
`64.3%
`19.8%
`58.8%
`
`Figure 8: Recall and precision scores for joint venture words
`
`But perhaps the most surprising result was that most of the
`keywords did not do very well. The phrase “joint venture”
`achieved both high recall and precision, but even this obvi-
`ously important phrase produced 90% precision. And vir-
`tually all of the other keywords achieved modest precision;
`only “tie-up”and “venture”achieved greater than 80% pre-
`cision.
`When we add prepositions to these keywords, we pro-
`duce more effective indexing terms. Figure 9 shows sev-
`eral signatures for the joint ventures domain that represent
`verbs and nouns paired with different prepositions. For ex-
`ample, Figure 9 shows that pairing the noun “venture”with
`the preposition “between”produces a signature that achieves
`100% precision. Similarly, pairing the word “venture”with
`the prepositions “with”and “by”produces signatures that
`achieve over 95% precision. And pairing the word “tie-up”
`with the preposition “with”increases precision from 84.2%
`to 100%. Figure 9 also shows substantially different preci-
`sion rates for the same word paired with different preposi-
`tions. For example, “project between”performs much better
`than “project with”,and “set up with”performs much better
`than “set up by”.
`
`Signature
`
` project, $entity-project-between$
` project, $entity-project-with$
` set, $entity-set-up-with$
` set, $entity-set-up-by$
`tie, $entity-tie-up-with$
` venture, $entity-venture-between$
` venture, $entity-venture-with$
` venture, $entity-venture-of$
` venture, $entity-venture-by$
`
`Rel. %
`100.0%
`75.0%
`94.7%
`66.7%
`100.0%
`100.0%
`95.9%
`95.4%
`90.9%
`
`Figure 9: Joint venture signatures with different prepositions
`
`It is important to note that the signatures are generated
`by CIRCUS, which is a natural language processing sys-
`
`Page 5 of 7
`
`
`
`
`

`

`venture-with$
`
`tem, so the preposition does not have to be adjacent to the
`noun. A prepositional phrase with the appropriate preposi-
`tion only has to follow the noun and be extracted by a con-
`cept node triggered by the noun. For example, the sentence
`“Toyota formed a joint venture in Japan on March 12 with
` venture, $entity-
`Nissan”,would produce the signature
`, even though the preposition “with”is six
`words away from the word “venture”.
`These results show that prepositions represent distinctions
`that are important in the joint ventures domain. The words
`“venture”and “joint venture”by themselves do not neces-
`sarily represent a specific joint venture activity. For exam-
`ple, they are often used as modifiers, as in “venture capital-
`ists”or “joint venture legislation”,or are used in a general
`sense, as in “one strategy is to form a joint venture”.The
`presence of the preposition “with”,however, almost always
`implies that a specific partner is mentioned. Similarly, the
`preposition “between”usually indicates that more than one
`partner is mentioned. In some sense, the prepositions act as
`pointers which indicate that one or more partners exist.
`As a result, keywords paired with these prepositions (e.g.,
`“tie-up with”,“venture with”,“venture between”,“project
`between”)almost always refer to a specific joint venture
`agreement and are much more effective indexing terms than
`the keywords alone. Furthermore, some prepositions are
`better indicators of the domain than others. For example,
`the preposition “between”suggests that multiple partners are
`mentioned and is therefore more highly associated with the
`domain than other prepositions.d
`Negation
`We also found that negated phrases performed differently
`than similar phrases without negation. Figure 10 shows sev-
`eral examples of this phenomenon. In the terrorism domain,
`negative expressions were more relevant than their positive
`counterparts.e For example, the expression “no casualties”
`was more relevant than the word “casualties”,“was not in-
`jured”was more relevant than “was injured”,and “no in-
`juries”was more relevant than “injuries”.The reason for
`this is subtle. Reports of injuries and casualties are com-
`mon in military event descriptions as well as terrorist event
`descriptions. However, negative reports of “no injuries”or
`“no casualties”are much more common in terrorist event de-
`scriptions. In most cases, these expressions implicitly sug-
`gest that there were no civilian casualties as a result of a ter-
`rorist attack. This is another example of a phenomenon that
` We did not conduct the preposition experiments in the terror-
`ism domain because the concept nodes in the hand-crafted terror-
`ism dictionary included multiple prepositions so it was difficult to
`compute statistics for prepositions individually. However we ex-
`pect that the same phenomenon occurs in the terrorism domain as
`well.
`
` We did not have any concept nodes representing negation in the
`joint ventures domain.
`
`would be difficult if not impossible for a human to predict.
`One of the main strengths of the relevancy signatures algo-
`rithm is that these associations are identified automatically
`using statistics generated from a training corpus.
`
`Signature
`
` casualties, $no-injury$
` casualties, $injury$
`injured, $no-injury-passive$
`injured, $injury-passive$
`injuries, $no-injury$
`injuries, $injury$
`
`Rel. %
`84.7%
`46.1%
`100.0%
`76.7%
`83.3%
`70.6%
`
`Figure 10: Negative/positive terrorism signatures
`
`Conclusions
`To summarize, we have found that similar linguistic expres-
`sions produced dramatically different text classification re-
`sults. In particular, singular nouns often represented specific
`incidents while plural nouns referred to general events. Dif-
`ferent verb forms (active, passive, infinitive) distinguished
`different tenses (past, future) and the presence or absence of
`objects (e.g., passive verbs do not require a known actor).
`And prepositions played a major role by implicitly acting as
`pointers to actors and objects. Finally, negated expressions
`behaved differently than their non-negated counterparts, not
`because the negated concept was explicitly required for the
`domain, but because the negated expressions conveyed sub-
`tle connotations about the event.
`Researchers in the information retrieval community have
`experimented with phrase-based indexing to create more ef-
`fective indexing terms (e.g., [Croft et al., 1991; Dillon, 1983;
`Fagan, 1989]). However, most of these systems build com-
`plex phrases from combinations of nouns, noun modifiers,
`and occasionally verbs. Function words, such as preposi-
`tions and auxiliary verbs, are almost always ignored. Stop-
`word lists typically throw away function words, preventing
`them from being considered during the generation of index-
`ing terms. Many information retrieval systems also lose the
`ability to distinguish singular and plural nouns when they use
`a stemming algorithm.
`Stopword lists and stemming algorithms perform valuable
`functions for many information retrieval systems. Stopword
`lists substantially reduce the size of inverted files and stem-
`ming algorithms allow the system to generalize over differ-
`ent morphological variants. However, some common stop-
`words do contribute substantially to the meaning of phrases.
`We believe that certain types of common stopwords, such
`as prepositions and auxiliary verbs, should be available for
`use in building complex phrases. Similarly, stemming algo-
`rithms may be appropriate for some terms but not for oth-
`ers. Users would likely benefit from being able to spec-
`ify whether a term should be stemmed or not. Automated
`
`Page 6 of 7
`
`
`
`
`
`

`

`Riloff, E. and Lehnert, W. 1994. Information Extraction as
`a Basis for High-Precision Text Classification. ACM Trans-
`actions on Information Systems 12(3):296–333.
`Riloff, E. 1993. Automatically Constructing a Dictio-
`nary for Information Extraction Tasks. In Proceedings of
`the Eleventh National Conference on Artificial Intelligence.
`AAAI Press/The MIT Press. 811–816.
`Riloff, E. 1994.
`Information Extraction as a Basis for
`Portable Text Classification Systems.
`Ph.D. Disserta-
`tion, Department of Computer Science, University of Mas-
`sachusetts Amherst.
`Proceedings of the TIPSTER Text Program (Phase I), San
`Francisco, CA. Morgan Kaufmann.
`
`indexing systems might also produce better results if both
`stemmed and non-stemmed indexing terms were available to
`them.
`As disk space becomes cheaper, space considerations are
`not nearly as important as they once were, and we should
`think twice before throwing away potentially valuable words
`simply for the sake of space. Although many function words
`do not represent complex concepts on their own, their pres-
`ence provides important clues about the information sur-
`rounding them. As text corpora grow in size and scope,
`user queries will be more specific and information retrieval
`systems will need to make more subtle domain discrimina-
`tions. Our results suggest that information retrieval systems
`would likely benefit from including these small words as part
`of larger phrases. We have shown that the effectiveness of
`slightly different linguistic expressions and word forms can
`vary substantially, and believe that these differences can be
`exploited to produce more effective indexing terms.
`
`References
`Croft, W. B.; Turtle, H. R.; and Lewis, D. D. 1991. The Use
`of Phrases and Structured Queries in Information Retrieval.
`In Proceedings, SIGIR 1991. 32–45.
`
`Dillon, M. 1983. FASIT: A Fully Automatic Syntactically
`Based Indexing System. Journal of the American Society
`for Information Science 34(2):99–108.
`
`Fagan, J. 1989. The Effectiveness of a Nonsyntactic Ap-
`proach to Automatic Phrase Indexing for Document Re-
`trieval. Journal of the American Society for Information
`Science 40(2):115–132.
`
`Frakes, William B. and Baeza-Yates, Ricardo, editors 1992.
`Information Retrieval: Data Structures and Algorithms.
`Prentice Hall, Englewood Cliffs, NJ.
`
`Harman, D. 1991. How Effective is Suffixing? Journal of
`the American Societ

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket