IPR2020-00755, No. 1013 Exhibit - Google Exhibit 1013 Riloff, Little Words Can Make a Big Difference for Text Classification (P.T.A.B. Mar. 27, 2020)

In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 130-136.
`
`Little Words Can Make a Big Difference for Text Classiﬁcation
`
`Ellen Riloff
`Department of Computer Science
`University of Utah
`Salt Lake City, UT 84112
`E-mail: riloff@cs.utah.edu
`
`Abstract
`
`Most information retrieval systems use stopword lists
`and stemming algorithms. However, we have found
`that recognizing singular and plural nouns, verb forms,
`negation, and prepositions can produce dramatically
`different text classiﬁcation results. We present results
`from text classiﬁcation experiments that compare rel-
`evancy signatures, which use local linguistic context,
`with corresponding indexing terms that do not. In two
`different domains, relevancy signatures produced better
`results than the simple indexing terms. These experi-
`ments suggest that stopword lists and stemming algo-
`rithms may remove or conﬂate many words that could
`be used to create more effective indexing terms.
`
`Introduction
`Most information retrieval systems use a stopword list to
`prevent common words from being used as indexing terms.
`Highly frequent words, such as determiners and preposi-
`tions, are not considered to be content words because they
`appear in virtually every document. Stopword lists are al-
`most universally accepted as a necessary part of an informa-
`tion retrieval system. For example, consider the following
`quote from a recent information retrieval textbook:
`
`“It has been recognized since the earliest days of infor-
`mation retrieval (Luhn 1957) that many of the most fre-
`quently occurring words in English (like “the”,“of”,
`“and”,“to”,etc.)
`are worthless indexing terms.”
`([Frakes and Baeza-Yates, 1992], p. 113)
`
`Many information retrieval systems also use a stemming
`algorithm to conﬂate morphologically related words into a
`
`single indexing term. The motivation behind stemming al-
`gorithms is to improve recall by generalizing over morpho-
`logical variants. Stemming algorithms are commonly used,
`although experiments to determine their effectiveness have
`produced mixed results (e.g., see [Harman, 1991; Krovetz,
`1993]).
`One beneﬁt of stopword lists and stemming algorithms is
`that they signiﬁcantly reduce the storage requirements of in-
`verted ﬁles. But at what price? We have found that some
`types of words, which would be removed by stopword lists
`or merged by stemming algorithms, play an important role in
`making certain domain discriminations. For example, simi-
`lar expressions containing different prepositions and auxil-
`iary verbs behave very differently. We have also found that
`singular and plural nouns produce dramatically different text
`classiﬁcation results.
`First, we will describe a text classiﬁcation algorithm that
`uses linguistic expressions called “relevancy signatures”to
`classify texts. Next, we will present results from text clas-
`siﬁcation experiments in two domains which show that sim-
`ilar signatures produce substantially different classiﬁcation
`results. Finally, we discuss the implications of these results
`for information retrieval systems.
`
`Relevancy Signatures
`Relevancy signatures represent linguistic expressions that
`can be used to classify texts for a speciﬁc domain (i.e., topic).
`The linguistic expressions are extracted from texts automat-
`ically using an information extraction system called CIR-
`CUS. The next section gives a brief introduction to informa-
`tion extraction and the CIRCUS sentence analyzer, and the
`following section describes relevancy signatures and how
`they are used to classify texts.
`
`Information Extraction
`CIRCUS [Lehnert, 1991] is a conceptual sentence analyzer
`that extracts domain-speciﬁc information from text. For ex-
`ample, in the domain of terrorism, CIRCUS can extract the
`names of perpetrators, victims, targets, weapons, dates, and
`locations associated with terrorist incidents. Information is
`
`Page 1 of 7
`
`GOOGLE EXHIBIT 1013
`
`

`extracted using a dictionary of domain-speciﬁc structures
`called concept nodes. Each concept node recognizes a spe-
`ciﬁc linguistic pattern and uses the pattern as a template for
`extracting information.
`For example, a concept node dictionary for the terror-
`ism domain contains a concept node called $murder-passive-
`
`dered”and extracts X as a murder victim. A similar concept
`node called $murder-active-perpetrator$ is triggered by the
`
`victim$ which is triggered by the pattern “ X was mur-
`pattern “ X murdered ...”and extracts X as the perpetra-
`
`tor of a murder. A concept node is activated during sentence
`processing when it recognizes its pattern in a text.
`Figure 1 shows a sample sentence and instantiated concept
`nodes produced by CIRCUS. Two concept nodes are gener-
`ated in response to the passive form of the verb “murdered”.
`One concept node, $murder-passive-victim$, extracts the
`“three peasants”as murder victims, and a second concept
`node, $murder-passive-perpetrator$, extracts the “guerril-
`las”as perpetrators.
`
`Linguistic Pattern
`
` subject passive-verb
` subject active-verb
` subject verb dobj
` subject verb inﬁnitive
` subject auxiliary noun
`active-verb dobj
`inﬁnitive dobj
`verb inﬁnitive dobj
`gerund dobj
`noun auxiliary dobj
`noun prep np
`active-verb prep np
`passive-verb prep np
`inﬁnitive prep np
`
`Example
`
`linked
`completed acquisition
`agreed to form
`is conglomerate
`
` entity was formed
` entity
` entity
` entity
` entity
`acquire entity
`to acquire entity
`agreed to establish entity
`producing product
`partner is entity
`partnership between entity
`buy into entity
`was signed between entity
`to collaborate on product
`
`Figure 2: Concept node patterns and examples from the joint
`ventures domain
`
`Sentence:
`Three peasants were murdered by guerrillas.
`
`$murder-passive-victim$
`victim = “three peasants”
`
`$murder-passive-perpetrator$
`perpetrator = “guerrillas”
`
`Figure 1: Two instantiated concept nodes
`
`Theoretically, concept nodes can be arbitrarily complex
`but, in practice, most of them recognize simple linguistic
`constructs. Most concept nodes represent one of the general
`linguistic patterns shown in Figure 2.
`All of the information extraction done by CIRCUS hap-
`pens through concept nodes, so it is crucial to have a good
`concept node dictionary for a domain. Multiple concept
`nodes may be generated for a sentence, or no concept nodes
`may be generated at all. Sentences that do not activate any
`concept nodes are effectively ignored.
`Building a concept node dictionary by hand can be ex-
`tremely time-consuming and tedious. We estimate that it
`took approximately 1500 person-hours for two experienced
`system developers to build a concept node dictionary by
` In principle, a single concept node can extract more than one
`item. However, concept nodes produced by AutoSlog [Riloff,
`1994; Riloff, 1993] extract only one item at a time. The joint ven-
`ture results presented in this paper are based on a concept node dic-
`tionary produced by AutoSlog.
` These are the linguistic patterns used by AutoSlog to create the
`joint ventures dictionary (see [Riloff, 1994; Riloff, 1993] for de-
`tails). The concept node dictionary for the terrorism domain was
`hand-crafted and contains some more complicated patterns as well.
`
`hand for the terrorism domain. However, we have since
`developed a system called AutoSlog [Riloff, 1994; Riloff,
`1993] that creates concept node dictionaries automatically
`using an annotated training corpus. Given a training corpus
`for the terrorism domain, a dictionary created by AutoSlog
`achieved 98% of the performance of the hand-crafted dictio-
`nary and required only 5 person-hours to build.
`
`Relevancy Signatures
`Motivation Most information retrieval systems classify
`texts on the basis of multiple words and phrases. However,
`for some classiﬁcation tasks, classifying texts on the basis
`of a single linguistic expression can be effective. Although
`single words do not usually provide enough context to be
`reliable indicators for a domain, slightly larger phrases can
`be reliable. For example, the word “dead”is not a reliable
`keyword for murder because people die in many ways that
`have nothing to do with murder. However, some expressions
`containing the word “dead” are reliable indicators of mur-
`der. Figure 3 shows several expressions involving the words
`“dead”and “ﬁre”,and the percentage of occurrences of each
`expression that appeared in relevant texts. These results are
`based on 1500 texts from the MUC-4 corpus. The texts in
`the MUC-4 corpus were retrieved from a general database
`because they contain one or more words related to terrorism
`but only half of them actually describe a relevant terrorist
`incident.
`Figure 3 shows that every occurrence of the expression
`“was found dead”appeared in a relevant text. However, only
`
` MUC-4 was the Fourth Message Understanding Conference
`held in 1992 [MUC-4 Proceedings, 1992].
` The MUC-4 organizers deﬁned terrorism according to a com-
`plicated set of guidelines but, in general, a relevant event was a spe-
`ciﬁc incident that occurred in Latin America involving a terrorist
`perpetrator and civilian target.
`
`Page 2 of 7
`
`
`

`Expression
`was found dead
`left dead
`
` number dead
`
`set on ﬁre
`opened ﬁre
`
` weapon ﬁre
`
`Rel. %
`100%
`61%
`47%
`100%
`87%
`59%
`
`Figure 3: Strength of Associations for Related Expressions
`
`47% of the occurrences of “ number
`
`61% of the occurrences of the expression “left dead”and
`dead”(e.g., “there
`were 61 dead”)appeared in relevant texts. This is because
`the expression “was found dead”has an implicit connotation
`of foul play, which suggests that murder is suspected. In con-
`
`trast, the expressions “left dead”and “ number dead”of-
`
`ten refer to military casualties that are not terrorist in nature.
`Figure 3 also shows that several expressions involving
`the word “ﬁre”have different correlations with relevance.
`The expression “set on ﬁre”was strongly correlated with rel-
`evant texts describing arson incidents, and the expression
`“opened ﬁre”was highly correlated with relevant texts de-
`scribing terrorist shooting incidents. However, the expres-
`
`sion “ weapon ﬁre”(e.g., “riﬂe ﬁre”or “gun ﬁre”)was not
`
`highly correlated with terrorist texts because it often appears
`in texts describing military incidents.
`These results show that similar linguistic expressions can
`have very different associations with relevance for a domain.
`Furthermore, many of these distinctions would be difﬁcult,
`if not impossible, for a human to anticipate. Based on these
`observations, we developed a text classiﬁcation algorithm
`that automatically identiﬁes linguistic expressions that are
`strongly associated with a domain and uses them to classify
`new texts. Our approach uses an underlying information ex-
`traction system, CIRCUS, to recognize linguistic context.
`
` murdered, $murder-passive-victim$
`
`The Relevancy Signatures Algorithm A signature is de-
`ﬁned as a pair consisting of a word and a concept node
`triggered by that word. Each signature represents a unique
`set of linguistic expressions. For example, the signature
`represents all ex-
`pressions of the form “was murdered”,“were murdered”,
`“have been murdered”,etc. Signatures are generated auto-
`matically by applying CIRCUS to a text corpus.
`A relevancy signature is a signature that is highly corre-
`lated with relevant texts in a preclassiﬁed training corpus. To
`generate relevancy signatures for a domain, the training cor-
`pus is processed by CIRCUS, which produces a set of instan-
`tiated concept nodes for each text. Each concept node is then
`transformed into a signature by pairing the name of the con-
`cept node with the word that triggered it. Once a set of signa-
`tures has been acquired from the corpus, for each signature
`we estimate the conditional probability that a text is relevant
`given that it contains the signature. The formula is:
`
`

( !#"$% & ')(* ) = +-, */. *01#243 546729864:+ , */. *
`the
`the number of occurrences of
`is
`where ;=<!>@?A
`and
`in
`the
`training
`corpus
`signature
`BCED >
`is the number of occurrences of the sig-
`;=<!>@?A@FHGJILKJM7N#IPOQNLR
`in relevant texts in the training corpus. The ep-
`nature BSCTD >
`silon is used loosely to denote the number of occurrences of
`the signature that “appeared in”relevant texts.
`Finally, two thresholds are used to identify the signatures
`that are most highly correlated with relevant texts. A rele-
`vance threshold R selects signatures with conditional prob-
`
`ability U R, and a frequency threshold M selects signatures
`
`that have appeared at least M times in the training corpus.
`For example, R = .85 speciﬁes that at least 85% of the occur-
`rences of a signature in the training corpus appeared in rele-
`vant texts, and M = 3 speciﬁes that the signature must have
`appeared at least 3 times in the training corpus.
`To classify a new text, the text is analyzed by CIRCUS
`and the resulting concept nodes are transformed into signa-
`tures. Then the signatures are compared with the list of rele-
`vancy signatures for the domain. If any of the relevancy sig-
`natures are found, then the text contains an expression that is
`strongly associated with the domain so it is classiﬁed as rel-
`evant. If no relevancy signatures are found, then the text is
`classiﬁed as irrelevant. The presence of a single relevancy
`signature is enough to produce a relevant classiﬁcation.
`
`Experimental Results for Similar Expressions
`Previous experiments demonstrated that the relevancy sig-
`natures algorithm can achieve high-precision text classiﬁ-
`cation and performed better than an analogous word-based
`algorithm in two domains:
`terrorism and joint ventures
`(see [Riloff, 1994; Riloff and Lehnert, 1994] for details).
`In this paper, we focus on the effectiveness of similar lin-
`guistic expressions for classiﬁcation. In many cases, sim-
`ilar signatures generated substantially different conditional
`probabilities. In particular, we found that recognizing sin-
`gular and plural nouns, different verb forms, negation, and
`prepositions was critically important in both the terrorism
`and joint ventures domains. The results are based on 1500
`texts from the MUC-4 terrorism corpus and 1080 texts from
`In both corpora, roughly 50% of the
`texts were relevant to the targeted domain. Although most
`general-purpose corpora contain a much smaller percentage
`of relevant texts, our goal is to simulate a pipelined system
`in which a traditional information retrieval system is ﬁrst ap-
`plied to a general-purpose corpus to identify potentially rele-
`vant texts. This preﬁltered corpus is then used by our system
`W These texts were randomly selected from a corpus of 1200
`texts, of which 719 came from the MUC-5 joint ventures cor-
`pus [MUC-5 Proceedings, 1993] and 481 came from the Tip-
`ster detection corpus [Tipster Proceedings, 1993; Harman, 1992]
`(see [Riloff, 1994] for details of how these texts were chosen).
`
`a joint ventures corpus.V
`
`Page 3 of 7
`
`

`to make more ﬁne-grained domain discriminations.X
`Singular and Plural Nouns
`Figures 4 and 5 show signatures that represent singular and
`plural forms of the same noun, and their conditional prob-
`abilities in the terrorism and joint ventures corpora, respec-
`tively. Singular and plural words produced dramatically dif-
`ferent correlations with relevant texts in both domains. For
`example, Figure 4 shows that 83.9% of the occurrences of the
`singular noun “assassination”appeared in relevant texts, but
`only 51.3% of the occurrences of the plural form “assassina-
`tions”appeared in relevant texts. Similarly, in the joint ven-
`tures domain, 100% of the occurrences of “venture between”
`appeared in relevant texts, but only 75% of the occurrences
`of “ventures between”appeared in relevant texts. And these
`were not isolated cases; Figures 4 and 5 show many more ex-
`amples of this phenomenon.
`
`often referred to general types of incidents. For example, the
`word “assassination”usually referred to the assassination of
`a speciﬁc person or group of people, such as “the assassina-
`tion of John Kennedy”or “the assassination of three diplo-
`mats.”In contrast, the word “assassinations”often referred
`to assassinations in general, such as “there were many as-
`sassinations in 1980”,or “assassinations often have political
`ramiﬁcations.”In both domains, a text was considered to be
`relevant only if it referred to a speciﬁc incident of the appro-
`
`priate type._
`Verb Forms
`We also observed that different verb forms (active, passive,
`inﬁnitive) behaved very differently. Figures 6 and 7 show
`the statistics for various verb forms in both domains. In gen-
`eral, passive verbs were more highly correlated with rele-
`vance than active verbs in the terrorism domain. For exam-
`
`Signature
`
` assassination, $murder$
` assassinations, $murder$
` car bombY , $weapon-vehicle-bomb$
` car bombs, $weapon-vehicle-bomb$
` corpse, $dead-body$
` corpses, $dead-body$
` disappearance, $disappearance$
` disappearances, $disappearance$
` grenade, $weapon-grenade$
` grenades, $weapon-grenade$
` murder, $murder$
` murders, $murder$
`
`Figure 4: Singular/plural terrorism signatures
`
`Rel. %
`83.9%
`51.3%
`100.0%
`75.0%
`100.0%
`50.0%
`83.3%
`22.2%
`81.3%
`34.1%
`83.8%
`56.7%
`
`Rel. %
`100.0%
`0.0%
`100.0%
`75.0%
`95.4%
`50.0%
`96.0%
`52.4%
`
`Signature
`
`tie-up, $entity-tie-up-with$
`tie-ups, $entity-tie-ups-with$ [Z
` venture, $entity-venture-between$
` ventures, $entity-ventures-between$
` venture, $entity-venture-of$
` ventures, $entity-ventures-of$
` venture, $entity-venture-with$
` ventures, $entity-ventures-with$
`
`Figure 5: Singular/plural joint ventures signatures
`
`The reason revolves around the fact that singular nouns
`usually referred to a speciﬁc incident, while the plural nouns
`\ In fact, the MUC-4 and MUC-5 corpora were constructed by
`applying a keyword search to large databases of news articles.
`] CIRCUS uses a phrasal lexicon to represent important phrases
`as single words. The underscore indicates that the phrase “car
`bomb”was treated as a single lexical item.
`^ This signature only appeared once in the corpus.
`
`peared in relevant texts but only 54.1% of the occurrences of
`
`ple, 77.8% of the occurrences of “was bombed by X ”ap-
`“ X bombed ...”appeared in relevant texts. In the MUC-
`
`4 corpus, passive verbs were most frequently used to de-
`scribe terrorist events, while active verbs were equally likely
`to describe military events. Two possible reasons are that (1)
`the perpetrator is often not known in terrorist events, which
`makes the passive form more appropriate, and (2) the passive
`form connotes a sense of victimization, which news reporters
`might have been trying to convey.
`
`Signature
`
` blamed, $suspected-or-accused-active$
` blamed, $suspected-or-accused-passive$
` bombed, $actor-passive-bombed-by$
` bombed, $actor-active-bomb$
` broke, $damage-active$
` broken, $damage-passive$
` burned, $arson-passive$
` burned, $arson-active$
` charged, $perpetrator-passive$
` charged, $perpetrator-active$
`left, $location-passive$
`left, $location-active$
`
`Rel. %
`84.6%
`33.3%
`77.8%
`54.1%
`80.0%
`62.5%
`100.0%
`76.9%
`68.4%
`37.5%
`87.5%
`20.0%
`
`Figure 6: Terrorism signatures with different verb forms
`
`However, the active verb form was more highly correlated
`with relevant texts for the words “blamed”and “broke”.Ter-
`rorists were often actively “blamed”for an incident, while
`all kinds of people “were blamed”or “have been blamed”
`for other types of things. The active form “broke”was often
`used to describe damage to physical targets while the passive
`form was often used in irrelevant phrases such as “talks were
`broken off”,or “a group was broken up”.
`` The relevance criteria are based on the MUC-4 and Tipster
`guidelines [MUC-4 Proceedings, 1992; Tipster Proceedings, 1993].
`
`Page 4 of 7
`
`
`
`
`
`

`Signature
`
` assemble, $entity-active-assemble$
` assemble, $prod-inﬁnitive-to-assemble$
` construct, $entity-active-construct$
` constructed, $facility-passive-constructed$
`form, $entity-inﬁnitive-to-form$
`form, $entity-obj-active-form$
` put, $entity-passive-put-up-by$
` put, $entity-active-put-up$
` manufacture, $prod-inﬁnitive-to-manufacture$
` manufacture, $prod-active-manufacture$
` manufactured, $prod-passive-manufactured$
` operate, $facility-active-operate$
` operated, $facility-passive-operated$
` supplied, $entity-active-supplied$
` supplied, $entity-passive-supplied-by$
`
`Rel. %
`87.5%
`68.8%
`100.0%
`63.6%
`83.1%
`69.2%
`84.2%
`50.0%
`86.7%
`53.8%
`52.6%
`85.0%
`66.7%
`83.3%
`65.0%
`
`Figure 7: Joint venture signatures with different verb forms
`
`In the joint ventures domain, Figure 7 also shows sig-
`niﬁcant differences in the relevancy rates of different verb
`forms. In most cases, active verbs were more relevant than
`passive verbs because active verbs often appeared in the fu-
`ture tense. This makes sense when describing joint ven-
`ture activities because, by deﬁnition, companies are planning
`events in the future. For example, many texts reported that
`a joint venture company “will assemble”a new product, or
`“will construct”a new facility.
`In contrast, passive verbs
`usually represent the past tense and don’t necessarily men-
`tion the actor (e.g., the company). For example, the phrase
`“a facility was constructed”implies that the construction has
`already happened and does not indicate who was responsible
`for the construction. Inﬁnitive verbs were also common in
`this domain because companies often intend to do things as
`part of joint venture agreements.
`
`Prepositions
`In the next set of experiments, we investigated the role of
`prepositions as part of the text representation. First, we
`probed the joint ventures corpus!a with joint venture key-
`words and computed the recall and precision rates for these
`words, which appear in Figure 8. For example, we retrieved
`all texts containing the word “consortium”and found that
`69.7% of them were relevant and 3.6% of the relevant texts
`were retrieved. Some of the keywords achieved high re-
`call and precision rates. For example, 88.9% of the texts
`containing the words “joint”and “venture” b were relevant.
`But only 73.2% of the texts containing the hyphenated word
`“joint-venture”were relevant. This is because the hyphen-
`ated form “joint-venture”is often used as a modiﬁer, as in
`“joint-venture law”or “joint-venture proposals”,where the
`main concept is not a speciﬁc joint venture. Figure 8 also
`shows much higher precision for the singular forms “ven-
`c These results are from the full joint ventures corpus of 1200
`texts.
` Not necessarily in adjacent positions.
`
`ture”and “joint venture”than for the plural forms, which is
`consistent with our previous results for singular and plural
`nouns.
`
`Words
`joint, venture
`tie-up
`venture
`jointly
`joint-venture
`consortium
`joint, ventures
`partnership
`ventures
`
`Recall Precision
`93.3%
`88.9%
`2.5%
`84.2%
`95.5%
`82.8%
`11.0%
`78.9%
`6.4%
`73.2%
`3.6%
`69.7%
`19.3%
`66.7%
`7.0%
`64.3%
`19.8%
`58.8%
`
`Figure 8: Recall and precision scores for joint venture words
`
`But perhaps the most surprising result was that most of the
`keywords did not do very well. The phrase “joint venture”
`achieved both high recall and precision, but even this obvi-
`ously important phrase produced 90% precision. And vir-
`tually all of the other keywords achieved modest precision;
`only “tie-up”and “venture”achieved greater than 80% pre-
`cision.
`When we add prepositions to these keywords, we pro-
`duce more effective indexing terms. Figure 9 shows sev-
`eral signatures for the joint ventures domain that represent
`verbs and nouns paired with different prepositions. For ex-
`ample, Figure 9 shows that pairing the noun “venture”with
`the preposition “between”produces a signature that achieves
`100% precision. Similarly, pairing the word “venture”with
`the prepositions “with”and “by”produces signatures that
`achieve over 95% precision. And pairing the word “tie-up”
`with the preposition “with”increases precision from 84.2%
`to 100%. Figure 9 also shows substantially different preci-
`sion rates for the same word paired with different preposi-
`tions. For example, “project between”performs much better
`than “project with”,and “set up with”performs much better
`than “set up by”.
`
`Signature
`
` project, $entity-project-between$
` project, $entity-project-with$
` set, $entity-set-up-with$
` set, $entity-set-up-by$
`tie, $entity-tie-up-with$
` venture, $entity-venture-between$
` venture, $entity-venture-with$
` venture, $entity-venture-of$
` venture, $entity-venture-by$
`
`Rel. %
`100.0%
`75.0%
`94.7%
`66.7%
`100.0%
`100.0%
`95.9%
`95.4%
`90.9%
`
`Figure 9: Joint venture signatures with different prepositions
`
`It is important to note that the signatures are generated
`by CIRCUS, which is a natural language processing sys-
`
`Page 5 of 7
`
`
`
`
`

`venture-with$
`
`tem, so the preposition does not have to be adjacent to the
`noun. A prepositional phrase with the appropriate preposi-
`tion only has to follow the noun and be extracted by a con-
`cept node triggered by the noun. For example, the sentence
`“Toyota formed a joint venture in Japan on March 12 with
` venture, $entity-
`Nissan”,would produce the signature
`, even though the preposition “with”is six
`words away from the word “venture”.
`These results show that prepositions represent distinctions
`that are important in the joint ventures domain. The words
`“venture”and “joint venture”by themselves do not neces-
`sarily represent a speciﬁc joint venture activity. For exam-
`ple, they are often used as modiﬁers, as in “venture capital-
`ists”or “joint venture legislation”,or are used in a general
`sense, as in “one strategy is to form a joint venture”.The
`presence of the preposition “with”,however, almost always
`implies that a speciﬁc partner is mentioned. Similarly, the
`preposition “between”usually indicates that more than one
`partner is mentioned. In some sense, the prepositions act as
`pointers which indicate that one or more partners exist.
`As a result, keywords paired with these prepositions (e.g.,
`“tie-up with”,“venture with”,“venture between”,“project
`between”)almost always refer to a speciﬁc joint venture
`agreement and are much more effective indexing terms than
`the keywords alone. Furthermore, some prepositions are
`better indicators of the domain than others. For example,
`the preposition “between”suggests that multiple partners are
`mentioned and is therefore more highly associated with the
`domain than other prepositions.d
`Negation
`We also found that negated phrases performed differently
`than similar phrases without negation. Figure 10 shows sev-
`eral examples of this phenomenon. In the terrorism domain,
`negative expressions were more relevant than their positive
`counterparts.e For example, the expression “no casualties”
`was more relevant than the word “casualties”,“was not in-
`jured”was more relevant than “was injured”,and “no in-
`juries”was more relevant than “injuries”.The reason for
`this is subtle. Reports of injuries and casualties are com-
`mon in military event descriptions as well as terrorist event
`descriptions. However, negative reports of “no injuries”or
`“no casualties”are much more common in terrorist event de-
`scriptions. In most cases, these expressions implicitly sug-
`gest that there were no civilian casualties as a result of a ter-
`rorist attack. This is another example of a phenomenon that
` We did not conduct the preposition experiments in the terror-
`ism domain because the concept nodes in the hand-crafted terror-
`ism dictionary included multiple prepositions so it was difﬁcult to
`compute statistics for prepositions individually. However we ex-
`pect that the same phenomenon occurs in the terrorism domain as
`well.
`
` We did not have any concept nodes representing negation in the
`joint ventures domain.
`
`would be difﬁcult if not impossible for a human to predict.
`One of the main strengths of the relevancy signatures algo-
`rithm is that these associations are identiﬁed automatically
`using statistics generated from a training corpus.
`
`Signature
`
` casualties, $no-injury$
` casualties, $injury$
`injured, $no-injury-passive$
`injured, $injury-passive$
`injuries, $no-injury$
`injuries, $injury$
`
`Rel. %
`84.7%
`46.1%
`100.0%
`76.7%
`83.3%
`70.6%
`
`Figure 10: Negative/positive terrorism signatures
`
`Conclusions
`To summarize, we have found that similar linguistic expres-
`sions produced dramatically different text classiﬁcation re-
`sults. In particular, singular nouns often represented speciﬁc
`incidents while plural nouns referred to general events. Dif-
`ferent verb forms (active, passive, inﬁnitive) distinguished
`different tenses (past, future) and the presence or absence of
`objects (e.g., passive verbs do not require a known actor).
`And prepositions played a major role by implicitly acting as
`pointers to actors and objects. Finally, negated expressions
`behaved differently than their non-negated counterparts, not
`because the negated concept was explicitly required for the
`domain, but because the negated expressions conveyed sub-
`tle connotations about the event.
`Researchers in the information retrieval community have
`experimented with phrase-based indexing to create more ef-
`fective indexing terms (e.g., [Croft et al., 1991; Dillon, 1983;
`Fagan, 1989]). However, most of these systems build com-
`plex phrases from combinations of nouns, noun modiﬁers,
`and occasionally verbs. Function words, such as preposi-
`tions and auxiliary verbs, are almost always ignored. Stop-
`word lists typically throw away function words, preventing
`them from being considered during the generation of index-
`ing terms. Many information retrieval systems also lose the
`ability to distinguish singular and plural nouns when they use
`a stemming algorithm.
`Stopword lists and stemming algorithms perform valuable
`functions for many information retrieval systems. Stopword
`lists substantially reduce the size of inverted ﬁles and stem-
`ming algorithms allow the system to generalize over differ-
`ent morphological variants. However, some common stop-
`words do contribute substantially to the meaning of phrases.
`We believe that certain types of common stopwords, such
`as prepositions and auxiliary verbs, should be available for
`use in building complex phrases. Similarly, stemming algo-
`rithms may be appropriate for some terms but not for oth-
`ers. Users would likely beneﬁt from being able to spec-
`ify whether a term should be stemmed or not. Automated
`
`Page 6 of 7
`
`
`
`
`
`

`Riloff, E. and Lehnert, W. 1994. Information Extraction as
`a Basis for High-Precision Text Classiﬁcation. ACM Trans-
`actions on Information Systems 12(3):296–333.
`Riloff, E. 1993. Automatically Constructing a Dictio-
`nary for Information Extraction Tasks. In Proceedings of
`the Eleventh National Conference on Artiﬁcial Intelligence.
`AAAI Press/The MIT Press. 811–816.
`Riloff, E. 1994.
`Information Extraction as a Basis for
`Portable Text Classiﬁcation Systems.
`Ph.D. Disserta-
`tion, Department of Computer Science, University of Mas-
`sachusetts Amherst.
`Proceedings of the TIPSTER Text Program (Phase I), San
`Francisco, CA. Morgan Kaufmann.
`
`indexing systems might also produce better results if both
`stemmed and non-stemmed indexing terms were available to
`them.
`As disk space becomes cheaper, space considerations are
`not nearly as important as they once were, and we should
`think twice before throwing away potentially valuable words
`simply for the sake of space. Although many function words
`do not represent complex concepts on their own, their pres-
`ence provides important clues about the information sur-
`rounding them. As text corpora grow in size and scope,
`user queries will be more speciﬁc and information retrieval
`systems will need to make more subtle domain discrimina-
`tions. Our results suggest that information retrieval systems
`would likely beneﬁt from including these small words as part
`of larger phrases. We have shown that the effectiveness of
`slightly different linguistic expressions and word forms can
`vary substantially, and believe that these differences can be
`exploited to produce more effective indexing terms.
`
`References
`Croft, W. B.; Turtle, H. R.; and Lewis, D. D. 1991. The Use
`of Phrases and Structured Queries in Information Retrieval.
`In Proceedings, SIGIR 1991. 32–45.
`
`Dillon, M. 1983. FASIT: A Fully Automatic Syntactically
`Based Indexing System. Journal of the American Society
`for Information Science 34(2):99–108.
`
`Fagan, J. 1989. The Effectiveness of a Nonsyntactic Ap-
`proach to Automatic Phrase Indexing for Document Re-
`trieval. Journal of the American Society for Information
`Science 40(2):115–132.
`
`Frakes, William B. and Baeza-Yates, Ricardo, editors 1992.
`Information Retrieval: Data Structures and Algorithms.
`Prentice Hall, Englewood Cliffs, NJ.
`
`Harman, D. 1991. How Effective is Sufﬁxing? Journal of
`the American Societ

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases