throbber
Phrasal Translation and Query Expansion Techniques for Cross-Language
`Retrieval
`
`Information
`
`Lisa Ballesteros and W. Bruce Croft
`balleste@cs.umass.edu,
`croft@cs.umass.edu
`Center for Intelligent Information Retrieval
`Computer Science Department
`University of Massachusetts
`Amherst, MA 01003-4610 USA
`
`Abstract
`
`Dictionary methods for cross-language information retrieval
`give performance
`below that for mono-lingual
`retrieval.
`Failure to translate multi-term phrases has been shown to
`be one of the factors responsible for the errors associated
`with dictionary methods. First, we study the importance of
`phrasal translation for this approach. Second, we explore the
`role of phrases in query expansion via local context analysis
`and local feedback and show how they can be used to signif-
`icantly reduce the error associated with automatic dictionary
`translation.
`
`1
`
`Introduction
`
`The development of IR systems for languages other than
`English has focused on building mono-lingual
`systems.
`In-
`creased availability of on-line text in languages other than
`English and increased multi-national
`collaboration
`have
`motivated research in cross-language information retrieval
`(CLIR) - the development of systems to perform retrieval
`across languages.
`transla-
`There have been three main approaches to CUR:
`tion via machine translation techniques [Rad94]; parallel or
`comparable corpora-based methods [DD95a, LL90, SB96],
`and dictionary-based methods [8al72, Pev72, HG96, BC96].
`Each of these approaches has shown promise, but also has
`disadvantages associated with it. Results suggest
`that im-
`provements gained via machine translation techniques may
`not outweigh the cost of linguistic analysis. One disadvan-
`tage of methods based on the use of parallel and aligned
`corpora is lack of resources: parallel corpora are not al-
`ways readily available and those that are available tend to
`be relatively small or to cover only a small number of sub-
`jects. Performance is also dependent on how well the cor-
`
`Pennission to make digital/hard copies of all or part of this material for
`personal or classr?o~ use is granted without fee provided that the copies
`~e not ".lade or ~lstflbuted for profit or conunercial advantage,
`the copy-
`n.ght notice, the !,t1e of the publication and its date appear, and notice is
`given tha~ copYright is by permission of the ACM, Inc. To copy otherwise,
`to republish, to post on servers or to redistribute to lists, requires specific
`permission and/or fee
`SIGIR 97 Philadelphia PA, USA
`Copyright 1997 ACM 0-89791-836-3/9717 ..$3.50
`
`pora are aligned. Our work takes the third approach and
`applies dictionary-based methods.
`Automatic machine readable dictionary (MRD) query
`translation leads to a drop in effectiveness of 40-60% below
`that of mono-lingual
`retrieval
`[HG96, BC96]. This is due
`primarily to three factors. First, specialized vocabulary not
`contained in the dictionary will not be translated.
`Second,
`dictionary translations are inherently ambiguous and add ex-
`traneous terms to the query. Third, failure to translate multi-
`term concepts as phrases reduces effectiveness.
`We are developing strategies for reducing the errors as-
`sociated with dictionary-based methods and focus on strate-
`gies which have a low processing cost and do not require
`scarce resources. This paper explores
`the identification of
`phrases
`in queries and the effectiveness of simple phrasal
`translation.
`In addition, we investigate the role of phrases in
`query expansion by comparing two approaches,
`local feed-
`back [AF77] and Local Context Analysis
`[XC96J,
`to ex-
`panding queries at various stages of the "translation"
`pro-
`cess.
`
`2
`
`Previous Work
`
`retrieval
`information
`for mono-lingual
`systems
`Effective
`research
`have been available for several years. Typically,
`in the area of multi-lingual
`information retrieval has focused
`on incorporating new languages
`into existing systems to al-
`low them to run in several mono-language
`retrieval modes.
`Recently, greater interest
`in retrieval across languages has
`motivated more work to study the factors involved in build-
`ing a CUR system.
`Salton [8al72] showed early on that with carefully con-
`structed thesauri, cross-language
`retrieval was nearly as ef-
`fective as mono-lingual
`retrieval. This study was good, how-
`ever the test collection was very small by current standards
`and it is unrealistic to manually index larger databases.
`a
`Landauer
`and Littman [LL90] have also proposed
`method for cross-language retrieval. Latent Semantic Index-
`ing (LSI) [FDD+88] was used to create a multidimensional
`indexing space for a parallel corpus of English documents
`and their French translations. Their method has been sue-
`
`84
`
`AOL Ex. 1031
`Page 1 of 10
`
`

`

`in re-
`translation,
`cessful at the task of retrieving a query's
`sponse to that query. However the collection used was small,
`containing 2482 paragraph-length
`documents
`from Cana-
`dian Parliamentary proceedings
`and no results of its effec-
`tiveness on the traditional
`retrieval
`task have been reported.
`The method also relies on the use of parallel corpora which
`are not always readily available.
`Another method that relies on parallel and aligned cor-
`pora has been suggested by Dunning and Davis [DD93].
`Their method is based on the vector space model and in-
`volves the linear
`transformation
`of the representation
`of a
`query in one language to its corresponding representation in
`another language. The transformation
`is done by reduction
`of the document space to generate a translation matrix. They
`have had some success
`in efficiently estimating the trans-
`lation matrix and results of tests to estimate its quality are
`promising. Further
`tests of the effectiveness of the method
`have been limited by its computational
`complexity.
`Davis and Dunning[DD95a, DD95b] have also devel-
`oped several other approaches
`to query translation, which
`they tested on the TREC ISM Spanish queries and collec-
`tion. Two of these rely on the use of a Spanish-English
`parallel corpus and one uses evolutionary programming for
`query optimization.
`In the first of the parallel corpus ap-
`proaches English queries were translated by replacing the
`original query terms with the 100 most frequent
`terms in the
`top 100 retrieved documents
`from the Spanish side of the
`parallel corpus. The second approach replaces the original
`query terms with terms found to be statistically significant.
`The evolutionary programming method starts with a query
`generated by the high frequency approach.
`It then modifies
`queries by randomly adding or deleting query terms. Opti-
`mization is done by evaluating query fitness after each round
`of mutations, and selecting the "most fit" to continue to the
`next generation. The evolutionary
`programming approach
`was the most effective, but results were disappointing, with
`each of the methods performing well below the word-by-
`word translation baseline.
`tag-
`[Dav96] uses part-of-speech
`More recently, Davis
`ging to select the best Spanish translations
`for English query
`terms. A parallel corpus is then used to further disambiguate
`the translated queries by choosing the Spanish terms that re-
`trieve documents most
`like those retrieved for the English
`query. This approach is more effective than previous ones,
`achieving up to 73.5% of monolingual performance.
`Sheridan and Ballerini
`[SB96] performed "translations"
`using co-occurrence
`thesauri generated from a comparable
`corpus. Cross-language
`experiments
`suggest
`that using co-
`occurrence thesauri generated with this type of data yields a
`translation effect. However, performance measured by aver-
`age precision is still considerably below that of mono-lingual
`retrieval. Disadvantages
`to the approach are that it relies on
`time-sensitive documents, queries are constrained to refer-
`encing specific events, and a strict definition of the notion
`of relevance. This is a side effect of the way in which the
`
`test data was constructed and in theory should not be a prob-
`lem inherent
`to the approach, but this has yet
`to be shown
`experimentally.
`Previous work has been done to recognize and translate
`phrases in text,
`for example
`[SWH96, Kup93]. These ap-
`proaches identify source language phrases and rely upon the
`use of parallel corpora to identify the context
`in which target
`language translations
`should be found. Although these ap-
`proaches work well, we use simple dictionary translation be-
`cause we are interested in exploring what can be done when
`scarce resources such as parallel corpora are unavailable.
`
`3 Dictionary Translation and Query Expansion
`
`Previous studies [HG96, BC96] have shown that automatic
`word-by-word (WBW)
`translation of queries via MRD re-
`sults in a 40-60% loss in effectiveness below that of mono-
`lingual retrieval. One of the factors causing this drop in ef-
`fectiveness is ambiguity caused by the transfer of extraneous
`terms. What may be more important however,
`is the failure
`to translate multi-term concepts as phrases. We have shown
`[BC96] that, despite the loss of phrases, query expansion via
`"local feedback" could reduce the errors such an approach
`normally makes. Relevance feedback [SB90] is a method by
`which a query is modified by the addition of terms found in
`documents known to be relevant
`to the query. Local feed-
`back [AF77] differs from classic relevance feedback in that
`it assumes the top retrieved documents are relevant.
`Local
`feedback modification before or after automatic
`query translation
`via MRD significantly
`improves
`per-
`formance.
`Pre-translation
`feedback expansion
`creates a
`stronger base for translation and improves precision.
`Lo-
`cal feedback after MRD translation introduces terms which
`de-emphasize irrelevant
`translations to reduce ambiguity and
`improve recall. Combining pre- and post-translation
`feed-
`back is most effective, and reduces translation error by up to
`36%. Improvement appears to be due to the removal of error
`caused by the addition of extraneous terms via the transla-
`tion process.
`In this paper, we look at another method of query expan-
`sion known as local context analysis (LCA)[XC96]
`to find
`words and phrases
`related to each query. LCA is a query
`expansion method that uses both global and local document
`analysis, and has been shown to be more effective than sim-
`ple local feedback. The reason for this study is two-fold.
`First, we are interested in exploring the effectiveness of sim-
`ple phrasal
`translation.
`Second, we want
`to compare these
`two methods of query expansion,
`local feedback and local
`context analysis (LCA),
`for addressing the error associated
`with dictionary translation of words and phrases.
`
`4 Experiments
`
`in this study were limited to two languages:
`The experiments
`Spanish and English.
`The Spanish queries
`consisted
`of
`
`AOL Ex. 1031
`Page 2 of 10
`
`

`

`TREC topics SP26-45. Evaluation was performed on the 208
`MB TREC ISM (El Norte) Spanish collection with provided
`relevance judgments. Training data for the pre-translation
`LCA experiments consisted of the documents in the 301MB
`San Jose Mercury News (SJMN) database from the TREC
`collection.
`In order
`Each Spanish query has relevance judgments.
`to use these judgments, we need to test the effectiveness of
`MRD translations to Spanish. To do this, we created base
`queries by manually translating the Spanish queries to En-
`glish (herein referred to as BASE). The automatic transla-
`tions of the base queries could then be evaluated using the
`relevance judgments of the original queries. The manual
`translation of the Spanish queries was performed by a bilin-
`gual graduate student whose native language is English.
`Phrases were identified in BASE queries in the following
`way. First, queries were tagged with th BBN part-of-speech
`tagger. Sequences of nouns and adjective-noun pairs were
`taken to be phrases. Automatic translations were performed
`by translating individual terms word-by-word and phrases as
`multi-term concepts. The word-by-word translations were
`done by replacing query terms in the source language with
`the dictionary definition of those terms in the target
`lan-
`guage. Words that were not found in the dictionary were
`added to the new query without
`translation. The Collins
`English-Spanish bilingual MRD was used for the transla-
`tions. For a more detailed description of this process, see
`[BC96]. Phrasal
`translations were. performed using infor-
`mation on phrases and word usage contained in the Collins
`MRD. This allowed the replacement of a source phrase with
`its multi-term representation in the target language. When a
`phrase could not be defined using this information,
`it was
`translated word-by-word as described above. Stop words
`and stop phrases such as ''A relevant document will" were
`also removed.
`Non-interpolated average precision on the top 1000 re-
`trieved documents is used as the basis of evaluation for all
`experiments. CUR would be useful for people who can only
`afford to have a small number of documents translated or
`who do not speak a foreign language well enough to for-
`mulate a good query, but who can read it well enough to
`judge a document's
`relevance. However it is unrealistic to
`expect the user to read many retrieved foreign documents to
`find a relevant one, so in some cases we also report preci-
`sion at low recall levels. The following sections describe our
`experiments. In section 5 we analyze and discuss the impor-
`tance of phrasal
`translation. Next we present a comparison
`of LCA and local feedback expansion. Sections 6.1, 6.2, and
`6.3 describe how pre-translation, post-translation,
`and com-
`bined pre- and post-translation expansion methods help to
`improve performance (see Fig. 1 for a flow chart of query
`processing for the experiments). Finally, section 7 presents
`conclusions and future work.
`All work in this study was performed using the IN-
`QUERY information retrieval system.
`INQUERY is based
`
`Original
`Spanish
`1REC
`Queries
`
`human
`
`~
`
`(BASE)
`English
`queries
`(bumUlIR,ulaled)
`
`~pan'lon
`
`dIctionary
`translation
`
`1automatic
`(MilD tnn.lation)"\ry expan.lon
`
`Span.lsh
`queries
`
`Spanish
`qucriCli
`
`SpanJNh
`queries
`(modified "I. expanllon
`
`1eutomeuc
`
`dictionary
`translation
`
`Spunl"lh
`queries
`
`modirl.Cd vIA. C1xpaallon
`
`(MRO u.,ulalion)
`
`~
`
`INQUERY
`
`~
`
`Figure 1: Flow chart of query processing.
`
`net model
`on the Bayesian inference
`elsewhere[TC91b, TC9Ia, CCB95].
`
`and is described
`
`5
`
`Phrasal Translation
`
`Failure to translate multi-term concepts as phrases greatly
`reduces the effectiveness of dictionary translation.
`In ex-
`periments where query phrases were manually translated
`[BC96], performance improved by up to 25% over automatic
`word-by-word (WBW) query translation. Our hypothesis is
`that automatically identifying phrases and defining them as
`such would improve effectiveness.
`To test this hypothesis, we compare performance of au-
`tomatically translated queries both with and without phrasal
`identification and translation. Phrasal
`translations are based
`on a database of phrasal and word usage information ex-
`tracted from the Collins Spanish-English MRD. During
`phrase translation,
`the database
`is searched for English
`phrases. A hit returns the Spanish translation of the English
`phrase. If more than one translation is found, each of them is
`added to the query. Table 1 gives some examples of phrasal
`translations.
`
`Phrase
`united nations
`
`trade agreement
`south africa
`
`member country
`
`I Translation
`
`Naciones Unidas
`Organizaci6n de las
`Naciones Unidas
`convenio comercial
`Uni6n Sudafricana
`Africa del Sur
`los paises miembros
`los parses afiliados
`los paises participantes
`los paises pertenecientes
`
`Table 1: Phrasal
`
`translations.
`
`The results in Table 2 suggest
`
`that in this case, phrasal
`
`AOL Ex. 1031
`Page 3 of 10
`
`

`

`It gives average
`translation does not improve effectiveness.
`precision values for a baseline of automatic WBW transla-
`tion vs automatic WBW with phrasal
`translation. A closer
`look at individual queries reveals that phrasal
`translation is
`not ineffective, but that results are sensitive to poor trans-
`lations. Average precision drops 40% below a baseline of
`automatic WBW translation for TREC [Har95] query SP30
`when phrasal
`translations are included. However,
`the prob-
`lem for this query is that "sports program"
`is translated
`as "emision deportiva" meaning televised sports program.
`When the poor phrasal
`translation is replaced with a WBW
`translation,
`results improve considerably
`(+ 150% over the
`baseline). Table 3 shows 5 representations
`of SP30: Origi-
`nal, BASE, automatic WBW translation, automatic phrasal +
`WBW translation, and automatic WBW translation + "good"
`phrasal translations. Parentheses enclose recognized phrases
`and brackets enclose phrasal
`translations. Results for the last
`three queries are given in Table 4.
`
`WBW Phrasal
`0.0823
`0.0826
`
`Avg
`
`Table 2: Average precision of WBW vs phrasal translation.
`
`programas y intercambios deportivos entre Mexico y
`los Estados Unidos
`(Sports programs) and (exchange programs) between
`Mexico and the (United States)
`deporte caza deporte juego diversi6n victima juguete
`programs canje intercambio programs Mejico Mexico
`States
`[emisi6n deportiva] cambio canje intercambio programs
`[Estados Unidos][el colo so del norte]
`[Estados Unidos de America] Mejico Mexico
`deporte caza deporte juego diversi6n victima juguete
`programs cambio canje intercambio programs
`[Estados Unidos]
`[el colo so del norte]
`[Estados Unidos de America] Mejico Mexico
`
`original,
`for SP30:
`Table 3: Five query representations
`BASE, MRD translation of BASE, MRD WBW + phrasal
`~anslation of BASE, MRD WBW + "good" phrasal transla-
`tions of BASE
`
`Avg
`% Change:
`
`WBW Phrasal
`0.0244
`0.0148
`-39.3
`
`Good Phrasal
`0.0610
`150.3
`
`for WBW vs two different
`Table 4: Average precision
`phrasal translations for query SP30.
`
`phrases can
`that well-translated
`suggest
`These results
`greatly improve effectiveness,
`but
`that poorly translated
`Phrases may negate the improvements. Translation accuracy
`may be more important
`for phrases than for terms.
`
`6 Local Context Analysis vs Local Feedback
`
`to those from our earlier work, we
`similar
`In experiments
`translated queries automatically via MRD. Query expansion
`via LCA was performed either prior to or after translation
`in the following way. A query set is evaluated and the top
`ranked passages
`for each query are retrieved. Queries are
`then expanded by the addition of the top ranked concepts
`from the top passages. Recall that concepts may be single or
`multi-term.
`
`6.1
`
`Pre-translation
`
`In this first set of experiments, we wanted to compare the ef-
`fectiveness of query expansion prior to automatic translation
`via LCA to previous
`results using local
`feedback. Recall
`that the queries were manually translated into English,
`so
`the Spanish ISM database cannot be used for pre-translation
`expansion. We chose to use the SJMN database, described
`above, as a training corpus from which to choose English
`expansion concepts. Multi-term concepts are translated as
`phrases.
`In the event
`that no phrasal
`translation is found,
`phrases are translated WBW. Table 5 shows 4 representa-
`tions of TREC query SP29. First is the original query, sec-
`ond is the manual
`translation (BASE) including automati-
`cally identified phrases,
`third is the LCA expanded query,
`and fourth is the automatic translation of the third. Paren-
`theses surround LCA expansion phrases and phrases auto-
`matically identified in the BASE query. Brackets surround
`the translation of each term or phrase.
`
`relations) between
`
`las relaciones econ6micas y comerciales entre Mexico y
`Canada
`the economic and (commercial
`mexico and canada
`relations) mexico canada
`economic (commercial
`mexico (trade agreement)
`(trade zone) cuba salinas
`[econ6mico equitativo] [comercio negocio trafico
`industria] [narraci6n relato relaci6n] [Mejico Mexico]
`Canada [Mejico Mexico]
`[convenio comercial]
`[comercio negocio trafico industria] zona cuba salinas
`
`original, BASE (with
`Table 5: Four query representations:
`identified phrases), LCA expanded BASE, WBW + phrasal
`translation of LCA expanded BASE.
`
`the effects of LCA expansion with-
`First, we look at
`out phrasal
`recognition
`in the base query and compare a
`straight WBW translation of all concepts with a combi-
`nation of phrasal and WBW translation. We then com-
`bine phrasal recognition in BASE with LCA expansion fol-
`lowed by both WBW and phrasal translation. Translations of
`multi-term LCA concepts were wrapped in the INQUERY
`#passage25 and #phrase operators.
`For example,
`#pas-
`sage25(#phrase(North American Free Trade AgreementĀ».
`Terms within a #phrase operator are evaluated to see whether
`
`AOL Ex. 1031
`Page 4 of 10
`
`

`

`If they do, co-
`they co-occur frequently in the collection.
`occurrences within 3 terms of each other are considered
`when calculating belief. If not, the terms are treated as hav-
`ing equal influence on the final result in order to allow for
`the possibility individual occurrences are evidence of rele-
`vance. The #passage25 operator looks for the elements to
`occur within a window of 25. This operator ensures that
`terms which do not co-occur frequently be found a limited
`distance apart.
`The best results for automatic translations to Spanish are
`shown in Table 6. Descriptions of query processing for
`rows 2-7 follow. Row 2 (MRD) is the automatic word-by-
`word translation of BASE (original TREC queries manu-
`ally translated). For row 3, phrases were identified in the
`BASE queries and then WBW translation was augmented
`by phrasal
`translation (MRD + Phr). Row 4 shows re-
`sults for pre-translation LCA expanded BASE queries trans-
`lated word-by-word (MRD + LCA-WBW). Row 5 repre-
`sents pre-translation LCA expanded BASE queries trans-
`lated word-by-word with phrasal translation where possible
`(MRD + LCA-Phr).
`In Row 6, after phrase identification in
`BASE queries, they were expanded via LCA prior to trans-
`lation. The expanded queries were then translated word-by-
`word with phrasal
`translation where possible. Finally, row
`7 shows results for pre-translation local feedback expanded
`BASE queries after word-by-word translation (LF).
`
`Method
`MRD
`MRD+Phr
`MRD+LCA-WBW
`MRD+LCA-phr
`MRD+Phr+LCA-phr
`LF
`
`Avg
`0.0823
`0.0826
`0.0969
`0.1009
`0.1053
`0.1099
`
`%Change
`
`0.3
`17.7
`22.7
`27.9
`33.5
`
`Table 6: Average precision for pre-translation expansion re-
`sults.
`
`The best results were gained after adding the top 30 con-
`cepts from the top 20 documents. They show that LCA ex-
`p~nsion is effective, but WBW translation of LCA concepts
`yields only a 17% increase. This is probably due to the am-
`biguity introduced through the loss of multi-term concepts.
`Further improvements
`are given when phrases are identi-
`fied in the BASE queries and when multi-term concepts are
`translated as phrases.
`If multi-term concepts are translated
`as phrases, effectiveness goes up by 5%. The addition of
`phrasal recognition in the BASE queries boosts effective-
`ness by an additional 5%. These results show that the use
`of phrasal translation can indeed improve effectiveness.
`Pre-translation LCA expansion results are still not as
`good as those for pre-translation local feedback. This is sur-
`prising since comparisons of local feedback and LCA in the
`mono-lingual environment
`[XC96] have shown LCA to be
`more robust for query expansion.
`We hypothesized that although most phrases added by
`
`they may lose their effec-
`LCA appear to be good phrases,
`terms. This happens when
`tiveness when taken as individual
`a phrasal
`translation fails and we are forced to translate the
`phrase word-by-word.
`In addition, poor phrases will also
`tend to be ineffective when translated word-by-word. To test
`this, we performed LCA expansion returning only the best
`single-term concepts. Results in section 5 show that query
`effectiveness
`is highly sensitive to the accuracy of phrasal
`translation. Expansion by individual
`terms eliminates
`the
`negative effects of poor phrasal
`translations.
`We found that in some cases, our hypothesis is supported.
`However,
`it is not consistent.
`Table 7 gives a few exam-
`ples ofLCA expansion with single- and multi-term concepts
`compared to expansion with only single-term concepts.
`In
`this table, each of the expansions was done using the top 20
`passages and the top 5 or 30 concepts. Automatic translation
`is given as a baseline. We believe the inconsistency is related
`to the types of multi-term concepts that are included in the
`expansion and on translation accuracy.
`
`Method
`MRD
`LCA5-Phrasal
`LCA5-Single
`LCA30-Phrasal
`LCA30-Single
`
`Avg prec %Change
`0.0823
`0.0819
`0.1051
`0.1053
`0.1010
`
`-0.5
`27.7
`27.9
`22.7
`
`Table 7: Average precision for multi-term and single-term
`concept expansion.
`
`Table 8 shows the best pre-translation results for expan-
`sion via local feedback and for single-term expansion via
`LCA. This shows that LCA can be more effective than local
`feedback when used prior to translation, however the choice
`of expansion concepts is critical.
`
`Avgprec
`% Change:
`Precision:
`5 docs:
`10 docs:
`15 docs:
`20 docs:
`
`MRD
`0.0823
`
`0.2000
`0.2100
`0.1867
`0.1975
`
`LF
`0.1099
`33.5
`
`0.2500
`0.2300
`0.2400
`0.2375
`
`LCAIO-Single
`0.1139
`38.5
`
`0.3100
`0.2750
`0.2600
`0.2350
`
`Table 8: Best pre-translation local feedback and single-term
`LCA expansion results.
`
`6.2
`
`Post-translation Expansion
`
`In experiments where post-translation LCA expansion was
`performed, multi-term concepts were wrapped in INQUERY
`#PHRASE operators. The top ranked concept was added to
`a query with a weight of 1.0. Each additional concept was
`down-weighted by 11100 with respect
`to the weight given its
`
`AOL Ex. 1031
`Page 5 of 10
`
`

`

`predecessor. This weighting scheme was shown to be effec-
`tive in LCA experiments for the TREC5 evaluations [Har96].
`Table 10 shows the best results for post-translation
`expan-
`sion via local feedback and LCA. In this table,
`local feed-
`back expansion was done by addition of the top 20 terms
`from the top 50 documents. LCA expansion was done by
`addition of the top 100 concepts
`from the top 20 passages.
`Table9 shows 2 representations of one of these queries. First
`is the BASE and second the automatic translation of BASE.
`The last row gives the top 20 expansion concepts that were
`added to this query, with multi-term concepts in parentheses.
`Note that all terms are stemmed.
`'
`
`relations mexico european
`
`economic commercial
`countries
`comerc narr relat rei econom equit rentabl pai patri
`camp region tierr mej mex europ
`(est un) canada pai europ franci (diversific comerc)
`mex polit pais alemani rentabl oportun product apoy
`australi (mere europ) agricultor bancarrot
`region
`(comun econom europ)
`
`for TREC query SP26:
`Table 9: Two query representations
`BASE and MRD translation of BASE. Row 3 gives the top
`20 post-translation LCA expansion concepts for this query.
`
`Avg prec
`% Change:
`Precision:
`5docs:
`10 docs:
`15 docs:
`20 docs:
`
`MRD
`0.0823
`
`0.2000
`0.2100
`0.1867
`0.1975
`
`LF
`0.0916
`11.3
`
`0.1800
`0.1850
`0.1800
`0.1575
`
`LCA20
`0.1022
`24.1
`
`0.2200
`0.2100
`0.2167
`0.2050
`
`Table 10: Best post-translation
`pansion results.
`
`local feedback and LCA ex-
`
`The best post-translation LCA expansion is 11.6% more
`effective than the best post-translation
`local feedback expan-
`sion. Eleven of 20 queries do better with LCA as compared
`t~ 7 which do better with LF. A paired sign test shows this
`dIfference to be significant at p = .01. This supports earlier
`Workby Xu which showed LCA to be a more effective query
`expansion technique than local feedback.
`
`6.3 Combined Pre- and Post-translation
`
`Expansion
`
`start with the pre-translation
`The combination experiments
`LC~ expansion of the BASE queries. After the expanded
`q~enes are translated automatically,
`they are expanded again
`VIaLCA multi-term expansion. The base query set for the
`post-translation expansion phase in these experiments,
`is
`the best pre-translation,
`single-term concept LCA expanded
`query set, as described in Section 6.1. Table 11 shows 4
`
`representations of one of these queries. First is the original
`query, second is the manual
`translation (BASE)
`including
`automatically identified phrases,
`third is the pre-translation
`LCA single-term expanded query, and fourth is the auto-
`matic translation of the third. The last row gives the top
`20 expansion concepts
`that were added to this query, with
`multi-term concepts in parentheses. Note that all terms are
`stemmed. Parentheses surround LCA expansion phrases and
`phrases automatically identified in the BASE query. Brack-
`ets surround the translation of each term or phrase.
`
`relations) between
`
`las relaciones econ6micas y comerciales entre Mexico
`y Canada
`the economic and (commercial
`mexico and canada
`relations) mexico canada
`economic (commercial
`mexico free-trade canada trade mexican salinas
`cuba pact economies barriers
`[econ6mico equitativo] [comercio negocio trafico
`industria] [narraci6n relato relaci6n] [Mejico Mexico]
`Canada [Mejico Mexicollconvenio
`comercial]
`[comercio negocio trafico industria] zona cuba salinas
`canada (Jibr comerci) trat ottaw dosm (acuerd paralel)
`norteamer (est un) (tres pais) import eu (vit econom)
`comerci (centr econom)
`(barrer comerc) (increment
`subit) superpot rel acuerd negoci
`
`Table 11: Four query representations: original, BASE (with
`identified phrases), LCA expanded BASE, WBW + phrasal
`translation of LCA expanded BASE.
`
`The combined approach is more effective than either pre-
`or post-translation LCA expansion alone. This was also
`shown to be the case for local feedback expansion. Table
`12 gives results for automatic translation,
`the best combined
`pre- and post-translation local feedback expansion, and the
`best combined LCA expansion.
`In this experiment, queries
`were expanded by the top 50 terms from the top 20 passages
`in the post-translation LCA phase.
`Fourteen and eleven
`queries show improvement over MRD translation alone for
`LCA and LF, respectively. The LCA approach shows a 9%
`greater improvement
`than the local feedback approach, but
`this difference is not statistically significant. When the two
`methods are compared 9 queries do better with LCA expan-
`sion as compared to 10 that do better with LF expansion.
`However, it is interesting to compare the effects of LCA and
`local feedback expansion on precision. The LCA expansion
`has higher precision at low recall
`levels. This is important
`in a CUR environment. The user may not be proficient at
`reading a foreign language, so could not be expected to look
`through more than the top retrieved documents.
`
`7 Conclusions and Future Work
`
`,i ~ .'_'i-
`
`Automatic dictionary translations are attractive because they
`are cost effective and easy to perform,
`resources
`are read-
`
`AOL Ex. 1031
`Page 6 of 10
`
`

`

`Avg prec
`% Change:
`Precision:
`5 docs:
`10 docs:
`15 docs:
`20 docs:
`
`MRD
`0.0823
`
`0.2000
`0.2100
`0.1867
`0.1975
`
`LF
`0.1242
`51.0
`
`0.2600
`0.2200
`0.2000
`0.2125
`
`LCA20-50
`0.1358
`65.0
`
`0.3700
`0.2850
`0.2767
`0.2600
`
`Table 12: Best combined pre- and post-translation local
`feedback and LCA expansion results.
`
`ily available, and performance is similar to that of other
`CUR methods. Ambiguity from failure to translate phrases
`is largely responsible for the large drops in effectiveness be-
`low monolingual performance.
`Phrasal
`translation can greatly improve effectiveness,
`however improvements are sensitive to the quality of the
`translations. The effect of one poor translation can coun-
`teract any improvement gained by the correct translation of
`several phrases and may cause additional drops in effective-
`ness. Certain types of multi-term concepts, such as proper
`noun phrases, are easily translated via MRD. However, dic-
`tionaries do not provide enough context for accurate phrasal
`translation in other cases.
`Query expansion via local feedback and LCA can be
`used to significantly reduce the error associated with dic-
`tionary translation. LCA expansion gives higher precision
`at low recall levels, which is important
`in a CUR environ-
`ment. Table 13 shows the performance of each method as
`measured by average precision and percentage of monolin-
`gual performance. LCA, which typically expands queries
`with multi-term phrases, is more sensitive to translation ef-
`fects when pre-translation expansion is performed. This is
`because phrases that must be translated WBW, are not as ef-
`fective when separated into individual terms. Pre-translation
`LCA expansion with single-term concepts can reduce this
`problem. Pre-translation LCA expansion with single terms
`is also more effective than pre-translation local feedback
`and improves both precision and recall.
`Post-translation
`LCA is more effective than post-translation local feedback
`and tends to improve precision. Combining pre- and post-
`translation expansion is most effective and improves preci-
`sion and recall.
`It can reduce translation error by 45% over
`automatic translation bringing CUR performance up from
`42% to 68% of monolingual performance. This is stilI well
`below a monolingual baseline, but improved phrasal transla-
`tions should hel

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket