`INFORMATION RETRIEVAL
`
`edited by
`
`Gregory Grefenstette
`Xerox Research Centre Europe
`Grenoble, France
`
`KLUWER ACADEMIC PUBLISHERS
`Boston / Dordrecht
`/ London
`
`AOL Ex. 1020
`Page 1 of 12
`
`
`
`for North America:
`Distributors
`Kluwer Academic Publishers
`101 Philip Drive
`Assinippi Park
`Norwell, Massachusetts 02061 USA
`
`for all other countries:
`Distributors
`Kluwer Academic Publishers Group
`Distribution Centre
`Post Office Box 322
`3300 AH Dordrecht, THE NETHERLANDS
`
`Library of Congress Cataloging-in-Publication
`
`Data
`
`A C.I.P. Catalogue record for this book is available
`from the Library of Congress.
`
`The publisher offers discounts on this book when ordered in bulk quantities. For
`more information contact:
`Sales Department, Kluwer Academic Publishers,
`101 Philip Drive, Assinippi Park, Norwell, MA 02061
`
`Copyright © 1998 by Kluwer Academic Publishers
`
`stored in a
`All rights reserved. No part of this publication may be reproduced,
`retrieval system or transmitted in any form or by any means, mechanical, photo-
`copying,
`recording, or otherwise, without
`the prior written permission of the
`publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell,
`Massachusett:s 02061
`
`Printed on acid-free paper.
`
`Printed in the United States of America
`
`AOL Ex. 1020
`Page 2 of 12
`
`
`
`4
`
`DISTRIBUTED CROSS-LINGUAL
`INFORMATION RETRIEVAL
`Christian Fluhr
`Dominique Schmit
`Philippe Ortet
`Faza Elkateb
`Karine Gurtner
`Khaled Radwan
`
`DIST/SMTI
`CEA-Saclay
`91191 Gij/Yvelte Cede»,
`France
`
`ABSTRACT
`Many text databases in Europe contain English documents and documents in at least
`one of the local languages. This is the case in the publications database of large
`organizations, of documents used and produced by European projects, of library cat-
`alogues, and more recently, in information destined for the Internet. For efficient and
`easy searching,
`it is necessary to ask a query in one language (a user usually finds
`their mother tongue more flexible and usable) and to get documents in their origi-
`nal language, providing full-text content or summaries or keywords. Two problems
`have to be faced: (1) to be able to locate information in a language other than the
`original query language, and (2) to be able to manage databases containing docu-
`ments (and even part of documents) in several different languages. A solution to the
`first problem has been addressed by the EMIR (European Multilingual Information
`Retrieval) ESPRIT project which produced a system that enables a user to pose a
`French, English or German query to access documents in any of a French, English or
`German database. The second problem can be approached by a splitting the multi-
`lingual database into as many databases as languages. In this approach, the problem
`oLmuItilinguality canbe solved in the same way as above if the problem of multibase
`(even distributed)
`interrogation can be solved.
`
`AOL Ex. 1020
`Page 3 of 12
`
`
`
`42
`
`CHAPTER 4
`
`Text introduced
`the database
`
`into
`
`research topic described
`natural
`language
`
`in
`
`ranked
`Documents
`in order of Relevance
`
`Figure 1 The EMIR Approach to Cross Language Information Retrieval.
`
`1 THE EMIR PROJECT APPROACH
`
`EMIR (European Multilingual Information Retrieval) [RFF91] [EMI94] is based
`on the SPIRIT system [DFRS9] which is a text
`information retrieval
`system
`It can manage homogeneous
`that can be interrogated in natural
`language.
`database in French, English or German. Russian has been recently added
`[ASE+96]. SPIRIT uses morphosyntatic linguistic processing to recognize and
`normalize the words found in the documents. Compounds
`are recognized as
`are domain-independent
`synonyms. Some homographs
`are resolved, as well.
`SPIRIT uses also statistical processing to weight intersections between queries
`and documents.
`
`splitting the charac-
`The linguistic processing is composed of several steps:
`ter string into candidate words using a finite state automaton; morphological
`analysis using a full-form dictionary;
`idiomatic expression recognition that can
`identify expressions having several flexional forms and non contiguous elements
`the light off = to switch off);part-of-speech
`(ex: he switches
`disambiguation
`that uses local syntactic knowledge learned from a corpus; syntactic parsing
`that provides dependency relations; normalization that
`turns
`the recognized
`words into a normalized form (lemmatized) and that eliminates
`stop words
`according to their part of speech.
`In. SPIRIT all these steps use linguistic
`knowledge separated from the programming code. This is absolutely necessary
`in an. adaptive multilingual setting,
`the systems switching from one language
`to. another simply by loading the appropriate. rules.
`
`AOL Ex. 1020
`Page 4 of 12
`
`
`
`Distributed Cross-lingual
`
`IR
`
`43
`
`Once words have been identified and normalized by the linguistic processing, a
`statistical model
`is applied. The role of this statistical model is to attribute a
`weight to each word according to the information theword provides in choosing
`the document relevant
`to a query. Roughly speaking, the weight is maximum for
`a word appearing in one single document and minimum for words appearing in
`all the documents. According to the Shannon's theory of information the rarest
`words bring more information than the more widely used words. This weight
`is used to compare the intersection between query and documents containing
`different words. The weight is database dependent. Take, for example, a query
`If the database
`about VAT on alcohol.
`is about
`tax law, the system should
`return documents containing alcohol before document
`talking about
`the value
`added tax (VAT). The reverse would be true in a database about food products.
`
`is a reformulation program that can infer new
`The final text processing tool
`words [Flu90a] from the original query words according to a lexical semantic
`knowledge base. The reformulation tool can be used to increase the quality of
`It can also be used to infer words
`~heretrieval
`in a monolingual
`interrogation.
`~nan other language[Flu90b]. This latter use ofreformulation was widely tested
`m.EMIR experiments[RFF91].
`The limit conditions of the problem that EMIR
`tried to address were that
`the inference rules for reformulation must not be
`domain dependent. That meant using a general language dictionary so that for
`each word all possible translations were proposed.
`
`these
`can be added for a given domain;
`Of course, very specific translations
`can be added to the general ones if necessary. This attitude
`is opposed to
`that found in most commercial
`translation
`programs that give preference to
`the specific domain translation, when these are present.
`
`to a word
`Reformulation rules can be applied to all instances of a word or
`only when it is playing a specific part-of-speech.
`Semantic relations can also
`be selected:
`translation
`synonyms word derived from the same root, etc. The
`general reformulation
`r~les at pre~ent contain from 30 000 to 50 000 entr~es.
`There is a monolingual
`reformulation
`set for each language (Fre~ch, Engl~sh,
`German, Russian) and the following bilingual sets: French-Enghsh, English-
`rrench, German-French.
`Rules for the specific domain of Energy are betl
`Implemented to improve access to catalogs for databases managed by the C
`(F~ench Atomic Energy Agency). Figure 2 shows rules for bilingual ref?rmu-
`labon from. French to English.
`Figure 3 shows an example of monohngual
`reformulation rules.
`
`Evaluations performed during the EMIR project have shown that usingcurr~nt
`translation systems to translate
`the query gives less relevant results thanusmg
`
`AOL Ex. 1020
`Page 5 of 12
`
`
`
`44
`
`entry
`stock
`
`POS#sem
`J#T
`S#T
`
`V#T
`
`CHAPTER 4
`
`reformulations
`classiqueJ; banaLJ;
`billotJV; fUtJV; boisJV; mancheJV; gauleJV;
`..N ;betailJV; materiel
`jasJV; cheptel
`roulant..N;
`matiere premiere.V;
`talonJV; reserve..Ni provisionJV;
`stockJV; reserve..N; souche.X;
`ligneeJVj famille..N;
`troncJVj souche..N; porte-greffeJV; giroflee..N; matthiole..N;
`valeur..N; titreJV; fonds publicJV; fonds d'Hat..N;
`aetion..N; palanque.V; palissadeJV; stockadeJV j
`approvisionner _V; peupler _V; empoissonner _Vj
`rendre; V; palanquer., Vi
`
`2 Bilingual French-English reformulation rules for the English word
`Figure
`"stock." Part-of-Speech
`Tags: Jeeadjective, Neenoun, V=verb.
`Semantic
`Tags:
`T=transfer
`rule, S=Synonyms.
`";" is the separator between meanings.
`
`entry
`sterile
`
`POS#sem
`J#S
`
`reformulations
`infertileJ;
`blandJ;
`barrenJ;
`antisepticJ;
`sterility JVj sterilization.N;
`sterilize. Vj
`soullessJ;
`sterilizer ..N;UllproductiveJ;
`
`Figure
`for the word "ster-
`3 Monolingual English-English reformulation
`ile." Part-of-Speech
`Tags: Jeeadjectlve, Neenoun, Veeverb,
`Semantic
`Tags:
`Teetransfer rule, S=Synonyms.
`";" is the separator between meanings.
`
`in the
`reformulation, see Figure 4. The reason for this is that
`of multilingual
`case of query translation only one solution is proposed.
`If there is a word that
`has a wrong translation,
`the result of the interrogation
`is unpredictable.
`In
`reformulation, all possible solutions are tried. It is the
`the case of multilingual
`full-text database itself that
`is used as a semantic filter to give the relevant
`documents. As a supplement,
`the text
`itself finds the right translations
`[FR93,
`RF95].
`
`testbed,
`These results have been obtained on the Cranfield! information retrieval
`containing documents and queries in the aerospace domain. French versions of
`the-queries were created by specialists in the ·domain.·.The SYSTRAN machine
`translation system [GLY98]was used to translate
`these queries back into En-
`glish in order to compare results.. SYSTRAN used its specific aerospace transfer
`dictionary,
`in which words were searched before falling back to a general. trans-
`lation dictionary. Evaluationwas
`performed using the same tools as those used
`in theTREG full-text
`information retrieval competition.
`
`When EMIR treated thequeries,many
`inated since. they do not occur in the
`Iftp://ftp.cs.comell/pub/smart/cran
`
`alternatives could be elim-
`translations
`database.
`For the remaining al-
`
`AOL Ex. 1020
`Page 6 of 12
`
`
`
`Distributed Cross-lingual
`
`IR
`
`45
`
`Comparison of
`a monolingual
`interrogation by SPIRIT
`a bilingual
`interrogation
`by EMIR
`and a translation of the query using SYSTRAN
`
`0,8
`
`0,7
`
`0,6
`c 0,5
`0
`'(;; 0,4
`'uQl.. 0,3
`
`Q.
`
`0,2
`
`0,1
`
`0
`
`--+-EMIR
`----
`SYSTRAN
`-
`SPIRIT anglais
`
`-
`
`10
`
`20
`
`30
`
`40
`
`GO
`
`70
`
`50
`Recall
`
`-
`
`80
`
`90
`
`Figure 4 Machine translation of queries performs less well than retaining
`multiple translations of query terms.
`
`it was possible to filter some out because relevant
`ternatives, we found that
`documents often contained many of the translated query concepts. In the most
`relevant documents,
`there was often at least one translation of each word from
`the query. It seemed that,
`in this case, the translations
`found in the most rele-
`vant documents are the right
`translations. We believe that
`the cooccurrence (or
`dependency relations). of translations
`of each query word is sufficient inmost
`cases both to find the relevant document and to give theright.translation.
`
`Of course the problem of multiword terms isdiffieult.
`.There are idiomatic
`expressions that are translated
`globally. Some compounds mustbe
`also trans-
`lated. globally others can be translated word. for. word but. a transformation
`must be applied to restore the right word order . The most difficult problem is
`t~ recognition of split derivable idiomatic expression .that m~st be transla~ed
`takeoff),
`verbs with
`g bally,.such a verbs with postposition in.English(ex:
`(ex: abdecken which can appear as decken ..
`mGerman
`ab)orverbal
`IUJ()m,ttil'expressions in French (ex; prendre part = to
`
`AOL Ex. 1020
`Page 7 of 12
`
`
`
`46
`
`CHAPTER 4
`
`Here is an example an interrogation on the database in the aerospace domain.
`in the database are
`The French query is Effet de choc, and the documents
`in English. The morphosyntactic parser gives one part of speech for each. word
`and specifies compounds (terms linked with syntactic dependency relations].
`Bilingual reformulation gives the following results according to the results of
`the morphosyntactic parsing:
`
`effet
`choc
`
`effect, result, action, operation, working, spin, break,
`shock, impact, bump, collision
`
`impression
`
`After filtering by the database lexicon,
`are as follows:
`
`the remaining translation
`
`ambiguities
`
`effet
`choc
`
`effect, result, action, operation,
`shock, impact, bump, collision
`
`spin,
`
`impression
`
`After filtering by the database lexicon of known compounds processed by word-
`for-word translation and transformation of the word order,
`the following trans-
`lation alternatives are retained:
`
`effet
`
`(de) choc
`
`effect (of) shock, result
`shock operation
`
`(of) shock, shock result,
`
`In this case good translations are obtained, but in more complicated situations
`and longer queries,
`if there is a document
`that has the cooccurrence of at
`least one translation of each concept of the query, the obtained translations
`are
`generally the right ones and the document
`is relevant
`to the query. Figure 5
`gives an example of filtering query translations
`using the database
`and best
`document, over a library catalog in the nuclear domain.
`
`2 THE DISTRIBUTED MULTILINGUAL
`CLIENT-SERVER ARCHITECTURE
`
`To give access to. SPIRIT. databases from standard Internet Clients (such as
`Netscapeor MS Explorer), an interface between a WWW server and a SPIRIT
`server has been developed. Users are faced with two kind of problems:
`L ••~heywould like to consider a set of databases as one whole logical database
`in orderto have a better coverage in the search.
`
`2. The databases could bein different languages and even in mixed languages.
`
`AOL Ex. 1020
`Page 8 of 12
`
`
`
`Distributed Cross-lingual
`
`IR
`
`Transfer
`
`Filtering by
`Database
`
`treating
`rraritement
`~
`
`PtrreOaCtemSeSnintg
`~---_~
`
`treatment-----~
`
`com ound
`
`salary
`
`deChets~.
`
`co pound
`
`~::srease-----.~
`diminution
`waste------~
`
`refuse
`failure
`
`47
`
`Monolingual
`Reformulation
`
`processing
`treatment
`
`1
`
`co pound
`
`Filtering by
`Best Docnment
`
`i
`
`treatment~
`
`co pound
`
`decrease
`
`waste-------
`
`~
`
`_
`waste~
`
`waste
`wasteful
`
`radioactifs__
`
`radioactive----_~
`
`radioactive----;O"~
`
`radioactive~
`
`~
`
`rad~oac~vitY
`radioactive
`
`5 Filtering the translations of the French query "Traitement des
`Figure
`dechets radioactifs" over an English library catalog in the nuclear domain.
`
`By mixed language we do not mean that we manage documents whose text
`contains several
`languages. Applications of this kind are rare, but we mean
`that the information linked to the documents can be in more than one language.
`~or example, a catalog can contain for each documents a title in French and
`in English, a summary in French and in English and the text only in English.
`The problem of the identification of the language[Gre95] has not been treated
`?ecause it
`is supposed that at
`the moment
`the document
`is introduced the
`mformation in different
`language are putin
`different fields.
`
`see Figure 6 is to add a new layer that
`is under development,
`The solution that
`to define a logical database composed of
`enables the user (or database manager)
`a set of existing databases whatever
`the location of'.the database in the world.
`If thi
`•.•.
`contam
`IS approach is followed,
`the problem of accessing databases that
`•
`documents in different
`languages or documents that have parts in various lan-
`guages can be solved by splitting such databases
`into as many bases as Ian-
`guages according to the structure
`description that gives the language~or each
`field. After this operation
`only monolingual databases have to be queried.
`
`AOL Ex. 1020
`Page 9 of 12
`
`
`
`48
`
`CHAPTER 4
`
`Interface SPIRIT·W3
`
`WWWServer
`
`Figure 6 General architecture of the distributed multilingual WEB text re-
`trieval system
`
`The definition of the logical database to be queried must describe not a set of
`databases but a set of clusters of databases. Each cluster is composed of the
`various language-specific fields of the same original database.
`
`united
`The problem of merging results from the various physical databases
`in a logical one is rather
`complicated.
`In the case of merging information
`from databases containing different document sets, the problem of eliminating
`If this is not
`the case, an elimination procedure of
`doubles can be ignored.
`doubles must be undertaken. The problem of merging word weights from the
`various origins can be simple if one adopts the hypothesis
`that
`there will be
`few doubles.
`
`the dif-
`representing
`In the case of merging information from the databases
`ferent language translations of the same original database,
`there will be many
`doubles in the query answer especially in the case of documents which contain
`severallanguagesinthe
`same document. The computation of the word weights
`is rather complex because the frequencies in each language can be very. different
`depending of the repartition of the languages in the database. A good solution
`to this problem,
`if possible,
`is to identify the concepts and their different rep-
`resentation in all of the languages considered and to recompute
`a weight for
`this pivot concept on the base of fictive database made of the sole retrieved
`
`AOL Ex. 1020
`Page 10 of 12
`
`
`
`Distributed Cross-lingual
`
`IR
`
`49
`
`documents from the queried databases.
`fromthe user's point of view.
`
`This solution give results acceptable
`
`3 EXPERIMENTATION
`CATALOG
`
`ON A LIBRARY
`
`The catalog of our libraries is done by downloading OCLC notices that are up-
`?ated to add local information.
`In this catalog,
`titles of documents are mainly
`~nFrench and in English. The same catalog entry can have a list of keywords
`III English and another
`in French but
`they are not the direct translation of each
`other. This database is really a mixed language database in the sense that
`in a
`sa~e entry there can be more than one language. For this reason, we adopted
`this database to experiment
`the architecture described in the previous section.
`
`interrogation of this database illustrating
`Hereis an example of a multilingual
`t~e functioning of the system. The test database is a sample of 1500 documents.
`TItles are either in English or French but each language is stored in a different
`field, and can thus be automatically
`identified. Keywords are in one or both
`lang~~ges, each language separated,
`again,
`in a different keyword field. The
`rnultIlmgual database was split
`into two monolingual databases. The same
`docu~ents can have a English part
`in one database and another French other
`p~rt in the second database. The French query is sent to the French database
`~Ith only a monolingual
`reformulation and to the English database with a cross-
`ltngual reformulation followed by a monolingual
`target
`language reformulation.
`
`is presented as a suite of documents
`The results of a SPIRIT interrogation
`Each class is characterized by the
`classes ordered by decreasing relevance.
`~arneintersection of concepts with the original query. A document
`is present
`~~the ~est class it can go into, best being understood in terms of releva~ce~:.n
`ampeof
`an actual query (we found that our users prefer to. query Ill.
`elf
`rnother tongue, French)
`is traitement
`des dechets
`radioactIfs. Here ISthe
`result of posing this query on the French part of the database. Documents are
`named by their internal number.
`
`AOL Ex. 1020
`Page 11 of 12
`
`
`
`50
`
`Class
`first
`second
`third
`
`Class
`first
`second
`third
`
`CHAPTER 4
`
`Result over French part of databases
`match
`documents
`dechets radioactifs compound
`1215
`traitement and radioactifs
`1192, 1216
`radioactif
`950,951,952,953,1397,1442
`
`Result over English part of databases
`match
`documents
`waste treatment compound
`1215
`by Best Doc.
`decrease from decheis
`339 eliminated
`radioactive, radioactivity
`42,950,951,953,1397,
`1442
`
`filtering
`
`Merging of Results Result, calculating new classes, new weights
`Class
`match
`documents
`dechets radioactifs compound
`1215
`first
`traitement and radioact.ifs
`second
`1192, 1216
`third
`radioactif
`42,950,951,952,953,1397,1442
`
`very
`like machine translation,
`that can be, in other applications
`A side result
`important
`is that
`this comparison process has been able to choose the right
`English translation "waste treatment."
`
`4 CONCLUSION
`
`.this approach seems to give readily exploitable
`The first conclusion is that
`results for the problem of the interrogation of mixed language databases. Of
`course, some problems remain. The quality of the linguistic processing and of
`the monolingual and transfer
`rules are critical for the quality of the answers.
`This forced us to begin strong quality assurance of this linguistic data.
`
`specific to the domain has to
`is that addition of compounds
`An other point
`be helped by automatic processing.
`It
`is especially important
`that
`the re-
`formulation cycle know compounds that cannot be translated word for word.
`Translating frequent compounds in the domain globally would, in addition, save
`CPU resources by avoiding the current generation of all possible combinations.
`
`A final remark is that, with this kind of information retrieval based on multi-
`lingual reformulation, a translation memory would no longer need a bilingual
`corpus[LL90) but only a large collection of texts in the target
`language.
`
`AOL Ex. 1020
`Page 12 of 12
`
`