throbber
CROSS-LANGUAGE
`INFORMATION RETRIEVAL
`
`edited by
`
`Gregory Grefenstette
`Xerox Research Centre Europe
`Grenoble, France
`
`KLUWER ACADEMIC PUBLISHERS
`Boston / Dordrecht
`/ London
`
`AOL Ex. 1020
`Page 1 of 12
`
`

`

`for North America:
`Distributors
`Kluwer Academic Publishers
`101 Philip Drive
`Assinippi Park
`Norwell, Massachusetts 02061 USA
`
`for all other countries:
`Distributors
`Kluwer Academic Publishers Group
`Distribution Centre
`Post Office Box 322
`3300 AH Dordrecht, THE NETHERLANDS
`
`Library of Congress Cataloging-in-Publication
`
`Data
`
`A C.I.P. Catalogue record for this book is available
`from the Library of Congress.
`
`The publisher offers discounts on this book when ordered in bulk quantities. For
`more information contact:
`Sales Department, Kluwer Academic Publishers,
`101 Philip Drive, Assinippi Park, Norwell, MA 02061
`
`Copyright © 1998 by Kluwer Academic Publishers
`
`stored in a
`All rights reserved. No part of this publication may be reproduced,
`retrieval system or transmitted in any form or by any means, mechanical, photo-
`copying,
`recording, or otherwise, without
`the prior written permission of the
`publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell,
`Massachusett:s 02061
`
`Printed on acid-free paper.
`
`Printed in the United States of America
`
`AOL Ex. 1020
`Page 2 of 12
`
`

`

`4
`
`DISTRIBUTED CROSS-LINGUAL
`INFORMATION RETRIEVAL
`Christian Fluhr
`Dominique Schmit
`Philippe Ortet
`Faza Elkateb
`Karine Gurtner
`Khaled Radwan
`
`DIST/SMTI
`CEA-Saclay
`91191 Gij/Yvelte Cede»,
`France
`
`ABSTRACT
`Many text databases in Europe contain English documents and documents in at least
`one of the local languages. This is the case in the publications database of large
`organizations, of documents used and produced by European projects, of library cat-
`alogues, and more recently, in information destined for the Internet. For efficient and
`easy searching,
`it is necessary to ask a query in one language (a user usually finds
`their mother tongue more flexible and usable) and to get documents in their origi-
`nal language, providing full-text content or summaries or keywords. Two problems
`have to be faced: (1) to be able to locate information in a language other than the
`original query language, and (2) to be able to manage databases containing docu-
`ments (and even part of documents) in several different languages. A solution to the
`first problem has been addressed by the EMIR (European Multilingual Information
`Retrieval) ESPRIT project which produced a system that enables a user to pose a
`French, English or German query to access documents in any of a French, English or
`German database. The second problem can be approached by a splitting the multi-
`lingual database into as many databases as languages. In this approach, the problem
`oLmuItilinguality canbe solved in the same way as above if the problem of multibase
`(even distributed)
`interrogation can be solved.
`
`AOL Ex. 1020
`Page 3 of 12
`
`

`

`42
`
`CHAPTER 4
`
`Text introduced
`the database
`
`into
`
`research topic described
`natural
`language
`
`in
`
`ranked
`Documents
`in order of Relevance
`
`Figure 1 The EMIR Approach to Cross Language Information Retrieval.
`
`1 THE EMIR PROJECT APPROACH
`
`EMIR (European Multilingual Information Retrieval) [RFF91] [EMI94] is based
`on the SPIRIT system [DFRS9] which is a text
`information retrieval
`system
`It can manage homogeneous
`that can be interrogated in natural
`language.
`database in French, English or German. Russian has been recently added
`[ASE+96]. SPIRIT uses morphosyntatic linguistic processing to recognize and
`normalize the words found in the documents. Compounds
`are recognized as
`are domain-independent
`synonyms. Some homographs
`are resolved, as well.
`SPIRIT uses also statistical processing to weight intersections between queries
`and documents.
`
`splitting the charac-
`The linguistic processing is composed of several steps:
`ter string into candidate words using a finite state automaton; morphological
`analysis using a full-form dictionary;
`idiomatic expression recognition that can
`identify expressions having several flexional forms and non contiguous elements
`the light off = to switch off);part-of-speech
`(ex: he switches
`disambiguation
`that uses local syntactic knowledge learned from a corpus; syntactic parsing
`that provides dependency relations; normalization that
`turns
`the recognized
`words into a normalized form (lemmatized) and that eliminates
`stop words
`according to their part of speech.
`In. SPIRIT all these steps use linguistic
`knowledge separated from the programming code. This is absolutely necessary
`in an. adaptive multilingual setting,
`the systems switching from one language
`to. another simply by loading the appropriate. rules.
`
`AOL Ex. 1020
`Page 4 of 12
`
`

`

`Distributed Cross-lingual
`
`IR
`
`43
`
`Once words have been identified and normalized by the linguistic processing, a
`statistical model
`is applied. The role of this statistical model is to attribute a
`weight to each word according to the information theword provides in choosing
`the document relevant
`to a query. Roughly speaking, the weight is maximum for
`a word appearing in one single document and minimum for words appearing in
`all the documents. According to the Shannon's theory of information the rarest
`words bring more information than the more widely used words. This weight
`is used to compare the intersection between query and documents containing
`different words. The weight is database dependent. Take, for example, a query
`If the database
`about VAT on alcohol.
`is about
`tax law, the system should
`return documents containing alcohol before document
`talking about
`the value
`added tax (VAT). The reverse would be true in a database about food products.
`
`is a reformulation program that can infer new
`The final text processing tool
`words [Flu90a] from the original query words according to a lexical semantic
`knowledge base. The reformulation tool can be used to increase the quality of
`It can also be used to infer words
`~heretrieval
`in a monolingual
`interrogation.
`~nan other language[Flu90b]. This latter use ofreformulation was widely tested
`m.EMIR experiments[RFF91].
`The limit conditions of the problem that EMIR
`tried to address were that
`the inference rules for reformulation must not be
`domain dependent. That meant using a general language dictionary so that for
`each word all possible translations were proposed.
`
`these
`can be added for a given domain;
`Of course, very specific translations
`can be added to the general ones if necessary. This attitude
`is opposed to
`that found in most commercial
`translation
`programs that give preference to
`the specific domain translation, when these are present.
`
`to a word
`Reformulation rules can be applied to all instances of a word or
`only when it is playing a specific part-of-speech.
`Semantic relations can also
`be selected:
`translation
`synonyms word derived from the same root, etc. The
`general reformulation
`r~les at pre~ent contain from 30 000 to 50 000 entr~es.
`There is a monolingual
`reformulation
`set for each language (Fre~ch, Engl~sh,
`German, Russian) and the following bilingual sets: French-Enghsh, English-
`rrench, German-French.
`Rules for the specific domain of Energy are betl
`Implemented to improve access to catalogs for databases managed by the C
`(F~ench Atomic Energy Agency). Figure 2 shows rules for bilingual ref?rmu-
`labon from. French to English.
`Figure 3 shows an example of monohngual
`reformulation rules.
`
`Evaluations performed during the EMIR project have shown that usingcurr~nt
`translation systems to translate
`the query gives less relevant results thanusmg
`
`AOL Ex. 1020
`Page 5 of 12
`
`

`

`44
`
`entry
`stock
`
`POS#sem
`J#T
`S#T
`
`V#T
`
`CHAPTER 4
`
`reformulations
`classiqueJ; banaLJ;
`billotJV; fUtJV; boisJV; mancheJV; gauleJV;
`..N ;betailJV; materiel
`jasJV; cheptel
`roulant..N;
`matiere premiere.V;
`talonJV; reserve..Ni provisionJV;
`stockJV; reserve..N; souche.X;
`ligneeJVj famille..N;
`troncJVj souche..N; porte-greffeJV; giroflee..N; matthiole..N;
`valeur..N; titreJV; fonds publicJV; fonds d'Hat..N;
`aetion..N; palanque.V; palissadeJV; stockadeJV j
`approvisionner _V; peupler _V; empoissonner _Vj
`rendre; V; palanquer., Vi
`
`2 Bilingual French-English reformulation rules for the English word
`Figure
`"stock." Part-of-Speech
`Tags: Jeeadjective, Neenoun, V=verb.
`Semantic
`Tags:
`T=transfer
`rule, S=Synonyms.
`";" is the separator between meanings.
`
`entry
`sterile
`
`POS#sem
`J#S
`
`reformulations
`infertileJ;
`blandJ;
`barrenJ;
`antisepticJ;
`sterility JVj sterilization.N;
`sterilize. Vj
`soullessJ;
`sterilizer ..N;UllproductiveJ;
`
`Figure
`for the word "ster-
`3 Monolingual English-English reformulation
`ile." Part-of-Speech
`Tags: Jeeadjectlve, Neenoun, Veeverb,
`Semantic
`Tags:
`Teetransfer rule, S=Synonyms.
`";" is the separator between meanings.
`
`in the
`reformulation, see Figure 4. The reason for this is that
`of multilingual
`case of query translation only one solution is proposed.
`If there is a word that
`has a wrong translation,
`the result of the interrogation
`is unpredictable.
`In
`reformulation, all possible solutions are tried. It is the
`the case of multilingual
`full-text database itself that
`is used as a semantic filter to give the relevant
`documents. As a supplement,
`the text
`itself finds the right translations
`[FR93,
`RF95].
`
`testbed,
`These results have been obtained on the Cranfield! information retrieval
`containing documents and queries in the aerospace domain. French versions of
`the-queries were created by specialists in the ·domain.·.The SYSTRAN machine
`translation system [GLY98]was used to translate
`these queries back into En-
`glish in order to compare results.. SYSTRAN used its specific aerospace transfer
`dictionary,
`in which words were searched before falling back to a general. trans-
`lation dictionary. Evaluationwas
`performed using the same tools as those used
`in theTREG full-text
`information retrieval competition.
`
`When EMIR treated thequeries,many
`inated since. they do not occur in the
`Iftp://ftp.cs.comell/pub/smart/cran
`
`alternatives could be elim-
`translations
`database.
`For the remaining al-
`
`AOL Ex. 1020
`Page 6 of 12
`
`

`

`Distributed Cross-lingual
`
`IR
`
`45
`
`Comparison of
`a monolingual
`interrogation by SPIRIT
`a bilingual
`interrogation
`by EMIR
`and a translation of the query using SYSTRAN
`
`0,8
`
`0,7
`
`0,6
`c 0,5
`0
`'(;; 0,4
`'uQl.. 0,3
`
`Q.
`
`0,2
`
`0,1
`
`0
`
`--+-EMIR
`----
`SYSTRAN
`-
`SPIRIT anglais
`
`-
`
`10
`
`20
`
`30
`
`40
`
`GO
`
`70
`
`50
`Recall
`
`-
`
`80
`
`90
`
`Figure 4 Machine translation of queries performs less well than retaining
`multiple translations of query terms.
`
`it was possible to filter some out because relevant
`ternatives, we found that
`documents often contained many of the translated query concepts. In the most
`relevant documents,
`there was often at least one translation of each word from
`the query. It seemed that,
`in this case, the translations
`found in the most rele-
`vant documents are the right
`translations. We believe that
`the cooccurrence (or
`dependency relations). of translations
`of each query word is sufficient inmost
`cases both to find the relevant document and to give theright.translation.
`
`Of course the problem of multiword terms isdiffieult.
`.There are idiomatic
`expressions that are translated
`globally. Some compounds mustbe
`also trans-
`lated. globally others can be translated word. for. word but. a transformation
`must be applied to restore the right word order . The most difficult problem is
`t~ recognition of split derivable idiomatic expression .that m~st be transla~ed
`takeoff),
`verbs with
`g bally,.such a verbs with postposition in.English(ex:
`(ex: abdecken which can appear as decken ..
`mGerman
`ab)orverbal
`IUJ()m,ttil'expressions in French (ex; prendre part = to
`
`AOL Ex. 1020
`Page 7 of 12
`
`

`

`46
`
`CHAPTER 4
`
`Here is an example an interrogation on the database in the aerospace domain.
`in the database are
`The French query is Effet de choc, and the documents
`in English. The morphosyntactic parser gives one part of speech for each. word
`and specifies compounds (terms linked with syntactic dependency relations].
`Bilingual reformulation gives the following results according to the results of
`the morphosyntactic parsing:
`
`effet
`choc
`
`effect, result, action, operation, working, spin, break,
`shock, impact, bump, collision
`
`impression
`
`After filtering by the database lexicon,
`are as follows:
`
`the remaining translation
`
`ambiguities
`
`effet
`choc
`
`effect, result, action, operation,
`shock, impact, bump, collision
`
`spin,
`
`impression
`
`After filtering by the database lexicon of known compounds processed by word-
`for-word translation and transformation of the word order,
`the following trans-
`lation alternatives are retained:
`
`effet
`
`(de) choc
`
`effect (of) shock, result
`shock operation
`
`(of) shock, shock result,
`
`In this case good translations are obtained, but in more complicated situations
`and longer queries,
`if there is a document
`that has the cooccurrence of at
`least one translation of each concept of the query, the obtained translations
`are
`generally the right ones and the document
`is relevant
`to the query. Figure 5
`gives an example of filtering query translations
`using the database
`and best
`document, over a library catalog in the nuclear domain.
`
`2 THE DISTRIBUTED MULTILINGUAL
`CLIENT-SERVER ARCHITECTURE
`
`To give access to. SPIRIT. databases from standard Internet Clients (such as
`Netscapeor MS Explorer), an interface between a WWW server and a SPIRIT
`server has been developed. Users are faced with two kind of problems:
`L ••~heywould like to consider a set of databases as one whole logical database
`in orderto have a better coverage in the search.
`
`2. The databases could bein different languages and even in mixed languages.
`
`AOL Ex. 1020
`Page 8 of 12
`
`

`

`Distributed Cross-lingual
`
`IR
`
`Transfer
`
`Filtering by
`Database
`
`treating
`rraritement
`~
`
`PtrreOaCtemSeSnintg
`~---_~
`
`treatment-----~
`
`com ound
`
`salary
`
`deChets~.
`
`co pound
`
`~::srease-----.~
`diminution
`waste------~
`
`refuse
`failure
`
`47
`
`Monolingual
`Reformulation
`
`processing
`treatment
`
`1
`
`co pound
`
`Filtering by
`Best Docnment
`
`i
`
`treatment~
`
`co pound
`
`decrease
`
`waste-------
`
`~
`
`_
`waste~
`
`waste
`wasteful
`
`radioactifs__
`
`radioactive----_~
`
`radioactive----;O"~
`
`radioactive~
`
`~
`
`rad~oac~vitY
`radioactive
`
`5 Filtering the translations of the French query "Traitement des
`Figure
`dechets radioactifs" over an English library catalog in the nuclear domain.
`
`By mixed language we do not mean that we manage documents whose text
`contains several
`languages. Applications of this kind are rare, but we mean
`that the information linked to the documents can be in more than one language.
`~or example, a catalog can contain for each documents a title in French and
`in English, a summary in French and in English and the text only in English.
`The problem of the identification of the language[Gre95] has not been treated
`?ecause it
`is supposed that at
`the moment
`the document
`is introduced the
`mformation in different
`language are putin
`different fields.
`
`see Figure 6 is to add a new layer that
`is under development,
`The solution that
`to define a logical database composed of
`enables the user (or database manager)
`a set of existing databases whatever
`the location of'.the database in the world.
`If thi
`•.•.
`contam
`IS approach is followed,
`the problem of accessing databases that
`•
`documents in different
`languages or documents that have parts in various lan-
`guages can be solved by splitting such databases
`into as many bases as Ian-
`guages according to the structure
`description that gives the language~or each
`field. After this operation
`only monolingual databases have to be queried.
`
`AOL Ex. 1020
`Page 9 of 12
`
`

`

`48
`
`CHAPTER 4
`
`Interface SPIRIT·W3
`
`WWWServer
`
`Figure 6 General architecture of the distributed multilingual WEB text re-
`trieval system
`
`The definition of the logical database to be queried must describe not a set of
`databases but a set of clusters of databases. Each cluster is composed of the
`various language-specific fields of the same original database.
`
`united
`The problem of merging results from the various physical databases
`in a logical one is rather
`complicated.
`In the case of merging information
`from databases containing different document sets, the problem of eliminating
`If this is not
`the case, an elimination procedure of
`doubles can be ignored.
`doubles must be undertaken. The problem of merging word weights from the
`various origins can be simple if one adopts the hypothesis
`that
`there will be
`few doubles.
`
`the dif-
`representing
`In the case of merging information from the databases
`ferent language translations of the same original database,
`there will be many
`doubles in the query answer especially in the case of documents which contain
`severallanguagesinthe
`same document. The computation of the word weights
`is rather complex because the frequencies in each language can be very. different
`depending of the repartition of the languages in the database. A good solution
`to this problem,
`if possible,
`is to identify the concepts and their different rep-
`resentation in all of the languages considered and to recompute
`a weight for
`this pivot concept on the base of fictive database made of the sole retrieved
`
`AOL Ex. 1020
`Page 10 of 12
`
`

`

`Distributed Cross-lingual
`
`IR
`
`49
`
`documents from the queried databases.
`fromthe user's point of view.
`
`This solution give results acceptable
`
`3 EXPERIMENTATION
`CATALOG
`
`ON A LIBRARY
`
`The catalog of our libraries is done by downloading OCLC notices that are up-
`?ated to add local information.
`In this catalog,
`titles of documents are mainly
`~nFrench and in English. The same catalog entry can have a list of keywords
`III English and another
`in French but
`they are not the direct translation of each
`other. This database is really a mixed language database in the sense that
`in a
`sa~e entry there can be more than one language. For this reason, we adopted
`this database to experiment
`the architecture described in the previous section.
`
`interrogation of this database illustrating
`Hereis an example of a multilingual
`t~e functioning of the system. The test database is a sample of 1500 documents.
`TItles are either in English or French but each language is stored in a different
`field, and can thus be automatically
`identified. Keywords are in one or both
`lang~~ges, each language separated,
`again,
`in a different keyword field. The
`rnultIlmgual database was split
`into two monolingual databases. The same
`docu~ents can have a English part
`in one database and another French other
`p~rt in the second database. The French query is sent to the French database
`~Ith only a monolingual
`reformulation and to the English database with a cross-
`ltngual reformulation followed by a monolingual
`target
`language reformulation.
`
`is presented as a suite of documents
`The results of a SPIRIT interrogation
`Each class is characterized by the
`classes ordered by decreasing relevance.
`~arneintersection of concepts with the original query. A document
`is present
`~~the ~est class it can go into, best being understood in terms of releva~ce~:.n
`ampeof
`an actual query (we found that our users prefer to. query Ill.
`elf
`rnother tongue, French)
`is traitement
`des dechets
`radioactIfs. Here ISthe
`result of posing this query on the French part of the database. Documents are
`named by their internal number.
`
`AOL Ex. 1020
`Page 11 of 12
`
`

`

`50
`
`Class
`first
`second
`third
`
`Class
`first
`second
`third
`
`CHAPTER 4
`
`Result over French part of databases
`match
`documents
`dechets radioactifs compound
`1215
`traitement and radioactifs
`1192, 1216
`radioactif
`950,951,952,953,1397,1442
`
`Result over English part of databases
`match
`documents
`waste treatment compound
`1215
`by Best Doc.
`decrease from decheis
`339 eliminated
`radioactive, radioactivity
`42,950,951,953,1397,
`1442
`
`filtering
`
`Merging of Results Result, calculating new classes, new weights
`Class
`match
`documents
`dechets radioactifs compound
`1215
`first
`traitement and radioact.ifs
`second
`1192, 1216
`third
`radioactif
`42,950,951,952,953,1397,1442
`
`very
`like machine translation,
`that can be, in other applications
`A side result
`important
`is that
`this comparison process has been able to choose the right
`English translation "waste treatment."
`
`4 CONCLUSION
`
`.this approach seems to give readily exploitable
`The first conclusion is that
`results for the problem of the interrogation of mixed language databases. Of
`course, some problems remain. The quality of the linguistic processing and of
`the monolingual and transfer
`rules are critical for the quality of the answers.
`This forced us to begin strong quality assurance of this linguistic data.
`
`specific to the domain has to
`is that addition of compounds
`An other point
`be helped by automatic processing.
`It
`is especially important
`that
`the re-
`formulation cycle know compounds that cannot be translated word for word.
`Translating frequent compounds in the domain globally would, in addition, save
`CPU resources by avoiding the current generation of all possible combinations.
`
`A final remark is that, with this kind of information retrieval based on multi-
`lingual reformulation, a translation memory would no longer need a bilingual
`corpus[LL90) but only a large collection of texts in the target
`language.
`
`AOL Ex. 1020
`Page 12 of 12
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket