throbber
Cross-Language Text &
`Speech Retrieval
`
`Papers from the 1997 AAAI Symposium
`
`David Hull & Doug Oard, Cochairs
`
`March 24—26, Stanford, California
`Technical Report SS-97-05
`
`AAATPress
`MenloPark, California
`
`AOL Ex. 1019
`Page 1 of 7
`
`AOL Ex. 1019
`Page 1 of 7
`
`

`

`Copyright © 1997, AAAI Press
`
`The American Association for Artificial Intelligence
`445 Burgess Drive
`Menlo Park, California 94025
`
`ISBN 1-57735-040-5
`
`SS-97-05
`
`Manufactured in the United States of America
`
`AOL Ex. 1019
`Page 2 of 7
`
`AOL Ex. 1019
`Page 2 of 7
`
`

`

`crosslingual interrogation
`Multilingual database and
`inareal internet application
`
`Architecture and problems of implementation
`
`Christian Fluhr, Dominique Schmit, Faiza Elkateb, Philippe Ortet, Karine Gurtner
`
`Commissariat 4 l'Energie Atomique
`Direction de !'Information Scientifique et Technique
`CEA/Saclay
`91191 Gif sur Yvette cedex
`France
`E_mail : flubr@tabarly.saclay.cea.fr
`
`Abstract
`The EMIR European project demonstrated the
`feasibility of a crosslingual interrogation of fulltext
`databases using a mono and a bilingual general
`reformulation of the query.
`accept
`to
`New developments have been done
`multilingual databases even if there is more than one
`language in the same documents.
`Problems of implementation of full-scale applications
`are discussed.
`
`Background
`
`Our crosslingual text retrieval technology is based on the
`results of the EMIR (Flubr 1994) (European Multilingual
`Information Retrieval) project in the framework of the
`ESPRIT European Program . The project was done on
`three languages: English, French and German, with
`partners from France, Germany and Belgium. The project
`was performed from October 90to April 94.
`
`technological support an already
`The project had as
`existing multimonolingual
`(French and English)
`text
`retrieval software : SPIRIT (Syntactic and Probabilistic
`Indexing and Retrieval of Information in Texts), SPIRIT
`has been marketed since 1980 on IBM mainframe, since
`1985 on various platforms (from PC to mainframes) and
`from 1993 in a client-server architecture.
`
`« The translation and retrieval process are mixed so that
`the fulltext database is used as a semantic knowledge
`for solving translation ambiguities|which are the main
`At this time a part of EMIR results are introduced in the
`problem to solve
`in this
`approach of domain
`SPIRIT system. Applications are especially for the French-
`independent reformulation. One of the main result of
`English couple. A new language has been added : Russian
`this approach is that if an answer to the query exists in
`and the Dutch language is on the way.
`
`Main principles of the approach of
`crosslingual interrogation
`
`The EMIR’s approach is based on the use of a general
`transfer dictionary as a set of reformulation rules. That
`means that all possible translations are proposed as possible
`key for retrieving the documents.
`
`This approach is opposite to the one consisting of a
`translation of the query followed by a monolingual
`interrogation.
`
`There are 3 differcnces :
`
`*
`
`*
`
`the reformulation is based onthe translation of concepts
`in the query and need not to build up a syntactically and
`semantically correct translated query.
`
`the translation of concepts is used to obtain documents,
`if the word inferred in the target language is not the
`tight translation but if a relevant documents is obtained
`we can consider that the system works properly ( for
`example : inference of an hyperonym or a word of a
`different part of speech).
`
`AOL Ex. 1019
`Page 3 of 7
`
`AOL Ex. 1019
`Page 3 of 7
`
`

`

`the database, the system, in most cases, can select the
`right translation by looking in most relevant documents.
`
`Principle of the method
`We suppose that
`the database is monolingual, we will
`discuss in the following paragraphs the problemsspecific to
`multilingual databases.
`
`Database processing
`The database is processed by the morphosyntactic parser.
`The results are normalized single words and compounds.
`Normalization is mainly based on a lemmatization but
`general synonymiescan be taken into account. For example
`« logiciel » and « software » in French are normalized by «
`logiciel », « harbour » and « harbor » in English are
`normalized by « harbor».
`
`As single words we assume really single words and
`idiomatic expressions like « monkey wrench » in GB
`English or « clé anglaise » in French. Compounds are
`words in dependency relations like « multilingual database
`». For each normalized word or compound a semantic
`weight is computed according to the information it brings
`to choosethe relevent documents.
`
`Query processing
`The query is processed by the same morphosyntactic
`parsing than for
`the database. Normalized words and
`compounds are produced with their part of speech.
`
`For each of these units, we try to infer all possible
`translations that agree with the part of speech. For example
`« light » adjective is translated by « léger » adjective in
`French but « light » noun is translated by « lumiére » noun.
`
`The filtering by the database lexicon is not sufficient to
`eliminate all wrong translations. So it is possible to take the
`translations contained in the most relevant documents (that
`means the ones that contain the maximum of the query
`wotds, especially the ones where words have the same
`dependency relations than in the query).
`
`It is necessary before performing this optimization to be
`quite sure that the « best » documents are relevant, that is to
`say that they contain a sufficient number of the most
`weighty words.
`
`If it is decided that the most relevant documents are really
`relevant ones, a feed back can be done on the transfer
`process, In a second step, only words compatible with the
`most relevant documents are proposed.
`
`This process is very strong to increase relevance butit has a
`bad effect on the recall because it can eliminate synonyms
`of the chosen words
`that are only in less relevant
`documents. So it is useful to follow this feed back by a
`monolingual reformulation in the target language. We are
`in the same situation that a well formed query directly in
`the target
`language or a well
`translated query that
`necessitates a monolingual reformulation to have a good
`recall,
`
`Exampleoftranslation and filtering
`
`Query : « spectroscopie de masse par temps de vol » ona
`base of 655000titles ofreports on Energy
`
`transfer rules:
`
`spectroscopie (Noun) : spectroscopy
`masse (Noun) : mass, bulk, ground, sledgehammer
`Compounds can betranslated globally or word for word. In
`temps (Noun) : stroke, tense, beat, time, weather, days
`this last case the word orderis rearrangedtofit the result of
`vol (Noun):flight, flock, theft
`the target
`language normalization of compounds, All
`compoundsthat cannot be translated word for word butit is
`not necessary to consider them as idiomatic expression. A
`compoundlike seat belt is really a belt on a seat and in
`French « ceinture de sécurité » is a belt for security.
`
`Generally, especially for single words, there is a lot of
`translations.
`
`Example : « talon » (French)--a(English) « heel », « crust
`», «spur», « stub », « conterfoil », « talon »
`
`At this level the results of multilingual inference is filtered
`by the database lexicon and a lot of translations that are
`incompatible with the domain are eliminated,
`
`After filtering by the databaselexicon :
`
`spectroscopie (Noun) : spectroscopy
`masse (Noun) : mass, bulk, ground
`temps (Noun): stroke, beat, time, weather, days
`vol (Noun): flight, theft
`
`After filtering by the best documentthat contain : « time of
`flight mass spectroscopy » where the system has recognize
`2 compounds «timeofflight » and « mass spectroscopy »
`
`spectroscopie (Noun) : spectroscopy
`masse (Noun) : mass
`temps (Noun) : time
`vol (Noun) : flight
`
`AOL Ex. 1019
`Page 4 of 7
`
`AOL Ex. 1019
`Page 4 of 7
`
`

`

`that means that the system has dynamically produced that:
`
`tempsde vol : time of flight
`spectroscopy de masse : mass spectroscopy
`
`The rearrangement of words after word for word translation
`has been obtained using rules depending from the couple of
`languages and the syntactic structure of the compounds.
`
`Architecture for multilingual databases on
`INTERNET
`
`The EMIR project has demonstrate that it is possible to
`query a monolingual database in a languagethatis different
`than the database language. But in our countries where we
`use our own language and English for scientific documents,
`the problem is
`that most of the databases contains
`documents in 2 or mores languages. Even in some cases,
`the information about a documentis in several languages.
`For example, in our library catalog, a document can have
`an original title in English, a translated title in French,
`keywords in English and French, a summary in English and
`French.
`
`Generally it is not possible to assume that the information
`is
`redundant between the
`languages. For
`example,
`keywords in English and in French are taken from different
`systems of indexing and they are not translations from each
`other,
`
`The mixed language databases have not been taken into
`account by EMIR andnorin the current version of SPIRIT
`system. That is the reason why,
`the problem has been
`solved through a new architecture based on a more standard
`Webarchitecture,
`
`For permitting the access of SPIRIT server using standard
`INTERNETclients (Netscape, MS explorer, tango,etc. ), a
`WWW-SPIRIT interface has been done at the end of 1995,
`
`This interface is being extended to support distributed
`multibase multilingual databases. That means that
`the
`system can managea logical database composed of several
`physical ones. The problem of mixed language databaseis
`supportedby this architecture in the following way:
`
`into as many
`* The mixed logical database is split
`physical databases as languagesin the logical one..
`
`*
`
`can have parts
`same documents
`the
`monolingual physical databases
`
`in several
`
`*
`

`

`
`*
`
`during the interrogation the query in one language (for
`example the user’s mother tongue) is sent
`to each
`physical
`database
`composing
`the
`logical mixed
`language one.
`
`receive the query with an
`each database server
`information of what is the query language. The database
`server knowing the language of his database performs
`either a monolingual or a bilingual interrogation and
`sendresults to the interface,
`
`the interface must merge theresults, for that it computes
`a weight of each concept of
`the query on an
`hypothetical database
`composed of
`the
`retrieved
`documents
`ftom the various language parts. This
`processing is necessary because the weight of cach
`concepts computed in each part can be very different
`and cannot represent the weight of the global use of the
`concepts in the multilingual logical database.
`
`In SPIRIT the answers are grouped by intersection
`classes and the classes are sorted according to the
`weight of concepts in the intersection query-document,
`The merging of results can produce new classes and
`suppressothers.
`
`at the end when a user ask to see a document, parts from
`different languages are to be obtained from various
`monolingual physical databases. The location of word
`that is also obtained from various monolingual physical
`databases are used to highlight the words used to extract
`the document. This functionality is specially important
`in the case of crosslingual interrogation to control why a
`documents has beenretrieve.
`
`First applications
`Atthis time our library catalog are put on the web using the
`first version of the SPIRIT/W3 interface which cannot
`manage multilingual databases nor multibase interrogation.
`
`Our documents are mainly in French and English. We are
`obliged to choose only one indexing language, for example
`French. For
`this indexing language, all
`the linguistic
`support is given (lemmatization), for the other language all
`words ate considered as proper nouns and are not
`lemmatized.
`
`It is easy to show that to be sure to have a sufficient answer
`it is necessary to ask a simple query in French, and a query
`with all possible variants of the words in English.
`
`34
`
`AOL Ex. 1019
`Page 5 of 7
`
`AOL Ex. 1019
`Page 5 of 7
`
`

`

`Thanks to god, main of our users are happy with partial
`answers and have not seen the problems. But it is not a
`good idea to expect that this situation will continue.
`

`
`compound are in different parts of the linguistic data
`considered as simple compound (words in dependency
`relations) or idiomatic expressions and in SPIRIT their
`normalization can bedifferent.
`
`That is the reason why, we decided to implement in an
`operational way the prototype of
`the SPIRIT/W3
`multilingual interface. We aim at beginning the service for
`late march 1997 on our library catalogs, catalogs of reports
`Inalinguistic process which goes through many steps, a
`on energy from the IAEA agency and from ETDE and a
`lack of coherence can brake the inference line. When
`catalog of the publications done by personal belonging to
`managing a huge amount of linguistic data, it is necessary
`our organization. All these applications will be visible from
`employ many people from different origin and even
`the all
`internet. Unfortunately,
`fulltext databases will be
`reserved for intranet use.
`working in different part of the world.
`
`This list is not exhaustive.
`
`So it is not possible to trust only in the human behavior.
`The solution is to implement a control system who can
`verify all that can be verify automatically and suggests to
`the human corrections or additions of information.
`
`Problems to solve
`
`Going from a feasibility prototype to an operational
`software is a hard task especially when the problem to
`solveis crosslingual interrogation.
`
`Acquisition of new compoundsin the transfer
`dictionary
`Such a systems involvesalot of tools and a lot of linguistic
`An other need is the fast adding of new words for new
`data, If somewhere something lacks, the result is strongly
`databases. It is very easy to know what single word lack in
`perturbed, That is the reason why a very strong quality
`the
`source language dictionary and in the transfer
`control
`is necessary to minimize the possibility of
`dictionary. The main problem is what compounds that
`cannot be translated word for word must be introduced into
`discontinuity between the query character string and the
`wordsin an other language to search in the databaseindex.
`the transfer dictionary.
`
`Control quality on linguistic data
`There are many causes ofthe discontinuity in the linguistic
`process:
`
`the word is not in the dictionary of the source language
`
`the word pertain to an idiomatic expression which is not
`in the source languagedictionary.
`
`the word is in the source language dictionary but not
`with all
`the possible part of speech (example ;
`processing as a verb but not as a noun)
`
`the normalized word has no entry in the transfer
`dictionary
`
`the word cannot be translated word for word : a
`translation of compound must be put into the transfer
`dictionary.
`
`*
`
`*
`

`
`*
`

`
`*
`
`Various
`problem :
`
`strategies have been followed to tackle this
`
`*
`

`
`*
`
`automatic extraction of terminology and treatment by a
`specialist.
`
`treatment of bilingual corpuses in the same domain,
`extraction ofterminology in the two lingual version and
`determination ofthe compounds that are not translatable
`word for word, This way seems promising but cannot be
`followed at
`this time because of the problems of
`consistence in the linguistic data mentioned above. At
`this time too much false detection are obtained to permit
`a use in exploitation
`
`processing of multilingual thesaurus in the domain of
`the database. We have begun to process the INIS
`(IAEA) and ETDE (OECD)thesaurus. The first one is
`on atomic energy, the second one is on all kind of
`energy,
`
`in the source language is
`normalization of words
`different from the normalization ofthe left part of the
`transfer dictionary,
`
`It is difficult at this moment to give general conclusions.
`Our evaluation done during the EMIR project was that even
`
`Conclusion
`
`35
`
`AOL Ex. 1019
`Page 6 of 7
`
`AOL Ex. 1019
`Page 6 of 7
`
`

`

`Conference of the UW Centre for the New Oxford English
`Dictionary and text Research, UW Centre for the New
`OEDand text Research, Waterloo Ontario Canada
`
`Radwan Kh., Foussier F., Flubr C., Multilingual access to
`textual databases. RIAO'91 Conference, April
`1991,
`Barcelona.
`
`Radwan Kh., Fluhr C., Textual database lexicon used as a
`filter
`to resolve semantic
`ambiguity,
`application on
`multilingual information retrieval, 4th annual symposium
`on document analysis and information retrieval, Las Vegas,
`24-26 April 1995.
`
`with a lack of consistency and without adding any word
`into the dictionaries and without translation feed back, we
`had a decreasing of results of about 10 % in comparison
`with a monolingual interrogation with the same system.
`
`In comparison with the actual situation where databases are
`indexed in only one of the languages, the new architecture
`will increase the quality of the answer using only one query
`in one language. What will be the level of the result in
`comparison with really good translation in each languages,
`the evaluation is to be done.
`
`In fact, even if the level of performanceis not the best one,
`according the behavior of our users, (they ask only one
`question in one language),
`the results will be strongly
`ameliorated in comparison with the actual performance of
`the database.
`
`References
`
`Debili F., Fluhr C., Radasoa P., About reformulation in
`fulltext IRS, Conference RIAO 88, MIT Cambridge, mars
`1988, A modified text has been published in Information
`processing and management Vol. 25, N° 6 1989, pp 647-
`657.
`
`Pacific Rim
`Information,
`C., Multilingnal
`Fluhbr
`Intelligence
`on Artificial
`Intemational Conference
`(RICAN),"AI and Large-Scale Information", Nagoya, 14-
`16 November 1990,
`
`lexical
`Fluhr C., Radwan Kh., Fulltext databases as
`semantic knowledge for multilingual
`interrogation and
`machine translation, EWAIC'93 Conference, Moscow, 7-9
`September 1993.
`
`Flubr C., Mordini P., Moulin A., Stegentritt E., EMIR Final
`report, ESPRIT project 5312, DG II, Commission of the
`European Union, October 1994
`
`Fluhr C., Schmit D,, Ortet P., Elkateb F., Gurtner K.,
`Semenova V., Distributed multilingual
`information
`retrieval, MULSAIC Workshop, ECAI96 Conference,
`Budapest, 12-16 August 1996
`
`Gachot D., Lange E., Yang J, Teh SYSTRAN NLP
`Browser:An Application of Machine Translation
`Technology in Multilingual Information Retrieval, Cross-
`Linguistic Information Retrieval Workshop, SIGIR’96,
`August 18-22, Zurich, Switzerland.
`
`Landauer T. K., Littman M. L., Fully Automatic Cross-
`Language Document
`retrieval Using Latent Semantic
`Indexing,
`(1990)
`in Proceedings of the sixth Annual
`
`36
`
`AOL Ex. 1019
`Page 7 of 7
`
`AOL Ex. 1019
`Page 7 of 7
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket