`David Farwell
`Eduard Hovy (Eds.)
`
`Machine Translation
`and the
`Information Soup
`
`Third Conference of the Association
`for Machine Translation in the Americas
`AMTA'98
`Langhorne, PA, USA, October 28-31, 1998
`Proceedings
`
`Springer
`
`AOL Ex. 1022
`Page 1 of 18
`
`
`
`Series,Editors
`'Jairii~ G. Carbonell, Carnegie Mellon University; Pittsburgh,
`Jorg Siekmann, University of Saarland,
`
`Volume Editors
`David Farwell
`New Mexico State University, Computing Research
`Box 300011 3CRL, Las Cruces, NM 88003, USA
`E-mail: david@crl.nmsu.edu
`
`Laurie Gerber
`SYSTRAN Inc.
`7855 Fay Avenue, Suite 300
`P.O. Box 907 , La Jolla, CA 92037, USA
`E-mail: 1gerber@systransoft.com
`
`Eduard Hovy
`University of Southern California, Information Sciences Institute
`4676 Admiralty Way, Marina del Rey, CA 90292-6695, USA
`E-mail: hovy@isLedu
`
`Cataloging-in-Publication Data applied for
`
`Die Deutsche Bibliothek - CIP-Einheitsaufnahme
`
`Machine translation and the information soup: proceedings; Langhorne, PA.
`USA, October 28·31, 19981David Farwell .., (ed.). - Berlin; Heidelberg;
`New York; Barcelona; Hong Kong; London; Milan; Paris; Singapore; Tokyo:
`Springer.1998»,
`(... Conference of Associationfor Machine Translation in the Americas, AMT A ... ; 3)
`(Lecture notes in computer science; Vol. 1529 : Lecture notes in artificial
`intelligence)
`ISBN 3-540-65259-0
`
`CRSubject Classification (l998): 1.2.7, H.3, FA.3, H.S, 1.5
`
`ISBN 3-540-65259-0 Springer-Verlag Berlin Heidelberg New York
`
`is
`the whole or part of the materia!
`"Thiswork is su?j.ect to copyright. All righ~s are reserved. whether
`concerne~. speclf,,:ally therights of translatIOn. reprinting. re-use of illustrations,
`recitation. broadcasll~g.
`reproduction on.mlcro~lms or In any other way, and storage in data banks. Duplication of this publicatIOn
`?r parts thereof Is.penrulted onl'y ~nder the provisions of the German Copyright Law of September 9, 1965.
`I? Its currentversl~n. and perrrussion for use must always be obtained from Springer. Verlag. Violations are
`hable for prosecutlonunder
`the German Copyright Law.
`© Springer-Verlag Berlin Heidelberg 1998
`Printed in Germany>...
`..
`Typesetting:'.~a~era ready by allthor',"'-,:::;"/,,>
`.06/3142
`- 5432
`I O.
`SPIN 106921)06
`
`Printed on acid-free
`
`AOL Ex. 1022
`Page 2 of 18
`
`
`
`Integrating Query Translation and Document Translation
`ina Cross-Language Information Retrieval System
`
`Guo-Wei Bian and Hsin-Hsi Chen
`Department of Computer Science and Information Engineering
`National Taiwan University
`Taipei, Taiwan, R.O.C.
`Email: gwbian@nlg.csie.ntu.edu.tw.hh_chen@csie.ntu.edu.tw
`http://nlg3.csie.ntu.edu.tw
`
`Due to the explosive growth of the WWW, very large multilingual
`Abstract.
`textual resources have motivated the researches in Cross-Language Information
`Retrieval and online Web Machine Translation.
`In this paper, the integration
`of language translation and text processing system is proposed to build a
`multilingual information system. A distributed English-Chinese
`system on
`WWW is introduced to illustrate how to integrate query translation, search
`engines, and web translation system.
`Since July 1997, more than 46,000 users
`have accessed our system and about 250,000 English web pages have been
`translated to pages in Chinese or bilingual English-Chinese versions. And the
`average satisfaction degree of users at document level is 67.47%.
`
`1 Introduction
`and has
`explosively
`the World Wide Web (WWW) grows
`In the past
`few years,
`become the most useful and powerful
`information
`retrieval
`and accessing system on
`the Internet.': The WWW breaks the boundaries
`of countries
`and provides very large
`inmultiple
`online documents· (more than 10 million documents)
`languages.
`These
`multilingual.
`textual
`resources have motivated
`the researches
`in Cross-Language
`(fUR)
`and. online Machine Translation
`(MT)
`to build the
`InformatiOllRetrieval
`
`
`
`
`
`;n~t:~;o~;~t\:~~;~~:;;~:~~ss~:::mbee~~~~:~:;:~:1r,~:'th;:;·~
`
`the major
`barrier becomes
`the. language
`locate interesting. and·.~elevallt. illfonnation
`indifferent
`problem for people to search, retrieve, and 'understand WWWdocuments
`languages.,
`.That ~ecreases the dissemination
`po\Ver of the WWW to some extent.
`T.0. aUeVI?te..this •.barrier,
`and WWW servers k~ep
`some. information
`providers
`multiple copIes of
`t~eir. information· in different
`.languages
`for. multilingual
`servIC~.
`•.of the WWW. environnient..
`the provided
`information IS
`Due. to the dynamicnat~re
`updatedfreq~ent1y.!~tsapproachis
`illvolved with the data inconsistency
`problem
`andt~e.01anagementp~o~lem~fll1ultilingUal
`documents.
`How to incorpo~a..e the
`for muItthngual
`C3p~bthtr·oflangu~getranslationintowww
`becomes
`indispensable
`'1" ...••......:"
`. .
`.
`......•
`. 4]· h ve been
`service ... Recently·
`sever· I ..
`,
`3 on me. mach me translation
`systems
`[1-
`a ..
`
`AOL Ex. 1022
`Page 3 of 18
`
`
`
`251
`
`to alleviate the language
`
`to the WWW directly
`be employed
`cannot
`presented. Traditional MT systems
`becausethey are usually used to translate
`the written documents
`in the off-line batch
`mode. Translation quality is the most
`important
`criterion.
`In on-line and real-time
`applications,speed performance
`is also an important
`factor.
`In this paper, we will focus on the following
`problems
`barrieronWWW:
`1. Language translation
`techniques
`system and text processing system
`2. The integration of language
`translation
`Thelanguagetranslation system is proposed
`to incorporate with the different kinds of
`systems, etc.).
`textprocessing systems
`(e.g.,
`searching
`engines,
`text summarization
`Asystemintegrated MT and IR technologies
`for WWW (abbreviated as MTIR)
`is
`introducedto illustrate
`our
`solutions
`for
`the mentioned
`problems.
`Section 2
`describesa general model of the multilingual
`information
`system and introduces the
`architectureof our bilingual English-Chinese
`system for WWW. We discuss the
`in section 3.
`Section 4 specifies how to integrate
`Chinese-Englishquery translation
`thequerytranslation of CUR with several
`searching engines on WWW.
`Section 5
`describesthe online and real-time web translation.
`Section 6 makes evaluations for
`sucha multilingual
`information
`system from different users' viewpoints.
`Section 7
`concludesthe remarks.
`2 Multilingual Information System
`systems are shown as follows:
`Someof multilingual requirements
`for computer
`1. Data Representation:
`character
`sets and coding systems
`2. Data Input:
`input methods
`and transliterated
`input
`3. Data Display and Output:
`font mapping
`4. Data Manipulation:
`the application must be able to handle the different
`coding characters
`need of users
`to translate the information
`5. Query Translation:
`6. Document Translation
`using Machine
`Translation
`(MT):
`documents
`have been resolved by system applications in several
`Thefirst three requirements
`Some of applications
`and packages
`can also handle
`computeroperating systems
`lO~hsingle-byte and multipie-byte
`coding systems
`for Indo-European
`and Eastern-
`san languages. However
`the language
`barrier becomes
`the major problem for
`reoPleto access the multili~gual
`documents.
`How to incorporate
`the capability of
`ang~~getranslation to meet
`the requirements
`5 and 6 becomes
`indispensable
`for
`multJlmgualsystems.
`
`to translate
`
`~:lFour.LayerMultilingual Information System (MLIS)
`t;;~slo:hp~\Vs.a. four-layer multilingual
`information
`system. We put. the. different
`ocessmg systems on the four layers:
`Layert: Language Identification
`(U)
`... Layer2: Text Processing Systems
`~eveI3: Language Translation SysteIDs
`evel4: User Interface
`(UI)
`
`AOL Ex. 1022
`Page 4 of 18
`
`
`
`252
`
`Multilingual Resources
`
`Multiple Langauges
`
`l~ Tex.t Processing
`
`J'
`
`j I
`
`Language
`( Translation
`)
`
`Fig. 1. A Four-Layer Model of Multilingual
`
`Information
`
`System (MUS)
`
`Native Langauge(s)
`
`2. The Overall ArchitectureofMTIR
`
`Syst~m
`
`AOL Ex. 1022
`Page 5 of 18
`
`
`
`253
`
`lexical analysis,
`(e.g.,
`techniques
`processing
`language
`Becausemost of natural
`parsing,etc.) are dependent
`document,
`the layer 1
`on the language
`of processed
`resolveslanguage identification
`problem before
`text processing.
`The language
`identificationsystem employs
`cues
`from the different
`character
`sets and coding
`systemsof languages.
`At
`layer 2, the systems may perform information extraction,
`informationfiltering,
`information
`retrieval,
`text classification,
`text summarization, or
`othertext processing
`tasks.
`Some
`of
`the
`text processing
`systems may have
`interactionwith another one.
`For example,
`the relevant documents
`retrieved 'by IR
`systemcan be summarized
`to users.
`Additionally,
`a multilingual
`text processing
`systemshould be able
`to handle
`the different
`coding
`characters
`to match the
`requirement4 (data manipulation).
`Several
`searching
`engines
`(e.g., AltaVista,
`Infoseek,etc.) have the ability to index the documents
`of multiple languages.
`The
`languagetranslation systems at
`layer 3 are used to translate the information need of
`usersfor text processing
`systems
`and translate
`the resultant documents
`from text
`processingsystems
`to users
`in their native
`languages.
`The user
`interface is the
`closestlayer to users.
`It gets
`the user's
`information
`need (included parameters,
`queryanduser profile) and displays
`the resultant document
`to user.
`
`System for WWW
`Information
`2.2Bilingual English-Chinese
`Onthe WWW,
`systems
`can be easily integrated as a larger distributed
`the distinct
`systemusing the HTTP protocol.
`Each system can be involved using an URL of
`eGI program. First,
`the CGI program gets
`input data from the caller. Then the
`ealle,rgets the resultant document
`from the server
`system.
`Fig. 2 shows the basic
`archItectureof MTIR system.
`Users
`express
`their
`intention by inputting URLs of
`web'pages or queries
`in ChineselEnglish.
`A Chinese query is translated into the
`Englishcounterpart using query translation mechanism.
`The translations of query
`terms.are disambiguated
`using word co-occurrence
`relationship.
`. Then the system
`~endsthe translated query to the searching
`engine that selected by user in the user
`Interface..The query subsystem takes care of the user interface part.
`the WWW is
`of : t~e
`co~trol
`under
`the
`eoThe ,su~sequent
`navigation
`on
`mmUlllcatIonsubsystem.
`To minimize
`the traffic of Internet, a caching module IS
`present~din this subsystem and some proxy systems are used to process the request.
`Th~obJects.in the' cache are checked when a request
`is. received.
`If the requested
`the communication
`system fetches. the HTML file (.htm or .html
`fiobJectis not found,
`I
`lie)
`.,
`or text file (.txt or .text file)
`from the neighboring
`proxy systems or the ongma
`server,
`.
`
`It divides the whole file .into
`file.
`the retrieved
`The HTML analyzer examines
`.
`.' Th HTML tags
`severaltr
`I'
`ans anon segments
`for the machine
`translation subsystem.
`e
`.
`suchastitIh
`···I·S:
`msandtables
`di
`..
`e, ea mgs, unordered
`lists, ordered lists, defimtlOn IstS,lor
`I .
`f ·11.···.
`Pay the sirnil
`•.nuestion mark .and
`.'
`. I ar
`roles of punctuation
`marks
`like
`u
`stop, que
`".
`•
`exclamaf ...
`..
`"'1··
`t
`.g" bold
`.
`ita!'
`• Ion mark.
`to the above tags,
`the font stylee
`emen s,e.
`,."
`In contrast
`Ie, supe
`nknown words
`..
`. .
`. . .•
`...
`be
`rscnpts,
`subscripts
`and font
`styles, may produce many u
`.
`'..
`causethe
`hI'
`.
`..
`h
`t style elements
`s
`shOUldbe' Woe word IS split mto several parts.
`.Th~s t ese 10~ •-. •.••
`•A.f
`". hidden from the attributed words during translation processmg,
`.•,.
`...
`.....
`• > ter rec '
`'.
`.'
`•...
`. .....
`ther mformatlon
`usersmayaccesso ..........><.
`.elVIng
`the first
`translated
`document,
`
`AOL Ex. 1022
`Page 6 of 18
`
`
`
`254
`
`through the hyperlinks. We attach our system's URL to those URLs that linkto
`HTML files or text files. Such a way guarantees the successive browses are linked
`with our system. The other URLs,
`including inline images and external MIME
`objects, are changed into their absolute URLs.
`In other words,
`the non-textual
`information is received from the original servers. Our experimental systemis
`accessible with the following URL:
`http://mtir.csie.ntu.edu.tw
`3 Query Translation
`Several approaches have been proposed for CUR recently. There are four main
`approaches for query translation:
`1. Dictionary-based approach [5-8]
`2. Corpus-based approach [9-10]
`3. Hybrid approach (combined dictionary-based and corpus-based) [6]
`4. Machine Translation based approach (MT-based) [11]
`Because the large parallel Chinese-English
`corpora
`are not available, the
`dictionary-based approach is adopted in our system.
`The query translation for
`Chinese-English CUR consists of three major steps:
`1. Word segmentation: To identify the word boundary of the input streamof
`Chinese characters.
`the translated English query using the
`2. Query translation: To construct
`bilingual dictionary.. The translation disambiguation is done using the
`monolingual corpus.
`3. Monolingual IR: To search the relevant documents using the translated
`queries.
`The segmentation and the query translation use the same bilingual dictionary,in
`this design. That speeds up the dictionary lookup and avoids the inconsistenCies
`resulting from two dictionaries (i.e., segmentation dictionary and transfer dictionm:y),
`This bilingual dictionary has approximately 90,000 terms. The longest-matchtng
`method is adopted in Chinese segmentation.
`The segmentation processing searches
`for a dictionary entry corresponding to the longest sequence of Chinese characters
`from left to right. After identification of Chinese terms, the system selects someof
`the translation equivalents for each query. term from the bilingual dictionary.. Th~
`terms of query can be translated in two different
`levels of dictionary translations.
`word-level (word-by-word) and phrase-level
`translations.
`Those terms, missing
`from the transfer dictionary, are passed unchanged to the final query.
`3.1 Selectioll Strategies
`When,there is more than one translation equivalent in a dictionary entry, the following
`selection.strategies are explored.
`system looks up each term in the bilingual
`Select-AU(SA):.The
`.
`; (1)
`dictionary andconstructsa
`translated query by concatenating of all the senses of the
`<
`...
`.
`.•.•.
`.•. ..>.
`...
`....
`• .•
`.
`.
`.
`terms.
`..
`•Select-Highest-Frequency (SHF): The system selects the sense .with fhe
`.
`.• (2)
`plghest. ~r~quency.in target language corpus for each term.
`•Because the translatIon
`probabdltJesof
`senses for each term are unavailable without a large-scale word-
`
`AOL Ex. 1022
`Page 7 of 18
`
`
`
`255
`
`P(X,y)
`
`are reduced to the probabilities
`probabilities
`the translation
`alignedbilingual corpus,
`So,
`the frequently-used
`transferring sense of
`ofsensein the target
`language
`corpus.
`atermis used instead of the frequently-translated
`sense.
`the
`selects
`strategy
`This
`(3) Select-N-POS-Highest-Frequency
`(SNHF):
`If the term has N POS
`highest-frequentsense of each POS candidate
`of the term.
`candidates,the system will select N translation
`senses.
`Compared to this strategy,
`thestrategy(2) always selects only one sense for each term.
`(4) Word co-occurrence
`(WCO): This method classifies words on the basis of
`theirco-occurrence with other words.
`The translation
`of a query term can be
`disambiguatedwith the co-occurrence
`of its translation equivalents and other words'
`equivalents. The mutual
`information
`(MI)
`of word
`pairs
`reflects
`the word
`If two words x and y have probabilities P(x) and
`associationnorms in one language.
`pry), theirmutual information
`[12] is defined to be
`x, y = og 2 P(x)P(y)
`1
`)
`I(
`around the translation equivalents within the text
`the content
`Thismethod considers
`collectionto decide the best
`target equivalent.
`The mutual
`information of word pairs
`istrainedusing a window size 3 in the CACM text collection [13].
`Totally, there are
`247,864 word pairs.
`Table1 illustrates an example
`'~
`The Chinese concept
`translation.
`for different
`J iiR )}
`JW.'
`translation
`'singular
`value
`and
`its phrase-level
`(jiyi
`zhi
`fenjie)
`Four
`translated
`representations
`using
`different
`deco~position' are
`employed.
`translation
`is shown in Table 1 (a). Column 3
`selectionstrategies on the word-level
`Showsthe translation equivalents
`in transfer dictionary for the query terms at word-
`level. Table 1 (b) lists the mutual
`information
`of some word pairs of translation
`equivalents. Most of word pairs have no co-occurrence
`relations. ••..Considering the
`of the term '-t>-:JL '.(jiyi) has. the largest MI score
`e~ample,the equivalent
`'singular'
`w~l
`.
`~~.
`...
`.
`a I translatIOn equivalents
`of other two words.
`'
`3.2Exper'
`d
`t
`imen s an Evaluations
`are
`translations
`and the phrase-level
`InthefollOWing experiments
`the word-level
`and multi-term
`tOUchedto demonstrate
`the 'problems
`from missing
`terminology
`concepts,In addition we will evaluate these
`selection strategies with the long and
`theshortversions of queries,
`The short queries are used to simulate. the behavior of
`ourmethodsfor WWW.
`The· SMART information retrieval system [14] is utilized to
`~:sure the similarity of the query and each document using the vector space mo~el.
`CACque~yweights are multiplied
`by the traditional
`IDF factor.. The test,collectlO.n
`coll~ IS used to evaluate
`the performance
`of different
`selectIOn strategIes.
`-This
`'ct· ectlOnContains 3204 texts and 64que.ries
`in English.
`Each query has relevance...
`JUgement
`T
`.
`....
`..
`t I 20
`.
`he average number of words in the query IS approxIma eiy ...'.... ,
`s'..
`ate the Chmese
`• In order t·.
`.
`..•
`. 0 test
`qu .
`the effecttveness
`of query translation,
`.'fe
`cre
`'....
`enes by
`eones
`.The
`.
`..'
`Chi
`mes.....
`.
`.:
`Ch' •... manually translating
`the original English quenes
`to
`Inese q
`.•.
`.E h Chi
`se query IS
`.
`.'
`uenes
`as the input queries later.
`•.. me.
`.
`tr
`are regarded
`ac
`.
`anslatedt
`The followmg
`c.:
`..
`.
`.. .
`...
`..
`0 lOur target queries
`using different
`selectIOn strategies......
`i of
`.'..
`e.x ..
`f
`.penments
`.
`slated verSIOns 0
`t
`.•••.•.." comp~re the retrieval
`performances
`of the
`our
`ran
`
`AOL Ex. 1022
`Page 8 of 18
`
`
`
`256
`
`One example of the
`queries.
`to the results of the original English
`Chinese queries
`original English query, human translated Chinese version,
`and translated queries are
`It gives the segmented Chinese
`shown in Table 2.
`string and four automatically
`translated representations
`for
`the CACM QI.
`Parentheses
`surround
`the English
`multi-term concepts
`and the brackets
`surround
`the translation
`equivalents of each
`term.
`and phrase-level
`translation
`the word-level
`of
`the performances
`To compare
`checked to find the multi-term
`translation,
`the CACM English queries are manually
`concepts that are not contained in our bilingual dictionary.
`These concepts and their
`translations are added into the bilingual dictionary
`for the phrase-level
`experiments.
`(:it AA; olI.Jt "f ~~),
`Totally, 102 multi-word
`concepts
`(e.g.,
`remote
`procedure
`call
`(~J,-1t$)-JW),etc.) are identified in the CACM queries.
`singular value decomposition
`
`Table I.
`
`Different
`
`translations of Chinese concept
`decomposition)
`
`'~-l-1t ~ Nt-' (singular value
`
`Translated representations
`Table lea).
`Term POS
`SA
`N oddity singularity
`~1(.
`(jjvi)
`ADJ
`singular
`iti.
`N value worth
`(zhi)
`:$)-At
`(fenjie)
`
`based on different
`SHF
`SNHF
`singularity
`singular
`value
`
`singular
`value
`
`strategies
`WCO
`
`singular
`value
`
`decomposition
`
`decomposition
`
`N Decomposition analysis
`dissociation cracking
`disintegration
`analyze anatomize decompose
`decompound disassemble
`dismount resolve
`(solit up) (break up)
`
`V
`
`XV
`
`analyze
`
`analyze
`
`(split up)
`
`-
`
`Table I(b).
`
`word IEQuivalents
`pddity
`~1(.
`wll
`(jiyi)
`Isingular
`w12
`lsin!!Ularitv
`wI3
`iti.
`value
`w21
`(zhi) worth
`w22
`:$)-A!f
`analysis
`w31
`(fenjie) decomposition w32
`nalyze
`w33
`ecompose
`w34
`ecompound
`w35
`esolve
`w36
`
`....
`
`...
`
`for some word pairs
`information
`The mutual
`_
`(fenjie)
`~ 1(. (jiyi)
`fiR (zhi)
`7}N{-
`wll wl2 wl3 w21 w22 w31 w32 w33 w34 w351w36
`-
`-
`---
`-
`
`6.099
`
`4.115
`6.669
`
`1.823
`4.377
`
`6.099
`
`4.115 6.669
`
`1.823 4.377
`
`.
`
`-
`
`AOL Ex. 1022
`Page 9 of 18
`
`
`
`257
`
`2.2SHF
`
`for CACM Ql
`Table2. The Chinese query and four translated representations
`OriginalQuery What articles exist which deal with TSS 'Time Sharing System', an operating
`system for IBM computers?
`ChineseOuerv ~ltbX -f ,tAr ~MTIS '7)-at *- Nt.',-oft
`IBM 't~~Q!J111:~!k
`1Sezmentation ~ ltb X -f
`oft IBM 't~~ ijlJ 11it~!k
`JJ:
`:ff ~MTIS' 7)- at ~!k',
`-
`those article [be yes yah yep] about TIS '[minute cent apportion deal dissever
`2.1SA
`sharing] time [formation lineage succession system)',
`[a ace mono] [class
`seed] IBM [computer computing] of [(operating system) (operation system)
`OS]
`those article be about TIS 'deal
`(operating svstem)
`those article [be yes] about TIS '[minute deal] time system', [a mono] class
`IBM computer of [Ioperating svstem) OSl
`those article be about ITS 'sharing time system', a classIBM computer of
`(operating system)
`
`time system', a class IBM computer of
`
`2.3SNHF
`
`2.4WCO
`
`the average terms of user-supplied
`environments,
`Overa wide range of operational
`queriesare 1.5 - 2 words and rarely more than 4 words. Hull and Grefenstette [7]
`workwiththe short versions of queries
`(average length of seven words) from French
`to~nglishin TREC experiments.
`But no comparison of the short and long queries is
`avaIlable.To evaluate
`the behavior
`of user's
`short queries, we make additional
`experimentsto compare with
`the
`results
`of
`the original
`long queries.
`Three
`resear~~ershelp us to create the English and Chinese versions of short queries from
`~heangInalEnglish queries of CACM.
`For example,
`the short version of CACM Ql
`~s"TSSTiming Sharing System" . On the average,
`the short query has near 4 words,
`f E li h
`IOclud'
`.
`.'
`, mg smgle-word terms and multi-term concepts.
`The short version 0
`ng IS
`quenesi~regarded as the baseline
`to compare the results of translated queries of the
`shortChmesequeries.
`TheoY~rallresults are shown in Fig. 3.
`average precision [15] of
`The ll-point
`It achieves
`the 83.42%
`is 29.85%.
`the monohngual short English
`queries
`~~6nnance of the original English
`queries.
`In word-level
`experiments,
`th~ best
`. (wordco-occurrence)
`strategy gets the 72.96% performance ofthemonohngual
`In
`Enghshshort version and 65.18% of the monolingual
`original English version.
`~~a~.level, the.WCO achieves 87.14% and 74.71% respec~ively. The .SHE SNHF,
`R
`COselectIOn strategies perform better in the long quenes than that III short ones.
`o~ever,the simple SA strategy
`has opposite
`result.
`Because users give more
`specificterms in short queries
`the SA strategy introduces
`less extraneous terms to the
`up to
`query·•·.·...Alt.'
`ernatlvely,
`the phrase-level
`translation
`improves.
`the peT10rmance
`'
`t:
`14
`~31lJ{·
`th
`.
`bi
`hr
`0 overtheword-level
`for Chinese-English CLIR ... Com irung
`e
`translation
`~L~e dictionary. and co-occurrence
`can bring the ••performance ~f
`disambiguation
`up to 87% of monolingual
`retrieval
`in short queries •.. Recall
`that. the multi-
`Wordc
`•
`•
`.....
`nts
`aft ••.oncepts and their
`translations
`are added to the dictIOnary III .0Uf expenme
`b'l~rthe domain experts
`the queries; Hence
`the coverage .of
`have
`examined
`IIngualphI
`.
`.
`..
`IR·
`··E
`though the
`b'l'
`rasa dictionary will affect
`the performance
`of CL
`ven
`'.
`lIngualdi
`h WCO method
`'.
`..•
`still. . ,
`Icttonary does not contain these multi-word
`eoncepts.; t e
`th of query at
`iff
`I
`achieve
`Word~1
`s near 70% monolingual
`effectiveness
`eng
`...•
`for di [erent
`eVeltranslation.
`
`AOL Ex. 1022
`Page 10 of 18
`
`
`
`258
`
`ll-point average precision (%)
`40
`35
`30
`25
`20
`15
`10
`
`SHF
`
`21.89
`26.41
`19.57
`24.93
`
`SNHF
`
`19.33
`23.62
`17.42
`22.92
`
`WCO
`
`23.32
`26.73
`21.78
`26.01
`
`Monolingual
`
`SA
`
`5o
`
`-+-word-Ieve1
`-II- phrase-level
`-&-word-1evel
`(short-query)
`-'-phrase-Ievel
`(short query)
`
`35.78
`35.78
`29.85
`29.85
`
`16.39
`20.45
`18.28
`23.36
`
`of query translations
`The comparison of retrieval performances
`Fig. 3.
`queries and short queries in different
`levels of translations
`
`for the long
`
`4 Search Engines
`S··
`ur MTIR
`.
`I'
`trans anon m a
`are integrated with language
`IX popular
`search engmes
`in the user inte.rfac~f
`system.. User inputs query.and selects one of the search engines
`The Chinese query terms WIll be translated to English ones. After
`the processmg d
`of the translat: 1
`query translation, our system will send an HTTP request composed
`The retrieved results from the search engine wil
`query to the chosen search engine.
`be translated to the user's native language
`(Chinese).
`In general,
`the CGI progr~m
`of searching engine processes 'the HTTP request of query.
`For
`instance,
`a~s~m~?~
`is the translated query of the Chinese query "~~~~~f .Ul~
`"machine translation"
`The HITP requests for the cm programs of several
`fanyi).
`search engines are liste
`'+' for the
`in Table 3. The query words
`should be separated with the symbol
`standard URL encoding.
`for. multilin~ual
`definitions
`different
`five
`[7] give
`Hull r and Grefenstette
`colleclJOn,
`type 4 is "IR on a multilingual
`document
`information
`retrieval.
`··The
`How to merge and
`where queries can retrieve documents
`in multiple
`languages".
`rank the retrieved documents in different
`languages
`is a problem in CLIR.
`A~ong
`these systems,
`the AltaVista and Infoseek have indexed both the English and ChInese
`web pages.
`If a bilingual query ("~~~1f+machine+translation")is
`invoked,
`the
`two systems will
`list
`the relevant documents.
`of both languages.
`However,
`the
`ranking for documents
`in differenrlanguages
`seems not good.
`It's still a problem for
`multilingualIR.
`..
`
`..•...•••.•.....•....:.:..•
`
`AOL Ex. 1022
`Page 11 of 18
`
`
`
`259
`
`for the CGI Programs of Searching Engines
`Table 3. HTIP Requests
`HTfP Requests for the em Programs of Searching Engines
`Chinese
`SearchEngine
`Indexing
`Yes
`
`No
`Yes
`
`No
`
`No
`
`No
`
`AltaVista http://www.altavista.digital.comlcgi-binlquery?pg=q&what=web
`&kl=XX&q=machine+translation&search.x=35&search,v=9
`http://search.excite.comlsearch.gw?search=machine+translation
`Excite
`Infoseek http://www.infoseek.comffitles?qt=machine+translation&col=WW
`&sv=IS&lk=noframes&nh=lO
`http://www,lycos.comlcgi-inlpursuit?matchmode=and&cat=lycos&
`querv=machine+translation&x=30&v=4
`MetaCrawlerhttp://www.metacrawler,comlcrawler?general=machine+translation
`&method=O&target=®ion=O&rpp=20&timeout=5&hpe=lO
`http://search.vahoo.comlbinisearch?p=machineHranslation
`
`Yahoo
`
`Lycos
`
`5 Document Translation
`system for users to navigate on
`translation
`Therequirement for an online machine
`WWW is' different
`from traditional
`off-line
`batch MT systems.
`An assisted MT
`systemshould help users quickly
`understand
`the Web pages and find the interested
`d~cumentsduring navigation
`on a very
`huge
`information
`resources.
`That
`is,
`differentusers' behaviors affect
`the requirements
`of machine translation systems.
`, Fromusers' viewpoint,
`a high-quality
`and high-speed
`online machine translation
`IS required. However, several
`steps should be performed after a query is issued.
`It
`takestime for the transfer
`of
`the query,
`the query translation,
`the retrieval of the
`document·satisfying
`the query,
`the
`transfer
`of
`the retrieved
`document
`and the
`documenttranslation.
`How to find the tradeoff between the speed performance
`and
`thetr
`.
`I
`.
`.
`hi
`ans ation performance
`on the WWW is an important
`issue.
`Besides t IS Issue,
`Ourprevious work [1] addressed
`including which material
`four other
`issues,
`is
`translated,what roles
`the HTML tags play in translation, what
`form the translated
`res~ltis.presented in, and where
`the translation
`capability is implemented,
`to design
`onhnerna hi
`.
`.
`.
`c llletranslatlOn
`systems
`for the WWW.
`design have been proposed [16-
`Manydifferent approaches
`to machine
`translation
`21],. These include rule-based
`example-based
`statistics-based,
`..knowledge-based,
`f
`and I
`.
`"
`g Ossary-based approaches.
`A hybrid approach [22] integrates the advantages ,0
`lhes7approaches and tries
`their disadvantages.
`to get
`rid of
`A r\lle-based partial
`~~slngrneth~d is adopted and the translation
`process
`is performed chunk .b)' chunk.
`fOl,lowthis design strategy
`and consider.
`the characters of web translauon ..• .The
`f
`OlloWtngsectionsdepict
`the details of analysis,
`transfer and synthesis modules.
`S.lAnalysis Module
`Attirst w'd
`" '.
`edelinliters
`t
`.
`.
`.'
`..
`.
`. .. ".
`s .
`.'.. e l.entIfy the sentence
`types of source sentences using sen enc
`.
`'.
`omestructural transfer
`rules can only be applied to some. types of sentences i... Then"
`. I t:
`(e g +ed
`Wetake a
`..'.
`.•
`.
`.
`+'
`... morphological
`analysis.
`The words
`in morphoiogica
`l?rms.
`'.'
`otng,tly,+s, etc.) are tagged with the morphological
`tags, which are useful forpart-
`-speechtag .
`.
`.
`..,
`.:.
`.....
`ti
`f the target s..ense
`gmg, word. sense disambigu anon
`and the gene.r.a. lO.no...
`.•.....
`.
`..••
`U•.••
`'.
`SIngth ..
`•.
`.'
`'..
`-:
`.•. •
`'.
`«.
`/
`.esense ofthe root word.
`..
`..•...
`••.:
`<
`. •...•.••.• .•'
`Afte.rmo h
`'.
`'.
`.
`.
`. ..•
`.•..••,
`.....
`.'
`. hed from VarIOUS
`•
`rp erne processing,
`the words
`In root
`forms are s~aI'c '. ...••...•
`. .
`
`AOL Ex. 1022
`Page 12 of 18
`
`
`
`260
`
`There are about 67,000 word
`dictionaries using the longest-matching strategy.
`entries in an English-Chinese general dictionary and 5,500 idioms in a phrasal
`dictionary.
`In addition, some domain specific dictionaries are required for better
`translation performance. After dictionary lookup,
`the idioms and the compound
`words are treated as complete units for POS tagging and sense translation.
`For consideration of the speed and robustness issues, a three-stage hybrid methodis
`It treats the certain cases using
`adopted to deal with part-of-speech tagging.
`heuristic rules, and disambiguates the uncertain cases using a statistical model. At
`stage 1, the words with specific morphological tags can be tagged without ambiguities.
`For example, the word of the pattern ADJ+ly is tagged with RB. The taggingof
`some morphological words depends on the morphological
`tag and the POS of itsroot
`form. For example, if the dictionary tag of the root of a word (root-er) is JJ, then
`this word is an adjective. Otherwise, it is a noun. Besides, if a word does nothave
`any morphological tags and has only one POS candidate in the dictionary, then the
`unique POS is assigned to this word. At stage 2, a pattern matching method that
`considers the morphological tags of the. current and the next words, as well as the
`POS of the next word, is employed to do the POS tagging.
`Stage 3 deals withthe
`remaining words, which have not been tagged up to now. A statistical bigramHMM
`model is followed to solve the uncertain cases.
`To reduce the cost of fully parsing in a real-time service; we adopt a partial parser
`to get the skeletons of sentences. A NP/ADJP finite state machine (FSM) is usedto
`segment the source sentence into a sequence of chunks. This FSM analyzes thet~g
`sequence, and recognizes the fundamental noun phrases and adjective phrases In
`linear time. Then a predicate-argument detector is followed to analyze the skeleton
`of sentence [23]. The determination of PP attachment
`is based on the rule templates
`[24].
`5.2 Transfer Module
`The structural transfer, the tense transfer and the lexical selections touch on the
`differences of source and target languages'. The major structural transfers occurin
`the comparative clauses,
`the question sentences
`and the modifications of noun
`phrases. The structure of noun phrases is left-recursion in English, but is right-
`recursion in Chinese. Due to the recursion in the noun phrases, the transferred target
`structure is treated as a whole chunk for the subsequent processing.
`For different
`tenses, the words·"have"
`and "be" have differentsenses
`in Chinese .
`. . Phrases and idioms are treated as complete units during lexical selection. A
`bJlmgual phrase dictionary is employed to produce phrase-by-phrase translation. For
`t~ose remaining words, several word selection algorithms like select-first, select the-
`highest-frequency word and mutual information m.ethod may be adopted to selectthe
`target sense. The select-first method always selects the first translation sense from
`the •.candidates with•.the matched POSes. The. second m~thod chooses the target
`sense with the highestoccurre~ce.probability,
`trained from a large-scale corpus of the
`target language .. The mutual information .modelconsiders
`the content around the
`wo~ds to ~e~ide the best combination of target words. Different models access
`'The largerthe table is, the more time it takes. Section 6
`v~o~s
`training tables.
`willdiscuss the time complexity, the table space and the translation accuracy.
`
`AOL Ex. 1022
`Page 13 of 18
`
`
`
`261
`
`5.3 Synthesis Module
`Thesynthesismodule deals with word insertion, deletion and word order refinement.
`Forexample, if the source word with morpheme
`tag YJB,
`is tagged as adverb (RB)
`andderivedfrom the adjective
`(JJ) word form,
`the target sense will be generated in
`lI-J "
`it" (di).
`thewayof deleting the character"
`(de) and appending"
`The character
`and the character "it" (di)
`"1JiJ" (de)always appears at the end of Chinese adjectives,
`attheend of adverbs.
`In addition,
`if the present participle
`and the past participle are
`bg" (de) is inserted into the target sense.
`taggedas adjective.
`The character"
`Ourprevious work [1] introduced
`the generation
`of bilingual aligned document for
`webtranslation s