`
`N0306-4573 :398-45730 N11
`
`AOL Ex. 1024
`Page 1 of 17
`
`AOL Ex. 1024
`Page 1 of 17
`
`
`
`-An International Journal(cid:173)
`(Incorporatihg INFORMATION TECHNOLOGY)
`Email: ipm@scils.rutgers.edu
`Web site: http:/ jwww.elsevier.nl/locatejinfoproman
`
`Tefko Saracevic
`School of Communication, Information and Librarv
`Studies
`·
`University
`Street
`New Brunswick, NJ 08903, U.S.A.
`tefko@ scils.rutgers.edu
`
`Nicholas J. Belkin
`School of Communication. Information and Library Studies
`University
`Street
`New Brunswick, NJ 08901-1071, U.S.A.
`nick@ belkin.rutgers.edu
`
`De:parltment of Computer and Information Science
`of Massachusetts
`01003. U.S.A.
`Amherst.
`croft((i.cs.umasss.edu
`
`Associate Editor (Europe) - - - - - - - -
`
`Associate Editor (Book Reviews) - - - - - - (cid:173)
`
`Founding Editor - - - - - - - - (cid:173)
`
`Amanda
`Associate
`School of Information Sciences & Technology
`The Pennsylvania State University
`State College
`PA 16802. U.S.A.
`ahs(d;psu.edu
`
`Harold Borko
`Graduate School of Education and Information
`University of California
`102 South Hall
`CA 90024-1520 U.S.A.
`
`Editorial Board - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
`
`Padova, Italy
`
`T atiana A parae
`
`Trudi Bellardo Hahn
`University of Maryland, MD U.S.A.
`th90(a.umaiLumd.edu
`
`Abraham Bookstein
`University of Chicago. Chicago, IL U.S.A.
`a-bookstein(a;uchicago.edu
`
`Michael Buckland
`
`chiara(i,;imagJr
`
`Lee
`
`Chien
`Sinica, Taipei, Taiwan
`lfchien(aiis.sinica.edu.tw
`
`of Washington
`U.S.A.
`
`Edward A. Fox
`
`U.S.A.
`Blacksburg.
`fox(a:Jox.cs.vt.edu
`
`H. P. Frei
`Union Bank of Switzerland
`Zurich. Switzerland
`frei@.ubilabubs.com
`
`Jonathan Furner
`
`Donna Harman
`National Institute of Standards &
`Technology
`Gaithersburg, MD U.S.A.
`donna.harman@nist.gov
`
`David Harper
`The Robert Gordon University
`Aberdeen. Scotland
`djh(l:~,scms.rgu.ac.uk
`
`William Hersh
`Oregon Health Sciences University
`Portland. OR U.S.A.
`hersh(a:ohsu.edu
`
`Peter Ingwersen
`The Royal School of Library and
`Information Science
`Co]penhag;en, Denmark
`
`Tetsuya Ishikawa
`University of Library and
`Information Science
`Taukuba,
`
`Haruo Kimoto
`Nippon Telegraph and Telephone
`Corporation
`Y okosuka, Japan
`kimoto@ isl.ntt.co.jp
`
`Gehard Knorz
`Fachhochschule Darmstadt
`Darmstadt, Germany
`knorz@~www.iud.fh-darmstadt.de
`
`Rainer Kuhlen
`Universitat Konstanz, Konstanz. Germany
`rainer.kuhlen(d uni-konstanz.de
`
`Syracuse, NY U.S.A.
`Syracuse
`liddy@mailbox.syr.edu
`
`Jessica L. Milstead
`The JELEM
`Indian Head,
`milstead(a jelem.com
`
`Sung Hung
`Chungnam
`Taejon, Korea
`shmyaeng@;cs.chungnam.ac.kr
`
`University
`
`Desai Narasimhalu
`National University of Singapore
`
`Fausto Rabitti
`Consiglio Nazionale delle Ricerche
`Pisa, Italy
`F. Rabitti@;cnuce.cnr.it
`
`U.K.
`
`U.S.A.
`
`Edie M. Rasmussen
`University
`Pittsburgh. PA
`erasmus(a~sis.pitt.edu
`
`E. Robertson
`Research Ltd. Cambridge. U.K.
`ser(dmicrosoft.com
`
`Kalervo Jiirvelin
`
`Paul B. Kantor
`
`Tampere, Finland
`
`Richard S. Marcus
`Massachusetts Institute of Technology
`Cambridge, MA U.S.A.
`marcus(il..lids.mit.edu
`
`Michel J. Mcnou
`Consultant in Information Management
`Les Rosiers sur Loire, France
`mmenou@.imaginet.fr
`
`Author Services Department: For queries
`contact the Author Services Department:
`
`to the general submission of articles (including electronic text and artwork) and the status of accepted manuscripts, please
`authors(cielsevier.co.uk; Fax: +44 (0) 1865 843905; Tel: +44 (0) 1865 843900.
`
`on this journal and other Elsevier
`entered on a calendar
`basis. Issues are sent
`missing issues should
`
`are available upon
`(http:,! /WW\'I.elsev'ter.nl11loc;atc~/irtfO])ro,m<m). Further information
`website: (http:,//wW\'I.elsevier.nl).
`basis
`and are
`for
`
`Department at the Regional Sales Office nearest you:
`phone: (+I) (212) 633 3730 [toll free number for North American customers: l-888-4ES-INFO
`
`Orders, claims, and product enquiries:
`New York: Elsevier Science, PO Box
`New York. NY 10159-0945,
`fax:
`633 3680: e-mail: usJntcH(,a;elise\IIer·.co'm
`PO Box 21
`1000 AE J-\cll.!MtO!u<~m, The Netherlands: phone: ( + 31) 20 4853757; fax:
`lYJlJLlauJ-"'"• Tokyo 106-0044, Japan; phone: ( + 81) (3) 5561
`e-mail: asJ<:tJJJ.lU\'' er"'v'"'
`Tower, Singapore 039192; phone: ( + 65) 434 3727; fax: ( + 65)
`16 Andar, 20050-002 Centro, Rio de Janeiro- RJ, Brazil:
`(+55) (21) 509
`Elsevier
`e-mail: elsevier(a.campus.com.br [Note (Latin America): for orders, claims and help desk information. please contact
`Regional Sales Office in
`© 2000 Elsevier Science Ltd. All rights reserved.
`NJ. Information
`The Boulevard.
`
`M<ma.geJneJ:lt (ISSN 0306-4573) is published six issues
`I\..I<lHHgtuH, Oxford OX5 ·1GB, UK. The US subscnptic)n
`& Management, Elsevier Science, Customer Support
`
`January, March, May. July, September
`is $919.00
`PO
`
`New
`
`NY 10159-0945.
`Distributed in the USA by Mercury Airfreight International. 365 Blair Road. Avenel, NJ 07001.
`
`AOL Ex. 1024
`Page 2 of 17
`
`
`
`
`
`PERGAMON
`
`PROCESSING
`_HANAGEMENT
`Information Processing and Management 36 (2000) 275-289
`www.elsevier.com/locate/infoproman
`
`INFORMATION
`
`A system for supporting cross-lingual informationretrieval
`
`Joanne Capstick*, Abdel Kader Diagne*, Gregor Erbach*, Hans Uszkoreit*,
`Anne Leisenberg®, Manfred Leisenberg?
`“German Research Center for Artificial Intelligence, Language Technology Lub, Stuhlsatzenhausweg 3, 66123
`Saarbricken, Germany
`>Bertelsmann Online-Media-Service, Carl-Bertelsmann-Str. 16] 0, 33311 Guitersloh, Germany
`
`
`
`
`
`Abstract -
`
`In this paper, we present the system MULINEX, a fully implemented system which supports cross-
`lingual search of the WWW. Users can formulate, expand and disambiguate queries, filter the search
`results and read the retrieved documents by using their native language alone. This multilingual
`functionality is achieved by the use of dictionary-based query translation, multilingual document
`categorisation and automatic translation of summaries and documents.
`The system supports French, German and English and has been installed and tested in the online
`services of two European internet content and service provider companies.
`This paper focuses on the techniques and algorithms used in the MULINEX system, explaining how
`each component works and howit contributes to the overall functionality of the integrated system. The
`primary system functionalities are outlined from the user perspective, followed by a description of the
`document database used in the system. The technologies and linguistic resources used in the various
`system components are then described in detail. © 2000 Published by Elsevier Science Lid. All rights
`reserved.
`
`
`L. Introduction
`
`With the steady increase of internet users cutside the US, English is losing its dominant
`position in the internet and we are witnessing the emergence of a truly multilingual medium.
`The
`interest
`in
`supporting
`a multilingual
`internet
`is
`demonstrated
`through web
`
`
`
`* Corresponding author.
`E-mail address: mulinext@dfki.de (J. Capstick).
`
`0306-4573/00/S - see front matter © 2000 Published by Elsevier Science Ltd. All rights reserved.
`PIL $0306-4573(99)00058-8
`
`AOL Ex. 1024
`Page 3 of 17
`
`AOL Ex. 1024
`Page 3 of 17
`
`
`
`276
`
`J, Capstick et al. | Information Processing and Management 36 (2000) 2735-289
`
`internationalisation initiatives and through the addition of language technologies, such as
`language identification and machine translation by leading search engines. Cross-lingual
`retrieval
`takes this trend further and provides the means for accessing multilingual internet
`content, by enabling queries made in one language to retrieve documents in one or more other
`languages.
`In this paper, we present the system MULINEX, whose objective is to enable users to search
`in multilingual document collections using their native language, supported by an effective
`combination of linguistic and information retrieval technologies.
`Both monolingual and crogs-lingual
`full-text
`retrieval are faced with the problem of
`understanding the actual intention of a user query. Two complementary strategies can help
`alleviate this situation:
`firstly, query formulation support
`to aid the user in making more
`focussed queries, and, secondly,
`tools for filtering and navigating through search results,
`providing users with accurate and efficient access to those documents which satisfy their
`information needs. For dictionary-based query translation, support
`for
`interactive query
`translation disambiguation is crucial
`to avoid a
`loss of precision through inaccurate
`translations.
`information retrieval
`The MULINEX system described in this paper combines current
`technology with state-of-the-art
`language technologies. The system emphasises user-friendly
`interaction, which supports the user byoffering query translation and expansion; by presenting
`search results along with information about language,
`thematic category, and automatically
`generated summaries; and by allowing the user to filter results according to multiple criteria.
`The basic components are embedded in an object-oriented, manager-based architecture,
`providing a flexible system with potential for extendibility and re-usability.
`This -paper focuses on the techniques and algorithms used in the MULINEX system, !
`explaining how each component works and howit contributes to the functionality of the
`overall
`integrated system. The primary system functionalities are outlined from the user
`perspective,
`followed by a description of t he document database. The technologies and
`linguistic resources used in the various system components are then described in detail.
`
`2. Functionality for the user
`
`MULINEX is aimed at users who want to retrieve information from the WWW which may
`be represented in web pages in different languages. Users need not have any knowledge of
`foreign language, since the cross-language retrieval process is fully supported by the transiation
`of queries, of summaries and ofretrieved documents. However, the system is equally useful for
`users with some knowledge of the foreign languages, since it provides convenient support for
`query translation, and allows
`filtering of
`the search results by language and thematic
`categories.
`A user requirements study was carried out at
`
`the beginning of the project. The study
`
`‘(Erbach et al., 1998) provides more detailed information on the social and economic factors influencing the pro-
`ject, the objectives of the project consortium members, and the user requirements for the system.
`
`AOL Ex. 1024
`Page 4 of 17
`
`AOL Ex. 1024
`Page 4 of 17
`
`
`
`
`
`J. Capstick et al. } Information Processing and Management 36 (2000) 275-289
`
`277
`
`consisted of questionnaires which were filled out by users of internet service and content
`providers in Germany and France (Hernandez, 1997), and of psychological experiments with
`84 subjects based on a mock-up version of the system (Capstick, Erbach & Uszkoreit, 1998).
`
`[english| [fY
`search
`
`deutsch
`advanced search
`
`Search for [European "monetary union”
`The language of the query is
`
`Find documents in
`
`English
`French
`German
`
`
`
`tedlisiade as!
`Category: Legal, Finance, Taxes, Jobs
`:
`Summary: The Euro: A Dozen Do's and Don't's. Make the National Labor Markets More Flexible in order to Avoid
`Additional Unemployment. Do Not Europeanize Wage Formation. Leave Employment a National Responsibility. Do Nat Push for a
`Social Union. Kiel Working Papers. Resist the Political Demand for Transfers. '
`Summary in:8esnce Garren
`hitp:/Aeww,uni-kielde:808 O/TW/pubKkb/T99eAa03 _98.hta Size 74K
`SY French
`/
`Conseil ewropéen d'Amsterdam du 17 juin 1997"
`tedislation
`Category: Politics, Legal, Jobs
`Summary: Les résultats du Conseil européen d'Amsterdam. |. Le mécanisme de change en phase II] (SME-bis). 2.
`Le statut juridique de l'euro. 3. Les pitces en euros. 4. Le pacte de stabilité et de croissance. 5. La France a obtenu fe lancement
`d'uneinitiative pour fa constilution d'un pier économique européen.
`'
`
`
`i hide summaries |:
`
`Fig. {. Query form and search resulis presentation.
`
`AOL Ex. 1024
`Page 5 of 17
`
`AOL Ex. 1024
`Page 5 of 17
`
`
`
`
`
`278
`
`J. Capstick et al. | information Processing and Management 36 (2000) 275-289
`
`the system provides the following
`Based on the results of the user requirements analysis,
`functionality to support
`the user
`in retrieving documents
`from multilmgual document
`collections:
`:
`
`@®¢@@e@2@886
`
`translation of the user’s query;
`interactive disambiguation of the query translation (optional);
`interactive query expansion (optional);
`simultaneous search in English, German and French documentcollections;
`informative presentation of search results, with summary, language and thematic category;
`filtering of search results by language and category;
`on-demand translation of summaries and search results.
`Cross-language retrieval research started with Salton’s seminal paper (Salton, 1973), and has
`become an active field of research over the past four years (Grefenstette, 1998; Hull & Oard,
`1997; Yang, Carbonell, Brown & Frederking, 1998), In terms of Oard’s classification of cross-
`language retrieval approaches (Oard, 1997a), the query translation approach adopted in our
`system is a knowledge-based approach, as opposed to the corpus-based approach based on
`comparable corpora adopted in Sheridan and Ballerini
`(1996) and to approaches which
`construct parallel corpora by means of automatic documenttranslation (Hiemstra & Kraayj,
`1998; Kraaij & Hiemstra, 1998; Oard, 1997b).
`We use the query translation approach because of the lack of substantial parallel or
`comparable multilingual corpra,’, and because we feel
`that document
`translation was not
`scalable to very large amounts of data because of the resource requirements of machine
`translation systems.
`We will now illustrate how the user interacts with the system. Queries are formulated by
`keywords as in a standard WWW search engine(see the search box in the top ofFig. 1). Since
`automatic language identification of short queries is error-prone, the query language must be
`specified by the user. The user can also select the acceptable document languages.
`The user interface is available in English, German and French, and can be extended to
`other
`languages. Users can switch the user
`interface language at any time during the
`interaction.
`the query is translated into the selected target languages. Since search
`In the next step,
`engine queries typically do not provide enough context for automatic disambiguation,” the
`“query assistant” provides
`the opportunity for
`interactive disambiguation of the query
`translations (see Fig. 2 for the translations and expansions of the query term fair). In order to
`help users who do not understand the target
`language with the disambiguation of query
`translations, the “query assistant” shows how each translated query term translates back into
`the original query language. As the example for the query term fair in Fig. 2 shows, the back
`translations assist
`the user in eliminating translations into German and French which are
`
`* Parallel corpora contain translated documents, while comparable corpora contain texts which are not trans-
`lations, but talk about the same topic (e.g., two newspaper articles about the same event written by journalists in
`different countries)
`* Our examination of 100,000 queries submitted-to the German web search engine web.de in 1998 revealed an aver-
`age query length of 1.3 words
`
`AOL Ex. 1024
`Page 6 of 17
`
`AOL Ex. 1024
`Page 6 of 17
`
`
`
`
`
`J. Capstick et al. | Information Processing and Management 36 ( 2000} 275-289
`
`279
`
`[deutsch|[ilfrancais|
`
`
`
`| Lwantto regist
`
`
`divancedsearch |help |.
`
`
`
`
`
`Fig. 2. Query assistant.
`
`AOL Ex. 1024
`Page 7 of 17
`
`AOL Ex. 1024
`Page 7 of 17
`
`
`
`
`
`280
`
`J. Capstick et al. | Information Processing and Management 36 (2000) 275-289
`
`to the intended meaning* even though the user may not have any knowledge of
`irrelevant
`German and French. The precision of the German and French queries is thus improved.
`The paraphrasing of the query term which results from the query translation and back-
`translation is also used to provide a simple query expansion mechanism which suggests
`alternative query terms to the user in the original query language. In the example shown in
`Fig. 2, the query terms trade fair and sales activity could also be selected to expand the query
`in the original language (English). After query expansion, disambiguation may optionally be
`repeated. Following query translation and expansion,
`the documents for each language are
`searched in parallel with the search terms for this language.
`Fig.
`| shows howsearch results are presented to the user. The search form with the original
`query and options is presented on the results page. The results list contains documents in all
`languages requested by the user and is sorted by relevance. For each document in the list, the
`language, title, URL and size are displayed. The document categories are presented in the user
`interface language and the summary in the document
`language. Users may request a
`translation of the summary, which is displayed in a separate window. The translation icon on
`the right provides automatic documenttranslation.
`By selecting the corresponding language tab, the results list can be filtered by language. The
`category navigation tool on the left hand side of the page enables the user to filter the results
`by category.
`In the following sections, we will specify the information stored about cach document in
`order to support the filtering and presentation of the search results, as well as the technologies
`and linguistic resources used to achieve the functionality described, and to obtain the
`information about each document.
`
`3. Document database
`
`The core of the system is a database in which certain pieces of information about all
`documents are stored. In the context of our search engine, a document is a unit of presentation
`that is accessed by a WWW user by following a hyperlink. A document may be composed of
`several web pages which are arranged in a frameset. Treating the entire frameset as one
`document has the advantage that queries made up of several terms can retrieve a frameset in
`which the terms occur in different frames. For example, the query travel thailand may retrieve a
`document
`in which the query terms occur in separate frames. Another advantage is that
`retrieved frames are presented in the context of their frameset.
`Documents in which multiple languages occur are not handled explicitly. Although the
`language identification module could well identify different languages in a document if it were
`run separately for each paragraph (or other suitable unit) of a document, we assign only one
`language on the basis of the text which occurs at the beginning of the document. This decision
`
`* Of course this method will only work to the extent that the intended meaning of the translation of the query
`term has alternative paraphrases in the original query language. The effectiveness of the method is improved if the
`underlying dictionarylists the most commontranslations before rare or domain specific translations
`
`AOL Ex. 1024
`Page 8 of 17
`
`AOL Ex. 1024
`Page 8 of 17
`
`
`
`
`
`
`
`
`J. Capstick et al. | Information Processing and Management 36 (2000) 275-289
`
`281
`
`“was taken in order to speed up the document analysis process and because a retrieval system
`for the WWW cannoteasily retrieve (or refer to) a passage or paragraph of a document which
`ig written in a specific language.
`In order to provide the functionality outlined in Section 2, the following information (see
`Table 1) is stored about each document.
`The Fulcrum SearchServer, a state-of-the-art document management and retrieval system
`with an SQL-based query language, is used as the document database.
`
`4, Technologies and resources for document analysis
`
`In this section we describe the technologies and linguistic resources which are used to
`analyse documents in order to obtain the information specified in Table 1. Document analysis
`takes place during the gathering of documents by means of a web spider.
`
`4.1, Document gathering
`
`Like all WWW search engines, MULINEX makes use of a web spider for the acquisition of
`documents and of a core information retrieval system for supporting the search. MULINEX
`extends this basic functionality by performing additional document analysis steps.
`Fig. 3 shows the steps of the document acquisition process.° At each step, the information
`about a document is successively refined. The web spider obtains information that is specified
`in HTTP and HTML suchas size, modification time, the URL, the character encoding and the
`full
`text of the document. The document analysis components analyse the content of the
`document
`to determine the language and thematic categories, and to create a document
`summary. All
`this information is then used to create or update a record in the document
`database.
`Gathering of documents is performed by the Harvest gatherer (Bowman et al., 1994), a
`highly configurable system for gathering documents from the WWW which respects the Robot
`Exclusion Protocol.
`Harvest consists of two parts: the gatherer (a web spider) and the broker (an indexing and
`search system based on the Glimpse retrieval engine). In the MULINEX system, we have
`
`Table 1
`Structure of a record in the document database
`
`
`Uniform resource locator of the document
`URL
`Title of the document, as specified in the HTML <TITLE> tag
`Title
`Size of the document, as provided in the HTTP protocol
`Size
`Last modification date of the document, as provided in the HTTP protocol
`Date
`Language of the document (see Section 4.2)
`Language
`Keywords, author-specified or automatically extracted (see Section 4.3)
`Keywords
`Summary, author-specified or automatically extracted (see Section 4.3)
`Summary
`Categories
`A list of categories and similarity values (see section 4.4)
`
`Full text index
`Full text of the document, indexed for document retrieval
`
`Timm,
`
`AOL Ex. 1024
`Page 9 of 17
`
`AOL Ex. 1024
`Page 9 of 17
`
`
`
`282
`
`J. Capstick et al. | Information Processing and Management 36 (2000) 275-289
`
`
`
`w| Categorisation
`
`Decument
`
`
`
`
`
`
`Database
`Web spider
`language
`
`identification
`Morphologicd
`hathtenes
`Di
`ame
`
`
`
`analysis and
`| Summarisation and
`ee
`
`|
`extraction of keywords
`determination of
`
`
`
`word frequencies
`
`
`
`Www
`
` Word
`
`frequency
`
`statistics
`
`
`Fig. 3. Document acquisition.
`
`decided to use only the gatherer and replace the broker by the Fulcrum SearchServer. Fulcrum
`provides certain advantages over Glimpse as a retrieval engine, notably an SQL-based query
`language and the capability to sort search results by relevance.
`
`4.2. Language identification
`
`further
`is a necessary prerequisite for
`the language of a document
`Information about
`processing steps: document categorisation, summarisation, and machine translation are ail
`dependent on knowing the document
`language. Knowledge of the language also improves
`indexing and retrieval performance by using appropriate stop-lists, stemming,
`term weights,
`thesauri, etc. for each language. The language of WWW documents is often not marked by the
`‘author even though HTML and HTTP allow the author
`to provide this information.
`Therefore, we use a statistical language identifier to determine the language of a document.
`Language identification is performed by making use of an algorithm which compares the
`relative frequencies of the most frequent n-grams (from | to 5 characters) in a document to 40
`stored language models (Cavnar & Trenkle, 1994).
`For each language, the language recogniser uses a language model — the sequence of the
`300 most frequent n-grams, ordered by their frequency in a training corpus.© For each
`document whose language we want to identify, we compute the sequence of its most frequent
`
`> Preprocessing steps which take place prior to document acquisition are shown with a shaded background.
`© We used the European CorpusInitiative (ECI) CD-ROM to obtain ourtraining data.
`
`AOL Ex. 1024
`Page 10 of 17
`
`AOL Ex. 1024
`Page 10 of 17
`
`
`
`J. Capstick et al. | Information Processing and Management 36 (2000) 275~289
`
`283
`
`Table 2
`Evaluation of three language identification methods(see text)
`
`Method
`
`3-5 words
`
`6-10
`
`11-15
`
`16-20
`
`21 or more
`
`
`
`English
`
`German
`
`100
`99.7
`99.3
`97.1
`68.75
`n-gram
`100
`99.9
`99.9
`99.5
`97.2
`trigram
`100
`99.9.
`99.8
`97.3
`87.7
`short words
`100
`99.9
`99.2
`98.4
`95.1
`n-gram
`100
`99.9
`99.8
`99,3
`97.2
`trigram
`100
`99.8
`98.2
`89.6
`71.6
`short words
`99.5
`98.3
`98
`98
`84.9
`n-gram
`100
`99.8
`93.6
`94.5
`93
`trigram
`
`
`
`
`
`81.8 96 97.2 99.8short words 100
`
`French
`
`n-grams. For each n-gram of the document, we compare its rank to the rank of the same n-
`gram in the language model and sum up the differences. A maximal!difference value is used for
`n-grams which are not present in the language model. The language whose language model has
`the smallest total difference to the current documentwill be assigned.
`The language identifier was evaluated using data from the European Corpus Initiative CD-
`ROM in Danish, Dutch, English, French, German,
`Italian, Norwegian, Portuguese and
`Spanish, replicating the experiments reported in Grefenstette (1995), which compared language
`identification schemes based on trigrams and on frequent short words. The evaluation results
`of our n-gram language identifier and a comparison to Grefenstette’s
`results
`for
`the
`MULINEX languages English, German and Frenchare given in Table 2.
`It has been argued that a language identifier based on the frequencies of n-grarms from
`length 1-5 combines the advantages of several well-known language identification methods: it
`takes into account the frequencies of single letters, of bigrams and trigrams and of frequent
`short words. However, our experiments showed significant advantages for French text of a
`length between 6 and 15 words only. Above 21 words, all algorithms show an almost perfect
`language identification accuracy.
`A separate test showed that the language identification perforrnance for the ECI corpus data
`did not degrade when the distinction between upper and lower case characters was ignored.
`For language identification of web pages, we ignore case because capitalised titles and headings
`often lead to errors in language identification.
`
`4.3. Summarisation
`
`Summarisation is performed byselecting the sentences’ which best characterise a document.
`We use sentence selection as our summarisation method because it shows robust performance
`on a collection of widely varying documents, such as those in the WWW.
`‘here are two summarisers, a neutral (query-independent) and a tailored (query-dependent)
`
`
`7 We treat words between punctuation or structural HTML markup as sentences, even if they are not grammati-
`cally well-formed sentences.
`
`AOL Ex. 1024
`Page 11 of 17
`
`AOL Ex. 1024
`Page 11 of 17
`
`
`
`
`
`284
`
`J. Capstick et al. | Information Processing and Management 36 (2000) 275-289
`
`summariser. Both work by selection of the most salient sentences. The target length of the
`summary can be specified, and the summariser will select the most important sentences until
`the target length has been reached. The query-independent summariser selects sentences which
`are marked up by structural (headings) and layout-oriented (boldface, italics) markup, and uses
`heuristics such as selecting the first sentence of a paragraph and making use of term frequency
`statistics. The query-specific summariser
`selects sentences in which the query terms (or
`morphological variants) occur.
`In the MULINEX system, we use the query-independent summariser during document
`gathering to generate summaries which are stored in the document database. Use of a query-
`specific summariser for the search system is not practical, as we do not store the full text of the
`documents on the server.
`In addition to sentence extraction, we extract a set of salient keywords for each document by
`choosing words which occur frequently in the document, but less frequently in the document
`collection.
`
`4.4. Categorisation
`
`The MULINEX system contains three different document categorisation algorithms, each
`suited for different categorisation tasks:
`
`1. n-gram categoriser for noisy input
`2. k-nearest-neighbour (KNN) algorithm for normal documents
`3. pattern-based categoriser for very short documents
`The n-gram categoriser makes use of the frequencies of n-grams of characters in a document
`which is compared to a category model, the sequence of most frequent n-grams ordered by
`their frequency in a training corpus.* Our evaluation has shown that this categoriser did not
`perform as well for web pages as the KNN categoriser. The n-gram categoriser is more useful
`in situations with noisy input, such as categorising OCR output.
`The k-nearest-neighbour algorithm (Yang, 1994) is a statistical algorithm which classifies a
`new document by combining the category assignments of the .k most similar
`training
`documents, weighted by the similarity between the new document and each of the k best
`matching training documents. The categoriser has been trained with documents
`from
`newsgroups in French, German and English.
`Although the categorisers are trained separately for each language, multilinguality is
`achieved by training for the same categoriés in different languages. For example, we gathered
`training material for the category politics from the German newsgroup de.soc.politik,
`the
`corresponding French group fr.soc.politique and English soc.politics. For medicine, we used all
`English scimed.* groups, German de.sci.medizin and French fr.bio.medecine.
`The two statistical categorisers (ANN and n-gram) have been evaluated with German news
`articles which were obtained from the Federal Press Office of the German government. The
`
`is used for language identification. It has been used successfully by Cavnar and
`® This is the algorithm that
`Trenkle for the categorisation of Usenet news articles (Cavnar & Trenkle, 1994).
`
`AOL Ex. 1024
`Page 12 of 17
`
`AOL Ex. 1024
`Page 12 of 17
`
`
`
`J. Capstick et al. | Information Processing and Management 36 (2000) 275-289
`
`285
`
`.
`__.
`“Table 3
`Average precision and recall for two categorisation algorithms
`
`
`
` KNN n-gram
`
`0.243
`0.757
`Averagerecall
`0.361
`0.714
`Average precision
`
`
`
`
`
`task of the Press Office is to distribute news articles to the different departments and ministries
`of the German government. The training data were categorised according to the government
`department to which they were distributed. We had a total of 8809 documents with 73 distinct
`--eategories; a document can have more than one category if it
`is sent
`to more than one
`“government department. The categorisers assigned a number of categories with a confidence
`yalue; all category assignments above a fixed cutoff point, which was chosen to maximise the f-
`measure value on the training set for each category, were used in the evaluation. 80% of the
`‘documents were used for training, the rest was retained for testing. Precision and recall were
`evaluated for each of the 73 categories with the following results (see Table 3).
`specialised
`In addition,
`there is a pattern-based categorisation algorithm for narrow,
`categories. The pattern-based categoriser recognises terms which are of interest to a certain
`domain, such as the names of travel agencies or airlines for the tourism domain. It is used in
`situations where the pages to be categorised contain only little text and are dominated by
`graphics and tables. The patterns, expressed in Perl regular expression syntax, have been
`defined manually, by abstracting over multi-word terms found in a corpus of web pages for the
`domain of interest.
`The pattern-based categoriser has not been formally evaluated due to the lack of categorised
`training data, but usability testing of the system showed usersatisfaction with categorisation
`results (cf. Section 6).
`
`5. Technologies and resources for search and result presentation
`
`In this section, we describe the technologies and resources used to process a user’s query and
`present the search results.
`
`5.1. Query analysis
`
`translates and expands the users’ queries. Since the
`The MULINEX system analyses,
`retrieval performance of automatically translated queries is inferior to monolingual information
`retrieval, there is an (optional) step of user interaction, where the user can select terms from
`the translated query and add othertranslations.
`The search syntax supports required search terms (+), excluded search terms (—), and allows
`the user to block the translation of search terms(!). Full boolean search syntax (AND/OR)is
`not supported because of the problems encountered with ambiguity in the translation of query
`terms.
`
`Bacon
`
`AOL Ex. 1024
`Page 13 of 17
`
`AOL Ex. 1024
`Page 13 of 17
`
`
`
`
`
`286
`
`J. Capstick