throbber
March
`
`N0306-4573 :398-45730 N11
`
`AOL Ex. 1024
`Page 1 of 17
`
`AOL Ex. 1024
`Page 1 of 17
`
`

`

`-An International Journal(cid:173)
`(Incorporatihg INFORMATION TECHNOLOGY)
`Email: ipm@scils.rutgers.edu
`Web site: http:/ jwww.elsevier.nl/locatejinfoproman
`
`Tefko Saracevic
`School of Communication, Information and Librarv
`Studies

`University
`Street
`New Brunswick, NJ 08903, U.S.A.
`tefko@ scils.rutgers.edu
`
`Nicholas J. Belkin
`School of Communication. Information and Library Studies
`University
`Street
`New Brunswick, NJ 08901-1071, U.S.A.
`nick@ belkin.rutgers.edu
`
`De:parltment of Computer and Information Science
`of Massachusetts
`01003. U.S.A.
`Amherst.
`croft((i.cs.umasss.edu
`
`Associate Editor (Europe) - - - - - - - -
`
`Associate Editor (Book Reviews) - - - - - - (cid:173)
`
`Founding Editor - - - - - - - - (cid:173)
`
`Amanda
`Associate
`School of Information Sciences & Technology
`The Pennsylvania State University
`State College
`PA 16802. U.S.A.
`ahs(d;psu.edu
`
`Harold Borko
`Graduate School of Education and Information
`University of California
`102 South Hall
`CA 90024-1520 U.S.A.
`
`Editorial Board - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
`
`Padova, Italy
`
`T atiana A parae
`
`Trudi Bellardo Hahn
`University of Maryland, MD U.S.A.
`th90(a.umaiLumd.edu
`
`Abraham Bookstein
`University of Chicago. Chicago, IL U.S.A.
`a-bookstein(a;uchicago.edu
`
`Michael Buckland
`
`chiara(i,;imagJr
`
`Lee
`
`Chien
`Sinica, Taipei, Taiwan
`lfchien(aiis.sinica.edu.tw
`
`of Washington
`U.S.A.
`
`Edward A. Fox
`
`U.S.A.
`Blacksburg.
`fox(a:Jox.cs.vt.edu
`
`H. P. Frei
`Union Bank of Switzerland
`Zurich. Switzerland
`frei@.ubilabubs.com
`
`Jonathan Furner
`
`Donna Harman
`National Institute of Standards &
`Technology
`Gaithersburg, MD U.S.A.
`donna.harman@nist.gov
`
`David Harper
`The Robert Gordon University
`Aberdeen. Scotland
`djh(l:~,scms.rgu.ac.uk
`
`William Hersh
`Oregon Health Sciences University
`Portland. OR U.S.A.
`hersh(a:ohsu.edu
`
`Peter Ingwersen
`The Royal School of Library and
`Information Science
`Co]penhag;en, Denmark
`
`Tetsuya Ishikawa
`University of Library and
`Information Science
`Taukuba,
`
`Haruo Kimoto
`Nippon Telegraph and Telephone
`Corporation
`Y okosuka, Japan
`kimoto@ isl.ntt.co.jp
`
`Gehard Knorz
`Fachhochschule Darmstadt
`Darmstadt, Germany
`knorz@~www.iud.fh-darmstadt.de
`
`Rainer Kuhlen
`Universitat Konstanz, Konstanz. Germany
`rainer.kuhlen(d uni-konstanz.de
`
`Syracuse, NY U.S.A.
`Syracuse
`liddy@mailbox.syr.edu
`
`Jessica L. Milstead
`The JELEM
`Indian Head,
`milstead(a jelem.com
`
`Sung Hung
`Chungnam
`Taejon, Korea
`shmyaeng@;cs.chungnam.ac.kr
`
`University
`
`Desai Narasimhalu
`National University of Singapore
`
`Fausto Rabitti
`Consiglio Nazionale delle Ricerche
`Pisa, Italy
`F. Rabitti@;cnuce.cnr.it
`
`U.K.
`
`U.S.A.
`
`Edie M. Rasmussen
`University
`Pittsburgh. PA
`erasmus(a~sis.pitt.edu
`
`E. Robertson
`Research Ltd. Cambridge. U.K.
`ser(dmicrosoft.com
`
`Kalervo Jiirvelin
`
`Paul B. Kantor
`
`Tampere, Finland
`
`Richard S. Marcus
`Massachusetts Institute of Technology
`Cambridge, MA U.S.A.
`marcus(il..lids.mit.edu
`
`Michel J. Mcnou
`Consultant in Information Management
`Les Rosiers sur Loire, France
`mmenou@.imaginet.fr
`
`Author Services Department: For queries
`contact the Author Services Department:
`
`to the general submission of articles (including electronic text and artwork) and the status of accepted manuscripts, please
`authors(cielsevier.co.uk; Fax: +44 (0) 1865 843905; Tel: +44 (0) 1865 843900.
`
`on this journal and other Elsevier
`entered on a calendar
`basis. Issues are sent
`missing issues should
`
`are available upon
`(http:,! /WW\'I.elsev'ter.nl11loc;atc~/irtfO])ro,m<m). Further information
`website: (http:,//wW\'I.elsevier.nl).
`basis
`and are
`for
`
`Department at the Regional Sales Office nearest you:
`phone: (+I) (212) 633 3730 [toll free number for North American customers: l-888-4ES-INFO
`
`Orders, claims, and product enquiries:
`New York: Elsevier Science, PO Box
`New York. NY 10159-0945,
`fax:
`633 3680: e-mail: usJntcH(,a;elise\IIer·.co'm
`PO Box 21
`1000 AE J-\cll.!MtO!u<~m, The Netherlands: phone: ( + 31) 20 4853757; fax:
`lYJlJLlauJ-"'"• Tokyo 106-0044, Japan; phone: ( + 81) (3) 5561
`e-mail: asJ<:tJJJ.lU\'' er"'v'"'
`Tower, Singapore 039192; phone: ( + 65) 434 3727; fax: ( + 65)
`16 Andar, 20050-002 Centro, Rio de Janeiro- RJ, Brazil:
`(+55) (21) 509
`Elsevier
`e-mail: elsevier(a.campus.com.br [Note (Latin America): for orders, claims and help desk information. please contact
`Regional Sales Office in
`© 2000 Elsevier Science Ltd. All rights reserved.
`NJ. Information
`The Boulevard.
`
`M<ma.geJneJ:lt (ISSN 0306-4573) is published six issues
`I\..I<lHHgtuH, Oxford OX5 ·1GB, UK. The US subscnptic)n
`& Management, Elsevier Science, Customer Support
`
`January, March, May. July, September
`is $919.00
`PO
`
`New
`
`NY 10159-0945.
`Distributed in the USA by Mercury Airfreight International. 365 Blair Road. Avenel, NJ 07001.
`
`AOL Ex. 1024
`Page 2 of 17
`
`

`

`
`
`PERGAMON
`
`PROCESSING
`_HANAGEMENT
`Information Processing and Management 36 (2000) 275-289
`www.elsevier.com/locate/infoproman
`
`INFORMATION
`
`A system for supporting cross-lingual informationretrieval
`
`Joanne Capstick*, Abdel Kader Diagne*, Gregor Erbach*, Hans Uszkoreit*,
`Anne Leisenberg®, Manfred Leisenberg?
`“German Research Center for Artificial Intelligence, Language Technology Lub, Stuhlsatzenhausweg 3, 66123
`Saarbricken, Germany
`>Bertelsmann Online-Media-Service, Carl-Bertelsmann-Str. 16] 0, 33311 Guitersloh, Germany
`
`
`
`
`
`Abstract -
`
`In this paper, we present the system MULINEX, a fully implemented system which supports cross-
`lingual search of the WWW. Users can formulate, expand and disambiguate queries, filter the search
`results and read the retrieved documents by using their native language alone. This multilingual
`functionality is achieved by the use of dictionary-based query translation, multilingual document
`categorisation and automatic translation of summaries and documents.
`The system supports French, German and English and has been installed and tested in the online
`services of two European internet content and service provider companies.
`This paper focuses on the techniques and algorithms used in the MULINEX system, explaining how
`each component works and howit contributes to the overall functionality of the integrated system. The
`primary system functionalities are outlined from the user perspective, followed by a description of the
`document database used in the system. The technologies and linguistic resources used in the various
`system components are then described in detail. © 2000 Published by Elsevier Science Lid. All rights
`reserved.
`
`
`L. Introduction
`
`With the steady increase of internet users cutside the US, English is losing its dominant
`position in the internet and we are witnessing the emergence of a truly multilingual medium.
`The
`interest
`in
`supporting
`a multilingual
`internet
`is
`demonstrated
`through web
`
`
`
`* Corresponding author.
`E-mail address: mulinext@dfki.de (J. Capstick).
`
`0306-4573/00/S - see front matter © 2000 Published by Elsevier Science Ltd. All rights reserved.
`PIL $0306-4573(99)00058-8
`
`AOL Ex. 1024
`Page 3 of 17
`
`AOL Ex. 1024
`Page 3 of 17
`
`

`

`276
`
`J, Capstick et al. | Information Processing and Management 36 (2000) 2735-289
`
`internationalisation initiatives and through the addition of language technologies, such as
`language identification and machine translation by leading search engines. Cross-lingual
`retrieval
`takes this trend further and provides the means for accessing multilingual internet
`content, by enabling queries made in one language to retrieve documents in one or more other
`languages.
`In this paper, we present the system MULINEX, whose objective is to enable users to search
`in multilingual document collections using their native language, supported by an effective
`combination of linguistic and information retrieval technologies.
`Both monolingual and crogs-lingual
`full-text
`retrieval are faced with the problem of
`understanding the actual intention of a user query. Two complementary strategies can help
`alleviate this situation:
`firstly, query formulation support
`to aid the user in making more
`focussed queries, and, secondly,
`tools for filtering and navigating through search results,
`providing users with accurate and efficient access to those documents which satisfy their
`information needs. For dictionary-based query translation, support
`for
`interactive query
`translation disambiguation is crucial
`to avoid a
`loss of precision through inaccurate
`translations.
`information retrieval
`The MULINEX system described in this paper combines current
`technology with state-of-the-art
`language technologies. The system emphasises user-friendly
`interaction, which supports the user byoffering query translation and expansion; by presenting
`search results along with information about language,
`thematic category, and automatically
`generated summaries; and by allowing the user to filter results according to multiple criteria.
`The basic components are embedded in an object-oriented, manager-based architecture,
`providing a flexible system with potential for extendibility and re-usability.
`This -paper focuses on the techniques and algorithms used in the MULINEX system, !
`explaining how each component works and howit contributes to the functionality of the
`overall
`integrated system. The primary system functionalities are outlined from the user
`perspective,
`followed by a description of t he document database. The technologies and
`linguistic resources used in the various system components are then described in detail.
`
`2. Functionality for the user
`
`MULINEX is aimed at users who want to retrieve information from the WWW which may
`be represented in web pages in different languages. Users need not have any knowledge of
`foreign language, since the cross-language retrieval process is fully supported by the transiation
`of queries, of summaries and ofretrieved documents. However, the system is equally useful for
`users with some knowledge of the foreign languages, since it provides convenient support for
`query translation, and allows
`filtering of
`the search results by language and thematic
`categories.
`A user requirements study was carried out at
`
`the beginning of the project. The study
`
`‘(Erbach et al., 1998) provides more detailed information on the social and economic factors influencing the pro-
`ject, the objectives of the project consortium members, and the user requirements for the system.
`
`AOL Ex. 1024
`Page 4 of 17
`
`AOL Ex. 1024
`Page 4 of 17
`
`

`

`
`
`J. Capstick et al. } Information Processing and Management 36 (2000) 275-289
`
`277
`
`consisted of questionnaires which were filled out by users of internet service and content
`providers in Germany and France (Hernandez, 1997), and of psychological experiments with
`84 subjects based on a mock-up version of the system (Capstick, Erbach & Uszkoreit, 1998).
`
`[english| [fY
`search
`
`deutsch
`advanced search
`
`Search for [European "monetary union”
`The language of the query is
`
`Find documents in
`
`English
`French
`German
`
`
`
`tedlisiade as!
`Category: Legal, Finance, Taxes, Jobs
`:
`Summary: The Euro: A Dozen Do's and Don't's. Make the National Labor Markets More Flexible in order to Avoid
`Additional Unemployment. Do Not Europeanize Wage Formation. Leave Employment a National Responsibility. Do Nat Push for a
`Social Union. Kiel Working Papers. Resist the Political Demand for Transfers. '
`Summary in:8esnce Garren
`hitp:/Aeww,uni-kielde:808 O/TW/pubKkb/T99eAa03 _98.hta Size 74K
`SY French
`/
`Conseil ewropéen d'Amsterdam du 17 juin 1997"
`tedislation
`Category: Politics, Legal, Jobs
`Summary: Les résultats du Conseil européen d'Amsterdam. |. Le mécanisme de change en phase II] (SME-bis). 2.
`Le statut juridique de l'euro. 3. Les pitces en euros. 4. Le pacte de stabilité et de croissance. 5. La France a obtenu fe lancement
`d'uneinitiative pour fa constilution d'un pier économique européen.
`'
`
`
`i hide summaries |:
`
`Fig. {. Query form and search resulis presentation.
`
`AOL Ex. 1024
`Page 5 of 17
`
`AOL Ex. 1024
`Page 5 of 17
`
`

`

`
`
`278
`
`J. Capstick et al. | information Processing and Management 36 (2000) 275-289
`
`the system provides the following
`Based on the results of the user requirements analysis,
`functionality to support
`the user
`in retrieving documents
`from multilmgual document
`collections:
`:
`
`@®¢@@e@2@886
`
`translation of the user’s query;
`interactive disambiguation of the query translation (optional);
`interactive query expansion (optional);
`simultaneous search in English, German and French documentcollections;
`informative presentation of search results, with summary, language and thematic category;
`filtering of search results by language and category;
`on-demand translation of summaries and search results.
`Cross-language retrieval research started with Salton’s seminal paper (Salton, 1973), and has
`become an active field of research over the past four years (Grefenstette, 1998; Hull & Oard,
`1997; Yang, Carbonell, Brown & Frederking, 1998), In terms of Oard’s classification of cross-
`language retrieval approaches (Oard, 1997a), the query translation approach adopted in our
`system is a knowledge-based approach, as opposed to the corpus-based approach based on
`comparable corpora adopted in Sheridan and Ballerini
`(1996) and to approaches which
`construct parallel corpora by means of automatic documenttranslation (Hiemstra & Kraayj,
`1998; Kraaij & Hiemstra, 1998; Oard, 1997b).
`We use the query translation approach because of the lack of substantial parallel or
`comparable multilingual corpra,’, and because we feel
`that document
`translation was not
`scalable to very large amounts of data because of the resource requirements of machine
`translation systems.
`We will now illustrate how the user interacts with the system. Queries are formulated by
`keywords as in a standard WWW search engine(see the search box in the top ofFig. 1). Since
`automatic language identification of short queries is error-prone, the query language must be
`specified by the user. The user can also select the acceptable document languages.
`The user interface is available in English, German and French, and can be extended to
`other
`languages. Users can switch the user
`interface language at any time during the
`interaction.
`the query is translated into the selected target languages. Since search
`In the next step,
`engine queries typically do not provide enough context for automatic disambiguation,” the
`“query assistant” provides
`the opportunity for
`interactive disambiguation of the query
`translations (see Fig. 2 for the translations and expansions of the query term fair). In order to
`help users who do not understand the target
`language with the disambiguation of query
`translations, the “query assistant” shows how each translated query term translates back into
`the original query language. As the example for the query term fair in Fig. 2 shows, the back
`translations assist
`the user in eliminating translations into German and French which are
`
`* Parallel corpora contain translated documents, while comparable corpora contain texts which are not trans-
`lations, but talk about the same topic (e.g., two newspaper articles about the same event written by journalists in
`different countries)
`* Our examination of 100,000 queries submitted-to the German web search engine web.de in 1998 revealed an aver-
`age query length of 1.3 words
`
`AOL Ex. 1024
`Page 6 of 17
`
`AOL Ex. 1024
`Page 6 of 17
`
`

`

`
`
`J. Capstick et al. | Information Processing and Management 36 ( 2000} 275-289
`
`279
`
`[deutsch|[ilfrancais|
`
`
`
`| Lwantto regist
`
`
`divancedsearch |help |.
`
`
`
`
`
`Fig. 2. Query assistant.
`
`AOL Ex. 1024
`Page 7 of 17
`
`AOL Ex. 1024
`Page 7 of 17
`
`

`

`
`
`280
`
`J. Capstick et al. | Information Processing and Management 36 (2000) 275-289
`
`to the intended meaning* even though the user may not have any knowledge of
`irrelevant
`German and French. The precision of the German and French queries is thus improved.
`The paraphrasing of the query term which results from the query translation and back-
`translation is also used to provide a simple query expansion mechanism which suggests
`alternative query terms to the user in the original query language. In the example shown in
`Fig. 2, the query terms trade fair and sales activity could also be selected to expand the query
`in the original language (English). After query expansion, disambiguation may optionally be
`repeated. Following query translation and expansion,
`the documents for each language are
`searched in parallel with the search terms for this language.
`Fig.
`| shows howsearch results are presented to the user. The search form with the original
`query and options is presented on the results page. The results list contains documents in all
`languages requested by the user and is sorted by relevance. For each document in the list, the
`language, title, URL and size are displayed. The document categories are presented in the user
`interface language and the summary in the document
`language. Users may request a
`translation of the summary, which is displayed in a separate window. The translation icon on
`the right provides automatic documenttranslation.
`By selecting the corresponding language tab, the results list can be filtered by language. The
`category navigation tool on the left hand side of the page enables the user to filter the results
`by category.
`In the following sections, we will specify the information stored about cach document in
`order to support the filtering and presentation of the search results, as well as the technologies
`and linguistic resources used to achieve the functionality described, and to obtain the
`information about each document.
`
`3. Document database
`
`The core of the system is a database in which certain pieces of information about all
`documents are stored. In the context of our search engine, a document is a unit of presentation
`that is accessed by a WWW user by following a hyperlink. A document may be composed of
`several web pages which are arranged in a frameset. Treating the entire frameset as one
`document has the advantage that queries made up of several terms can retrieve a frameset in
`which the terms occur in different frames. For example, the query travel thailand may retrieve a
`document
`in which the query terms occur in separate frames. Another advantage is that
`retrieved frames are presented in the context of their frameset.
`Documents in which multiple languages occur are not handled explicitly. Although the
`language identification module could well identify different languages in a document if it were
`run separately for each paragraph (or other suitable unit) of a document, we assign only one
`language on the basis of the text which occurs at the beginning of the document. This decision
`
`* Of course this method will only work to the extent that the intended meaning of the translation of the query
`term has alternative paraphrases in the original query language. The effectiveness of the method is improved if the
`underlying dictionarylists the most commontranslations before rare or domain specific translations
`
`AOL Ex. 1024
`Page 8 of 17
`
`AOL Ex. 1024
`Page 8 of 17
`
`

`

`
`
`
`
`
`J. Capstick et al. | Information Processing and Management 36 (2000) 275-289
`
`281
`
`“was taken in order to speed up the document analysis process and because a retrieval system
`for the WWW cannoteasily retrieve (or refer to) a passage or paragraph of a document which
`ig written in a specific language.
`In order to provide the functionality outlined in Section 2, the following information (see
`Table 1) is stored about each document.
`The Fulcrum SearchServer, a state-of-the-art document management and retrieval system
`with an SQL-based query language, is used as the document database.
`
`4, Technologies and resources for document analysis
`
`In this section we describe the technologies and linguistic resources which are used to
`analyse documents in order to obtain the information specified in Table 1. Document analysis
`takes place during the gathering of documents by means of a web spider.
`
`4.1, Document gathering
`
`Like all WWW search engines, MULINEX makes use of a web spider for the acquisition of
`documents and of a core information retrieval system for supporting the search. MULINEX
`extends this basic functionality by performing additional document analysis steps.
`Fig. 3 shows the steps of the document acquisition process.° At each step, the information
`about a document is successively refined. The web spider obtains information that is specified
`in HTTP and HTML suchas size, modification time, the URL, the character encoding and the
`full
`text of the document. The document analysis components analyse the content of the
`document
`to determine the language and thematic categories, and to create a document
`summary. All
`this information is then used to create or update a record in the document
`database.
`Gathering of documents is performed by the Harvest gatherer (Bowman et al., 1994), a
`highly configurable system for gathering documents from the WWW which respects the Robot
`Exclusion Protocol.
`Harvest consists of two parts: the gatherer (a web spider) and the broker (an indexing and
`search system based on the Glimpse retrieval engine). In the MULINEX system, we have
`
`Table 1
`Structure of a record in the document database
`
`
`Uniform resource locator of the document
`URL
`Title of the document, as specified in the HTML <TITLE> tag
`Title
`Size of the document, as provided in the HTTP protocol
`Size
`Last modification date of the document, as provided in the HTTP protocol
`Date
`Language of the document (see Section 4.2)
`Language
`Keywords, author-specified or automatically extracted (see Section 4.3)
`Keywords
`Summary, author-specified or automatically extracted (see Section 4.3)
`Summary
`Categories
`A list of categories and similarity values (see section 4.4)
`
`Full text index
`Full text of the document, indexed for document retrieval
`
`Timm,
`
`AOL Ex. 1024
`Page 9 of 17
`
`AOL Ex. 1024
`Page 9 of 17
`
`

`

`282
`
`J. Capstick et al. | Information Processing and Management 36 (2000) 275-289
`
`
`
`w| Categorisation
`
`Decument
`
`
`
`
`
`
`Database
`Web spider
`language
`
`identification
`Morphologicd
`hathtenes
`Di
`ame
`
`
`
`analysis and
`| Summarisation and
`ee
`
`|
`extraction of keywords
`determination of
`
`
`
`word frequencies
`
`
`
`Www
`
` Word
`
`frequency
`
`statistics
`
`
`Fig. 3. Document acquisition.
`
`decided to use only the gatherer and replace the broker by the Fulcrum SearchServer. Fulcrum
`provides certain advantages over Glimpse as a retrieval engine, notably an SQL-based query
`language and the capability to sort search results by relevance.
`
`4.2. Language identification
`
`further
`is a necessary prerequisite for
`the language of a document
`Information about
`processing steps: document categorisation, summarisation, and machine translation are ail
`dependent on knowing the document
`language. Knowledge of the language also improves
`indexing and retrieval performance by using appropriate stop-lists, stemming,
`term weights,
`thesauri, etc. for each language. The language of WWW documents is often not marked by the
`‘author even though HTML and HTTP allow the author
`to provide this information.
`Therefore, we use a statistical language identifier to determine the language of a document.
`Language identification is performed by making use of an algorithm which compares the
`relative frequencies of the most frequent n-grams (from | to 5 characters) in a document to 40
`stored language models (Cavnar & Trenkle, 1994).
`For each language, the language recogniser uses a language model — the sequence of the
`300 most frequent n-grams, ordered by their frequency in a training corpus.© For each
`document whose language we want to identify, we compute the sequence of its most frequent
`
`> Preprocessing steps which take place prior to document acquisition are shown with a shaded background.
`© We used the European CorpusInitiative (ECI) CD-ROM to obtain ourtraining data.
`
`AOL Ex. 1024
`Page 10 of 17
`
`AOL Ex. 1024
`Page 10 of 17
`
`

`

`J. Capstick et al. | Information Processing and Management 36 (2000) 275~289
`
`283
`
`Table 2
`Evaluation of three language identification methods(see text)
`
`Method
`
`3-5 words
`
`6-10
`
`11-15
`
`16-20
`
`21 or more
`
`
`
`English
`
`German
`
`100
`99.7
`99.3
`97.1
`68.75
`n-gram
`100
`99.9
`99.9
`99.5
`97.2
`trigram
`100
`99.9.
`99.8
`97.3
`87.7
`short words
`100
`99.9
`99.2
`98.4
`95.1
`n-gram
`100
`99.9
`99.8
`99,3
`97.2
`trigram
`100
`99.8
`98.2
`89.6
`71.6
`short words
`99.5
`98.3
`98
`98
`84.9
`n-gram
`100
`99.8
`93.6
`94.5
`93
`trigram
`
`
`
`
`
`81.8 96 97.2 99.8short words 100
`
`French
`
`n-grams. For each n-gram of the document, we compare its rank to the rank of the same n-
`gram in the language model and sum up the differences. A maximal!difference value is used for
`n-grams which are not present in the language model. The language whose language model has
`the smallest total difference to the current documentwill be assigned.
`The language identifier was evaluated using data from the European Corpus Initiative CD-
`ROM in Danish, Dutch, English, French, German,
`Italian, Norwegian, Portuguese and
`Spanish, replicating the experiments reported in Grefenstette (1995), which compared language
`identification schemes based on trigrams and on frequent short words. The evaluation results
`of our n-gram language identifier and a comparison to Grefenstette’s
`results
`for
`the
`MULINEX languages English, German and Frenchare given in Table 2.
`It has been argued that a language identifier based on the frequencies of n-grarms from
`length 1-5 combines the advantages of several well-known language identification methods: it
`takes into account the frequencies of single letters, of bigrams and trigrams and of frequent
`short words. However, our experiments showed significant advantages for French text of a
`length between 6 and 15 words only. Above 21 words, all algorithms show an almost perfect
`language identification accuracy.
`A separate test showed that the language identification perforrnance for the ECI corpus data
`did not degrade when the distinction between upper and lower case characters was ignored.
`For language identification of web pages, we ignore case because capitalised titles and headings
`often lead to errors in language identification.
`
`4.3. Summarisation
`
`Summarisation is performed byselecting the sentences’ which best characterise a document.
`We use sentence selection as our summarisation method because it shows robust performance
`on a collection of widely varying documents, such as those in the WWW.
`‘here are two summarisers, a neutral (query-independent) and a tailored (query-dependent)
`
`
`7 We treat words between punctuation or structural HTML markup as sentences, even if they are not grammati-
`cally well-formed sentences.
`
`AOL Ex. 1024
`Page 11 of 17
`
`AOL Ex. 1024
`Page 11 of 17
`
`

`

`
`
`284
`
`J. Capstick et al. | Information Processing and Management 36 (2000) 275-289
`
`summariser. Both work by selection of the most salient sentences. The target length of the
`summary can be specified, and the summariser will select the most important sentences until
`the target length has been reached. The query-independent summariser selects sentences which
`are marked up by structural (headings) and layout-oriented (boldface, italics) markup, and uses
`heuristics such as selecting the first sentence of a paragraph and making use of term frequency
`statistics. The query-specific summariser
`selects sentences in which the query terms (or
`morphological variants) occur.
`In the MULINEX system, we use the query-independent summariser during document
`gathering to generate summaries which are stored in the document database. Use of a query-
`specific summariser for the search system is not practical, as we do not store the full text of the
`documents on the server.
`In addition to sentence extraction, we extract a set of salient keywords for each document by
`choosing words which occur frequently in the document, but less frequently in the document
`collection.
`
`4.4. Categorisation
`
`The MULINEX system contains three different document categorisation algorithms, each
`suited for different categorisation tasks:
`
`1. n-gram categoriser for noisy input
`2. k-nearest-neighbour (KNN) algorithm for normal documents
`3. pattern-based categoriser for very short documents
`The n-gram categoriser makes use of the frequencies of n-grams of characters in a document
`which is compared to a category model, the sequence of most frequent n-grams ordered by
`their frequency in a training corpus.* Our evaluation has shown that this categoriser did not
`perform as well for web pages as the KNN categoriser. The n-gram categoriser is more useful
`in situations with noisy input, such as categorising OCR output.
`The k-nearest-neighbour algorithm (Yang, 1994) is a statistical algorithm which classifies a
`new document by combining the category assignments of the .k most similar
`training
`documents, weighted by the similarity between the new document and each of the k best
`matching training documents. The categoriser has been trained with documents
`from
`newsgroups in French, German and English.
`Although the categorisers are trained separately for each language, multilinguality is
`achieved by training for the same categoriés in different languages. For example, we gathered
`training material for the category politics from the German newsgroup de.soc.politik,
`the
`corresponding French group fr.soc.politique and English soc.politics. For medicine, we used all
`English scimed.* groups, German de.sci.medizin and French fr.bio.medecine.
`The two statistical categorisers (ANN and n-gram) have been evaluated with German news
`articles which were obtained from the Federal Press Office of the German government. The
`
`is used for language identification. It has been used successfully by Cavnar and
`® This is the algorithm that
`Trenkle for the categorisation of Usenet news articles (Cavnar & Trenkle, 1994).
`
`AOL Ex. 1024
`Page 12 of 17
`
`AOL Ex. 1024
`Page 12 of 17
`
`

`

`J. Capstick et al. | Information Processing and Management 36 (2000) 275-289
`
`285
`
`.
`__.
`“Table 3
`Average precision and recall for two categorisation algorithms
`
`
`
` KNN n-gram
`
`0.243
`0.757
`Averagerecall
`0.361
`0.714
`Average precision
`
`
`
`
`
`task of the Press Office is to distribute news articles to the different departments and ministries
`of the German government. The training data were categorised according to the government
`department to which they were distributed. We had a total of 8809 documents with 73 distinct
`--eategories; a document can have more than one category if it
`is sent
`to more than one
`“government department. The categorisers assigned a number of categories with a confidence
`yalue; all category assignments above a fixed cutoff point, which was chosen to maximise the f-
`measure value on the training set for each category, were used in the evaluation. 80% of the
`‘documents were used for training, the rest was retained for testing. Precision and recall were
`evaluated for each of the 73 categories with the following results (see Table 3).
`specialised
`In addition,
`there is a pattern-based categorisation algorithm for narrow,
`categories. The pattern-based categoriser recognises terms which are of interest to a certain
`domain, such as the names of travel agencies or airlines for the tourism domain. It is used in
`situations where the pages to be categorised contain only little text and are dominated by
`graphics and tables. The patterns, expressed in Perl regular expression syntax, have been
`defined manually, by abstracting over multi-word terms found in a corpus of web pages for the
`domain of interest.
`The pattern-based categoriser has not been formally evaluated due to the lack of categorised
`training data, but usability testing of the system showed usersatisfaction with categorisation
`results (cf. Section 6).
`
`5. Technologies and resources for search and result presentation
`
`In this section, we describe the technologies and resources used to process a user’s query and
`present the search results.
`
`5.1. Query analysis
`
`translates and expands the users’ queries. Since the
`The MULINEX system analyses,
`retrieval performance of automatically translated queries is inferior to monolingual information
`retrieval, there is an (optional) step of user interaction, where the user can select terms from
`the translated query and add othertranslations.
`The search syntax supports required search terms (+), excluded search terms (—), and allows
`the user to block the translation of search terms(!). Full boolean search syntax (AND/OR)is
`not supported because of the problems encountered with ambiguity in the translation of query
`terms.
`
`Bacon
`
`AOL Ex. 1024
`Page 13 of 17
`
`AOL Ex. 1024
`Page 13 of 17
`
`

`

`
`
`286
`
`J. Capstick

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket