`
`
`
`
`
`IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
`
`
`
`
`
`BEFORE THE PATENT TRIAL AND APPEAL BOARD
`
`
`
`
`AOL INC.
`Petitioner
`v.
`
`IMPROVED SEARCH LLC
`Patent Owner
`
`
`
`Case No. CBM2017-00038
`U.S. Patent No. 7,516,154
`
`
`DECLARATION OF DOUGLAS W. OARD, Ph.D.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`AOL Ex. 1002
`Page 1 of 90
`
`
`
`
`
`TABLE OF CONTENTS
`
`TABLE OF CONTENTS .......................................................................................... ii
`I.
`INTRODUCTION .......................................................................................... 1
`II.
`QUALIFICATIONS ....................................................................................... 2
`III. COMPENSATION AND RELATIONSHIP TO THE PARTIES ................. 7
`IV. LEGAL STANDARDS USED IN MY ANALYSIS ..................................... 7
`A.
`The Person of Ordinary Skill in the Art ................................................ 8
`B.
`Broadest Reasonable Interpretation ...................................................... 8
`C. Means-Plus-Function Claim Elements .................................................. 9
`D.
`Enablement .......................................................................................... 10
`E.
`Incorporation by Reference ................................................................. 11
`F.
`Obviousness ......................................................................................... 11
`SUMMARY OF OPINIONS ........................................................................ 12
`V.
`VI. TECHNICAL BACKGROUND .................................................................. 13
`A.
`Cross-Language Search and Query Translation .................................. 13
`B.
`Problems Inherent in the Query Translation Approach ...................... 17
`1.
`Identifying individual query terms ........................................... 18
`2.
`Identifying possible translations for each query term ............... 21
`3.
`Determining how best to use those possible translations ......... 26
`Sponsored Search ................................................................................ 30
`C.
`The ’101 Patent ................................................................................... 32
`D.
`The Divisional Application ................................................................. 36
`E.
`The ’154 Patent ................................................................................... 36
`F.
`The Challenged Claims ....................................................................... 39
`G.
`VII. OPINIONS REGARDING CLAIM CONSTRUCTION ............................. 43
`A.
`“Dialectal Standardization” / “Dialectally Standardizing” ................. 43
`B.
`“Content Word” ................................................................................... 45
`C.
`“Advertising Cues”.............................................................................. 45
`ii
`
`
`AOL Ex. 1002
`Page 2 of 90
`
`
`
`
`
`X.
`
`VIII. CLAIMS 1 AND 7 ARE INVALID FOR LACK OF
`ENABLEMENT. .......................................................................................... 46
`A.
`The Patent Gives No Guidance on How to Perform Dialectal
`Standardization. ................................................................................... 47
`Undue Experimentation ....................................................................... 55
`B.
`IX. CLAIM 7 IS INVALID FOR INDEFINITENESS. ..................................... 61
`A.
`“A Dialectal Controller for Dialectally Standardizing a Content
`Word Extracted from the Query” ........................................................ 62
`“Means to Search the Database of the Advertising Cues Based
`on the Relevancy to the Translated Content Word” ........................... 69
`CLAIMS 1 AND 7 ARE INVALID FOR OBVIOUSNESS. ...................... 72
`A.
`Claims 1 and 7 Are Not Entitled to the Priority Date of Either
`Parent Application. .............................................................................. 72
`Claims 1 and 7 Would Have Been Obvious In Light of the ’101
`Patent and Skillen. ............................................................................... 74
`1.
`The ’101 patent and Skillen disclose each and every
`element of claim 1. .................................................................... 75
`The ’101 patent and Skillen disclose each and every
`element of claim 7. .................................................................... 79
`A POSA would have found it obvious to combine the
`teachings of the ’101 patent and Skillen. .................................. 84
`I am aware of no objective indicia weighing in favor of a
`finding of non-obviousness. ...................................................... 86
`XI. CONCLUSION ............................................................................................. 86
`
`B.
`
`B.
`
`2.
`
`3.
`
`4.
`
`
`
`
`
`
`
`
`iii
`
`
`AOL Ex. 1002
`Page 3 of 90
`
`
`
`
`
`I, Dr. Douglas W. Oard, hereby state the following:
`I.
`
`INTRODUCTION
`1.
`
`I have been retained on behalf of AOL Inc. (“AOL”) to provide
`
`technical assistance related to the filing of a Petition for Covered Business Method
`
`Review (“CBM Review”) of U.S. Patent No. 7,516,154 (“the ’154 patent”). I am
`
`working as a private consultant on this matter and the opinions presented here are
`
`my own.
`
`2.
`
`I have been asked to provide a written declaration, including opinions
`
`related to the following issues:
`
` Technical background,
`
` The qualifications of a person of ordinary skill in the art (“POSA”),
`
` The proper interpretation of the claims under the broadest reasonable
`
`construction standard,
`
` Whether the specification of the patent describes the invention in such
`
`full, clear, concise, and exact terms as to enable a POSA to carry out the
`
`claimed invention without undue experimentation,
`
` Whether the specification of the patent describes sufficiently definite
`
`structure for performing functions recited in means-plus-function claims,
`
`and
`
`1
`
`
`AOL Ex. 1002
`Page 4 of 90
`
`
`
`
`
` Whether claims 1 and 7 of the ’154 patent would have been obvious
`
`to a POSA at the time of the alleged invention in light of U.S. Patent No.
`
`6,604,101 (“the ’101 patent”) (Ex. 1003) and U.S. Patent No. 6,098,065
`
`(“Skillen”) (Ex. 1004).
`
`In reaching my opinions on the ’154 patent, I have reviewed the documents cited
`
`herein and relied on my many years of knowledge and experience in the field of
`
`information retrieval (outlined in Section II). This Declaration sets forth the bases
`
`and reasons for my opinions, including the materials and information relied upon
`
`in forming those opinions and conclusions.
`
`II. QUALIFICATIONS
`3.
`I am a professor and researcher in the fields of computer science and
`
`information science. Information Retrieval (IR), which is the preferred technical
`
`term for search, has been the primary subject of my research for over twenty years.
`
`From 1994 to 2002, Cross-Language IR (CLIR) was a particular area of focus for
`
`me.
`
`4.
`
`I received two degrees from Rice University: a Master of Electrical
`
`Engineering degree in 1979 and a Bachelor of Arts degree with a double major in
`
`Electrical Engineering and Mathematical Sciences, also in 1979. I received a
`
`Ph.D. in Electrical Engineering from the University of Maryland, College Park in
`
`2
`
`
`AOL Ex. 1002
`Page 5 of 90
`
`
`
`
`
`1996, with a dissertation on Adaptive Vector Space Text Filtering for Monolingual
`
`and Cross-Language Applications.
`
`5.
`
`After completing my Ph.D. in 1996, I was appointed in that same year
`
`as an Assistant Professor in the College of Library and Information Services at the
`
`University of Maryland, College Park. The name of the College of Library and
`
`Information Services has subsequently been changed to the College of Information
`
`Studies, reflecting a broader scope of both teaching and research. I was promoted
`
`to Associate Professor (with tenure) in 2002, and to Professor (with tenure) in
`
`2010. From 2006 to 2009 I served as Associate Dean for Research in the College
`
`of Information Studies. In 2000, I was appointed to a joint faculty position in the
`
`University of Maryland Institute for Advanced Computer Studies (UMIACS).
`
`UMIACS appointments are renewable appointments with a term of three to five
`
`years, and my UMIACS appointment has been renewed continuously. I also
`
`currently serve as an Affiliate Professor in the Computer Science Department at
`
`the University of Maryland, College Park, and as an Affiliate Professor in the
`
`Applied Mathematics, Statistics and Scientific Computation (AMSC) program at
`
`the University of Maryland, College Park.
`
`6.
`
`I have also held visiting positions while conducting research on IR
`
`during sabbatical visits (of 5-14 months duration) at the University of California
`
`Berkeley, the University of Southern California Information Sciences Institute
`
`3
`
`
`AOL Ex. 1002
`Page 6 of 90
`
`
`
`
`
`(USC-ISI), the University of Melbourne (Australia) and (concurrently) RMIT
`
`University (Australia), and the University of Florida and (concurrently) the
`
`University of South Florida. I also am affiliated with the Johns Hopkins
`
`University Human Language Technology Center of Excellence, and I hold a
`
`Visiting Professor appointment at the National Institute of Informatics (NII) in
`
`Japan.
`
`7.
`
`From 2010 to 2012, I served as Director of the UMIACS
`
`Computational Linguistics and Information Processing (CLIP) lab. The CLIP lab’s
`
`research record is particularly strong in both computational linguistics and IR.
`
`8.
`
`As I mentioned above, I perform research in the general area of IR,
`
`with particular emphasis on the design of search systems that leverage specific
`
`technologies for the computational manipulation of human language. Examples of
`
`these technologies include translation (for CLIR), speech recognition (for speech
`
`retrieval), and optical character recognition (for document image retrieval). I have
`
`also conducted research on retrieval from informal sources of text such as email
`
`(particularly in the context of e-discovery), text chat, and microblog posts, and on
`
`recommender systems, knowledge base population, and computational social
`
`science.
`
`9.
`
`I have published more than 240 academic papers. About 100 of those
`
`papers are on CLIR, and I continue to conduct, publish, and review research on
`
`4
`
`
`AOL Ex. 1002
`Page 7 of 90
`
`
`
`
`
`that topic. I have published peer reviewed papers on CLIR in venues such as the
`
`Journal of the Association for Information Science and Technology, Information
`
`Processing & Management, Information Retrieval, ACM Transactions on Asian
`
`Language Information Processing, Computer Speech and Language, the Annual
`
`Review of Information Science and Technology, and the ACM Special Interest
`
`Group on Information Retrieval (SIGIR) conference.
`
`10. At the University of Maryland, I teach courses on IR and on other
`
`aspects of information technology. Examples include graduate courses on
`
`Information Retrieval Systems, Creating Information Infrastructures, and
`
`Transformational Information Technologies, and an undergraduate course on
`
`Information and Knowledge Management.
`
`11.
`
`I recently completed a five-year term as co-editor of the peer reviewed
`
`journal Foundations and Trends in Information Retrieval, and I continue to serve
`
`as a Senior Associate Editor for the peer reviewed journal ACM Transactions on
`
`Information Systems. I have previously also served on the editorial boards of the
`
`peer-reviewed journals Information Processing & Management, Journal of the
`
`Association for Information Science and Technology, and Information Retrieval.
`
`In 2008, I served as Program Committee Co-Chair for the leading IR research
`
`conference, the ACM SIGIR conference. I have also helped to organize more than
`
`thirty other international research meetings; examples include the 1997 American
`
`5
`
`
`AOL Ex. 1002
`Page 8 of 90
`
`
`
`
`
`Association for Artificial Intelligence (AAAI)1 Spring Symposium on Cross-
`
`Language Text and Speech Retrieval, the 2009 annual conference of the North
`
`American chapter of the Association for Computational Linguistics (NAACL), and
`
`seven workshops on the Discovery of Electronically Stored Information (DESI).
`
`12.
`
`I have served in leadership roles for four of the major global IR
`
`evaluation venues, including as General Co-Chair for the NII Testbeds and
`
`Community for Information Access Research (NTCIR) evaluation in Japan; as
`
`Program Committee member and as a coordinator for tracks on CLIR and e-
`
`discovery at the Text Retrieval Conference (TREC) evaluation in the United
`
`States; as a coordinator for evaluations of interactive CLIR and cross-language
`
`speech retrieval at the Cross-Language Evaluation Forum (CLEF) in Europe; and
`
`as a coordinator for tracks on speech retrieval and on retrieval of scanned
`
`documents at the Forum for Information Retrieval Evaluation (FIRE) in India.
`
`13. My research on CLIR has been supported by the National Science
`
`Foundation (NSF) and the Defense Advanced Research Projects Agency
`
`(DARPA). My research on other topics has been supported by the National
`
`Endowment for the Humanities, IBM, and the Qatar National Research Fund,
`
`
`1 The name of AAAI has subsequently been changed to the Association for the
`
`Advancement of Artificial Intelligence.
`
`6
`
`
`AOL Ex. 1002
`Page 9 of 90
`
`
`
`
`
`among others. I regularly review research proposals for NSF, and occasionally for
`
`similar bodies in other countries (including, for example, Canada, Hong Kong,
`
`Luxembourg, and Switzerland).
`
`14.
`
`I have given presentations on my research in more than 30 countries,
`
`including, for example, Brazil, China, Egypt, Germany, India, New Zealand,
`
`Russia, Singapore, Spain, and the United Kingdom.
`
`15. A more detailed description of my professional qualifications,
`
`including a list of publications, teaching, and professional activities, is contained in
`
`my curriculum vitae, a copy of which is attached as Appendix A.
`
`III. COMPENSATION AND RELATIONSHIP TO THE PARTIES
`16.
`I am being compensated for my time on this matter at my standard
`
`consulting rate of $420 per hour plus expenses. Apart from that, I have no
`
`financial interest in AOL Inc., Google Inc., or Improved Search LLC. My
`
`compensation is in no way dependent on the substance of my opinions or the
`
`outcome of this proceeding.
`
`IV. LEGAL STANDARDS USED IN MY ANALYSIS
`17. Although I am not an attorney and do not offer any opinions on the
`
`law, I have been informed of certain legal principles that I have relied on in
`
`reaching the opinions set forth in this Declaration.
`
`7
`
`
`AOL Ex. 1002
`Page 10 of 90
`
`
`
`
`
`A. The Person of Ordinary Skill in the Art
`18.
`I have been informed that a POSA is a hypothetical person who is
`
`presumed to have known all of the relevant prior art as of the priority date. I have
`
`been informed that factors that may be considered in determining the level of
`
`ordinary skill in the art may include: (a) the educational level of the inventor; (b)
`
`the type of problems encountered in the art; (c) prior art solutions to those
`
`problems; (d) the rapidity with which innovations are made; (e) the sophistication
`
`of the technology; and (f) the educational level of active workers in the field.
`
`19.
`
`I have been asked to provide my opinion as to the qualifications of the
`
`person of ordinary skill in the art to which the ’154 patent pertains. In my opinion,
`
`a POSA would have at least an undergraduate degree in computer science,
`
`information science, or a similar field and at least two years of experience in the
`
`field of CLIR, which could include academic experience (e.g., a Masters degree
`
`with a CLIR focus). In addition, I believe a POSA would be familiar with
`
`commercial aspects of information retrieval, including search-related advertising.
`
`B.
`20.
`
`Broadest Reasonable Interpretation
`
`I have been informed that for purposes of this CBM proceeding the
`
`terms in the claims of the ’154 patent are to be given their broadest reasonable
`
`interpretation in light of the specification of the ’154 patent, as understood by a
`
`POSA. I have used this standard throughout my analysis.
`
`8
`
`
`AOL Ex. 1002
`Page 11 of 90
`
`
`
`
`
`C. Means-Plus-Function Claim Elements
`21.
`I have been informed that an element of a patent claim may be
`
`expressed as a means or step for performing a specified function without the recital
`
`of structure, materials, or acts in support thereof. I have been informed that such
`
`elements are referred to as “means-plus-function” elements and are construed to
`
`cover the corresponding structure, material, or acts described in the specification
`
`and equivalents thereof. I have been informed that if the specification fails to
`
`identify any corresponding structure, material, or acts for a means-plus-function
`
`claim element, the claim is invalid because it is indefinite.
`
`22.
`
`I have been informed that, in determining whether a claim element is
`
`a means-plus-function element, use of the term “means” creates a presumption that
`
`a claim element is a means-plus-function element and that lack of the term
`
`“means” creates a presumption that a claim element is not a means-plus-function
`
`element. However, I have been informed that the essential inquiry is whether the
`
`words of the claim are understood by persons of ordinary skill in the art as having
`
`sufficiently definite meaning as the name for structure. I have been informed that
`
`the use of nonce words or generic terms such as module, mechanism, element, or
`
`device may invoke means-plus-function treatment.
`
`23.
`
`I have been informed that, where the structure disclosed by the
`
`specification for performing a particular function is a generic computer
`
`9
`
`
`AOL Ex. 1002
`Page 12 of 90
`
`
`
`
`
`programmed to carry out an algorithm, the disclosed structure is not the general
`
`purpose computer but the special purpose computer programmed to perform the
`
`disclosed algorithm.
`
`D. Enablement
`24.
`I have been informed that a patent claim is invalid if the specification
`
`of the patent fails to describe the claimed invention in such full, clear, concise, and
`
`exact terms as to enable a POSA to make and use the claimed invention.
`
`25.
`
`I have been informed that a claim lacks enablement if, as of the
`
`priority date, a POSA could not practice the full scope of the claim without undue
`
`experimentation. I have been informed that courts and the Patent Office often
`
`consider eight factors in determining whether undue experimentation would be
`
`needed to practice the full scope of a patent claim:
`
`(1) the quantity of experimentation necessary,
`
`(2) the amount of direction or guidance presented,
`
`(3) the presence or absence of working examples,
`
`(4) the nature of the invention,
`
`(5) the state of the prior art,
`
`(6) the relative skill of those in the art,
`
`(7) the predictability or unpredictability of the art, and
`
`(8) the breadth of the claims.
`
`10
`
`
`AOL Ex. 1002
`Page 13 of 90
`
`
`
`
`
`I have been informed that these factors are referred to as the Wands factors.
`
`E.
`26.
`
`Incorporation by Reference
`
`I have been informed that a patent specification may incorporate other
`
`material by reference, and that such material is considered to be part of the
`
`specification as if it had been set forth explicitly in the specification. I have been
`
`instructed to assume that the ’154 patent properly incorporates by reference the
`
`entireties of the ’101 patent and U.S. Patent App. No. 10/449,740 (Ex. 1007), the
`
`divisional application that followed the ’101 patent and preceded the ’154 patent.
`
`F. Obviousness
`27.
`I have been informed that a patent claim is invalid if the differences
`
`between the subject matter and the prior art are such that the subject matter as a
`
`whole would have been obvious to a POSA at the time of the alleged invention. I
`
`have been informed that an obviousness analysis involves reviewing the scope and
`
`content of the prior art, the differences between the prior art and the claims at
`
`issue, the level of ordinary skill in the pertinent art, and objective indicia of non-
`
`obviousness such as long-felt need, industry praise for the invention, and
`
`skepticism of others in the field.
`
`28.
`
`I have been informed that the following rationales, among others, may
`
`support a conclusion of obviousness:
`
`
`
`
`
`(a)
`
`the combination of familiar elements according to known
`
`11
`
`
`AOL Ex. 1002
`Page 14 of 90
`
`
`
`
`
`methods to yield predictable results;
`
`
`
`
`
`(b)
`
`the simple substitution of one known element for another to
`
`obtain predictable results;
`
`
`
`
`
`(c)
`
`the use of known techniques to improve similar methods or
`
`apparatuses in the same way;
`
`
`
`
`
`(d)
`
`the application of a known technique to a known method or
`
`apparatus ready for improvement to yield predictable results;
`
`
`
`
`
`(e)
`
`the choice of a particular solution from a finite number of
`
`identified, predictable solutions with a reasonable expectation of success;
`
`
`
`
`
`(f)
`
`the use of known work in one field of endeavor in either the
`
`same field or a different one based on design incentives or other market forces, if
`
`the variations are predictable to one of ordinary skill in the art; and
`
`
`
`
`
`(g)
`
`the following of some teaching, suggestion, or motivation in the
`
`prior art that would have led one of ordinary skill to modify the prior art reference
`
`or to combine prior art reference teachings to arrive at the claimed invention.
`
`V.
`
`SUMMARY OF OPINIONS
`29.
`
`It is my opinion that claims 1 and 7 of the ’154 patent are
`
`unpatentable. I conclude that claims 1 and 7 are unpatentable because the
`
`specification fails to enable a POSA to carry out the claimed invention without
`
`undue experimentation, that claim 7 is unpatentable because the specification fails
`
`12
`
`
`AOL Ex. 1002
`Page 15 of 90
`
`
`
`
`
`to disclose any structure, material, or acts associated with the claimed function of
`
`dialectal standardization, and that both claims are unpatentable because they would
`
`have been obvious to a POSA as of the priority date in light of the ’101 patent and
`
`Skillen.
`
`VI. TECHNICAL BACKGROUND
`30.
`In order to explain my opinion and the bases for it, I provide in this
`
`section a background on the field of cross-language search, including some of the
`
`challenges inherent in the field and some of the work that other researchers were
`
`doing in the years leading up to the ’101 and ’154 patents. I then provide a brief
`
`discussion of search-related advertising. Finally, I provide an overview of the ’154
`
`patent and its predecessor, the ’101 patent.
`
`A. Cross-Language Search and Query Translation
`31. The ’154 patent relates to the field of cross-language search, referred
`
`to in academia as CLIR for Cross-Language Information Retrieval. In the typical
`
`formulation of the CLIR problem, a user presents a query to a search engine in one
`
`human language (e.g., English) and the search engine retrieves documents written
`
`in some other human language (e.g., Chinese) that the search engine determines are
`
`most likely to be relevant to the user’s request.
`
`32. The problem of finding documents in a language different from the
`
`language used to pose the request is not new; librarians who sought to serve
`
`13
`
`
`AOL Ex. 1002
`Page 16 of 90
`
`
`
`
`
`diverse user populations have faced the same problem for centuries. But, by the
`
`1990s, both the academic and commercial worlds recognized the wealth of digital
`
`information available in a variety of languages, and “the need to find ways of
`
`retrieving information across language boundaries, and to understand this
`
`information, once retrieved.” Ex. 1023 at 1 (Grefenstette). Since the late 1980s,
`
`corporations, government institutions, academic research centers, and others have
`
`sponsored or conducted research to develop methods and systems for accessing
`
`and understanding information written in a language other than that of the user’s
`
`query, and earlier work with smaller collections extends back to at least the 1960s.
`
`33. CLIR involves the same challenges as monolingual IR—such as
`
`determining which results are most relevant to the user’s query—plus the
`
`additional challenges posed by translation. Research on CLIR has often built upon
`
`techniques and approaches developed in monolingual IR. For example, it was well
`
`known by the late 1990s that one workable method of performing CLIR was to
`
`translate the user’s query into the target language and then perform a search in that
`
`language with a monolingual search engine. See, e.g., Ex. 1021 at 101
`
`(Yamabana); Ex. 1022 at 254, 258-59 (Bian). That approach to CLIR is known as
`
`14
`
`
`AOL Ex. 1002
`Page 17 of 90
`
`
`
`
`
`query translation because of its focus on translating the terms2 found in a user’s
`
`query.
`
`34. Query translation was the subject of much of the research on CLIR in
`
`the 1990s. For example, in a 1998 paper published in the book Cross-Language
`
`Information Retrieval, Kiyoshi Yamabana and colleagues presented a method for
`
`translating queries that included an interactive user interface allowing the user to
`
`select from different translations of the query so as to ensure that the term selected
`
`for use in the translated query properly reflected the intended meaning of the query
`
`term. Ex. 1021 at 9-11.
`
`
`2 “Word” and “term” are often used interchangeably in the CLIR literature. When
`
`I use “term” in this Declaration I intend to refer generically to words or multi-word
`
`expressions.
`
`15
`
`
`AOL Ex. 1002
`Page 18 of 90
`
`
`
`
`
`
`In the figure above, Yamabana et al. shows a prompt allowing the user to select
`
`from different translations of the Japanese word jouhou (which, in the interface, is
`
`written using Japanese characters). Id. at 10-11. The translated query is then used
`
`in a monolingual retrieval system. See id. at 11.
`
`35. Another example of the use of query translation for CLIR can be
`
`found in a 1998 paper by Guo-Wei Bian and Hsin-Hsi Chen. Ex. 1022. Bian and
`
`Chen described a system that would automatically translate a user’s query from
`
`Chinese into English, send the query to English language search engines, and
`
`automatically translate the received results back into Chinese. Id. at 11-12.
`
`36. Yet another example of query translation CLIR in the late 1990s was
`
`described in a paper by Joanne Capstick and several colleagues published in
`
`Information Processing & Management in March of 2000. Ex. 1024. Capstick et
`
`16
`
`
`AOL Ex. 1002
`Page 19 of 90
`
`
`
`
`
`al. created a system called MULINEX that allowed a user entering a query in
`
`English, German, or French to obtain search results in any of those languages by
`
`translating their query and searching for documents in the desired language. See
`
`id. at 4.
`
`37. Alternatives to query translation, including what came to be called
`
`“document translation” (which, more precisely, refers to translating terms found in
`
`the documents as they are indexed), or interlingual techniques (in which language-
`
`neutral representations of both the queries and the documents are created) were
`
`also explored in this time frame. The focus on query translation was due, at least
`
`in part, to the fact that queries are typically short and thus can be translated
`
`quickly. This offers advantages for experimental settings (in which researchers
`
`often wish to compare results from multiple system designs quickly, sometimes
`
`over a broad range of parameter settings) and in some operational settings (e.g.,
`
`when a system needs to support numerous query languages, a setting in which it
`
`might be infeasible to build an index for each possible query language).
`
`B.
`Problems Inherent in the Query Translation Approach
`38. Building a complete and usable system for CLIR using query
`
`translation presents three inherent challenges that must be addressed through
`
`refined solutions in order to maximize the degree to which the search results can be
`
`expected to be relevant to the searcher’s information need: identifying individual
`
`17
`
`
`AOL Ex. 1002
`Page 20 of 90
`
`
`
`
`
`query terms, identifying possible translations for each query term, and determining
`
`how best to use those possible translations. In this section, I discuss each of those
`
`challenges in turn.
`
`1.
`Identifying individual query terms
`39. The first step of the query-translation process is typically to identify
`
`the individual terms in a user’s query. Even though a person fluent in the language
`
`of the user’s query might be able to easily read a query and point out what they
`
`think of as individual words, a computer must be programmed to use some
`
`automated process to identify the terms on which the search will be based, and it
`
`must do so in some way that is well suited to the retrieval task.
`
`40.
`
`In Western languages such as English, French, and Russian in which
`
`words are by convention delimited by spaces or other recognizable characters, the
`
`simplest approach to identifying individual terms is to strip punctuation and then
`
`split the words at “white space” (e.g., spaces, tab characters, or line endings) to
`
`obtain terms (often referred to as “tokens”) that approximate words. See, e.g., Ex.
`
`1020 at 42 (Fluhr ’98). Simply segmenting a string at white space can, however,
`
`split terms that are better thought of as a single term for purposes of translation
`
`(e.g., “high school”). When there are known translations for multi-word
`
`expressions, it can be useful to segment the text in a way that treats a known multi-
`
`word expression as a single term, which requires a more complex approach.
`
`18
`
`
`AOL Ex. 1002
`Page 21 of 90
`
`
`
`
`
`41. A more complex approach is also required for languages in which
`
`words are frequently not separated by white space or punctuation. Examples of
`
`such languages include Chinese, in which sentences are delimited but individual
`
`words are not, and some “freely compounding” languages (e.g., German), in which
`
`it is common to combine words into longer terms that lack internal delimiters.
`
`42. One approach to segmentation in such cases is to start with some list
`
`of words in the language (e.g., from a dictionary) and then to use an algorithm to
`
`identify the “best” way of tiling those terms onto a longer string of characters.
`
`Generally these algorithms differ not in their goal or their basic approach, but
`
`rather in the computational details of how the tiling process was performed. One
`
`particularly simple approach, a type of “greedy” segmentation, is to work from left
`
`to right, repeatedly finding the longest matching dictionary entry.3 For example,
`
`this approach will easily segment the German term “Götterdämmerung” into the
`
`strings “götter” (gods) and “dämmerung” (twilight). More sophisticated
`
`techniques are needed to deal with missing or added characters (e.g., when splitting
`
`the German word “Fahrvergnügen” into the strings “fahren” (driving) and
`
`“vergnügen” (enjoyment)). Such cases can be handled by automatically generating
`
`
`3 See generally Ex. 1025 (Kwok ’99) at 3-5 (describing a greedy segmentation
`
`approach for Chinese).
`
`19
`
`
`AOL Ex. 1002
`Page 22 of 90
`
`
`
`
`
`common transformations (such as fahren to fahr), but at the cost of adding
`
`additional opportunities to make mistakes.
`
`43. Queries in languages such as Chinese that completely lack word
`
`delimiters pose particular problems. For an English analogue (with spaces
`
`removed), consider the query “tentsandstakes,” which may or may not contain the
`
`word “sand.” Properly handling such cases requires moving beyond greedy
`
`methods to generate all possible segmentations and then testing each to see which
`
`are the most likely to reflect the writer’s intent.4 To deal correctly with such
`
`complex cases, it can be useful to perform syntactic analysis (e.g., “tents and
`
`stakes” is a well formed clause), to leverage simple term counts (e.g., “and” is a
`
`very common word), or to make use of broader range of corpus statistics (e.g.,
`
`“tent” and “stakes” might rarely be written together near “sand,” perhaps because
`
`tent stakes don’t work well in sand). When the number of options becomes too
`
`large, pruning techniques such as dynamic programming can