`INTELLIGENT
`SYSTEMS
`
`Current Research and Practice in
`Information Extraction and Retrieval
`
`001
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`002
`
`Facebook, Inc. - EXHIBIT 1238
`
`002
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`Text-Based Intelligent Systems
`Current Research and Practice
`in Information Extraction and
`Retrieval
`
`003
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`Text-Based Intelligent Systems:
`Current Research and Practice in
`Information Extraction and Retrieval
`
`Edited by Paul S. Jacobs
`Artificial Intelligence Laboratory
`GE Research and Development Center
`
`LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS
`Hove and London
`Hillsdale, New Jersey
`
`1992
`
`r
`
`004
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`2-1 ,. _-,
`
`.. r
`
`---
`
`Copyright © 1992 by Lawrence Eribaurn Associates, Inc.
`Ail rights reserved. No part of this book may be reproduced in any form by
`photostat, microform, retrieval system, or any other mns, without the prior Wntten
`permission of the publisher.
`
`Lawrence Eribaum Associates, Inc., Publishers
`365 Broadway
`Hillsdale, New Jersey 07642
`
`Library of Congress Cataloging-in-Publication Data
`Text-based intelligent systems: current research and practice in information extraction and
`retneval I edited by Paul S. Jacobs.
`cm.
`p.
`Includes bibliographical references and index.
`ISBN 0-8058-1188-5. -- ISBN 0-8058-1189-3 (pbk.)
`1. Text processing (Computer science) 2. Natural language processing (Computer science) 3.
`Artifical intelligence.
`I. Jacobs, Paul Schafran
`1992
`QA76.9. T48T469
`006.3'5-dc2O
`
`92-17802dr
`
`Printed in the United States of America
`10 9 8 7 6 5 4 3 2 1
`
`005
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`Coictents
`
`c
`
`1
`
`C)
`
`f-A i
`-r
`t 12
`EN4
`
`i
`
`Introduction: Text Power and Intelligent Systems - Paul S.
`Jacobs, GE Research and Development Center
`
`Part I Broad-Scale NLP
`2 Robust Processing of Real-World Natural-Language Texts -
`Jerry R. Hobbs, Douglas E. Appell, John Bear, Mabry Tyson, and
`David Magerman, SRI Intern at:onal
`
`3 Combining Weak Methods in Large-Scale Text Processing
`- Yorick Wilks, Louise Guíhrie, Joe Guthrie, and Jim Cowie, New
`Mexico Stale University
`
`4 Mixed-Depth Representations for Natural Language Text -
`Graeme Hirsi and Mark Ryan, Univers1iy of Toronto
`
`9
`
`13
`
`35
`
`59
`
`5 Robust Partial-Parsing Through Incremental, Multi-Algorithm
`Processing - David D. McDonald, Brandeis University and Content
`Technologies, Inc.
`
`83
`
`6 Corpus-Based Thematic Analysis
`
`101
`
`Part II
`"Traditional" Information Retrieval
`7 Text Retrieval and Inference - W. Bruce Croft, University of
`127
`Massachusetts, and Howard R. Turtle, West Publishing Company
`
`123
`
`8 Assumptions and Issues in Text-Based Retrieval - Karen Spa rck
`157
`Jones, University of Cambridge
`
`9 Text Representation for Intelligent Text Retrieval: A Classification-
`Oriented View - David D. Lewis, University of Chicago
`179
`
`V
`
`006
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`Piefac
`
`This volume started with a Symposium in 1990, sponsored by AAAI and ti-
`tled "Text-Based Intelligent Systems". The push for this get-together, which
`included about 50 scientists with a variety of backgrounds, was a rapidly-
`emerging set of technologies for exploiting the massive quantity of textual
`information that has become increasingly available through advances in com-
`puting technology.
`The challenge for this group was to explore new ways to take advantage
`of the power of on-line text, We intuit that a billion words of text can be
`a lot more generally useful than a few hundred logical rules, if we can use
`advanced computation (1) to extract useful information from streams of text,
`and (2) to help find (retrieve) what we need in the sea of available material.
`The extraction task has become a hot topic for the field of Natural Language
`Processing, while the retrieval task has been solidly in the field of Information
`Retrieval. These two disciplines came together at our Symposium, and have
`been cross-breeding more than ever.
`This text has gone to press very quickly, in order to provide a "snapshot"
`of current research and practice and to help others to contribute to this new
`In fact, enough has happened since the 1990 Symposium that
`discipline.
`the papers in this book bear little resemblance to the original versions pre-
`sented. Since then, there have been some new commercial applications, the
`government has undertaken a substantial research program called TIPSTER
`along with a series of formal evaluations (known as MUC and TREC) for
`testing text-processing technologies, and computer programs have scaled up
`from handling a few texts in simple domains to getting useful information
`out of millions of words of naturally occurring text. The contributors here
`are representative of the individuals, groups, and approaches that are behind
`this progress.
`Not all the contributors here like the word "intelligent" in the title:
`It
`is meant not to ascribe any real intelligence to our programs, but rather to
`connote the innovative nature of the work, The systems are meant to be
`fast, effective, and helpful"Text-based Fast, Effective, Helpful Systems"
`does not roll off the tongue as well, so we have chosen Text-Based Intelligent
`Systems (TBIS) to represent the nature of the science and the applications.
`The book is organized in three parts. The first group of papers describes
`
`vii
`
`007
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`vi
`
`CONTENTS
`
`10 Automatic Text Structuring Experiments - Gerard Salton and
`199
`Chris Buckley, Cornell Universzíy
`
`211
`
`Part III Emerging Applications
`11 Statistical Methods, Artificial Intelligence, and Information
`Retrieval - Craig Sianfihl and David L. Waltz, Thinkzng Machines
`215
`Corporation
`12 Intelligent High-Volume Text Processing Using Shallow, Domain-
`227
`Specific Techniques - Philip J. Hayes, Carnegie Group
`13 Automatically Constructing Simple Help Systems from Nat-
`ural Language Documentation - Yoëlle S. Maarek, IBM T.J.
`243
`Watson Research Center
`14 Direction-Based Text Interpretation as an Information Ac-
`cess Refinement - Mani A. Hearst, University of Cahfor'nia, Berke-
`257
`ley
`
`Index
`
`275
`
`008
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`viii
`
`PREFACE
`
`the current set of natural language (NL) processing techniques that are used
`for interpreting and extracting information from quantities of text. The sec-
`ond group gives some of the historical perspective, methodology, and current
`practice of Information Retrieval (IR) work. The third set covers some of the
`current and emerging application.
`The volume is aimed at an audience of computer professionals who have
`at least some knowledge of natural language and IR, but it has also been
`prepared with advanced students in mind. While there are now good texts
`in both NL and IR, the changes in both fields have been substantial enough
`that the texts do not capture much of current practice with respect to TB IS.
`This collection of readings should give students and scientists alike a good
`idea of the current techniques as well as a general concept of how to go about
`developing and testing systems to handle volumes of text.
`This work is the result of the cooperative efforts of the contributors, to
`whom I am indebted for their timely and appropriate response. Every word
`has been prepared, submitted, reviewed, and typset electronically (in "soft
`copy") to keep the material current and correct. I am also thankful for the
`support of AAAI for the original Symposium, to Norm Sondheimer and the
`Artificial Intelligence Laboratory at GE for helping to promote this type of
`work, and to Lisa Rau for helping me put everything together.
`
`009
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`Introduction: Text Power an
`Intelligent Systems
`
`Paul S. Jacobs
`Artificial Intelligence Program
`GE Research and Development Center
`Schenectady, NY 12301 USA
`
`Li A New Opportunity
`Huge quantities of readily available on-line text raise new challenges and
`opportunities for artificial intelligence systems. The ease of acquiring text
`knowledge suggests replacing, or at least augmenting, knowledge-based sys-
`tems with "text-based" intelligence wherever possible. Making use of this
`text knowledge demands more work in robust processing, retrieval, and pre-
`sentation of information, but raises a host of new applications of AI tech-
`nologies, where on-line information exists but knowledge bases do not.
`Most AI programs have failed to "scale up" because of the difficulty of
`developing large, robust knowledge bases. At the same time, rapid advances
`in networks and information storage now provide access to knowledge bases
`millions of times largerin text form. No knowledge representation claims
`the expressive power or the compactness of this raw text. The next generation
`of AI applications, therefore, may well be "text-based" rather than knowledge
`based, deriving more power from large quantities of stored text than from
`hand-crafted rules.
`Text-based intelligent systems can combine artificial intelligence tech-
`niques with more robust but "shallower" methods. Natural language process-
`ing (NLP) research has been hampered, on the one hand, by the limitations
`of deep systems that work only on a very small number of texts (often only
`one), and, on the other hand, by the failure of more mature technologies,
`Information retrieval (IR)
`such as parsing, to apply to practical systems.
`systems offer a vehicle where selected NLP methods can produce useful re-
`sults; hence, there is a natural and potentially important marriage between
`IR and NLP. This synergy extends beyond the traditional realms of either
`technology to a variety of emerging applications.
`As examples, we must consider what a knowledge-based system can offer
`in the domain of medical diagnosis, on-line operating systems, fault diagno-
`sis in engines, or financial advising, that cannot be found in a medical text-
`
`i
`
`k
`
`010
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`2
`
`P. JACOBS
`
`book, a user's manual, a design specification, or a tax preparation handbook.
`Computers should help make the right information from these documents ac-
`cessible and comprehensible to the user. Harnessing the power of volumes
`of available textthrough information retrieval, natural language analysis,
`knowledge representation, and conceptual information extractionwill pose
`a major challenge for AI into the next century.
`Advocates of the text-based approach to intelligent systems must accept
`its inherent limitations. Some of the traditional AI problems, such as rea-
`soning, inference, and pragmatics, must necessarily play a limited role. But
`there is evidence of substantial progress in building robust text processing
`systems that rely more heavily on shallower methods. The rest of this paper
`describes the combination of applications, methodologies and techniques that
`forms the backbone of work on Text-Based Intelligent Systems.
`
`1.2 A New Name
`To merit their own label, "text-based intelligent systems" must suggest some-
`thing distinctly different from prevailing research. As the introduction has
`implied, a text-based intelligent system (TBIS) is a program that derives
`its power from large quantities of raw text, in an intelligent manner. Such
`systems differ from traditional information retrieval systems in that they
`must be more flexible and responsive, possibly segmenting, combining, or
`synthesizing a response rather than just retrieving texts. The systems differ
`from traditional natural language programs in that they must be much more
`robust.
`The category of text-based intelligent systems includes, for example:
`
`Text extraction systemsprograms that analyze volumes of unstruc-
`tured text, selecting certain features from the text and potentially
`storing such features in a structured form. These systems currently
`exist in limited domains. Examples of this type of system are news
`reading programs [Jacobs and Rau, 1990] (see the papers by Hobbs et
`al. and McDonald in this volume), database generation programs that
`produce fixed-field information from free text, and transaction handling
`programs, such as those that read banking transfer messages [Lytinen
`and Gershman, 1986; Young and Hayes, 19851.
`
`Automated indexing and hypertextknowledge-based programs that
`determine key terms and topics by which to select texts or portions
`of text [Jonak, 19841 or automatically link portions of text that relate
`to one another (see the paper by Salton and Buckley in this volume).
`Summarization and abstraclingprograms that integrate multiple texts
`that repeat, correct, or augment one another, as in following the course
`of a news story over time such as a corporate merger or political event
`[Rau, 1987].
`
`011
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`INTRODUCTION
`
`3
`
`Inelligen1 info rmazon reínevalsysterns with enhanced information
`retrieval capabilities, through robust query processing, user modeling,
`or limited inference [1PM, 19871 (see also the paper by Croft and Turtle
`in this volume).
`This volume contains position papers covering all of the topics above,
`along with discussions of underlying problems in constructing TBIS's, such
`as the representation and storage of knowledge about texts or about lan-
`guage, and robust text processing techniques. Many of the positions describe
`research related to substantial systems in one of the above categories, and
`virtually all address the issue of robust processing of some sort. The next
`section describes the apparent methodological themes of this sort of research,
`
`1.3 No More "Donkeys"
`Much of this research combines the discipline of information retrieval with
`some of the techniques of natural language processing. Historically, the
`methodology of information retrieval has been to develop new methods and
`conduct experiments to compare those methods with other approaches. By
`contrast, the methodology of natural language processing has been either to
`develop theories that apply to broad but carefully selected linguistic phenom-
`ena, or to develop programs that apply to carefully selected texts. In other
`words, there has been very little effort within natural language to produce
`results such as "This program performs the following task with 95% accuracy
`on the following set of 1000 texts".
`As a result of its more theoretical orientation, natural language as a field
`has devoted much of its attention to paradigmatic but improbable exam-
`ples. Researchers in natural language were trained to think about contrived
`sentences"Every man who owns a donkey beats it" or "The box is in the
`pen." These are so familiar that one might stand up with a question at the
`end of a presentation and ask, "But what about the 'donkey' sentences?"
`Researchers are acquainted enough with the examples that they needn't be
`repeated, in spite of the fact that they hardly seem representative of examples
`or problems that we might encounter,
`The current methodological shift in the experimental element of natural
`language processing (by no means the dominant segment of the field) brings
`text processing, as experimental computer science, closer to information re-
`trieval. Rather than seek out examples that support or challenge theories,
`the experimental methodology uses sets of naturally occurring examples as
`test cases, possibly ignoring certain interesting problems that simply do not
`occur in a particular task. While this approach has some disadvantages, it
`has the benefit of focussing work on the issues in natural language processing
`that inhibit robustness.
`Another example of the experimental shift is the area of language ac-
`quisition. During the 1970's and most of the 1980's, the field of language
`
`012
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`4
`
`P. JACOBS
`
`acquisition concentrated on the techniques through which knowledge, espe-
`cially grammatical knowledge, could be acquired. The result of this effort
`was a host of theories and techniques, but very little in the way of sizable
`knowledge bases. Recently, however, the research focus in language acquisi-
`tion has been on achieving the goal of acquisition rather than on the process,
`resulting in extensive lexicons and knowledge bases for use in processing texts
`[Zernik, 1991].
`While the methodology of natural language may be drifting toward in-
`formation retrieval, information retrieval is slowly changing in focus. The
`extreme difficulty of producing significant improvements using traditional
`document retrieval metrics suggests exploring new retrieval strategies as well
`as devising new measures. As the combined fields of natural language process-
`ing and information retrieval continue to make progress, the demand grows
`for test collections and metrics that evaluate meaningful tasks, including not
`only the accuracy of document retrieval, but also the accuracy, speed, trans-
`portability, and ease of use of systems that perform functions such as those
`outlined in the previous section. This new direction involves the constant
`interplay of two goals: (1) produce new measurable results and (2) produce
`new measures of new results.
`The resulting experimental methodology has spawned a host of research
`projects emphasizing robust processing, large-scale systems, knowledge ac-
`quisition, and performance evaluation. As the new research is still taking
`shape, one shouldn't expect any breakthroughs as yet. The next section
`considers the limited progress that has already resulted.
`
`1.4 Where We Are Now
`
`While text-based intelligent systems are very much a futuristic concept, the
`recent emphasis on experiment and performance has brought some noticeable
`changes during the last several years:
`
`Evaluation:
`In government, academia, and industry, the desire for results has led
`to new metrics for evaluating system performance. While metrics and
`benchmarks often spark debate, they also show clear progress.
`For
`example, a government-sponsored message processing conference three
`years ago featured a small set of programs performing different func-
`tions in different domains, while a more recent similar conference in-
`cluded nine substantial programs performing a common task on a set
`of over loo real messages, and produced meaningful results [Sundheim,
`1989] (see Hobbs ei al., this volume). New evaluation metrics have ap-
`peared also in other tasks, such as text categorization (cf. Hayes, this
`volume).
`
`013
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`INTRODUCTION
`
`o
`
`o Scale:
`Natural language programs typically have operated on a handful of
`texts; recently, programs have emerged that process streams of hun-
`dreds of thousands of words or more, depending on the level of semantic
`processing. Along with their broader capabilities, the knowledge bases
`that such programs use have been expanding. While a typical lexicon
`recently might have included 100 or 200 words, many systems now have
`real lexicons of 10,000 roots or more.
`
`o Commercializalion:
`The number of industrial scientists represented in this volume is an
`indicator of the emerging commercial applications of robust text pro-
`cessing and information retrieval technology, as is the increasing num-
`ber of commercially available systems. Many commercial applications
`that formerly used relational databases or other structured knowledge
`sources are shifting to textual databases because of the availability of
`on-line text information, and many hardware and software vendors are
`packaging their products with substantial text databases. These prod-
`ucts generally do not employ the sort of technology discussed here, but
`do provide a vehicle for the ultimate application of the technology.
`
`Cooperaiion and Compeffiion:
`Until recently, schools of thought in text processing and information
`retrieval were dogmatic enough to ignore most other related work. In
`many areas, recent projects have spawned cooperative efforts in col-
`lecting data and lexical knowledge, assembling test collections, and co-
`operating between industry and academia. Competition, on the other
`hand, was never allowed because of the general lack of evaluation cri-
`teria. Now there is a growing interest in holding "showdowns" that
`objectively compare different methods.
`
`-
`
`While there has been some visible progress toward text-based intelligent
`systems, we aren't very close to a desirable state of technology. The next
`section addresses some of the obstacles we must overcome.
`
`1.5 Why We Aren't There Yet
`Many of us have workstations on top of our desks that have access via com-
`puter networks to trillions of words of textencyclopedias, almanacs, dic-
`tionaries, literature, news, and electronic bulletin boards. Ironically, we are
`loath to attempt to use most of this information because a combination of
`factorsmainly the difficulty of finding any particular bit of knowledge we
`desiremakes it a gross waste of time.
`Much of this problem in crudeness of information access boils down to
`issues that are relatively mundane, having little to do with text contentthe
`
`014
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`6
`
`P. JACOBS
`
`speed of transmission across networks, compatibility of hardware, security,
`legal and copyright concerns, the lack of standards for storing and trans-
`mitting on-line text, etc. As the motivation for using on-line text helps to
`dissolve some of these issues, we can hope for better opportunities to use the
`advanced technologies for content analysis that are reported here.
`In addition to these mundane communication and standardization issues,
`there is a more relevant problem of how to market the technology that we
`are developing. Too often we ignore the strengths of the competitionin
`this case, simple text search, Boolean query, and keyword retrieval methods.
`While these simpler methods lack the power and intuitive appeal, say, of
`natural language analysis or concept-based information retrieval, they have
`certain features that appeal to users of large text databases: they are fast,
`portable, relatively inexpensive, and relatively easy to learn. The techniques
`are compatible with many software packages, run on many hardware plat-
`forms, and are easier to implement in hardware. By contrast, natural lan-
`guage processing can be slow, brittle, and expensive. In order to bring the
`technology to the marketplace in the near future (such as the next dozen
`or so years), we will either have to minimize these disadvantages or prove
`dramatic improvements over simpler methods.
`Some key technical barriers stand in the way of the all-knowing desktop
`librarian. These technical barriers will form some of the focal points the
`research reported in this volume as well as the progress that is likely to be
`made in the rest of the century. Four such issues are (1) robustness of analysis,
`(2) retrieval strategy, (3) presentation of information, and (4) cultivation of
`applications. The next section will outline the technical challenges in each of
`these areas.
`
`1.6 Challenges for the 1990's
`The intelligent access to information from texts is the central theme of this
`research. The following are some of the key thrusts of this theme, including
`the topics of many of the papers here:
`
`Robustness:
`The next generation of language analyzers must do much of the same
`sort of processing that current systems do, but must do it more ac-
`curately, faster, and with less domain-dependent knowledge. Robust-
`ness applies both to extending techniques that are already robust, such
`as parsing and morphology, and to increasing the robustness of more
`knowledge-intensive techniques, such as semantic analysis.
`
`Rerzeva1 Strategy:
`Current retrieval methods are oriented toward the retrieval of docu-
`ments, not information in generad. Text-based systems must address
`the broader issue of satisfying the information needs of many different
`
`015
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`INTRODUCTION
`
`7
`
`systems and users. Within this broader information processing context,
`the concept of success but be redefined to be more than reproducing
`"relevant" texts, and new retrieval strategies must address this new no-
`tion of success. For example, if a user wants to know a specific piece of
`information and the system produces an extremely long text containing
`relevant information, this is somehow not as good as producing a direct
`answer to the user's question.
`
`Presentation:
`A big problem with on-line text retrieval is that people do not like
`to read. On-line text is even harder to read than printed material.
`Current systems depend on users' reading skills rather than present-
`ing information that satisfy's a user's needs. We have only begun to
`address the many different ways textual information can be effectively
`displayed. For example, hypertext systems can link together pieces of
`text from different parts of a document or different documents, making
`it easier for the user to control the presentation. For all the "hype"
`that hypertext has received, we have a long way to go in presenting
`texts intelligentlyfor example, generating a summary by combining
`different portions of text, highlighting sections of text that contain in-
`formation that is asked for, or compressing a text so that only key
`portions appear. Many of these techniques must be developed to suit
`the requirements of new applications.
`
`o Applications:
`One of the limitations of information retrieval research is that it has
`narrowly defined its territory, possibly overlooking appropriate appli-
`cation areas. Many different types of content-based text applications
`have already emerged, such as routing (selective dissemination of in-
`formation), text categorization, database generation, and transaction
`handling. The range of application areas continues to grow. Some
`provocative application areas are: skimming news stories about politi-
`cal issues to determine whether a figure is "for" or "against" (cf. Hearst,
`this volume); selecting and ordering requirements from a large software
`specification; and generating a help system from on-line documentation
`(Maarek, this volume). Research in text-based systems must consider
`these new testbed applications along with the underlying technical is-
`sues.
`
`While each of these areas poses some substantive problems, text-based
`systems are bound to grow steadily in their capabilities. After all, the use of
`information retrieval systems is expanding in spite of relatively poor accu-
`racy. It's a good bet that many of the developments in text-based intelligent
`systems will pan out as they apply more robust methods to use the increasing
`power of on-line text.
`
`016
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`i
`
`8
`
`P. JACOBS
`
`1.7 Summary
`The emerging field of text-based intelligent systems marries the content-based
`analysis of natural language processing with the experimental methodology of
`information retrieval. This combination can overcome many of the limitations
`of current knowledge-based systems by applying shallow methods of analysis
`to huge bodies of text. This new focus has already produced an expansion
`in robust text processing capabilities, and is likely to produce a wave of
`maturing applications in the next decade.
`
`Bibliography
`[1PM, 1987] Information Processing and Management, Special Issue on Ar-
`tificial Intelligence for Information Retrieval, 23(4), 1987.
`
`[Jacobs and Rau, 1990] Paul Jacobs and Lisa Rau. SCISOR: Extracting in-
`formation from on-line news, Communications of the Association for Corn-
`ptzting Machinery, 33(11):88-97, November 1990.
`
`[Jonak, 1984] Zdenek Jonak. Automatic indexing of full texts. Information
`Processing and Management, 20(5-6) :619-627, 1984.
`
`[Lytinen and Gershman, 1986] Steven Lytinen and Anatole Gershman.
`In Pro-
`ATRANS: Automatic processing of money transfer messages.
`ceedings of the Fifth National Conference on Artificial Intelligence, pages
`1089-1093, Philadelphia, 1986.
`
`[Rau, 1987] Lisa F. Rau. Information retrieval in never-ending stories.
`In
`Proceedings of the Sixth National Conference on Artificial Intelligence,
`pages 317-321, Seattle, Washington, July 1987. Morgan Kaufmann Inc.
`
`[Sundheim, 1989] Beth Sundheim. Second message understanding (MUCK-
`II) report. Technical Report 1328, Naval Ocean Systems Center, San Diego,
`CA, 1989.
`
`[Young and Hayes, 1985] 5. Young and P. Hayes. Automatic classification
`and summarization of banking telexes. In The Second Conference on Ar-
`tificial Intelligence Applications, pages 402-208, IEEE Press, 1985.
`
`[Zernik, 1991] U. Zernik, editor. Lexical Acquisition: Using On-Line Re-
`sources o Build a Lexicon. Lawrence Eribaum Associates, Hillsdale, NJ,
`1991.
`
`017
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`Part I
`BroadSca1e NLP
`
`r
`
`018
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`BROAD-SCALE NLP
`
`11
`
`Part I: Broad-Scale NLP
`
`Two forces drive the emergence of text-based systems: the power of on-
`line text and the increased ability of computers to process text. This section
`covers the techniques that have changed the way computers interpret texts
`in recent years, from increased coverage and completeness of traditional lin-
`guistic processing to the integration of statistical or "weak" methods with
`deeper interpretation.
`The paper by Hobbs e al. argues that augmenting the detailed models of
`parsing and inference that have been explored in the past can provide much
`of what's needed to extract information from quantities of real text. Wilks e
`al. and Hirst and Ryan lean more heavily on weak methods, while McDonald
`presents an alternative model of parsing. The Zernik paper gives one view
`of how weak methods can aid, rather than replace, linguistic processing.
`
`r
`
`019
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`Robust Processing of Reah World
`Natural-Language Texts
`Jerry R. Hobbs, Douglas E. Appelt, John Bear,
`Mabry Tyson, and David Magerman
`
`Artificial Intelligence Center
`SRI International
`Menlo Park, California
`
`Abstract
`It is often assumed that when natural language processing meets the
`real world, the ideal of aiming for complete and correct interpretations
`has to be abandoned. However, our experience with TACITUS, espe-
`cially in the MUC-3 evaluation, has shown that principled techniques
`for syntactic and pragmatic analysis can be bolstered with methods for
`achieving robustness. We describe and evaluate a method for dealing
`with unknown words and a method for filtering out sentences irrele-
`vant to the task. We describe three techniques for making syntactic
`analysis more robustan agenda-based scheduling parser, a recovery
`technique for failed parses, and a new technique called terminal sub-
`string parsing. For pragmatics processing, we describe how the method
`of abductive inference is inherently robust, in that an interpretation
`is always possible, so that in the absence of the required world knowl-
`edge, performance degrades gracefully. Each of these techniques has
`been evaluated, and the results of the evaluations are presented.
`
`Introduction
`2.1
`If automatic text processing is to be a useful enterprise, it must be demon-
`strated that the completeness and accuracy of the information extracted is
`adequate for the application one has in mind. While it is clear that certain
`applications require only a minimal level of competence from a system, it is
`also true that many applications require a very high degree of completeness
`and accuracy, and an increase in capability in either area is a clear advantage.
`Therefore, we adopt an extremely high standard against which the perfor-
`mance of a text processing system should be measured: it should recover all
`information that is implicitly or explicitly present in the text, and it should
`do so without making mistakes.
`
`13
`
`020
`
`Facebook, Inc. - EXHIBIT 1238
`
`
`
`14
`
`J. HOBBS ET AL.
`
`This standard is far beyond the state of the art. It is an impossibly high
`standard for human beings, let alone machines. However, progress toward
`adequate text processing is best served by setting ambitious goals. For this
`reason we believe that, while it may be necessary in the intermediate term
`to settle for results that are far short of this ultimate goal, any linguistic
`theory or system architecture that is adopted should not be demonstrably
`inconsistent with attaining this objective. However, if one is interested, as
`we are, in the potentially successful application of these intermediate-term
`systems to real problems, it is impossible to ignore the question of whether
`they can be made efficient enough and robust enough for actual applications.
`
`2.1.1 The TACITUS System
`The TACITUS text processing system has been under development at SRI
`International for the last six years. This system has been designed as a
`first step toward the realization of a system with very high completeness
`and accuracy in its ability to extract information from text. The general
`philosophy underlying the design of this system is that the system, to the
`maximum extent possible, should not discard any information that might be
`semantically or pragmatically relevant to a full, correct interpretation. The
`effect of this design philosophy on the system arch