throbber
TEXT -BASED
`INTELLIGENT
`SYSTEMS
`
`Current Research and Practice in
`Information Extraction and Retrieval
`
`001
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`002
`
`Facebook, Inc. - EXHIBIT 1031
`
`002
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`Text-Based Intelligent Systems
`Current Research and Practice
`in Information Extraction and
`Retrieval
`
`003
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`Text-Based Intelligent Systems:
`Current Research and Practice in
`Information Extraction and Retrieval
`
`Edited by Paul S. Jacobs
`Artificial Intelligence Laboratory
`GE Research and Development Center
`
`LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS
`Hove and London
`Hillsdale, New Jersey
`
`1992
`
`r
`
`004
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`2-1 ,. _-,
`
`.. r
`
`---
`
`Copyright © 1992 by Lawrence Eribaurn Associates, Inc.
`Ail rights reserved. No part of this book may be reproduced in any form by
`photostat, microform, retrieval system, or any other mns, without the prior Wntten
`permission of the publisher.
`
`Lawrence Eribaum Associates, Inc., Publishers
`365 Broadway
`Hillsdale, New Jersey 07642
`
`Library of Congress Cataloging-in-Publication Data
`Text-based intelligent systems: current research and practice in information extraction and
`retneval I edited by Paul S. Jacobs.
`cm.
`p.
`Includes bibliographical references and index.
`ISBN 0-8058-1188-5. -- ISBN 0-8058-1189-3 (pbk.)
`1. Text processing (Computer science) 2. Natural language processing (Computer science) 3.
`Artifical intelligence.
`I. Jacobs, Paul Schafran
`1992
`QA76.9. T48T469
`006.3'5-dc2O
`
`92-17802dr
`
`Printed in the United States of America
`10 9 8 7 6 5 4 3 2 1
`
`005
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`Coictents
`
`c
`
`1
`
`C)
`
`f-A i
`-r
`t 12
`EN4
`
`i
`
`Introduction: Text Power and Intelligent Systems - Paul S.
`Jacobs, GE Research and Development Center
`
`Part I Broad-Scale NLP
`2 Robust Processing of Real-World Natural-Language Texts -
`Jerry R. Hobbs, Douglas E. Appell, John Bear, Mabry Tyson, and
`David Magerman, SRI Intern at:onal
`
`3 Combining Weak Methods in Large-Scale Text Processing
`- Yorick Wilks, Louise Guíhrie, Joe Guthrie, and Jim Cowie, New
`Mexico Stale University
`
`4 Mixed-Depth Representations for Natural Language Text -
`Graeme Hirsi and Mark Ryan, Univers1iy of Toronto
`
`9
`
`13
`
`35
`
`59
`
`5 Robust Partial-Parsing Through Incremental, Multi-Algorithm
`Processing - David D. McDonald, Brandeis University and Content
`Technologies, Inc.
`
`83
`
`6 Corpus-Based Thematic Analysis
`
`101
`
`Part II
`"Traditional" Information Retrieval
`7 Text Retrieval and Inference - W. Bruce Croft, University of
`127
`Massachusetts, and Howard R. Turtle, West Publishing Company
`
`123
`
`8 Assumptions and Issues in Text-Based Retrieval - Karen Spa rck
`157
`Jones, University of Cambridge
`
`9 Text Representation for Intelligent Text Retrieval: A Classification-
`Oriented View - David D. Lewis, University of Chicago
`179
`
`V
`
`006
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`Piefac
`
`This volume started with a Symposium in 1990, sponsored by AAAI and ti-
`tled "Text-Based Intelligent Systems". The push for this get-together, which
`included about 50 scientists with a variety of backgrounds, was a rapidly-
`emerging set of technologies for exploiting the massive quantity of textual
`information that has become increasingly available through advances in com-
`puting technology.
`The challenge for this group was to explore new ways to take advantage
`of the power of on-line text, We intuit that a billion words of text can be
`a lot more generally useful than a few hundred logical rules, if we can use
`advanced computation (1) to extract useful information from streams of text,
`and (2) to help find (retrieve) what we need in the sea of available material.
`The extraction task has become a hot topic for the field of Natural Language
`Processing, while the retrieval task has been solidly in the field of Information
`Retrieval. These two disciplines came together at our Symposium, and have
`been cross-breeding more than ever.
`This text has gone to press very quickly, in order to provide a "snapshot"
`of current research and practice and to help others to contribute to this new
`In fact, enough has happened since the 1990 Symposium that
`discipline.
`the papers in this book bear little resemblance to the original versions pre-
`sented. Since then, there have been some new commercial applications, the
`government has undertaken a substantial research program called TIPSTER
`along with a series of formal evaluations (known as MUC and TREC) for
`testing text-processing technologies, and computer programs have scaled up
`from handling a few texts in simple domains to getting useful information
`out of millions of words of naturally occurring text. The contributors here
`are representative of the individuals, groups, and approaches that are behind
`this progress.
`Not all the contributors here like the word "intelligent" in the title:
`It
`is meant not to ascribe any real intelligence to our programs, but rather to
`connote the innovative nature of the work, The systems are meant to be
`fast, effective, and helpful"Text-based Fast, Effective, Helpful Systems"
`does not roll off the tongue as well, so we have chosen Text-Based Intelligent
`Systems (TBIS) to represent the nature of the science and the applications.
`The book is organized in three parts. The first group of papers describes
`
`vii
`
`007
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`vi
`
`CONTENTS
`
`10 Automatic Text Structuring Experiments - Gerard Salton and
`199
`Chris Buckley, Cornell Universzíy
`
`211
`
`Part III Emerging Applications
`11 Statistical Methods, Artificial Intelligence, and Information
`Retrieval - Craig Sianfihl and David L. Waltz, Thinkzng Machines
`215
`Corporation
`12 Intelligent High-Volume Text Processing Using Shallow, Domain-
`227
`Specific Techniques - Philip J. Hayes, Carnegie Group
`13 Automatically Constructing Simple Help Systems from Nat-
`ural Language Documentation - Yoëlle S. Maarek, IBM T.J.
`243
`Watson Research Center
`14 Direction-Based Text Interpretation as an Information Ac-
`cess Refinement - Mani A. Hearst, University of Cahfor'nia, Berke-
`257
`ley
`
`Index
`
`275
`
`008
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`viii
`
`PREFACE
`
`the current set of natural language (NL) processing techniques that are used
`for interpreting and extracting information from quantities of text. The sec-
`ond group gives some of the historical perspective, methodology, and current
`practice of Information Retrieval (IR) work. The third set covers some of the
`current and emerging application.
`The volume is aimed at an audience of computer professionals who have
`at least some knowledge of natural language and IR, but it has also been
`prepared with advanced students in mind. While there are now good texts
`in both NL and IR, the changes in both fields have been substantial enough
`that the texts do not capture much of current practice with respect to TB IS.
`This collection of readings should give students and scientists alike a good
`idea of the current techniques as well as a general concept of how to go about
`developing and testing systems to handle volumes of text.
`This work is the result of the cooperative efforts of the contributors, to
`whom I am indebted for their timely and appropriate response. Every word
`has been prepared, submitted, reviewed, and typset electronically (in "soft
`copy") to keep the material current and correct. I am also thankful for the
`support of AAAI for the original Symposium, to Norm Sondheimer and the
`Artificial Intelligence Laboratory at GE for helping to promote this type of
`work, and to Lisa Rau for helping me put everything together.
`
`009
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`Introduction: Text Power an
`Intelligent Systems
`
`Paul S. Jacobs
`Artificial Intelligence Program
`GE Research and Development Center
`Schenectady, NY 12301 USA
`
`Li A New Opportunity
`Huge quantities of readily available on-line text raise new challenges and
`opportunities for artificial intelligence systems. The ease of acquiring text
`knowledge suggests replacing, or at least augmenting, knowledge-based sys-
`tems with "text-based" intelligence wherever possible. Making use of this
`text knowledge demands more work in robust processing, retrieval, and pre-
`sentation of information, but raises a host of new applications of AI tech-
`nologies, where on-line information exists but knowledge bases do not.
`Most AI programs have failed to "scale up" because of the difficulty of
`developing large, robust knowledge bases. At the same time, rapid advances
`in networks and information storage now provide access to knowledge bases
`millions of times largerin text form. No knowledge representation claims
`the expressive power or the compactness of this raw text. The next generation
`of AI applications, therefore, may well be "text-based" rather than knowledge
`based, deriving more power from large quantities of stored text than from
`hand-crafted rules.
`Text-based intelligent systems can combine artificial intelligence tech-
`niques with more robust but "shallower" methods. Natural language process-
`ing (NLP) research has been hampered, on the one hand, by the limitations
`of deep systems that work only on a very small number of texts (often only
`one), and, on the other hand, by the failure of more mature technologies,
`Information retrieval (IR)
`such as parsing, to apply to practical systems.
`systems offer a vehicle where selected NLP methods can produce useful re-
`sults; hence, there is a natural and potentially important marriage between
`IR and NLP. This synergy extends beyond the traditional realms of either
`technology to a variety of emerging applications.
`As examples, we must consider what a knowledge-based system can offer
`in the domain of medical diagnosis, on-line operating systems, fault diagno-
`sis in engines, or financial advising, that cannot be found in a medical text-
`
`i
`
`k
`
`010
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`2
`
`P. JACOBS
`
`book, a user's manual, a design specification, or a tax preparation handbook.
`Computers should help make the right information from these documents ac-
`cessible and comprehensible to the user. Harnessing the power of volumes
`of available textthrough information retrieval, natural language analysis,
`knowledge representation, and conceptual information extractionwill pose
`a major challenge for AI into the next century.
`Advocates of the text-based approach to intelligent systems must accept
`its inherent limitations. Some of the traditional AI problems, such as rea-
`soning, inference, and pragmatics, must necessarily play a limited role. But
`there is evidence of substantial progress in building robust text processing
`systems that rely more heavily on shallower methods. The rest of this paper
`describes the combination of applications, methodologies and techniques that
`forms the backbone of work on Text-Based Intelligent Systems.
`
`1.2 A New Name
`To merit their own label, "text-based intelligent systems" must suggest some-
`thing distinctly different from prevailing research. As the introduction has
`implied, a text-based intelligent system (TBIS) is a program that derives
`its power from large quantities of raw text, in an intelligent manner. Such
`systems differ from traditional information retrieval systems in that they
`must be more flexible and responsive, possibly segmenting, combining, or
`synthesizing a response rather than just retrieving texts. The systems differ
`from traditional natural language programs in that they must be much more
`robust.
`The category of text-based intelligent systems includes, for example:
`
`Text extraction systemsprograms that analyze volumes of unstruc-
`tured text, selecting certain features from the text and potentially
`storing such features in a structured form. These systems currently
`exist in limited domains. Examples of this type of system are news
`reading programs [Jacobs and Rau, 1990] (see the papers by Hobbs et
`al. and McDonald in this volume), database generation programs that
`produce fixed-field information from free text, and transaction handling
`programs, such as those that read banking transfer messages [Lytinen
`and Gershman, 1986; Young and Hayes, 19851.
`
`Automated indexing and hypertextknowledge-based programs that
`determine key terms and topics by which to select texts or portions
`of text [Jonak, 19841 or automatically link portions of text that relate
`to one another (see the paper by Salton and Buckley in this volume).
`Summarization and abstraclingprograms that integrate multiple texts
`that repeat, correct, or augment one another, as in following the course
`of a news story over time such as a corporate merger or political event
`[Rau, 1987].
`
`011
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`INTRODUCTION
`
`3
`
`Inelligen1 info rmazon reínevalsysterns with enhanced information
`retrieval capabilities, through robust query processing, user modeling,
`or limited inference [1PM, 19871 (see also the paper by Croft and Turtle
`in this volume).
`This volume contains position papers covering all of the topics above,
`along with discussions of underlying problems in constructing TBIS's, such
`as the representation and storage of knowledge about texts or about lan-
`guage, and robust text processing techniques. Many of the positions describe
`research related to substantial systems in one of the above categories, and
`virtually all address the issue of robust processing of some sort. The next
`section describes the apparent methodological themes of this sort of research,
`
`1.3 No More "Donkeys"
`Much of this research combines the discipline of information retrieval with
`some of the techniques of natural language processing. Historically, the
`methodology of information retrieval has been to develop new methods and
`conduct experiments to compare those methods with other approaches. By
`contrast, the methodology of natural language processing has been either to
`develop theories that apply to broad but carefully selected linguistic phenom-
`ena, or to develop programs that apply to carefully selected texts. In other
`words, there has been very little effort within natural language to produce
`results such as "This program performs the following task with 95% accuracy
`on the following set of 1000 texts".
`As a result of its more theoretical orientation, natural language as a field
`has devoted much of its attention to paradigmatic but improbable exam-
`ples. Researchers in natural language were trained to think about contrived
`sentences"Every man who owns a donkey beats it" or "The box is in the
`pen." These are so familiar that one might stand up with a question at the
`end of a presentation and ask, "But what about the 'donkey' sentences?"
`Researchers are acquainted enough with the examples that they needn't be
`repeated, in spite of the fact that they hardly seem representative of examples
`or problems that we might encounter,
`The current methodological shift in the experimental element of natural
`language processing (by no means the dominant segment of the field) brings
`text processing, as experimental computer science, closer to information re-
`trieval. Rather than seek out examples that support or challenge theories,
`the experimental methodology uses sets of naturally occurring examples as
`test cases, possibly ignoring certain interesting problems that simply do not
`occur in a particular task. While this approach has some disadvantages, it
`has the benefit of focussing work on the issues in natural language processing
`that inhibit robustness.
`Another example of the experimental shift is the area of language ac-
`quisition. During the 1970's and most of the 1980's, the field of language
`
`012
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`4
`
`P. JACOBS
`
`acquisition concentrated on the techniques through which knowledge, espe-
`cially grammatical knowledge, could be acquired. The result of this effort
`was a host of theories and techniques, but very little in the way of sizable
`knowledge bases. Recently, however, the research focus in language acquisi-
`tion has been on achieving the goal of acquisition rather than on the process,
`resulting in extensive lexicons and knowledge bases for use in processing texts
`[Zernik, 1991].
`While the methodology of natural language may be drifting toward in-
`formation retrieval, information retrieval is slowly changing in focus. The
`extreme difficulty of producing significant improvements using traditional
`document retrieval metrics suggests exploring new retrieval strategies as well
`as devising new measures. As the combined fields of natural language process-
`ing and information retrieval continue to make progress, the demand grows
`for test collections and metrics that evaluate meaningful tasks, including not
`only the accuracy of document retrieval, but also the accuracy, speed, trans-
`portability, and ease of use of systems that perform functions such as those
`outlined in the previous section. This new direction involves the constant
`interplay of two goals: (1) produce new measurable results and (2) produce
`new measures of new results.
`The resulting experimental methodology has spawned a host of research
`projects emphasizing robust processing, large-scale systems, knowledge ac-
`quisition, and performance evaluation. As the new research is still taking
`shape, one shouldn't expect any breakthroughs as yet. The next section
`considers the limited progress that has already resulted.
`
`1.4 Where We Are Now
`
`While text-based intelligent systems are very much a futuristic concept, the
`recent emphasis on experiment and performance has brought some noticeable
`changes during the last several years:
`
`Evaluation:
`In government, academia, and industry, the desire for results has led
`to new metrics for evaluating system performance. While metrics and
`benchmarks often spark debate, they also show clear progress.
`For
`example, a government-sponsored message processing conference three
`years ago featured a small set of programs performing different func-
`tions in different domains, while a more recent similar conference in-
`cluded nine substantial programs performing a common task on a set
`of over loo real messages, and produced meaningful results [Sundheim,
`1989] (see Hobbs ei al., this volume). New evaluation metrics have ap-
`peared also in other tasks, such as text categorization (cf. Hayes, this
`volume).
`
`013
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`INTRODUCTION
`
`o
`
`o Scale:
`Natural language programs typically have operated on a handful of
`texts; recently, programs have emerged that process streams of hun-
`dreds of thousands of words or more, depending on the level of semantic
`processing. Along with their broader capabilities, the knowledge bases
`that such programs use have been expanding. While a typical lexicon
`recently might have included 100 or 200 words, many systems now have
`real lexicons of 10,000 roots or more.
`
`o Commercializalion:
`The number of industrial scientists represented in this volume is an
`indicator of the emerging commercial applications of robust text pro-
`cessing and information retrieval technology, as is the increasing num-
`ber of commercially available systems. Many commercial applications
`that formerly used relational databases or other structured knowledge
`sources are shifting to textual databases because of the availability of
`on-line text information, and many hardware and software vendors are
`packaging their products with substantial text databases. These prod-
`ucts generally do not employ the sort of technology discussed here, but
`do provide a vehicle for the ultimate application of the technology.
`
`Cooperaiion and Compeffiion:
`Until recently, schools of thought in text processing and information
`retrieval were dogmatic enough to ignore most other related work. In
`many areas, recent projects have spawned cooperative efforts in col-
`lecting data and lexical knowledge, assembling test collections, and co-
`operating between industry and academia. Competition, on the other
`hand, was never allowed because of the general lack of evaluation cri-
`teria. Now there is a growing interest in holding "showdowns" that
`objectively compare different methods.
`
`-
`
`While there has been some visible progress toward text-based intelligent
`systems, we aren't very close to a desirable state of technology. The next
`section addresses some of the obstacles we must overcome.
`
`1.5 Why We Aren't There Yet
`Many of us have workstations on top of our desks that have access via com-
`puter networks to trillions of words of textencyclopedias, almanacs, dic-
`tionaries, literature, news, and electronic bulletin boards. Ironically, we are
`loath to attempt to use most of this information because a combination of
`factorsmainly the difficulty of finding any particular bit of knowledge we
`desiremakes it a gross waste of time.
`Much of this problem in crudeness of information access boils down to
`issues that are relatively mundane, having little to do with text contentthe
`
`014
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`6
`
`P. JACOBS
`
`speed of transmission across networks, compatibility of hardware, security,
`legal and copyright concerns, the lack of standards for storing and trans-
`mitting on-line text, etc. As the motivation for using on-line text helps to
`dissolve some of these issues, we can hope for better opportunities to use the
`advanced technologies for content analysis that are reported here.
`In addition to these mundane communication and standardization issues,
`there is a more relevant problem of how to market the technology that we
`are developing. Too often we ignore the strengths of the competitionin
`this case, simple text search, Boolean query, and keyword retrieval methods.
`While these simpler methods lack the power and intuitive appeal, say, of
`natural language analysis or concept-based information retrieval, they have
`certain features that appeal to users of large text databases: they are fast,
`portable, relatively inexpensive, and relatively easy to learn. The techniques
`are compatible with many software packages, run on many hardware plat-
`forms, and are easier to implement in hardware. By contrast, natural lan-
`guage processing can be slow, brittle, and expensive. In order to bring the
`technology to the marketplace in the near future (such as the next dozen
`or so years), we will either have to minimize these disadvantages or prove
`dramatic improvements over simpler methods.
`Some key technical barriers stand in the way of the all-knowing desktop
`librarian. These technical barriers will form some of the focal points the
`research reported in this volume as well as the progress that is likely to be
`made in the rest of the century. Four such issues are (1) robustness of analysis,
`(2) retrieval strategy, (3) presentation of information, and (4) cultivation of
`applications. The next section will outline the technical challenges in each of
`these areas.
`
`1.6 Challenges for the 1990's
`The intelligent access to information from texts is the central theme of this
`research. The following are some of the key thrusts of this theme, including
`the topics of many of the papers here:
`
`Robustness:
`The next generation of language analyzers must do much of the same
`sort of processing that current systems do, but must do it more ac-
`curately, faster, and with less domain-dependent knowledge. Robust-
`ness applies both to extending techniques that are already robust, such
`as parsing and morphology, and to increasing the robustness of more
`knowledge-intensive techniques, such as semantic analysis.
`
`Rerzeva1 Strategy:
`Current retrieval methods are oriented toward the retrieval of docu-
`ments, not information in generad. Text-based systems must address
`the broader issue of satisfying the information needs of many different
`
`015
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`INTRODUCTION
`
`7
`
`systems and users. Within this broader information processing context,
`the concept of success but be redefined to be more than reproducing
`"relevant" texts, and new retrieval strategies must address this new no-
`tion of success. For example, if a user wants to know a specific piece of
`information and the system produces an extremely long text containing
`relevant information, this is somehow not as good as producing a direct
`answer to the user's question.
`
`Presentation:
`A big problem with on-line text retrieval is that people do not like
`to read. On-line text is even harder to read than printed material.
`Current systems depend on users' reading skills rather than present-
`ing information that satisfy's a user's needs. We have only begun to
`address the many different ways textual information can be effectively
`displayed. For example, hypertext systems can link together pieces of
`text from different parts of a document or different documents, making
`it easier for the user to control the presentation. For all the "hype"
`that hypertext has received, we have a long way to go in presenting
`texts intelligentlyfor example, generating a summary by combining
`different portions of text, highlighting sections of text that contain in-
`formation that is asked for, or compressing a text so that only key
`portions appear. Many of these techniques must be developed to suit
`the requirements of new applications.
`
`o Applications:
`One of the limitations of information retrieval research is that it has
`narrowly defined its territory, possibly overlooking appropriate appli-
`cation areas. Many different types of content-based text applications
`have already emerged, such as routing (selective dissemination of in-
`formation), text categorization, database generation, and transaction
`handling. The range of application areas continues to grow. Some
`provocative application areas are: skimming news stories about politi-
`cal issues to determine whether a figure is "for" or "against" (cf. Hearst,
`this volume); selecting and ordering requirements from a large software
`specification; and generating a help system from on-line documentation
`(Maarek, this volume). Research in text-based systems must consider
`these new testbed applications along with the underlying technical is-
`sues.
`
`While each of these areas poses some substantive problems, text-based
`systems are bound to grow steadily in their capabilities. After all, the use of
`information retrieval systems is expanding in spite of relatively poor accu-
`racy. It's a good bet that many of the developments in text-based intelligent
`systems will pan out as they apply more robust methods to use the increasing
`power of on-line text.
`
`016
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`i
`
`8
`
`P. JACOBS
`
`1.7 Summary
`The emerging field of text-based intelligent systems marries the content-based
`analysis of natural language processing with the experimental methodology of
`information retrieval. This combination can overcome many of the limitations
`of current knowledge-based systems by applying shallow methods of analysis
`to huge bodies of text. This new focus has already produced an expansion
`in robust text processing capabilities, and is likely to produce a wave of
`maturing applications in the next decade.
`
`Bibliography
`[1PM, 1987] Information Processing and Management, Special Issue on Ar-
`tificial Intelligence for Information Retrieval, 23(4), 1987.
`
`[Jacobs and Rau, 1990] Paul Jacobs and Lisa Rau. SCISOR: Extracting in-
`formation from on-line news, Communications of the Association for Corn-
`ptzting Machinery, 33(11):88-97, November 1990.
`
`[Jonak, 1984] Zdenek Jonak. Automatic indexing of full texts. Information
`Processing and Management, 20(5-6) :619-627, 1984.
`
`[Lytinen and Gershman, 1986] Steven Lytinen and Anatole Gershman.
`In Pro-
`ATRANS: Automatic processing of money transfer messages.
`ceedings of the Fifth National Conference on Artificial Intelligence, pages
`1089-1093, Philadelphia, 1986.
`
`[Rau, 1987] Lisa F. Rau. Information retrieval in never-ending stories.
`In
`Proceedings of the Sixth National Conference on Artificial Intelligence,
`pages 317-321, Seattle, Washington, July 1987. Morgan Kaufmann Inc.
`
`[Sundheim, 1989] Beth Sundheim. Second message understanding (MUCK-
`II) report. Technical Report 1328, Naval Ocean Systems Center, San Diego,
`CA, 1989.
`
`[Young and Hayes, 1985] 5. Young and P. Hayes. Automatic classification
`and summarization of banking telexes. In The Second Conference on Ar-
`tificial Intelligence Applications, pages 402-208, IEEE Press, 1985.
`
`[Zernik, 1991] U. Zernik, editor. Lexical Acquisition: Using On-Line Re-
`sources o Build a Lexicon. Lawrence Eribaum Associates, Hillsdale, NJ,
`1991.
`
`017
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`Part I
`BroadSca1e NLP
`
`r
`
`018
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`BROAD-SCALE NLP
`
`11
`
`Part I: Broad-Scale NLP
`
`Two forces drive the emergence of text-based systems: the power of on-
`line text and the increased ability of computers to process text. This section
`covers the techniques that have changed the way computers interpret texts
`in recent years, from increased coverage and completeness of traditional lin-
`guistic processing to the integration of statistical or "weak" methods with
`deeper interpretation.
`The paper by Hobbs e al. argues that augmenting the detailed models of
`parsing and inference that have been explored in the past can provide much
`of what's needed to extract information from quantities of real text. Wilks e
`al. and Hirst and Ryan lean more heavily on weak methods, while McDonald
`presents an alternative model of parsing. The Zernik paper gives one view
`of how weak methods can aid, rather than replace, linguistic processing.
`
`r
`
`019
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`Robust Processing of Reah World
`Natural-Language Texts
`Jerry R. Hobbs, Douglas E. Appelt, John Bear,
`Mabry Tyson, and David Magerman
`
`Artificial Intelligence Center
`SRI International
`Menlo Park, California
`
`Abstract
`It is often assumed that when natural language processing meets the
`real world, the ideal of aiming for complete and correct interpretations
`has to be abandoned. However, our experience with TACITUS, espe-
`cially in the MUC-3 evaluation, has shown that principled techniques
`for syntactic and pragmatic analysis can be bolstered with methods for
`achieving robustness. We describe and evaluate a method for dealing
`with unknown words and a method for filtering out sentences irrele-
`vant to the task. We describe three techniques for making syntactic
`analysis more robustan agenda-based scheduling parser, a recovery
`technique for failed parses, and a new technique called terminal sub-
`string parsing. For pragmatics processing, we describe how the method
`of abductive inference is inherently robust, in that an interpretation
`is always possible, so that in the absence of the required world knowl-
`edge, performance degrades gracefully. Each of these techniques has
`been evaluated, and the results of the evaluations are presented.
`
`Introduction
`2.1
`If automatic text processing is to be a useful enterprise, it must be demon-
`strated that the completeness and accuracy of the information extracted is
`adequate for the application one has in mind. While it is clear that certain
`applications require only a minimal level of competence from a system, it is
`also true that many applications require a very high degree of completeness
`and accuracy, and an increase in capability in either area is a clear advantage.
`Therefore, we adopt an extremely high standard against which the perfor-
`mance of a text processing system should be measured: it should recover all
`information that is implicitly or explicitly present in the text, and it should
`do so without making mistakes.
`
`13
`
`020
`
`Facebook, Inc. - EXHIBIT 1031
`
`

`

`14
`
`J. HOBBS ET AL.
`
`This standard is far beyond the state of the art. It is an impossibly high
`standard for human beings, let alone machines. However, progress toward
`adequate text processing is best served by setting ambitious goals. For this
`reason we believe that, while it may be necessary in the intermediate term
`to settle for results that are far short of this ultimate goal, any linguistic
`theory or system architecture that is adopted should not be demonstrably
`inconsistent with attaining this objective. However, if one is interested, as
`we are, in the potentially successful application of these intermediate-term
`systems to real problems, it is impossible to ignore the question of whether
`they can be made efficient enough and robust enough for actual applications.
`
`2.1.1 The TACITUS System
`The TACITUS text processing system has been under development at SRI
`International for the last six years. This system has been designed as a
`first step toward the realization of a system with very high completeness
`and accuracy in its ability to extract information from text. The general
`philosophy underlying the design of this system is that the system, to the
`maximum extent possible, should not discard any information that might be
`semantically or pragmatically relevant to a full, correct interpretation. The
`effect of this design philosophy on the system arch

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket