throbber
FoundationsofStatisticalNaturalLanguageProcessing
`
`E0123734
`
`Christopher D. ManningHinrich SchiitzeThe MIT PressCambridge, MassachusettsLondon, England
`
`Page 1 of 704
`
`GOOGLE EXHIBIT 1029
`
`

`

`Second printing, 19990 1999 Massachusetts Institute of TechnologySecond printing with corrections, 2000All rights reserved. No part of this book may be reproduced in any form by anyelectronic or mechanical means (including photocopying, recording, or informa-tion storage and retrieval) without permission in writing from the publisher.Typeset in lo/13 Lucida Bright by the authors using ETPX2E.Printed and bound in the United States of America.Library of Congress Cataloging-in-Publication InformationManning, Christopher D.Foundations of statistical natural language processing / Christopher D.Manning, Hinrich Schutze.p. cm.Includes bibliographical references (p.) and index.ISBN 0-262-13360-l1. Computational linguistics-Statistical methods. I. Schutze, Hinrich.II. Title.P98.5.S83M36 1999410’.285-dc2199-21137CIP
`
`Page 2 of 704
`
`

`

`Brief Contents
`
`I Preliminaries 1
`1
`Introduction 3
`2 Mathematical Foundations
`3 Linguistic Essentials 81
`4 Corpus-Based Work 117
`
`39
`
`II W o r d s 1 4 9
`5 Collocations 151
`6 Statistical Inference: n-gram Models over Sparse Data
`7 Word Sense Disambiguation
`229
`8 Lexical Acquisition 265
`
`191
`
`III Grammar 315
`9 Markov Models 317
`341
`10 Part-of-Speech Tagging
`11 Probabilistic Context Free Grammars
`407
`12 Probabilistic Parsing
`
`381
`
`461
`I v Applications and Techniques
`13 Statistical Alignment and Machine Translation
`495
`14 Clustering
`15 Topics in Information Retrieval
`575
`16 Text Categorization
`
`529
`
`463
`
`Page 3 of 704
`
`

`

`Contents
`
`List of Tables
`
`xv
`
`List of Figures
`
`xxi
`
`Table of Notations
`
`xxv
`
`P r e f a c e rodx
`
`R o a d M a p mxv
`
`I Preliminaries 1
`
`1
`
`Introduction 3
`
`1.1 Rationalist and Empiricist Approaches to Language41.2 Scientific Content 71.2.1 Questions that linguistics should answer81.2.2 Non-categorical phenomena in language111.2.3Language and cognition as probabilisticphenomena 151.3The Ambiguity of Language: Why NLP Is Difficult171.4 Dirty Hands 191.4.1 Lexical resources 191.4.2 Word counts 201.4.3 Zipf’s laws 231.4.4 Collocations 291.4.5 Concordances 311.5 Further Reading 34
`
`Page 4 of 704
`
`

`

`.Vlll
`
`Contents1.6 Exercises 352
`
`39
`
`Mathematical Foundations
`2.1
`
`2.2
`
`Elementary ProbabilityTheory402.1.1Probability spaces 402.1.2Conditional probability and independence2.1.3Bayes’ theorem 432.1.4Random variables4 52.1.5Expectation and variance462.1.6Notation 4 72.1.7Joint and conditional distributions482.1.8Determining P 482.1.9Standard distributions 502.1.10Bayesian statistics 542.1.11Exercises 5942
`Essential Information Theory 602.2.1 Entropy 612.2.2 Joint entropy and conditional entropy632.2.3 Mutual information 662.2.4 The noisy channel model 682.2.5Relative entropy or Kullback-Leibler divergence2.2.6 The relation to language: Cross entropy732.2.7 The entropy of English 762.2.8 Perplexity 782.2.9 Exercises 782.3 Further Reading 793
`813.1 Parts of Speech and Morphology 8 13.1.1 Nouns and 83pronouns3.1.2Words that accompany nouns: Determiners andadjectives 873.1.3 Verbs 883.1.4 Other parts of speech 913.2 Phrase Structure 933.2.1 Phrase structure 96grammars3.2.2 Dependency: Arguments and adjuncts3.2.3 X’ theory 1063.2.4 Phrase structure ambiguity 107101
`
`Linguistic Essentials
`
`72
`
`Page 5 of 704
`
`

`

`Contents
`
`ix
`
`3.3 Semantics and Pragmatics 1093.4 Other Areas 1123.5 Further Reading 1133.6 Exercises 1144
`
`Corpus-Based Work
`
`4.4
`Further Reading4.5 Exercises 147
`
`II Words 149
`5 Collocations
`
`1174.1 Getting Set Up 1184.1.1 Computers 1184.1.2 Corpora 1184.1.3 Software 1204.2 Looking at Text 1234.2.1 Low-level formatting issues1234.2.2 Tokenization: What is a word?1244.2.3 Morphology 1314.2.4 Sentences 1344.3 Marked-up Data 1364.3.1 Markup schemes 1374.3.2 Grammatical tagging 139
`1515.1 Frequency 1535.2Mean and Variance5.3 Hypothesis Testing5.3.1 The t test1451571621635.3.2 Hypothesis testing of differences1665.3.3 Pearson’s chi-square test 1695.3.4 Likelihood ratios 1725.4Mutual Information 1785.5The Notion of Collocation 1835.6Further Reading 1876
`
`Statistical Inference: n -gram Models over Sparse Data
`
`1916.1Bins: Forming Equivalence Classes 1926.1.1 Reliability vs. discrimination 1926.1.2 models 192n-gram
`
`Page 6 of 704
`
`

`

`Contents
`
`6.1.3 Building modelsn-gram1956.2Statistical Estimators 1966.2.1Maximum Likelihood Estimation
`
`(MLE)
`
`Word Sense Disambiguation 229
`7.1
`
`6.2.2Laplace’s law, Lidstone’s law and theJeffreys-Perks law 2026.2.3 Held out estimation 2056.2.4Cross-validation (deleted estimation)6.2.5 Good-Turing estimation2126.2.6 Briefly noted 2166.3 Combining Estimators 2176.3.1Simple linear interpolation6.3.2 Katz’s backing-off 2196.3.3General linear interpolation6.3.4 Briefly noted 2226.3.5Language models for Austen6.4 Conclusions 2246.5 Further Reading 2256.6 Exercises 2257
`
`Methodological Preliminaries 2322182202237.1.1Supervised and unsupervised learning7.1.2 Pseudowords2331972102327.1.3Upper and lower bounds on performance2337.2Supervised Disambiguation 2357.2.1 Bayesian classification 2357.2.2 An information-theoretic approach 2397.3Dictionary-Based Disambiguation 2417.3.1 Disambiguation based on sense definitions2427.3.2 Thesaurus-based disambiguation 2447.3.3Disambiguation based on translations in asecond-language corpus 2477.3.4One sense per discourse, one sense percollocation 2497.4Unsupervised Disambiguation 2527.5What Is a Word Sense? 2567.6Further Reading 2607.7Exercises 262
`
`Page 7 of 704
`
`

`

`Lexical Acquisition 265
`8.1
`
`xi
`
`III Grammar
`
`315
`
`Evaluation Measures 2678.2 Verb Subcategorization 2718.3 Attachment Ambiguity 2788.3.1 Hindle and Rooth (1993) 2808.3.2 General remarks on PP attachment2848.4 Selectional Preferences 2888.5 Semantic Similarity 2948.5.1 Vector measures 296space8.5.2 Probabilistic measures 3038.6The Role of Lexical Acquisition in Statistical NLP3088.7 Further Reading 312
`Markov Models 3189.2 Hidden Markov Models 3209.2.1 Why use HMMs? 3229.2.2 General form of an HMM 3249.3The Three Fundamental Questions for HMMs3259.3.1Finding the probability of an observation3269.3.2 Finding the best state sequence 3319.3.3 The third problem: Parameter estimation3339.4 HMMs: Implementation, Properties, and Variants3369.4.1 Implementation 3369.4.2 Variants 3379.4.3 Multiple input observations 3389.4.4 Initialization of parameter values 3399.5 Further Reading 339
`
`9 Markov Models 317
`9.1
`
`10 Part-of-Speech Tagging 341
`
`10.1 The Information Sources in Tagging 34310.2 Markov Model Taggers 34510.2.1 The probabilistic model 34510.2.2 The Viterbi algorithm 34910.2.3 Variations 35110.3 Hidden Markov Model Taggers 356
`
`Page 8 of 704
`
`Contents
`8
`

`

`xii
`
`Contents10.3.1 Applying HMMs to POS tagging 35710.32The effect of initialization on HMM training10.4Transformation-Based Learning of Tags 36110.4.1 Transformations 36210.4.2 The learning algorithm 36410.4.3 Relation to other models 36510.4.4 Automata 36710.4.5 Summary 36910.5Other Methods, Other Languages 37010.5.1 Other approaches to tagging 37010.5.2 Languages other than English 37110.6 Tagging Accuracy and Uses of Taggers37110.6.1 Tagging 371accuracy10.6.2 Applications of tagging 37410.7Further Reading 37710.8Exercises 37935911
`
`Probabilistic Context Free Grammars 381
`3 8 6
`
`3 9 8
`
`11.2 Questions for PCFGs 38811.3 The Probability of a String 39211.3.1 Using inside probabilities 39211.3.2 Using outside probabilities 39411.3.3Finding the most likely parse for a sentence39611.3.4 Training a PCFG
`11.4 Problems with the Inside-Outside Algorithm40111.5 Further Reading 40211.6 Exercises 40412
`Some Concepts 40812.1.1 Parsing for disambiguation 40812.1.2 Treebanks 41212.1.3 Parsing models vs. language models 41412.1.4 Weakening the independence assumptions ofPCFGs 41612.1.5 Tree probabilities and derivational probabilities42112.1.6There’s more than one way to do it423
`
`Probabilistic Parsing 407
`12.1
`
`Page 9 of 704
`
`11.1 Some Features of PCFGs
`

`

`Contents. .Xl11
`
`461
`IV Applications and Techniques
`Statistical Alignment and Machine Translation 463
`13.1
`
`121.7 Phrase structure grammars and dependencygrammars 42812.1.8 Evaluation 43112.1.9 Equivalent models 43712.1.10 Building Search methods 439parsers:12.1.11 Use of the geometric mean 44212.2Some Approaches 44312.2.1 Non-lexicalized treebank 443grammars12.2.2 Lexicalized models using derivational histories44812.2.3 Dependency-based models 45112.2.4 Discussion 45412.3Further Reading 45612.4Exercises 458
`Text Alignment 46613.1.1 Aligning sentences and paragraphs 46713.1.2 Length-based methods 47113.1.3Offset alignment by signal processingtechniques 47513.1.4 Lexical methods of sentence alignment 47813.1.5 Summary 48413.1.6 Exercises 48413.2 Word Alignment 48413.3 Statistical Machine Translation 48613.4 Further Reading 49214
`14.1 Hierarchical Clustering 50014.1.1 Single-link and complete-link clustering 50314.1.2 Group-average agglomerative clustering 50714.1.3 An application: Improving a language model50914.1.4 Top-down clustering 51214.2 Non-Hierarchical Clustering 51414.2.1 K-means 51514.2.2 The EM algorithm 51814.3 Further Reading 527
`
`Clustering 495
`
`Page 10 of 704
`
`13
`

`

`xiv
`
`Contents14.4 Exercises 528
`
`15 Topics in Information Retrieval 529
`15.1 Some
`
`Background on Information Retrieval 53015.1.1Common design features of IR systems53215.1.2 Evaluation measures 53415.1.3 The probability ranking principle (PRP) 53815.2 The Vector Space Model 53915.2.1 Vector similarity 54015.2.2 Term weighting 54115.3 Term Distribution Models 54415.3.1 The Poisson distribution 54515.3.2 The two-Poisson model 54815.3.3 The K mixture 54915.3.4 Inverse document frequency 55115.3.5 Residual inverse document frequency 55315.3.6 Usage of term distribution models 55415.4 Latent Semantic Indexing 55415.4.1 Least-squares methods 55715.4.2 Singular Value Decomposition 55815.4.3 Latent Semantic Indexing in IR 56415.5 Discourse Segmentation 56615.5.1 Perceptrons59716.4
`
`k
`Nearest Neighbor Classification60416.5 Further Reading607
`Tiny Statistical Tables
`609
`
`Bibliography
`
`611
`
`Index
`
`657
`
`Page 11 of 704
`
`

`

`List of Tables
`
`Tom Sawyer.
`Tom Sawyer.
`Times.
`
`Major suppliers of electronic corpora with contact URLs.Different formats for telephone numbers appearing in anissue of
`
`The Economist.
`
`Sentence lengths in newswire text.Sizes of various tag sets.
`
`Comparison of different tag sets: adjective, adverb,conjunction, determiner, noun, and pronoun tags.Comparison of different tag sets: Verb, preposition,punctuation and symbol tags.
`
`1.1
`1.2
`1.3
`1.4
`1.5
`
`2.1
`2.2
`
`3.1
`3.2
`3.3
`
`4.1
`4.2
`
`4.3
`4.4
`4.5
`
`4.6
`
`5.1
`5.2
`5.3
`
`21
`22
`24
`30
`32
`
`58
`71
`
`84
`86
`90
`
`119
`
`131
`137
`140
`
`141
`
`142
`
`154
`154
`
`Page 12 of 704
`
`Common words in Tom Sawyer.
`Frequency of frequencies of word types in
`Empirical evaluation of Zipf’s law on
`Commonest bigram collocations in the New York
`Frequent bigrams after filtering.
`Likelihood ratios between two theories.
`Statistical NLP problems as decoding problems.
`Common inflections of nouns.
`Pronoun forms in English.
`Features commonly marked on verbs.
`Finding Collocations: Raw Frequency.
`Part of speech tag patterns for collocation filtering.
`Finding Collocations: Justeson and Katz’ part-of-speech filter. 15 5
`

`

`xvi
`
`List of Tables
`
`‘strong w’
`
`‘powerful w.’
`
`5.4
`
`5.5
`5.6
`
`5.7
`
`5.8
`
`5.9
`5.10
`
`5.11
`5.12
`
`5.13
`5.14
`
`5.15
`
`5.16
`5.17
`
`5.18
`
`6.1
`6.2
`6.3
`
`6.4
`
`6.5
`
`6.6
`
`6.7
`
`Finding collocations based on mean and variance.Finding collocations: The t test applied to 10 bigrams thatoccur with frequency 20.Words that occur significantly more often with
`(the last ten words).A 2-by-2 table showing the dependence of occurrences ofnew and
`
`strong
`
`powerful
`
`companies.
`
`powerful
`
`Using the t test for comparing the performance of twosystems.Extracts from the frequencies of frequencies distributionfor bigrams and trigrams in the Austen corpus.156161166167169171171172174176178179181182185194197200203205209214
`
`Correspondence of vache and cow in an aligned corpus.Testing for the independence of words in different corporausing x2.How to compute Dunning’s likelihood ratio test.Bigrams of
`with the highest scores according toDunning’s likelihood ratio test.Damerau’s frequency ratio test.Finding collocations: Ten bigrams that occur withfrequency 20, ranked according to mutual information.Correspondence of chambre and
`
`house in
`
`house
`
`mutual information
`
`the aligned Hansard corpus.Problems for Mutual Information from data sparseness.Different definitions of
`in (Cover andThomas 1991) and (Fano 1961).Collocations in the
`
`BBI
`Combinatory Dictionary of Englishfor the words
`
`strength
`
`power.
`
`Growth in number of parameters for n-gram models.Notation for the statistical estimation chapter.Probabilities of each successive word for a clause fromPersuasion.
`Estimated frequencies for the AP data from Church andGale (1991a).Expected Likelihood Estimation estimates for the wordfollowing
`
`was.
`
`Page 13 of 704
`
`The nouns w occurring most often in the patterns
` and
`(the first ten words) and
`and communeS and
`and
`

`

`List of Tables
`
`xvii
`
`Persuasion.
`
`215Back-off language models with Good-Turing estimationtested on
`
`Persuasion.
`
`6.86.96.106.117.17.27.37.47.57.67.77.87.98.18.28.38.48.58.68.78.88.98.109.19.210.1Good-Turing estimates for bigrams: Adjusted frequenciesand probabilities.215Good-Turing bigram frequency estimates for the clausefrom using a second-language corpus. 248Examples of the one sense per discourse constraint.250Some results of unsupervised disambiguation.256The F measure and accuracy are different objective functions. 270Some subcategorization frames with example verbs andsentences.271Some subcategorization frames learned by Manning’s system. 276An example where the simple model for resolving PPattachment ambiguity fails.280Selectional Preference Strength (SPS).290Association strength distinguishes a verb’s plausible andimplausible objects.292Similarity measures for binary vectors.299The cosine as a measure of semantic similarity.302Measures of (dis-)similarity between probability distributions. 304Types of words occurring in the LOB corpus that were notcovered by the OALD dictionary.310Notation used in the HMM chapter.324Variable calculations for 0 = (lem, ice-t, cola).330Some part-of-speech tags frequently used for tagging English. 342
`
`223Probability estimates of the test clause according to variouslanguage models.224Notational conventions used in this chapter.235Clues for two senses of
`used by a Bayesian classifier.238Highly informative indicators for three ambiguous Frenchwords.239Two senses of ash.243Disambiguation of ash with Lesk’s algorithm.243Some results of thesaurus-based disambiguation.247How to disambiguate
`
`interest
`
`drug
`
`
`
`Page 14 of 704
`
`

`

`. . .xvlll10.210.310.410.510.610.710.810.9
`
`List of Tables
`
`Notational conventions for tagging.346Idealized counts of some tag transitions in the Brown Corpus. 348Idealized counts of tags that some words occur within theBrown Corpus.Table of probabilities for dealing with unknown words intagging.Initialization of the parameters of an HMM.Triggering environments in Brill’s transformation-basedtagger.Examples of some transformations learned intransformation-based tagging.Examples of frequent errors of probabilistic taggers.10.10 A portion of a confusion matrix for part of speech tagging.11.111.211.312.112.2Notation for the PCFG chapter.A simple Probabilistic Context Free Grammar (PCFG).Calculation of inside probabilities.12.312.412.512.6Abbreviations for phrasal categories in the Penn Treebank.Frequency of common subcategorization frames (local treesexpanding VP) for selected verbs.Selected common expansions of NP as Subject vs. Object,ordered by log odds ratio.Selected common expansions of NP as first and secondobject inside VP.Precision and recall evaluation results for PP attachmenterrors for different styles of phrase structure.Comparison of some statistical parsing systems.13.1Sentence alignment papers.14.1A summary of the attributes of different clusteringalgorithms.14.2Symbols used in the clustering chapter.14.3Similarity functions used in clustering.14.4An example of K-means clustering.14.5An example of a Gaussian mixture.15.115.2A small stop list for English.533An example of the evaluation of rankings.535349352359363363374375383384394413418420420436455470500501503518521
`
`Page 15 of 704
`
`

`

`List of Tables
`
`xix
`
`Times
`
`15.315.415.515.615.715.815.916.116.216.316.416.516.616.716.816.9Three quantities that are commonly used in term weightingin information retrieval.Term and document frequencies of two words in anexample corpus.Components of tf.idf weighting schemes.Document frequency (df) and collection frequency (cf) for6 words in the New York
`corpus.Actual and estimated number of documents with koccurrences for six terms.Example for exploiting co-occurrence in computing contentsimilarity.The matrix of document correlations BTB.Some examples of classification tasks in NLP.Contingency table for evaluating a binary classifier.The representation of document 11, shown in figure 16.3.An example of information gain as a splitting criterion.Contingency table for a decision tree for the Reuterscategory “earnings.”An example of a maximum entropy distribution in the formof equation (16.4).An empirical distribution whose corresponding maximumentropy distribution is the one in table 16.6.Feature weights in maximum entropy modeling for thecategory “earnings” in Reuters.Classification results for the distribution corresponding totable 16.8 on the test set.16.10 Perceptron for the “earnings” category.16.11 Classification results for the perceptron in table 16.10 onthe test set.16.12 Classification results for an 1NN categorizer for the“earnings” category.
`
`542542544547550554562576577581582586593594595595601602606
`
`Page 16 of 704
`
`

`

`List of Figures
`
`(KWIC)
`showed in Tom Sawyer.
`
`A diagram illustrating the calculation of conditionalprobability
`
`P(AJB).
`
`showed.
`
`1.11.21.3
`
`1.4
`
`2.1
`
`2.2
`2.3
`
`2.4
`
`2.5
`2.6
`
`2.7
`2.8
`2.9
`
`3.1
`3.2
`
`4.1
`4.2
`
`
`
`A sentence as tagged according to several different tag sets.5.1Zipf’s law.Mandelbrot’s formula.Key Word In Context A random variable X for the sum of two dice.Two examples of binomial distributions: b(r; 10,0.7) andb(r; 10,O.l).Example normal distribution curves: n(x; 0,l) andn(x; 1.5,2).The entropy of a weighted coin.The relationship between mutual information I andentropy H.Using a three word collocational window to capturebigrams at a distance.26273233424552536367696970135140
`
`An example of recursive phrase structure expansion.99
`
`An example of a prepositional phrase attachment ambiguity.108
`
`158
`
`Page 17 of 704
`
`The noisy channel model.
`A binary symmetric channel.
`The noisy channel model in linguistics.
`Heuristic sentence boundary detection algorithm.
` display for the word
`Syntactic frames for
`

`

`5.2
`
`7.1
`7.2
`
`7.3
`7.4
`7.5
`7.6
`7.7
`
`7.8
`
`8.1
`8.2
`8.3
`8.4
`8.5
`
`9.1
`9.2
`
`9.3
`
`9.4
`9.5
`9.6
`
`9.7
`
`10.1
`10.2
`10.3
`
`11.1
`
`11.2
`11.3
`
`12.1
`
`Bayesian disambiguation.The Flip-Flop algorithm applied to finding indicators fordisambiguation.Lesk’s dictionary-based disambiguation algorithm.Thesaurus-based disambiguation.Adaptive thesaurus-based disambiguation.Disambiguation based on a second-language corpus.Disambiguation based on “one sense per collocation” and“one sense per discourse.”An EM algorithm for learning a word sense clustering.
`
`The crazy soft drink machine, showing the states of themachine and the state transition probabilities.A section of an HMM for a linearly interpolated languagemodel.
`
`A program for a Markov process.Trellis algorithms.
`
`Trellis algorithms: Closeup of the computation of forwardprobabilities at one node.
`
`A.
`
`The two parse trees, their probabilities, and the sentenceprobability.A Probabilistic Regular Grammar (PRG).Inside and outside probabilities in PCFGs.
`
` of
`
`160
`
`238
`
`240
`243
`245
`246
`249
`
`252
`254
`
`268
`285
`297
`297
`297
`
`319
`
`321
`
`323
`325
`328
`
`329
`334
`
`348
`350
`364
`
`385
`390
`391
`
`408
`
`Page 18 of 704
`
`List
`Figures
`Histograms of the position of strong relative to three words.
`A diagram motivating the measures of precision and recall.
`Attachments in a complex sentence.
`A document-by-word matrix
`A word-by-word matrix B.
`A modifier-by-head matrix C.
`A Markov model.
`The probability of traversing an arc.
`Algorithm for training a Visible Markov Model Tagger.
`Algorithm for tagging with a Visible Markov Model Tagger.
`The learning algorithm for transformation-based tagging.
`A word lattice (simplified).
`

`

`List of Figures
`
`12.2
`12.3
`12.4
`12.5
`12.6
`12.7
`12.8
`
`13.1
`13.2
`13.3
`13.4
`13.5
`13.6
`
`14.1
`
`14.2
`14.3
`14.4
`14.5
`14.6
`14.7
`14.8
`14.9
`14.10
`
`15.1
`
`15.2
`15.3
`15.4
`15.5
`15.6
`15.7
`15.8
`
`15.9
`
`CFG
`A Penn Treebank tree.Two
`
`derivations of the same tree.An LC stack parser.Decomposing a local tree into dependencies.An example of the
`
`PARSEVAL
`
`Dimensionality reduction.An example of linear regression.The matrix T of the
`decomposition of the matrix infigure 15.5.The matrix of singular values of the
`
`SVD
`
`decomposition ofthe matrix in figure 15.5.. . .xx111
`measures.The idea of crossing brackets.Penn trees versus other trees.Different strategies for Machine Translation.Alignment and correspondence.Calculating the cost of alignments.A sample dot plot.The pillow-shaped envelope that is searched.The noisy channel model in machine translation.A single-link clustering of 22 frequent English wordsrepresented as a dendrogram.Bottom-up hierarchical clustering.Top-down hierarchical clustering.A cloud of points in a plane.Intermediate clustering of the points in figure 14.4.Single-link clustering of the points in figure 14.4.Complete-link clustering of the points in figure 14.4.The K-means clustering algorithm.One iteration of the K-means algorithm.An example of using the EM algorithm for soft clustering.Results of the search ‘ “glass pyramid” Pei Louvre’ on aninternet search engine.Two examples of precision-recall curves.A vector space with two dimensions.The Poisson distribution.An example of a term-by-document matrix
`
`413
`421
`425
`430
`433
`434
`436
`
`464
`469
`473
`476
`480
`486
`
`496
`502
`502
`504
`504
`505
`505
`516
`517
`519
`
`531
`537
`540
`546
`555
`555
`558
`
`560
`
`560
`
`A.
`
`SVD
`
`Page 19 of 704
`
`

`

`D
`
`List of Figures
`
`of the SVD decomposition of the matrix infigure 15.5.15.11 The matrix J3 =
`
`S2x2D2xn
`
` of documents after resealing withsingular values and reduction to two dimensions.15.12 Three constellations of cohesion scores in topic boundary16.116.216.3A decision tree.16.416.5Geometric interpretation of part of the tree in figure 16.1.An example of a Reuters news story in the topic category“earnings.”Pruning a decision tree.Classification accuracy depends on the amount of trainingdata available.16.616.716.8An example of how decision trees use data inefficientlyfrom the domain of phonological rule learning.The Perceptron Learning Algorithm.One error-correcting step of the perceptron learningalgorithm.16.9Geometric interpretation of a perceptron.identification.561562569578579580585587588598600602
`
`Page 20 of 704
`
`xxiv
`15.10 The matrix
`

`

`Table of Notations
`
`A
`
`Union of setsIntersection of setsSet differenceThe complement of set
`
`The empty setThe power set of
`
`A
`
`-B, A\B
`
`U n A
`
`A 0 Z
`
`A, 2’(A)
`IAl
`
`Cardinality of a setSumProductp implies q (logical inference)p and q are logically equivalentDefined to be equal to (only used if “=” is ambiguous)The set of real numbersNThe set of natural numbers
`
`g u
`
`%
`
`n!
`ca
`
`n
`
`Infinity1x1Absolute value of a number<<Much smaller than>>Much greater than
`
`f : A - B
`
`f
`
`A
`
`B
`
`m=f
`
`f
`
`c r
`
`I P
`
`a4
`PO4
`
`Page 21 of 704
`
`The factorial of
`A function
`from values in
`to
`The maximum value of
`

`

`xxvi
`
`Table of Notations
`
`f
`
`minfThe minimum value of
`
`f
`f
`f
`f
`has its maximum valuearg min
`f
`has its minimum valuelim,,,
`f(x)
`fms
`
`CT22E(X)Var(X)puxs2P(AIB)X - P(X)b(r; n,p)n0rnk F, H(X)f is proportional to gPartial derivativeIntegralThe logarithm of
`
`and column j of matrix CTranspose of matrix CEstimate of XExpectation of XVariance of XMeanStandard deviationSample meanSample varianceThe probability of
`
`A
`
`B
`
`r
`
`Random variable X is distributed according to pThe binomial distributionCombination or binomial coefficient (the number of ways ofchoosing
`objects from n)The normal distributionEntropy
`
`a
`
`The exponential functionThe smallest integer
`i
`
`i
`
`2 a
`
`A real-valued vector: 2 E 08”Euclidean length of 2The dot product of x’ and y’The cosine of the angle between 2 and y’Element in row
`
`i
`
`as
`
`a
`
`exp(x), ex
`[al
`
`2 I
`
`x’1
`2.9
`cos(x’, y’,
`
`Page 22 of 704
`
`arg max
`The argument for which
`The argument for which
`The limit of
`as x tends to infinity
`log
`s.t.
`conditional on
`

`

`Table of Notations
`
`xxvii
`
`1(X; Y)D(P II 9)ct.1fu
`
`Mutual informationKullback-Leibler (KL) divergenceCount of the entity in parenthesesThe relative frequency of u.The words Wi, wi+i, . . . , WjThe same as wijThe same as wijTime complexity of an algorithmUngrammatical sentence or phrase or ill-formed wordMarginally grammatical sentence or marginally acceptablephraseNote.Some chapters have separate notation tables for symbols that areused locally: table 6.2 (Statistical Inference), table 7.1 (Word Sense Dis-ambiguation), table 9.1 (Markov Models), table 10.2 (Tagging), table 11.1(Probabilistic Context-Free Grammars), and table 14.2 (Clustering).
`
`Wij, W(i)(j)
`
`Wi,j
`
`Wi,...,Wj
`
`O(n)
`
`4 ?
`
`Page 23 of 704
`
`

`

`Preface
`
`THE NEED
`
`for a thorough textbook for Statistical Natural Language Pro-cessing hardly needs to be argued for in the age of on-line information,electronic communication and the World Wide Web. Increasingly, busi-nesses, government agencies and individuals are confronted with largeamounts of text that are critical for working and living, but not wellenough understood to get the enormous value out of them that they po-tentially hide.At the same time, the availability of large text corpora has changedthe scientific approach to language in linguistics and cognitive science.Phenomena that were not detectable or seemed uninteresting in studyingtoy domains and individual sentences have moved into the center field ofwhat is considered important to explain. Whereas as recently as the early1990s quantitative methods were seen as so inadequate for linguisticsthat an important textbook for mathematical linguistics did not coverthem in any way, they are now increasingly seen as crucial for linguistictheory.In this book we have tried to achieve a balance between theory andpractice, and between intuition and rigor. We attempt to ground ap-proaches in theoretical ideas, both mathematical and linguistic, but si-multaneously we try to not let the material get too dry, and try to showhow theoretical ideas have been used to solve practical problems. To dothis, we first present key concepts in probability theory, statistics, infor-mation theory, and linguistics in order to give students the foundationsto understand the field and contribute to it. Then we describe the prob-lems that are addressed in Statistical Natural Language Processing
`
`(NLP),
`
`Page 24 of 704
`
`like tagging and disambiguation, and a selection of important work so
`

`

`Preface
`
`that students are grounded in the advances that have been made and,having understood the special problems that language poses, can movethe field forward.When we designed the basic structure of the book, we had to makea number of decisions about what to include and how to organize thematerial. A key criterion was to keep the book to a manageable size. (Wedidn’t entirely succeed!) Thus the book is not a complete introductionto probability theory, information theory, statistics, and the many otherareas of mathematics that are used in Statistical NLP. We have tried tocover those topics that seem most important in the field, but there willbe many occasions when those teaching from the book will need to usesupplementary materials for a more in-depth coverage of mathematicalfoundations that are of particular interest.We also decided against attempting to present Statistical NLP as homo-geneous in terms of the mathematical tools and theories that are used.It is true that a unified underlying mathematical theory would be desir-able, but such a theory simply does not exist at this point. This has ledto an eclectic mix in some places, but we believe that it is too early tomandate that a particular approach to
`is right and should be givenpreference to others.A perhaps surprising decision is that we do not cover speech recogni-tion. Speech recognition began as a separate field to NLP, mainly grow-ing out of electrical engineering departments, with separate conferencesand journals, and many of its own concerns. However, in recent yearsthere has been increasing convergence and overlap. It was research intospeech recognition that inspired the revival of statistical methods withinNLP, and many of the techniques that we present were developed first forspeech and then spread over into NLP. In particular, work on languagemodels within speech recognition greatly overlaps with the discussionof language models in this book. Moreover, one can argue that speechrecognition is the area of language processing that currently is the mostsuccessful and the one that is most widely used in applications. Neverthe-less, there are a number of practical reasons for excluding the area fromthis book: there are already several good textbooks for speech, it is not anarea in which we have worked or are terribly expert, and this book seemedquite long enough without including speech as well. Additionally, whilethere is overlap, there is also considerable separation: a speech recogni-tion textbook requires thorough coverage of issues in signal analysis and
`
`NLP
`
`Page 25 of 704
`
`xxx
`

`

`Preface
`
`xxxi
`
`acoustic modeling which would not generally be of interest or accessibleto someone from a computer science or NLP background, while in thereverse direction, most people studying speech would be uninterested inmany of the NLP topics on which we focus.Other related areas that have a somewhat fuzzy boundary with Statis-tical NLP are machine learning, text categorization, information retrieval,and cognitive science. For all of these areas, one can find examples ofwork that is not covered and which would fit very well into the book.It was simply a matter of space that we did not include important con-cepts, methods and problems like minimum description length, back-propagation, the Rocchio algorithm, and the psychological and cognitive-science literature on frequency effects on language processing.The decisions that were most difficult for us to make are those thatconcern the boundary between statistical and non-statistical NLP. Webelieve that, when we started the book, there was a clear dividing linebetween the two, but this line has become much more fuzzy recently.An increasing number of non-statistical researchers use corpus evidenceand incorporate quantitative methods. And it is now generally acceptedin Statistical NLP that one needs to start with all the scientific knowledgethat is available about a phenomenon when building a probabilistic orother model, rather than closing one’s eyes and taking a clean-slate ap-proach.Many NLP researchers will therefore question the wisdom of writing aseparate textbook for the statistical side. And the last thing we wouldwant to do with this textbook is to promote the unfortunate view insome quarters that linguistic theory and symbolic computational workare not relevant to Statistical NLP. However, we believe that there isso much quite complex foundational material to cover that one simplycannot write a textbook of a manageable size that is a satisfactory andcomprehensive introduction to all of NLP. Again, other good texts al-ready exist, and we recommend using supplementary material if a morebalanced coverage of statistical and non-statistical methods is desired.A final remark is in order on the title we have chosen for this book.
`
`Nuturd Language Processing
`
`from a standard introduction to statistics. Statistical NLP as we define itcomprises all quantitative approaches to automated language processing,including probabilistic modeling, information theory, and linear algebra.
`
`STATISTICALNATURAL
`L A N G U A G E
`PROCESSING
`
`Page 26 of 704
`
`Calling the field Statistical
`might seem ques-
`tionable to someone who takes their definition of a statistical method
`

`

`Preface
`
`While probability theory is the foundation for formal statistical reason-ing, we take the basic meaning of the term ‘statistics’ as being broader,encompassing all quantitative approaches to data (a definition which onecan quickly confirm in almost any dictionary). Although there is thussome potential for ambiguity, Statistical NLP has been th

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket