throbber

`

`Automatic Speech Recognition for Large Vocabularies
`
`197
`
`12.7.l N-grams
`
`A simple solution to the problem of estimating word-sequence probabilities is to
`use N-grams (where N is a small number). Here it is assumed that the probability
`of observing any word wk depends only on the identity of the previous N - I words
`(so that each conditional probability depends on a total of N words, including the
`current word). Using N-grams Equation (12.3) can be approximated as follows:
`
`K
`
`P(W) = TT P( wk I Wi, ... , wk-I) ~ TT P( wk I wk-N+1' ... , wk-I).
`
`K
`
`(12.4)
`
`k=l
`k=l
`If N = 1 the model is a unigram and just represents the probability of the word. A
`bigram (N = 2) models the probability of a word given the immediately preceding
`word, while a trigram (N = 3) takes into account the previous two words.
`To illustrate the use of N-grams to estimate word-sequence probabilities,
`consider the phrase "ten pots fell over". For a unigram model, the probability of the
`sequence is obtained simply by multiplying together the word probabilities:
`
`P(ten pots fell over)~ P(ten) P(pots) P(fell) P( over).
`
`(12.5)
`
`In the case of a bigram model the probability of the sequence is estimated as:
`P(ten pots fell over):::::: P(ten I START) P(pots I ten)
`P( fell I pots) P( over I fell),
`
`(12.6)
`
`where START is used to indicate the beginning of the sequence. For a trigram
`model, the probability estimate becomes:
`P(ten pots fell over):::::: P(ten I START) P(pots I START ten)
`P( fell I ten pots) P( over I pots fell).
`
`(12.7)
`
`In principle, N-grams can be estimated using simple frequency counts from
`training data to obtain maximum-likelihood estimates for the required probabilities.
`Considering the bigram (wk_ 1, wk) and the trigram (wk_ 2, wk_ 1, wk), the conditional
`probabilities P( wk I wk_ 1) and P( wk I wk_ 2, wk_ 1) could be estimated thus:
`lw
`P(w lw )= C(wk-1,wk)
`C (
`)
`k
`k-2 '
`k-1
`wk-I
`
`P(w
`
`'
`
`k
`
`w
`k-1
`
`)= C(wk-2,wk-l'wk)
`C (
`wk-2' wk-I
`
`)
`
`'
`
`(12.8)
`
`where the notation C(x) is used to represent the count of number of examples of x.
`For the examples of the bi gram ( fell over) and the trigram (pots fell over) we have:
`I~ ll) C (fell over)
`I
`P .. (
`p ... (
`c. ll) C (pots fell over)
`over pots 1e =
`over 1e = -----,
`C (fell)
`C (fell over)
`
`(12.9)
`
`12.7.2 Perplexity and evaluating language models
`
`The task of the language model can be viewed as one of predicting words in a
`sequence, and a good language model will be one that provides a good predictor of
`
`IPR2023-00037
`Apple EX1016 Page 86
`
`

`

`

`

`Automatic Speech Recognition for Large Vocabularies
`
`/99
`
`provided that the sample size is large in relation to the number of possible
`outcomes. Unfortunately, in the case of N-gram modelling there are a vast number
`of possible outcomes. A vocabulary of V words provides V 2 potential bigrams and
`v3 potential trigrams. For a 20,000-word vocabulary there are thus 400 million
`possible word bigrams and eight million million possible trigrams! It would be
`completely impracticable
`to provide sufficient speech data to determine the
`language-model probabilities, and instead very large text corpora are used. Even
`so, while typical text corpora may contain over 100 million words, most of the
`possible bigrams and the vast majority of possible trigrams will not occur at all in
`the training data and many others will only appear once or twice.
`Data sparsity is a much greater problem for the language model than for the
`acoustic model, due to the much larger size of the inventory of basic units (words
`rather than phones). As a consequence of this severe data-sparsity problem, it is not
`possible to rely only on simple frequency counts for estimating language-model
`probabilities, as many of the bigram and trigram probabilities would be set to zero
`(so that it would then be impossible to recognize any word sequence containing one
`of these combinations) and many other probabilities would be poorly estimated.
`The successful use of N-grarns for L VCSR is dependent upon the use of smoothing
`techniques for obtaining accurate, robust (non-zero) probability estimates for all
`the possible N-grams that can occur for a given vocabulary. Zero probabilities and
`low non-zero probabilities
`are adjusted upwards, and high probabilities are
`reduced. Thus the overall effect is to make the probability distributions more
`uniform, and hence 'smoother'. Some different aspects of smoothing algorithms are
`described briefly in the following sections.
`
`12.7.4 Discounting
`
`such as bigrams or trigrams, the sum of the
`'events',
`For any set of possible
`probabilities for all the possibilities must be equal to one. When only a subset of
`the possible events occurs in the training data, the sum of the probabilities of all the
`observed events must therefore be less than one. This effect can be achieved by
`reducing the observed
`frequency counts. The process is generally known as
`discounting, and is often described in terms of 'freeing' probability mass from the
`observed events which can then be redistributed among the unseen events.
`Several different methods have been used to achieve discounting, and these
`methods differ in the way in which the reduced probabilities are calculated. Full
`coverage of discounting methods is outside the scope of this book but, for example,
`one simple but effective technique is absolute discounting. Here some small fixed
`amount (between zero and one) is subtracted
`from each frequency count ( the
`numerators in Equation (12.8) or in the examples shown in Equation (12.9)). Thus
`the probability of every observed event is reduced, but the effect decreases as the
`observed frequency count increases (when the maximum-likelihood estimate should
`be more reliable). However, the same discount value may not be optimum for the
`full range of frequency counts. In particular, there is some evidence that absolute
`discounting imposes too great a reduction
`in the probability of events that occur
`only once or twice. A variant of the technique overcomes this problem by having
`separate discount values for these rare events.
`
`IPR2023-00037
`Apple EX1016 Page 88
`
`

`

`200
`
`Speech Synthesis and Recognition
`
`12.7.5 Backing off in language modelling
`
`Probability mass which has been made available as a result of discounting can be
`allocated to the unseen events. In estimating the probabilities of the unseen events,
`it is desirable to make use of any information
`that is available about the relative
`probabilities of the different events. In the case of N-grams, useful information may
`be available in the form of the probability according to a more general distribution.
`Thus it is possible to formulate a recursive procedure of 'backing off': if a trigram
`is not observed, the model backs off to the relevant bigram probability, and if the
`bigram is not available, it backs off further to the unigram probability. For any
`words which do not occur in the training texts at all, it is possible to back off to a
`uniform distribution whereby all words are assumed to be equally likely.
`Backing off can be illustrated by considering
`the task of estimating the
`conditional trigram probabilities P( wk I wk_ 2 , wk_ 1) for word wk in the context of
`the sequence of preceding words (wk_ 2 , wk_ 1). We will assume that some
`appropriate discounting method has been used
`to assign probabilities to all
`observed
`trigrams,
`and
`these
`probability
`estimates will be denoted
`P(wk I wk_
`Thus, using Ps(wk I wk- 2
`, wk_ 1).
`, wk- 1
`to denote an estimate for the
`probability of word wk given the sequence of preceding words (wk-2, wk_,), a
`backing-off scheme for obtaining this probability estimate is as follows:
`
`)
`
`2
`
`5
`
`if C(wk_2 , wk_1, wk)> 0
`) otherwise.
`(12.11)
`
`) = {P(wk I wk_2, wk_1)
`I
`p (
`s wk wk_z,wk-l
`(wklwk_ 1
`B(wk_2 ,wk_1)P
`Ps(wk I wk- 1) is the estimated bigram probability of wk given preceding word wk-I·
`B( wk- 2, wk- 1) is a normalizing constant for trigram context ( wk- 2, wk- 1), and scales
`the bigram probabilities so that the sum of the Ps( wk I wk_ 2, wk- 1) terms is equal to
`I when computed over all possible words wk. The sum of all the probabilities that
`are calculated by backing off must be equal to the total probability mass that has
`been freed from discounting the relevant trigram context. The normalizing constant
`is therefore chosen to be equal to the appropriate fraction of this freed probability.
`Backing off from bigram
`to unigram and
`from unigram
`to unifonn
`distributions can be accomplished in an analogous manner.
`By setting the count threshold to zero, backing off only occurs for N-grams
`with no examples at all. However, as observing just one or two examples is unlikely
`to provide a reliable probability estimate, it can be beneficial to apply a higher
`threshold and so disregard those N-grams that occur just a few times in the training
`data. In this way only robustly estimated N-grams should be retained, which has the
`added benefit of a substantial saving in the memory needed for the language model.
`
`12.7.6 Interpolation of language models
`
`Backing off involves choosing between a specific and a more general distribution.
`An alternative is to compute a weighted average of different probability estimates,
`obtained for contexts that can range from very specific to very general. The idea is
`to improve the accuracy and robustness of a context-specific probability estimate
`by combining it with more general estimates for which more data are available. One
`option involves taking a linear combination of different probability estimates. For
`
`IPR2023-00037
`Apple EX1016 Page 89
`
`

`

`Automatic Speech Recognition for Large Vocabularies
`
`201
`
`example, a trigram probability could be estimated by linear interpolation between
`relative frequencies of the relevant trigrams, bigrams and unigrams, thus:
`
`I
`Ps (wk Wk-2, Wk-I
`
`A C(wk) (1212 )
`A C(wk-1,wk)
`) =A C(wk-2,wk-t,Wk)
`,
`•
`3 C(
`) + 2 C(
`) + I K
`Wk-2, Wk-1
`Wk-I
`
`where K is the number of different words, and the sum of the non-negative weights
`A.t + J 2 + ,.1,3 = 1 . The values for the weights need to be chosen to make the best
`compromise between specificity and ability to generalize to new data. The training
`data are therefore divided into two parts. The first (larger) part of the data is used to
`derive the frequency counts, which are then used when finding the optimum values
`of the weights in order to maximize the probability for the second part of the data.
`Because the parameters that are estimated will tend to depend on how the data are
`partitioned, often alternative sets of parameters are estimated for several different
`ways of splitting the data, and then the individual estimates are combined. This
`smoothing method is often called deleted interpolation.
`For simplicity, interpolation has been introduced using probability estimates
`obtained from simple frequency counts. However, smoothing by interpolation can
`also be applied to other probability estimates, including N-gram probabilities that
`have been obtained by, for example, the absolute discounting technique mentioned
`in Section 12.7.4. Interpolation schemes can also be used with probabilities from
`other types of language model, such as those discussed in Section 12. 7. 8 below.
`
`12.7.7 Choice of more general distribution for smoothing
`
`In the previous sections we have described ways of smoothing N-grams using
`lower-order N-grams. The lower-order distribution
`is usually defined in a way
`which is exactly analagous to the definition for the higher-order distribution that is
`being smoothed. There is however an alternative method which may give more
`reliable estimates of the lower-order probabilities. This method is best explained by
`describing an example. Consider a word, such as "Francisco", that almost always
`occurs following just one other word ("San"). If the training text contains several
`examples of "San Francisco", both the bigram probability P(Francisco!San) and the
`unigram probability P(Francisco) will be high. Thus if we then use a discounting
`method such as absolute discounting
`to derive the probability of "Francisco"
`occurring after some other bigram history, this probability will also tend to be fairly
`high. However, intuitively it seems more plausible that, if the only information we
`have implies that "Francisco" always follows "San", the probability of "Francisco"
`following any other word should be low.
`It is possible to define the unigram probability used for smoothing not in
`terms of number of examples of a word but rather in terms of the number of
`different contexts in which the word occurs. Chen and Goodman ( 1999) have
`~roposed using this approach
`together with smoothing by interpolation, and
`mcorporating a variant of absolute discounting
`that uses three separate discounts
`(one for N-grams occurring just once, one for N-grams occurring twice, and a third
`fo~ all other N-gram counts). In comparative experiments, they demonstrated that
`this method performed consistently better
`(in terms of both perplexity and
`recognition performance) than a wide range of other smoothing techniques.
`
`IPR2023-00037
`Apple EX1016 Page 90
`
`

`

`

`

`A1ttomatic Speech Recognition for Large Vocabularies
`
`203
`
`12,8 DECODING
`
`The recognition task is to find the most probable sequence of words W, as given by
`Equation ( 12.1 ). As the vocabulary size becomes larger, the number of different
`possible word sequences soon becomes prohibitive and any practical system needs
`an efficient means of searching to find the most probable sequence. The component
`of the system that performs this search operation is often referred to as the decoder.
`Although recognition using whole-word HMMs is not practicable with a large
`vocabulary, the same principles can be applied to units of any size. In the previous
`sections we have explained how statistical models can be used at different levels.
`One level captures the relationship between phone-size units and their acoustic
`realization, then there are the different phone sequences that can make up a word,
`and finally there are the probabilities of different word sequences. Thus a language
`can be modelled as a network of states representing linguistic structure at many
`levels. At the lowest level a small network of states represents a triphone or similar
`unit. At the next level a network of triphones fonns a state to represent a word. A
`complete sentence can then be generated by a network of word states, where the
`connections between the states correspond to the language-model probabilities. The
`decoding task is to find the best path through this multiple-level network, and the
`recognized word sequence is given by the best path at the highest level.
`In order to apply the principles of DP using the Viterbi algorithm to a
`multiple-level network, probabilities need to be evaluated for all valid partial paths
`at all levels, with no decisions being reached until all earlier parts of a path enter
`the same highest-level state at the same time, thus delaying decisions until all
`relevant evidence has been used. The use of delayed decisions is a fundamental
`principle of this HMM approach, as it enables the knowledge and constraints at all
`levels to be employed simultaneously
`to find the best explanation of the data. In
`practice, even at any one point in time, in a large-vocabulary system there are so
`many possibilities (both at the word and at the sub-word level) that it is impossible
`to evaluate all of them. However, the language model provides strong constraints
`that act to make many of the possibilities extremely unlikely, and it is necessary to
`find an efficient way of applying these constraints within the decoding.
`In a first-order HMM the probability of a transition from any one state to any
`other state depends only on the identities of the source and destination states. This
`model cannot accommodate a trigram ( or higher-order) language model, because
`here the word transition probability depends on more than just the immediately
`preceding state. In order to use such a language model, some way of dealing with
`the higher-order dependencies needs to be found. An additional complication arises
`when using cross-word triphones because the identity of a word-final triphone, and
`hence the word probability, depends on the identity of the following word.
`The issues associated with large-vocabulary decoding have been addressed in
`various different ways. Three main types of method are briefly described below.
`
`12.8.1 Efficient one-pass Viterbi decoding for large vocabularies
`
`triphone models, the state network for a
`In_ order to accommodate cross-word
`Viterbi search needs to include multiple entries for each word to cover all possible
`
`IPR2023-00037
`Apple EX1016 Page 92
`
`

`

`

`

`

`

`

`

`

`

`208
`
`Speech Synthesis and Recognition
`
`Journal (WSJ), and included both a 5,000-word recognition test and a 20,000-word
`test. More recently, the test set was expanded to include a range of North American
`business (NAB) news and restrictions on training for the acoustic and language
`models were removed. The recognition vocabulary became essentially unlimited
`and performance was generally evaluated using systems designed for recognizing
`about 64,000 words. WSJ and NAB evaluations were conducted in the period
`between 1992 and 1995. Over the years the complexity and variety of the tests
`increased, but recognition performance still improved each year (see Figure 12.6).
`Later ARP A evaluations moved away from read speech, and concentrated on
`transcribing
`'found' speech from broadcast news (BN) shows on radio and
`television. Such data provided a lot of variation in speech quality, microphones and
`background noise, as well as non-native speech and periods containing only non(cid:173)
`speech data such as sound effects and music. For the first BN evaluation in 1996,
`performance dropped considerably from that which had been obtained in the
`previous evaluations using read speech. However, researchers soon found ways of
`coping with the greater variety of material, and by the final BN evaluations in 1998
`and 1999 typical error rates had roughly halved from those obtained in 1996.
`ARP A have also sponsored a research programme on the recognition of
`conversational speech, including the collection of two different corpora. Data for
`one corpus, known as Switchboard (SWB), first became available in 1992. The
`data comprised telephone conversations between two strangers about some topic
`that had been specified in advance. A few years later, in 1995, collection of the
`Callhome (CH) corpus was initiated to provide more natural conversational speech
`by recording telephone conversations between family members. The CH set
`includes a number of different languages. Conversational speech is rather different
`in nature to either read speech or television and radio broadcasts. Conversations
`tend to include many disfluencies, false starts and so on, as well as a lot of
`phonological reduction and special use of prosody. There is also tum-taking
`behaviour and reliance on shared knowledge
`to assist in the communication,
`especially when the participants know each other very well ( as in the CH data). As
`a result, errors are much higher than for the other tasks (see Figure 12.6), but again
`substantial progress has been made in improving performance on the conversational
`recognition tasks. As of 2001, these evaluations are still continuing.
`The ARP A initiative has been very influential in the development of large(cid:173)
`vocabulary recognizers, both by providing a focus for refining the techniques to
`cope with increasingly difficult problems and also by making available huge
`quantities of speech data for training. Over the years, the result has been a
`progressive improvement in performance on increasingly difficult tasks, as can be
`seen in Figure 12.6. There are now several research systems which give impressive
`performance on substantial transcription tasks. However, to perform at their best
`these systems generally require both a lot of memory ( with typically several million
`parameters in both the acoustic and language models), and a lot of processing
`power ( often operating at a few hundred times real time). In the last two BN
`evaluations, the effect of processing demands was tested by introducing a category
`for systems that operated at no slower than 10 times real time (shown as BN(IOx)
`in Figure 12.6). Enforcing this constraint
`led to around 15% more errors than if
`there were no processing restrictions. We will consider performance in relation to
`the requirements of different applications in Chapter 15.
`
`IPR2023-00037
`Apple EX1016 Page 97
`
`

`

`Automatic Speech Recognition for Large Vocabularies
`
`209
`
`12.10 SPEECH UNDERSTANDING
`
`In order to understand spoken language, it is necessary not only to recognize the
`words, but also to determine the linguistic structure, including both syntactic and
`semantic aspects. Given a word sequence, the field of computational linguistics
`provides techniques for deriving a representation of the linguistic structure.
`Traditionally these techniques
`involve a complex linguistic analysis based on
`qualitative models that are usually stated in the form of a grammar that is specified
`manually by human experts, although in recent years data-driven statistical methods
`have also been used. Even so, any detailed
`linguistic analysis will involve
`modelling long-range effects that may operate over the entire span of an utterance.
`It is very difficult to incorporate such a model into a speech recognition search.
`Due to the difficulties involved in attempting to incorporate detailed linguistic
`analysis into the recognition search, speech understanding systems generally treat
`these tasks as separate components: an ASR system of the type described earlier in
`this chapter, and a second component that uses artificial-intelligence techniques to
`perform linguistic analysis
`in order to 'understand'
`the recognizer output. To
`reduce the problems caused by errors made at the recognition stage, the output of
`the recognizer should include alternatives, which can be provided in the form of a
`word lattice or an N-best list. The final recognized output is typically taken to be
`the highest-scoring alternative that is also allowed by the linguistic model.
`The process of determining the linguistic structure of a sentence is known as
`parsing. Syntactic structure is usually expressed in terms of a formal grammar, and
`there are a variety of grammar formalisms and associated methods for syntactic
`parsing. However, traditional parsers aim to recover complete, exact parses. This
`goal will often not be achievable for spoken language, which tends to contain
`grammatical errors as well as hesitations, false starts and so on. The problem is
`made even worse when dealing with the output of a speech recognizer, which may
`misrecognize short function words even when overall recognition performance is
`very good. It can therefore be useful to adopt partial parsing techniques, whereby
`only segments of a complete word sequence are parsed (for example, noun phrases
`might be identified). Other useful tools include part-of-speech taggers, which aim
`to disambiguate parts of speech ( e.g. the word "green" acts as an adjective in the
`phrase "green coat" but as a noun in "village green"), but without performing a
`parsing operation. Some taggers are rule-based, but there are also some very
`successful taggers that are based on HMMs, with the HMM states representing tags
`(or sequences of tags). Transition probabilities are probabilities of tag(s) given
`previous tag(s) and emission probabilities are probabilities of words given tags.
`Partial parsing and part-of-speech
`tagging enable useful linguistic information to be
`extracted from spoken input without requiring a comprehensive linguistic analysis.
`General semantic analysis for the derivation of meaning representations is a
`complex aspect of computational
`linguistics. However, many speech-understanding
`systems have used a simpler approach
`that is more application dependent. A
`popular technique uses templates, or 'frames', where each frame is associated with
`an action to be taken by the system. A frame has 'slots' for different items of
`relevant infonnation. In a flight enquiry system for example, one frame could
`represent a request for flight information. This frame could include slots for the
`type of infonnation ( e.g. flights, fares) and for any constraints ( e.g. date, price).
`
`IPR2023-00037
`Apple EX1016 Page 98
`
`

`

`

`

`Automatic Speech Recognition for Large Vocabularies
`
`2/1
`
`CHAPTER 12 SUMMARY
`
`•
`
`• Some large-vocabulary recognition tasks may require accurate transcription of
`the words that have been said, while others will need understanding of the
`semantic content (but not necessarily accurate recognition of every word).
`• For speech transcription the task is to find the most likely sequence of words,
`where the probability for any one sequence is given by the product of acoustic(cid:173)
`model and language-model probabilities.
`• The principles of continuous speech recognition using HMMs can be applied
`to large vocabularies, but with special techniques to deal with the large number
`of different words that need to be recognized.
`It is not practical or useful to train a separate model for every word, and
`instead sub-word models are used. Typically phone-size units are chosen, with
`the pronunciation of each word being provided by a dictionary.
`• Triphone models represent each phone in the context of its left and right
`neighbours. The large number of possible triphones is such that many will not
`occur in any given set of training data. Probabilities for these triphones can be
`estimated by 'backing of~ or interpolating with biphones ( dependent on only
`the left or the right context) or even context-independent monophones.
`• Another option, which allows greater context specificity to be achieved, is to
`group ('cluster') similar triphones together and share ('tie') their parameters. A
`phonetic decision tree can be used to find the best way to cluster the triphones
`based on questions about phonetic context. The idea is to optimize the fit to the
`data while also having sufficient data available to train each tied state.
`• An embedded training procedure
`is used, typically starting by estimating
`parameters for very general monophone models for which a lot of data are
`available. These models are used to initialize triphone models. The triphones
`are trained and similar states are then tied together. Multiple-component
`mixture distributions are introduced at the final stage.
`• The purpose of the language model is to incorporate language constraints,
`expressed as probabilities
`for different word sequences. The perplexity, or
`average branching factor, provides a measure of how good the language model
`is at predicting the next word given the words that have been seen so far.
`• N-grams model the probability of a word depending on just the immediately
`preceding N- 1 words, where typically N = 2 ('bigrams') or N = 3 ('trigrams').
`• The large number of different possible words is such that data sparsity is a
`massive problem for language modelling, and special techniques are needed to
`estimate probabilities for N-grams that do not occur in the training data.
`• Probabilities for N-grams that occur in the training text can be estimated from
`frequency counts, but some probability must be 'freed' and made available for
`those N-grams that do not occur. Probabilities for these unseen N-grams can
`then be estimated by backing off or interpolating with more general models.
`• The principles of HMM recognition extend to large vocabularies, with a
`multiple-level structure in which phones are represented as networks of states,
`words as networks of phones, and sentences as networks of words. In practice
`the decoding task is not straightforward due to the very large size of the search
`space, especially if cross-word
`triphones are used. Special treatment is also
`required for language models whose probabilities depend on more than the
`
`IPR2023-00037
`Apple EX1016 Page 100
`
`

`

`2 J 2
`
`Speech Synthesis and Recognition
`
`immediately preceding word (i.e. for models more complex than bigrams). The
`one-pass Viterbi search can be extended to operate with cross-word triphones
`and with trigram language models, but the search space becomes very large
`and is usually organized as a tree. Efficient pruning is essential.
`• An alternative search strategy uses multiple passes. The first pass identifies a
`restricted set of possibilities, which are typically organized as an N-best list, a
`word lattice or a word graph. Later passes select between these possibilities.
`Another option is to use a depth-first search.
`• Automatic speech understanding needs
`further processing of the speech
`recognizer output to analyse the meaning, which may involve syntactic and
`semantic analysis. To reduce the impact of recognition errors, it is usual to
`start with an N-best list or word lattice. Partial parsing techniques can be used
`for syntactic analysis to deal with the fact that the spoken input may be
`impossible to parse completely because parts do not fit the grammar, due to
`grammatical errors, hesitations and so on.
`• Meaning is often represented using templates, which will be specific to the
`application and have 'slots' that are filled by means of a linguistic analysis.
`In spoken dialogue systems, a dialogue manager
`is used to control the
`interaction with the user and ensure that all necessary information is obtained.
`• ARP A has been
`influential
`in promoting progress
`in large vocabulary
`recognition and understanding, by sponsoring the collection of large databases
`and running series of competitive evaluations. Error rates of less than 10%
`have been achieved for transcribing unlimited-vocabulary read speech and for
`understanding spoken dialogue queries. Recognition of more casually spoken
`spontaneous speech is still problematic.
`
`•
`
`CHAPTER 12 EXERCISES
`
`E12.1 Explain the different requirements and problems in speech transcription and
`speech understanding.
`E12.2 What are the special issues for the design of the acoustic model, the
`language model and the search component when a recognizer needs to cope
`with a large vocabulary size?
`E12.3 Give examples of how acoustic/phonetic knowledge can be used when
`choosing an acoustic model set for a large-vocabulary recognition system.
`El 2.4 When estimating trigram language-model probabilities based on counts in
`some training text, explain how probabilities can be estimated for trigrams
`that do not occur in the training data. How does this operation affect the
`probabilities that need to be assigned to trigrams that do occur?
`El2.5 What are the similarities and differences between the concept of 'backing
`off in language modelling and in acoustic modelling?
`El2.6 Explain the relative merits of performing large-vocabulary recognition using
`a one-pass Viterbi search versus using a multiple-pass search strategy.
`El2.7 What measures are required
`understanding system?
`
`the performance of a speech
`
`to evaluate
`
`IPR2023-00037
`Apple EX1016 Page 101
`
`

`

`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket