throbber
Carnegie Mellon University
`Research Showcase @ CMU
`
`Computer Science Department
`
`School of Computer Science
`
`1991
`
`A survey of hardware architectures designed for
`speech recognition
`Hsiao-Wuen Hon
`Carnegie Mellon University
`
`Follow this and additional works at: http://repository.cmu.edu/compsci
`
`This Technical Report is brought to you for free and open access by the School of Computer Science at Research Showcase @ CMU. It has been
`accepted for inclusion in Computer Science Department by an authorized administrator of Research Showcase @ CMU. For more information, please
`contact research-showcase@andrew.cmu.edu.
`
`IPR2023-00037
`Apple EX1006 Page 1
`
`

`

`NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS:
`The copyright law of the United States (title 17, U.S. Code) governs the making
`of photocopies or other reproductions of copyrighted material. Any copying of this
`document without permission of its author may be prohibited by law.
`
`IPR2023-00037
`Apple EX1006 Page 2
`
`

`

`A Survey of Hardware Architectures
`Designed for Speech Recognition
`
`Hsiao-Wuen Hon
`August 9, 1991
`C M U - C S - 9 1 - 1 6 92
`
`School of Computer Science
`Carnegie Mellon University
`Pittsburgh, PA 15213
`
`Abstract
`
`In the past few years, there have been many special purpose hardware designs emerging to support
`speech recognition systems. This paper tries to identify the the system requirements of different
`spoken language systems' components, such as search and training. Some general design criteria
`of speech hardware architecture are also presented. Based on these criteria, we survey a variety of
`notable special purpose computer architectures designed for speech recognition systems and make
`a paper-and-pencil evaluation of those architecture alternatives.
`
`This research was supported by the Defense Advanced Research Projects Agency (DOD) and monitored by the
`Space and Naval Warfare Systems Command under Contract N00039-85-C-0163, ARPA Order No. 5167.
`The views and conclusions contained in this document are those of the author and should not be interpreted as
`representing the official policies, either expressed or implied, of DARPA or the U.S. government.
`
`IPR2023-00037
`Apple EX1006 Page 3
`
`

`

`Keyword : speech recognition, spoken language system, Viterbi search, beam search, stack de(cid:173)
`coding, GSM (Graph Search Machine), ASPEN (AT&T's Systolic Processing Ensemble),
`MARS (Microprogrammable Accelerator for Rapid Simulation), Search Machine, BEAM,
`PLUS.
`
`IPR2023-00037
`Apple EX1006 Page 4
`
`

`

`Contents
`
`1. Introduction
`
`2. General System Requirements for Speech Recognition Systems
`
`2.1. Search
`
`2.1.1. Viterbi Search
`
`2.1.2. Stack Decoding
`
`2.1.3. Continuous vs. Discrete Density
`
`2.2. Training
`
`3. Design Criteria of Speech Hardware Architectures
`
`4. Survey of Some Notable Systems
`
`4.1. AT&T's Graph Search Machine
`
`4.2. SRI-Berkeley's Speech Search Machine
`
`4.3. AT&T's ASPEN Tree-Machine
`
`4.4. MARS
`
`4.5. CMU's BEAM accelerator
`
`4.6. CMU's PLUS
`
`1
`
`1
`
`2
`
`2
`
`5
`
`8
`
`11
`
`12
`
`14
`
`14
`
`15
`
`17
`
`19
`
`20
`
`21
`
`um \ vM :>liy IJuU>Ht,5
`
`IPR2023-00037
`Apple EX1006 Page 5
`
`

`

`1.
`
`Introduction
`
`Communicating with machines through natural spoken language has always been a dream of human
`beings. In order to achieve this goal, we need not only an accurate, but also a real-time speech
`recognition system. Speech recognition has been an active research area for quite some time, and
`a good deal of progress has been made in the last couple of decades.
`In the last three to five
`years, the Hidden Markov Model (HMM) [26, 13, 23, 34, 43, 48] has proven to be a popular and
`effective technique for the implementation of speech recognition systems. However, the ultimate
`goal of highly accurate large-vocabulary continuous speech recognition with real-time response is
`hard to achieve without special hardware support because of the large computational and memory
`requirements of the H MM approach.
`
`In order to speed up the speech recognition process, many special purpose hardware designs have
`emerged for speech recognition systems in the last couple of years. They range from fully custom
`designed machines, to systolic machines, to pipeline machines, to shared memory multiprocessor
`machines, to message passing machines, to distributed multiprocessor machines. Of course, every
`architecture has its own pros and cons. Some architectures are much faster, while some are easy
`to program, easy to expand, cheap to build, or flexible for different kinds of applications, etc.
`This paper attempts to describe those features and provide a survey of some notable systems
`developed recently. We would like this document to serve as a future reference for people who
`may contemplate building new speech architectures or for those who would like to choose among
`the architecture alternatives that are available.
`
`The paper is organized as follows. Section 2 characterizes the general computation and memory
`requirements for a variety of speech related algorithms , like recognition and training. Section
`3 addresses the general design criteria of hardware architectures for speech recognition systems.
`The paper-and-pencil evaluation and comparison of a variety of notable architectural alternatives
`are presented in Section 4.
`
`2. General System Requirements for Speech Recognition Systems
`
`Although this paper focuses on HMM based systems, it doesn't necessarily exclude other ap(cid:173)
`proaches, like knowledge-based systems [52] or neural network based systems [51]. Nevertheless,
`since the H MM has been a predominant approach for continuous speech recognition, almost all
`
`1
`
`IPR2023-00037
`Apple EX1006 Page 6
`
`

`

`of the hardware architecture designs for speech recognition are aimed at H MM based systems.
`Actually, some architectures mentioned in this paper could be extended to non-HMM systems with
`small modifications.
`
`By looking into the spectrum of the whole speech recognition system, front-end signal processing
`[33,29] for the transformation of the input signal into a form suitable for analysis is of no concern
`from the computational point of view. Although auditory models [14, 32,31,1] suggested by some
`speech researchers require large computation, their superiority in continuous speech recognition
`has not been shown. Commercial digit signal processors can handily generate the two most widely
`used speech signal representations: FFT and LPC [33] in real time. On the other hand, the search
`part requires large computation and therefore becomes the most challenging part of the architecture
`design in order to achieve real-time performance to produce a practical speech recognition system.
`Moreover, although the training process whose responsibility is to build accurate models for
`recognition doesn't require real-time performance in general, it is still important for the system to
`speed up the training process so as to reduce the time required for model development.
`
`In the following, we will describe and characterize the general computation memory require(cid:173)
`ments for a variety of speech related algorithms, including Viterbi search, stack decoding and
`training. Since parallel(including pipeline) architecture is used to speed up intense computation
`most frequently, we will pay more attention on parallel implementation of such algorithms.
`
`2.1.
`
`Search
`
`2.1.1. Viterbi Search
`
`The Viterbi Search [50] is the most popular search algorithm for continuous speech recognition
`because of its efficiency.
`It has been successfully applied in many speech recognition systems
`[27, 3 5 , 1 3 , 4 8 , 3 4 , 4 2 ]. The Viterbi Search is a time synchronous search algorithm that completely
`processes all input up to time t before going on to time t + 1 . For time t, each state is updated by
`the best score from states at time t — 1. From this, the most probable state sequence can be found
`by backtracking at the end of the search.
`
`A full Viterbi search is quite efficient for moderate tasks [7]. However, for large tasks, it can
`be very time consuming. A straightforward way to prune the search space is the beam search
`
`2
`
`IPR2023-00037
`Apple EX1006 Page 7
`
`

`

`[30, 49, 36]. Instead of retaining all candidates at every time frame, a threshold T is used to
`maintain only a group of likely candidates. The use of the beam search alleviates the necessity of
`enumerating all of the states and can lead to substantial savings in computation with little or no
`loss of accuracy. For example, in the implementation of SPHINX [25], they found it is possible to
`prune 80% - 90% of the hypotheses, without any loss in accuracy.
`
`There are several advantages to using the Viterbi search algorithm. First, if we use a logarithmic
`representation of the H MM probabilities, Viterbi search is extremely efficient because multipli(cid:173)
`cations of probabilities become additions. Second, because of its time-synchronous nature, it is
`very easy to modify the time-synchronous Viterbi algorithm into a beam search and it will also be
`suitable for parallel implementation if necessary.
`
`Although the actual implementations of the Viterbi search algorithm will vary, it can roughly
`be broken down into the following two modules:
`
`• The word expansion module performs Viterbi search and finds the probabilities for words
`in the vocabulary matching the speech signal.
`
`• The grammar expansion module controls the word expansion module by telling it what
`words to process in the next time frame based on grammar constraints.
`
`The DARPA resource management task (see [25] for a description of this task), which is a
`1,000—word, perplexity 60 task, has been used as a common benchmark task by the speech
`community since 1987. As a result, many architecture designs aimed to achieve this task in
`real-time. We will use the resource management task as our benchmark test to evaluate different
`architecture designs in section 4. In order to do that, we must first analyze the time and space
`requirement for Viterbi search in the resource management task. Let's take SPHINX'S [27]
`implementation for an example, without loss of generality. In the SPHINX system, allophones
`are described with seven-state models and words are sequences of (on average) six allophones.
`Therefore, each word requires about 40 states. Resource management used a statistical grammar
`which specifies the legal word pairs. In a regular Viterbi algorithm, the word expansion module
`needs to update all the states at every frame (10 ms), so for the resource management task, this
`would require updating about 40,000 states every 10ms, or about 80,000 transitions, since there
`are, on average, two predecessors for each state. There are about 7 instructions required for each
`transition in the SPHINX system, which used a discrete HMM approach. (For continuous H MM
`
`3
`
`IPR2023-00037
`Apple EX1006 Page 8
`
`

`

`approaches, the number of instructions required is at least one order of magnitude greater than
`that for the discrete H MM approach. Please see section 2.1.3 for details) This is quite a lot of
`computation. Fortunately, because of its time-synchronous nature, the word expansion module can
`be implemented in parallel efficiently. The other way to reduce the computational load is to use
`the beam search described earlier. For example, a beam search could reduce the number of states
`that need to be updated at every frame to about 4,000 (8,000 transitions) instead of 40,000 (80,000
`transitions) in the SPHINX's implementation. On the other hand, the computational requirements
`of the grammar expansion module depend on the complexity of grammar. For example, the
`resource management task with a word-pair grammar of perplexity 60 will need to process 1,000
`* 60 grammar transitions for non-pruned Viterbi search. Of course, the beam search can again
`be used here to reduce the computational requirements. If the Viterbi search algorithm (with or
`without beam pruning) is implemented on a single-processor general-purpose machine, the word
`expansion module generally consumes about 70% to 80% of the total search time(and up to 9 5% if
`using continuous densities), while the grammar expansion module consumes about 20% to 30%.
`Therefore, most system designs are aimed at speeding up the word expansion module.
`
`While the speed of commercial general purpose processors continues to improve every year
`(at a rate of two to three times per year), the speed of memory access does not increase at the
`same rate. [20, 39] Unfortunately, the word expansion and grammar expansion modules require
`access to a large amount of memory with a high bandwidth, and therefore the Viterbi search is
`mainly a memory-bound task. Because of the frame-based nature of the Viterbi search, it exhibits
`a very poor cache hit rate. Therefore, memory bandwidth is probably the major bottleneck in
`Viterbi search. For example, The DARPA resource management task requires an average memory
`bandwidth of 40 Mbytes per second for SPHINX [9] if it is executed in real time. This figure could
`even be larger because most systems tend to use more sophisticated models in order to achieve
`better performance. Most commercial bus-based shared memory multi-processor machines are
`not a good solution for Viterbi search because they work well only with algorithms that have a
`high cache hit-rate and the serious memory contention will slow down the processors when a few
`processors (more than 5) running in parallel.
`
`The other potential problem for parallel implementation of Viterbi search is the substantial
`amount of synchronization because Viterbi algorithm reveals a very small granularity. For example,
`the DARPA resource management task needs about 2 fis between synchronizations even with the
`beam search. Although beam search could substantially reduced computation, it also introduces
`the problem of load-balancing because it retains an unknown pattern of the most promising states
`
`4
`
`IPR2023-00037
`Apple EX1006 Page 9
`
`

`

`and therefore requires more complicated synchronization among processors.
`
`2.1.2. Stack Decoding
`
`Stack decoding [3] is a best-first search algorithm for continuous speech recognition. It is derived
`from the A* algorithm [37] used in artificial intelligence. Stack decoding is not time-synchronous,
`but extends paths of different lengths. Stack decoding constructs a search tree from a state graph S
`that is specified by the language model. The states in the graph represent the states in the language,
`and the branches in the search tree are words that cause transition from one language state to
`another. The search begins by adding the initial states of the language with empty hypothesis to the
`OPEN list. Then every time the best hypothesis is popped out from the OPEN list, all paths from it
`are extended, evaluated, and placed back in the OPEN list. This search continues until a complete
`path that is guaranteed to be no worse than all paths in the OPEN list has been found. In selecting
`the best hypothesis, we need an evaluation function that estimates the score of the complete path
`as the sum of the actual score of the partial path and the expected score of the remaining path. If
`the expected score of the remaining path is always an underestimate of the actual score, then the
`solution found will be optimal. This feature is called the admissibility (optimality)
`feature of the
`A* search. [38]
`
`Viterbi search is a graph search, and paths cannot be summed because they may have different
`word histories. Stack decoding is a tree search, so each node in OPEN has a unique history, and
`paths can be summed within word hypotheses when a word is extended. Therefore, with stack
`decoding, it is possible to find the optimal word sequence with multiple beginning and ending
`times which is more consistent with the training algorithm [8], than the optimal state sequence.
`Moreover, Viterbi search computes over the entire search space at all times (although beam search
`can reduce the search space), while stack decoding only computes with the best hypothesis in
`each expansion. Since the memory requirement of dynamic expansion is much less than that of
`Viterbi search, stack decoding is likely to be able to deal with very large vocabularies and more
`complicated grammars.
`
`While the aforementioned properties of stack decoding are very attractive, many implementation
`problems arise for stack decoding. First, it is very hard to compare the probabilities of partial
`paths with different lengths. Direct comparison of probabilities will cause the search to always
`favor shorter paths, resulting in a full breadth-first search. On the other hand, since the acoustic
`
`5
`
`IPR2023-00037
`Apple EX1006 Page 10
`
`

`

`probabilities being compared in Viterbi search are always based on the same partial path, which
`is not the case in stack decoding, stack decoding needs to perform some normalization on the
`probabilities. Unfortunately, there is no easy way to find such a normalization. Second, the
`evaluation function that guarantees the admissibility of the algorithm is difficult to find for speech
`recognition because of uncertainty about the remaining part of speech. A function that is guaranteed
`to underestimate will result in a very large OPEN list, so a heuristic function that may overestimate
`has to be used to evaluate the remaining paths. Of course, this invalidates the admissibility of the
`algorithm. Even with an overestimating evaluation function, the OPEN list will still be too large.
`Therefore, two other types of pruning are used to reduce the search burden:
`
`1. A fast-match phrase that extends only a small fraction of most promising paths from a partial
`path in order to get a list of most likely following words waiting for detailed matching. [4]
`
`2. The use of a stack (similar to the beam in Viterbi search) that saves only a fixed number of
`hypotheses in the OPEN
`list.
`
`Finally, consideration of multiple beginning and ending time of words is very tricky because it is
`too costly to try all possible beginning and ending times. Some heuristic must be used to determine
`the boundary regions.
`
`Based on the problems described above, stack decoding is considerably more difficult to im(cid:173)
`plement. Thus, very limited effort has been invested on the stack decoding for continuous speech
`recognition systems [6,41]. There are even fewer hardware architectures aimed at stack decoding.
`Although all speech architectures mentioned in this paper are aimed at Viterbi search, we would
`also like to analyze the possibility of exploring stack decoding. In order to do this, let's now try to
`characterize the potential system requirements of stack decoding.
`
`Unlike Viterbi search, a parallel implementation of stack decoding is not so straightforward since
`it is not time-synchronous. One obvious parallel implementation could be performed by popping
`N-best hypotheses from the stack and sending each to different processors. Once each processor
`completes one level of expansion(from one language state to another language state), the new
`hypotheses will be inserted back into the stack or merged with existing states if necessary. Since
`stack decoding is a best-first search, some early actions taken by executing non-best hypotheses in
`parallel might be proven to be unnecessary later and therefore it might be a waste of time to explore
`the less promising hypotheses. Based on simulation results [40], the best complete path usually
`comes from the 10-best hypotheses. From such a point of view, this kind of decomposition seems
`
`6
`
`IPR2023-00037
`Apple EX1006 Page 11
`
`

`

`not to be suitable for massive parallel implementations. The second type of parallel decomposition
`could be done when performing the search on the best hypotheses. In general, there are many
`language transitions for a language state depending on the perplexity of the grammar. For example,
`there are on the average 60 grammar transitions per language state for a grammar of perplexity 60.
`It is ideal to have a master processor popping out the best hypothesis and handing out the different
`word evaluations to different slave processors according to the grammar transitions. After the
`slave processors finish word evaluation, the new hypotheses along with the score information are
`sent back to the master processor and the master processor will finish the bookkeeping, merging
`and finally inserting the new hypotheses back into the stack. This decomposition allows not
`only for massive parallelism, but also for medium granularity because there are
`time-synchronous
`constraints. On the other hand, hypotheses are extended one word at a time instead of being
`extended one frame at a time in stack decoding, and different words tend to have different lengths,
`so processors are not likely to terminate at the same time when running word evaluation in
`parallel. Therefore dynamic load balancing is very crucial to implement stack decoding in parallel
`efficiently. Moreover, the communication between processors is more complicated than Viterbi
`search because it happens almost everywhere and is no longer time-synchronous. Fortunately,
`the memory bandwidth issue seems to be a less serious problem for stack decoding. Since stack
`decoders do word-based evaluation instead of frame-based evaluation, in stack decoding, it is not
`necessary to deal with the whole search space that Viterbi is always likely to access. Processors
`would have a relatively good temporal and spatial locality and therefore stack decoding would
`achieve a much higher cache hit-rate than Viterbi search.
`
`The other related issue in stack decoding is the use of fast match. Since stack decoding doesn't
`have the ability to prune unpromising word hypotheses until the completion of the whole word
`evaluation, it becomes increasingly important to find a short list of promising candidate words
`quickly and cheaply (fast match) before performing slow and expensive detailed matches, as the
`vocabulary increases in size. Because it must be quick and cheap, a parallel implementation is
`ideal for fast match. Like implementing the word evaluation in parallel, this parallelism tends to
`have medium granularity and high cache hit-rate. In addition, we could have detailed acoustic
`matching for the most promising hypothesis and use fast match to find the short list of words for
`the second most promising hypothesis to do detailed matching next running in parallel, and so on.
`However, the cost of complex synchronization is the inevitable price to pay for doing this kind of
`hybrid parallelism.
`
`7
`
`IPR2023-00037
`Apple EX1006 Page 12
`
`

`

`2.1.3. Continuous vs. Discrete Density
`
`The other important feature which would affect the design of speech architecture is the type of
`density functions embedded in the H MM representation. One is the discrete density function [46]
`which models the output probability for each output symbol explicitly. In association with discrete
`density HMMs, each frame of speech must be represented by a symbol from a finite alphabet through
`vector quantization (VQ) [29]. VQ is a data compression technique that first identifies a set of
`prototype vectors from training data. Then speech input is converted from a multi-dimensional
`real feature vector of, say FFT or LPC coefficients, to a symbol that represents the best-matching
`prototype vector. The other is the continuous density function. By pre-assuming certain properties
`of the output distributions, the output probability of observed speech can be attained according to
`the distribution formulas. The most frequently used continuous density is the multivariate Gaussian
`density [46]. With the multivariate Gaussian density, the output probability density function (pdf)
`is described by a mean vector and a covariance matrix. To reduce computation, the covariance
`matrix is sometimes assumed to be diagonal (all the off-diagonal terms are zeros). Moreover,
`mixture densities [47] are often used to improve the modeling because a single-mixture diagonal
`Gaussian density function does not adequately represent certain speech parameters [47].
`
`The use of continuous HMMs requires heavy computation. By using discrete HMMs, computing
`the output probability of an observed speech frame during the search is merely vector quantization
`and table-lookup. On the other hand, by using the continuous HMMs, more memory accesses
`and many more instructions are required for every transition even with the simplest single-mixture
`diagonal Gaussian density function. For example, every single-mixture diagonal Gaussian density
`function in the AT&T system requires more than 72 multiplications and 36 additions. The use of
`multiple-mixture or full covariance would definitely further increase the computational requirement
`of speech recognition with continuous HMMs.
`
`In order to compare the computation requirement of discrete and continuous densities, two
`pseudo codes for computing observation probabilities from discrete and continuous densities
`are given as follows. The pseudo code for discrete densities is based on the implementation
`in SPHINX and that for continuous densities is based on the semi-continuous SPHINX system
`[22] which resembles a 4-mixture multivariate diagonal Gaussian density function. Logarithmic
`probabilities are used in both cases. Suppose there are 3 different codebooks and the size for
`each one is 256. As we can see in the first part (quantization), continuous densities need about
`twice as many multiplications and additions as the discrete ones. However, in the second part
`
`8
`
`IPR2023-00037
`Apple EX1006 Page 13
`
`

`

`(output probability generation), while discrete densities only need 3 additions for each distribution,
`continuous densities need 3 * 4 *2 more additions, 3 *4 more exponentiations and 3 more
`logarithms which are very expensive for each distribution. Be aware that the computation for
`discrete densities is fixed when the number of the codebooks and the sizes of codebooks remain
`unchanged. On the other hand, the computation for continuous densities could be increased by
`using more mixtures or more Gaussian densities even when the number of codebooks and the
`sizes of the codebooks remain unchanged. For example, the AT&T system [24], which uses more
`mixtures (more than 60) and more Gaussian densities (more than 9,000), could potentially require
`much more computation.
`
`Discrete Density:
`
`for each frame {
`for each codebook K{ {
`min = 0;
`B{ = 0;
`for each code Cj in K{ {
`dist = 0;
`for each dimension k in Cj {
`dist = dist + (cep[k] - Cj[k})2\
`/* cep[] is the cepstra vector and Cj[] is the centroid vector for Cj */
`
`}
`if (dist < min) {
`B{ — j\
`min = dist;
`
`}
`
`}
`
`}
`
`for each distribution i {
`Probi = 0;
`for each codebook j {
`Probi = Probi +
`
`OuUj[Bj\,
`
`}
`
`}
`
`}
`
`9
`
`IPR2023-00037
`Apple EX1006 Page 14
`
`

`

`Continuous Density:
`
`for each frame {
`for each codebook K{ {
`for each code Cj in K{ {
`Gaussij.code
`Gaussij.prob
`= 0;
`for each dimension k in Cj {
`Gaussij.prob
`= Gaussij.prob — (cep[fc] — Cj[k])2 * Var^fc];
`/* cep[] is the cepstra vector, Cj[] is the mean vector and Varj[] is the covariance vector of Cj
`}
`+ Detij;
`= Gaussij.prob
`Gaussij.prob
`/* D e tj is the determinant of Gaussian pdf Gaussi,j
`
`= j;
`
`*/
`
`}
`
`}
`
`for each codebook K{
`SORT Gauss[i, 1..sizeJOj.codebook J<i] according to field prob
`
`for each distribution i {
`Probi = 0;
`for each codebook j {
`temp = 1;
`for each mixture k {
`Gaussj^.prob);
`+
`j[Gaussj,k-code]
`temp = t e mp + exp(Outi
`/* Outij[]
`is the mixture weight vector for distribution i, codebook j */
`
`y
`
`}
`Probi = Pro&i +
`
`log(temp);
`
`}
`
`}
`
`}
`
`In spite of the heavy computational requirements of continuous HMMs, it doesn't necessarily
`present as tough a problem as we might have thought at first glance. First of all, thanks to
`advanced VLSI technology, multiplication can be executed as fast as addition; the exponentiation
`
`10
`
`IPR2023-00037
`Apple EX1006 Page 15
`
`

`

`and logarithm could also be executed effectively by mathematic co-processors. Second, the
`computation of observation probabilities could be logically separated from the search part, so they
`can be executed in parallel and the search part only needs to access the observation probability
`result buffer filled by the observation probability generator. This functional separation could also
`be used to speed up large vocabulary Viterbi search on a single processor machine. For example,
`while there are about 2,000 distributions in the SPHINX system, it needs to update about
` 8,000
`transitions at every frame (see Section 2.1.1). Therefore, without functional separation, it will
`need to compute 8,000 distributions instead of 2,000 and many of them are actually re-computed
`several times. The recent SPHINX system [22] used a semi-continuous density function which
`resembles a 4-mixture multivariate Gaussian density function and resulted in only a 3-5 times
`slower recognition performance than the discrete SPHINX, while the performance is about 3 times
`slower without the functional separation. Moreover, the observation probability generator is a
`sequence of vector-style operations which could be implemented in parallel and involve almost
`no synchronization and load-balancing. The vector-style observation probability generator could
`even be accelerated by vector processing machines like the CRAY and Alliance.
`
`2.2. Training
`
`In comparison with recognition, training is a much less serious problem because the paths to
`search in training are always known. A good implementation on a commercial single processor
`workstations (SUNs, DEC Stations, NEXT) could easily perform the job in real-time. However,
`the size of the training database normally consists thousands of utterances constituting hours of
`speech. For example, the D ARPA speaker-independent resource management training database[44]
`consists of about 4,000 utterances and CMU's general English training database consists of more
`than 20,000 utterances [21]. The training could also consists of several iterations and several
`phases, such as mono-phone training and triphone training. Besides, the discriminative training
`schemes (such as Maximum Mutual Information training, corrective training) [11,5] could add a lot
`of phases to the training process. Therefore, it would be nice for the speech hardware architecture
`to accelerate the training process as well.
`
`As mentioned before, training on one utterance is basically a one-path parametric re-estimation
`and thus is a sequential behavior which could not benefit from a parallel implementation. However,
`since the training database contains many utterances, it is possible to train different utterances in
`parallel. This decomposition is very easy and requires only a small amount of synchronization (the
`
`11
`
`IPR2023-00037
`Apple EX1006 Page 16
`
`

`

`granularity is relatively large, only 1-2 seconds between synchronization).
`
`Although the memory bandwidth requirement for training is much less than that for recognition,
`the total memory needed for training could be much more because training normally needs several
`copies of the HMM's. Since the training on different utterances would require accessing different
`models, the visibility of all the models for training processor would be essential for speed-up. If
`the primary memory can't hold all the spaces needed for training, disk swapping would be fatal
`to the training process. Fortunately, swapping between disks and primary memory, and between
`primary memory and cache only happens when completing the training on the current utterance
`and beginning to train on the next utterances. We could arrange the training database so that the
`differences between consecutive utterances are minimized, therefore requiring less page-in's and
`page-out's. The other possibility is to create one process to pre-fetch the required data for the next
`utterances when training on the current utterance [2]. Even if the pre-fetch requires page-in or
`page-out, the fetch operation would not affect the training on the current utterance since the CPU
`operation and IO operation are interleaved.
`
`3. Design Criteria of Speech Hardware Architectures
`
`In this section, we would like to give some of the desired criteria for speech hardware architecture
`designs. A good architecture design doesn't necessarily need to cover all the criteria we state
`because different architecture designs have their own goals. However, we would like to list some
`criteria which ma

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket