`Research Showcase @ CMU
`
`Computer Science Department
`
`School of Computer Science
`
`1991
`
`A survey of hardware architectures designed for
`speech recognition
`
`Hsiao-Wuen Hon
`Carnegie Mellon University
`
`Follow this and additional works at: http://repository.cmu.edu/compsci
`
`This Technical Report is brought to you for free and open access by the School of Computer Science at Research Showcase @ CMU. It has been
`accepted for inclusion in Computer Science Department by an authorized administrator of Research Showcase @ CMU. For more information, please
`contact research-showcase@andrew.cmu.edu.
`
`Amazon / Zentian Limited
`Exhibit 1006
`Page 1
`
`
`
`NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS:
`The copyright law of the United States (title 17, U.S. Code) governs the making
`of photocopies or other reproductions of copyrighted material. Any copying of this
`document without permission of its author may be prohibited by law.
`
`Amazon / Zentian Limited
`Exhibit 1006
`Page 2
`
`
`
`A Survey of Hardware Architectures
`Designed for Speech Recognition
`
`Hsiao-Wuen Hon
`August 9, 1991
`C M U - C S - 9 1 - 1 6 92
`
`School of Computer Science
`Carnegie Mellon University
`Pittsburgh, PA 15213
`
`Abstract
`
`In the past few years, there have been many special purpose hardware designs emerging to support
`speech recognition systems. This paper tries to identify the the system requirements of different
`spoken language systems' components, such as search and training. Some general design criteria
`of speech hardware architecture are also presented. Based on these criteria, we survey a variety of
`notable special purpose computer architectures designed for speech recognition systems and make
`a paper-and-pencil evaluation of those architecture alternatives.
`
`This research was supported by the Defense Advanced Research Projects Agency (DOD) and monitored by the
`Space and Naval Warfare Systems Command under Contract N00039-85-C-0163, ARPA Order No. 5167.
`The views and conclusions contained in this document are those of the author and should not be interpreted as
`representing the official policies, either expressed or implied, of DARPA or the U.S. government.
`
`Amazon / Zentian Limited
`Exhibit 1006
`Page 3
`
`
`
`Keyword : speech recognition, spoken language system, Viterbi search, beam search, stack de(cid:173)
`coding, GSM (Graph Search Machine), ASPEN (AT&T's Systolic Processing Ensemble),
`MARS (Microprogrammable Accelerator for Rapid Simulation), Search Machine, BEAM,
`PLUS.
`
`Amazon / Zentian Limited
`Exhibit 1006
`Page 4
`
`
`
`Contents
`
`1. Introduction
`
`2. General System Requirements for Speech Recognition Systems
`
`2.1. Search
`
`2.1.1. Viterbi Search
`
`2.1.2. Stack Decoding
`
`2.1.3. Continuous vs. Discrete Density
`
`2.2. Training
`
`3. Design Criteria of Speech Hardware Architectures
`
`4. Survey of Some Notable Systems
`
`4.1. AT&T's Graph Search Machine
`
`4.2. SRI-Berkeley's Speech Search Machine
`
`4.3. AT&T's ASPEN Tree-Machine
`
`4.4. MARS
`
`4.5. CMU's BEAM accelerator
`
`4.6. CMU's PLUS
`
`1
`
`1
`
`2
`
`2
`
`5
`
`8
`
`11
`
`12
`
`14
`
`14
`
`15
`
`17
`
`19
`
`20
`
`21
`
`um \ vM :>liy IJuU>Ht,5
`
`Amazon / Zentian Limited
`Exhibit 1006
`Page 5
`
`
`
`1.
`
`Introduction
`
`Communicating with machines through natural spoken language has always been a dream of human
`beings. In order to achieve this goal, we need not only an accurate, but also a real-time speech
`recognition system. Speech recognition has been an active research area for quite some time, and
`a good deal of progress has been made in the last couple of decades.
`In the last three to five
`years, the Hidden Markov Model (HMM) [26, 13, 23, 34, 43, 48] has proven to be a popular and
`effective technique for the implementation of speech recognition systems. However, the ultimate
`goal of highly accurate large-vocabulary continuous speech recognition with real-time response is
`hard to achieve without special hardware support because of the large computational and memory
`requirements of the H MM approach.
`
`In order to speed up the speech recognition process, many special purpose hardware designs have
`emerged for speech recognition systems in the last couple of years. They range from fully custom
`designed machines, to systolic machines, to pipeline machines, to shared memory multiprocessor
`machines, to message passing machines, to distributed multiprocessor machines. Of course, every
`architecture has its own pros and cons. Some architectures are much faster, while some are easy
`to program, easy to expand, cheap to build, or flexible for different kinds of applications, etc.
`This paper attempts to describe those features and provide a survey of some notable systems
`developed recently. We would like this document to serve as a future reference for people who
`may contemplate building new speech architectures or for those who would like to choose among
`the architecture alternatives that are available.
`
`The paper is organized as follows. Section 2 characterizes the general computation and memory
`requirements for a variety of speech related algorithms , like recognition and training. Section
`3 addresses the general design criteria of hardware architectures for speech recognition systems.
`The paper-and-pencil evaluation and comparison of a variety of notable architectural alternatives
`are presented in Section 4.
`
`2. General System Requirements for Speech Recognition Systems
`
`Although this paper focuses on HMM based systems, it doesn't necessarily exclude other ap(cid:173)
`proaches, like knowledge-based systems [52] or neural network based systems [51]. Nevertheless,
`since the H MM has been a predominant approach for continuous speech recognition, almost all
`
`1
`
`Amazon / Zentian Limited
`Exhibit 1006
`Page 6
`
`
`
`of the hardware architecture designs for speech recognition are aimed at H MM based systems.
`Actually, some architectures mentioned in this paper could be extended to non-HMM systems with
`small modifications.
`
`By looking into the spectrum of the whole speech recognition system, front-end signal processing
`[33,29] for the transformation of the input signal into a form suitable for analysis is of no concern
`from the computational point of view. Although auditory models [14, 32,31,1] suggested by some
`speech researchers require large computation, their superiority in continuous speech recognition
`has not been shown. Commercial digit signal processors can handily generate the two most widely
`used speech signal representations: FFT and LPC [33] in real time. On the other hand, the search
`part requires large computation and therefore becomes the most challenging part of the architecture
`design in order to achieve real-time performance to produce a practical speech recognition system.
`Moreover, although the training process whose responsibility is to build accurate models for
`recognition doesn't require real-time performance in general, it is still important for the system to
`speed up the training process so as to reduce the time required for model development.
`
`In the following, we will describe and characterize the general computation memory require(cid:173)
`ments for a variety of speech related algorithms, including Viterbi search, stack decoding and
`training. Since parallel(including pipeline) architecture is used to speed up intense computation
`most frequently, we will pay more attention on parallel implementation of such algorithms.
`
`2.1.
`
`Search
`
`2.1.1. Viterbi Search
`
`The Viterbi Search [50] is the most popular search algorithm for continuous speech recognition
`because of its efficiency.
`It has been successfully applied in many speech recognition systems
`[27, 3 5 , 1 3 , 4 8 , 3 4 , 4 2 ]. The Viterbi Search is a time synchronous search algorithm that completely
`processes all input up to time t before going on to time t + 1 . For time t, each state is updated by
`the best score from states at time t — 1. From this, the most probable state sequence can be found
`by backtracking at the end of the search.
`
`A full Viterbi search is quite efficient for moderate tasks [7]. However, for large tasks, it can
`be very time consuming. A straightforward way to prune the search space is the beam search
`
`2
`
`Amazon / Zentian Limited
`Exhibit 1006
`Page 7
`
`
`
`[30, 49, 36]. Instead of retaining all candidates at every time frame, a threshold T is used to
`maintain only a group of likely candidates. The use of the beam search alleviates the necessity of
`enumerating all of the states and can lead to substantial savings in computation with little or no
`loss of accuracy. For example, in the implementation of SPHINX [25], they found it is possible to
`prune 80% - 90% of the hypotheses, without any loss in accuracy.
`
`There are several advantages to using the Viterbi search algorithm. First, if we use a logarithmic
`representation of the H MM probabilities, Viterbi search is extremely efficient because multipli(cid:173)
`cations of probabilities become additions. Second, because of its time-synchronous nature, it is
`very easy to modify the time-synchronous Viterbi algorithm into a beam search and it will also be
`suitable for parallel implementation if necessary.
`
`Although the actual implementations of the Viterbi search algorithm will vary, it can roughly
`be broken down into the following two modules:
`
`• The word expansion module performs Viterbi search and finds the probabilities for words
`in the vocabulary matching the speech signal.
`
`• The grammar expansion module controls the word expansion module by telling it what
`words to process in the next time frame based on grammar constraints.
`
`The DARPA resource management task (see [25] for a description of this task), which is a
`1,000—word, perplexity 60 task, has been used as a common benchmark task by the speech
`community since 1987. As a result, many architecture designs aimed to achieve this task in
`real-time. We will use the resource management task as our benchmark test to evaluate different
`architecture designs in section 4. In order to do that, we must first analyze the time and space
`requirement for Viterbi search in the resource management task. Let's take SPHINX'S [27]
`implementation for an example, without loss of generality. In the SPHINX system, allophones
`are described with seven-state models and words are sequences of (on average) six allophones.
`Therefore, each word requires about 40 states. Resource management used a statistical grammar
`which specifies the legal word pairs. In a regular Viterbi algorithm, the word expansion module
`needs to update all the states at every frame (10 ms), so for the resource management task, this
`would require updating about 40,000 states every 10ms, or about 80,000 transitions, since there
`are, on average, two predecessors for each state. There are about 7 instructions required for each
`transition in the SPHINX system, which used a discrete HMM approach. (For continuous H MM
`
`3
`
`Amazon / Zentian Limited
`Exhibit 1006
`Page 8
`
`
`
`approaches, the number of instructions required is at least one order of magnitude greater than
`that for the discrete H MM approach. Please see section 2.1.3 for details) This is quite a lot of
`computation. Fortunately, because of its time-synchronous nature, the word expansion module can
`be implemented in parallel efficiently. The other way to reduce the computational load is to use
`the beam search described earlier. For example, a beam search could reduce the number of states
`that need to be updated at every frame to about 4,000 (8,000 transitions) instead of 40,000 (80,000
`transitions) in the SPHINX's implementation. On the other hand, the computational requirements
`of the grammar expansion module depend on the complexity of grammar. For example, the
`resource management task with a word-pair grammar of perplexity 60 will need to process 1,000
`* 60 grammar transitions for non-pruned Viterbi search. Of course, the beam search can again
`be used here to reduce the computational requirements. If the Viterbi search algorithm (with or
`without beam pruning) is implemented on a single-processor general-purpose machine, the word
`expansion module generally consumes about 70% to 80% of the total search time(and up to 9 5% if
`using continuous densities), while the grammar expansion module consumes about 20% to 30%.
`Therefore, most system designs are aimed at speeding up the word expansion module.
`
`While the speed of commercial general purpose processors continues to improve every year
`(at a rate of two to three times per year), the speed of memory access does not increase at the
`same rate. [20, 39] Unfortunately, the word expansion and grammar expansion modules require
`access to a large amount of memory with a high bandwidth, and therefore the Viterbi search is
`mainly a memory-bound task. Because of the frame-based nature of the Viterbi search, it exhibits
`a very poor cache hit rate. Therefore, memory bandwidth is probably the major bottleneck in
`Viterbi search. For example, The DARPA resource management task requires an average memory
`bandwidth of 40 Mbytes per second for SPHINX [9] if it is executed in real time. This figure could
`even be larger because most systems tend to use more sophisticated models in order to achieve
`better performance. Most commercial bus-based shared memory multi-processor machines are
`not a good solution for Viterbi search because they work well only with algorithms that have a
`high cache hit-rate and the serious memory contention will slow down the processors when a few
`processors (more than 5) running in parallel.
`
`The other potential problem for parallel implementation of Viterbi search is the substantial
`amount of synchronization because Viterbi algorithm reveals a very small granularity. For example,
`the DARPA resource management task needs about 2 fis between synchronizations even with the
`beam search. Although beam search could substantially reduced computation, it also introduces
`the problem of load-balancing because it retains an unknown pattern of the most promising states
`
`4
`
`Amazon / Zentian Limited
`Exhibit 1006
`Page 9
`
`
`
`and therefore requires more complicated synchronization among processors.
`
`2.1.2. Stack Decoding
`
`Stack decoding [3] is a best-first search algorithm for continuous speech recognition. It is derived
`from the A* algorithm [37] used in artificial intelligence. Stack decoding is not time-synchronous,
`but extends paths of different lengths. Stack decoding constructs a search tree from a state graph S
`that is specified by the language model. The states in the graph represent the states in the language,
`and the branches in the search tree are words that cause transition from one language state to
`another. The search begins by adding the initial states of the language with empty hypothesis to the
`OPEN list. Then every time the best hypothesis is popped out from the OPEN list, all paths from it
`are extended, evaluated, and placed back in the OPEN list. This search continues until a complete
`path that is guaranteed to be no worse than all paths in the OPEN list has been found. In selecting
`the best hypothesis, we need an evaluation function that estimates the score of the complete path
`as the sum of the actual score of the partial path and the expected score of the remaining path. If
`the expected score of the remaining path is always an underestimate of the actual score, then the
`solution found will be optimal. This feature is called the admissibility (optimality)
`feature of the
`A* search. [38]
`
`Viterbi search is a graph search, and paths cannot be summed because they may have different
`word histories. Stack decoding is a tree search, so each node in OPEN has a unique history, and
`paths can be summed within word hypotheses when a word is extended. Therefore, with stack
`decoding, it is possible to find the optimal word sequence with multiple beginning and ending
`times which is more consistent with the training algorithm [8], than the optimal state sequence.
`Moreover, Viterbi search computes over the entire search space at all times (although beam search
`can reduce the search space), while stack decoding only computes with the best hypothesis in
`each expansion. Since the memory requirement of dynamic expansion is much less than that of
`Viterbi search, stack decoding is likely to be able to deal with very large vocabularies and more
`complicated grammars.
`
`While the aforementioned properties of stack decoding are very attractive, many implementation
`problems arise for stack decoding. First, it is very hard to compare the probabilities of partial
`paths with different lengths. Direct comparison of probabilities will cause the search to always
`favor shorter paths, resulting in a full breadth-first search. On the other hand, since the acoustic
`
`5
`
`Amazon / Zentian Limited
`Exhibit 1006
`Page 10
`
`
`
`probabilities being compared in Viterbi search are always based on the same partial path, which
`is not the case in stack decoding, stack decoding needs to perform some normalization on the
`probabilities. Unfortunately, there is no easy way to find such a normalization. Second, the
`evaluation function that guarantees the admissibility of the algorithm is difficult to find for speech
`recognition because of uncertainty about the remaining part of speech. A function that is guaranteed
`to underestimate will result in a very large OPEN list, so a heuristic function that may overestimate
`has to be used to evaluate the remaining paths. Of course, this invalidates the admissibility of the
`algorithm. Even with an overestimating evaluation function, the OPEN list will still be too large.
`Therefore, two other types of pruning are used to reduce the search burden:
`
`1. A fast-match phrase that extends only a small fraction of most promising paths from a partial
`path in order to get a list of most likely following words waiting for detailed matching. [4]
`
`2. The use of a stack (similar to the beam in Viterbi search) that saves only a fixed number of
`hypotheses in the OPEN
`list.
`
`Finally, consideration of multiple beginning and ending time of words is very tricky because it is
`too costly to try all possible beginning and ending times. Some heuristic must be used to determine
`the boundary regions.
`
`Based on the problems described above, stack decoding is considerably more difficult to im(cid:173)
`plement. Thus, very limited effort has been invested on the stack decoding for continuous speech
`recognition systems [6,41]. There are even fewer hardware architectures aimed at stack decoding.
`Although all speech architectures mentioned in this paper are aimed at Viterbi search, we would
`also like to analyze the possibility of exploring stack decoding. In order to do this, let's now try to
`characterize the potential system requirements of stack decoding.
`
`Unlike Viterbi search, a parallel implementation of stack decoding is not so straightforward since
`it is not time-synchronous. One obvious parallel implementation could be performed by popping
`N-best hypotheses from the stack and sending each to different processors. Once each processor
`completes one level of expansion(from one language state to another language state), the new
`hypotheses will be inserted back into the stack or merged with existing states if necessary. Since
`stack decoding is a best-first search, some early actions taken by executing non-best hypotheses in
`parallel might be proven to be unnecessary later and therefore it might be a waste of time to explore
`the less promising hypotheses. Based on simulation results [40], the best complete path usually
`comes from the 10-best hypotheses. From such a point of view, this kind of decomposition seems
`
`6
`
`Amazon / Zentian Limited
`Exhibit 1006
`Page 11
`
`
`
`not to be suitable for massive parallel implementations. The second type of parallel decomposition
`could be done when performing the search on the best hypotheses. In general, there are many
`language transitions for a language state depending on the perplexity of the grammar. For example,
`there are on the average 60 grammar transitions per language state for a grammar of perplexity 60.
`It is ideal to have a master processor popping out the best hypothesis and handing out the different
`word evaluations to different slave processors according to the grammar transitions. After the
`slave processors finish word evaluation, the new hypotheses along with the score information are
`sent back to the master processor and the master processor will finish the bookkeeping, merging
`and finally inserting the new hypotheses back into the stack. This decomposition allows not
`only for massive parallelism, but also for medium granularity because there are
`time-synchronous
`constraints. On the other hand, hypotheses are extended one word at a time instead of being
`extended one frame at a time in stack decoding, and different words tend to have different lengths,
`so processors are not likely to terminate at the same time when running word evaluation in
`parallel. Therefore dynamic load balancing is very crucial to implement stack decoding in parallel
`efficiently. Moreover, the communication between processors is more complicated than Viterbi
`search because it happens almost everywhere and is no longer time-synchronous. Fortunately,
`the memory bandwidth issue seems to be a less serious problem for stack decoding. Since stack
`decoders do word-based evaluation instead of frame-based evaluation, in stack decoding, it is not
`necessary to deal with the whole search space that Viterbi is always likely to access. Processors
`would have a relatively good temporal and spatial locality and therefore stack decoding would
`achieve a much higher cache hit-rate than Viterbi search.
`
`The other related issue in stack decoding is the use of fast match. Since stack decoding doesn't
`have the ability to prune unpromising word hypotheses until the completion of the whole word
`evaluation, it becomes increasingly important to find a short list of promising candidate words
`quickly and cheaply (fast match) before performing slow and expensive detailed matches, as the
`vocabulary increases in size. Because it must be quick and cheap, a parallel implementation is
`ideal for fast match. Like implementing the word evaluation in parallel, this parallelism tends to
`have medium granularity and high cache hit-rate. In addition, we could have detailed acoustic
`matching for the most promising hypothesis and use fast match to find the short list of words for
`the second most promising hypothesis to do detailed matching next running in parallel, and so on.
`However, the cost of complex synchronization is the inevitable price to pay for doing this kind of
`hybrid parallelism.
`
`7
`
`Amazon / Zentian Limited
`Exhibit 1006
`Page 12
`
`
`
`2.1.3. Continuous vs. Discrete Density
`
`The other important feature which would affect the design of speech architecture is the type of
`density functions embedded in the H MM representation. One is the discrete density function [46]
`which models the output probability for each output symbol explicitly. In association with discrete
`density HMMs, each frame of speech must be represented by a symbol from a finite alphabet through
`vector quantization (VQ) [29]. VQ is a data compression technique that first identifies a set of
`prototype vectors from training data. Then speech input is converted from a multi-dimensional
`real feature vector of, say FFT or LPC coefficients, to a symbol that represents the best-matching
`prototype vector. The other is the continuous density function. By pre-assuming certain properties
`of the output distributions, the output probability of observed speech can be attained according to
`the distribution formulas. The most frequently used continuous density is the multivariate Gaussian
`density [46]. With the multivariate Gaussian density, the output probability density function (pdf)
`is described by a mean vector and a covariance matrix. To reduce computation, the covariance
`matrix is sometimes assumed to be diagonal (all the off-diagonal terms are zeros). Moreover,
`mixture densities [47] are often used to improve the modeling because a single-mixture diagonal
`Gaussian density function does not adequately represent certain speech parameters [47].
`
`The use of continuous HMMs requires heavy computation. By using discrete HMMs, computing
`the output probability of an observed speech frame during the search is merely vector quantization
`and table-lookup. On the other hand, by using the continuous HMMs, more memory accesses
`and many more instructions are required for every transition even with the simplest single-mixture
`diagonal Gaussian density function. For example, every single-mixture diagonal Gaussian density
`function in the AT&T system requires more than 72 multiplications and 36 additions. The use of
`multiple-mixture or full covariance would definitely further increase the computational requirement
`of speech recognition with continuous HMMs.
`
`In order to compare the computation requirement of discrete and continuous densities, two
`pseudo codes for computing observation probabilities from discrete and continuous densities
`are given as follows. The pseudo code for discrete densities is based on the implementation
`in SPHINX and that for continuous densities is based on the semi-continuous SPHINX system
`[22] which resembles a 4-mixture multivariate diagonal Gaussian density function. Logarithmic
`probabilities are used in both cases. Suppose there are 3 different codebooks and the size for
`each one is 256. As we can see in the first part (quantization), continuous densities need about
`twice as many multiplications and additions as the discrete ones. However, in the second part
`
`8
`
`Amazon / Zentian Limited
`Exhibit 1006
`Page 13
`
`
`
`(output probability generation), while discrete densities only need 3 additions for each distribution,
`continuous densities need 3 * 4 *2 more additions, 3 *4 more exponentiations and 3 more
`logarithms which are very expensive for each distribution. Be aware that the computation for
`discrete densities is fixed when the number of the codebooks and the sizes of codebooks remain
`unchanged. On the other hand, the computation for continuous densities could be increased by
`using more mixtures or more Gaussian densities even when the number of codebooks and the
`sizes of the codebooks remain unchanged. For example, the AT&T system [24], which uses more
`mixtures (more than 60) and more Gaussian densities (more than 9,000), could potentially require
`much more computation.
`
`Discrete Density:
`
`for each frame {
`for each codebook K{ {
`min = 0;
`B{ = 0;
`for each code Cj in K{ {
`dist = 0;
`for each dimension k in Cj {
`dist = dist + (cep[k] - Cj[k})2\
`/* cep[] is the cepstra vector and Cj[] is the centroid vector for Cj */
`
`}
`if (dist < min) {
`B{ — j\
`min = dist;
`
`}
`
`}
`
`}
`
`for each distribution i {
`Probi = 0;
`for each codebook j {
`Probi = Probi +
`
`OuUj[Bj\,
`
`}
`
`}
`
`}
`
`9
`
`Amazon / Zentian Limited
`Exhibit 1006
`Page 14
`
`
`
`Continuous Density:
`
`for each frame {
`for each codebook K{ {
`for each code Cj in K{ {
`Gaussij.code
`Gaussij.prob
`= 0;
`for each dimension k in Cj {
`Gaussij.prob
`= Gaussij.prob — (cep[fc] — Cj[k])2 * Var^fc];
`/* cep[] is the cepstra vector, Cj[] is the mean vector and Varj[] is the covariance vector of Cj
`}
`+ Detij;
`= Gaussij.prob
`Gaussij.prob
`/* D e tj is the determinant of Gaussian pdf Gaussi,j
`
`= j;
`
`*/
`
`}
`
`}
`
`for each codebook K{
`SORT Gauss[i, 1..sizeJOj.codebook J<i] according to field prob
`
`for each distribution i {
`Probi = 0;
`for each codebook j {
`temp = 1;
`for each mixture k {
`Gaussj^.prob);
`+
`j[Gaussj,k-code]
`temp = t e mp + exp(Outi
`/* Outij[]
`is the mixture weight vector for distribution i, codebook j */
`
`y
`
`}
`Probi = Pro&i +
`
`log(temp);
`
`}
`
`}
`
`}
`
`In spite of the heavy computational requirements of continuous HMMs, it doesn't necessarily
`present as tough a problem as we might have thought at first glance. First of all, thanks to
`advanced VLSI technology, multiplication can be executed as fast as addition; the exponentiation
`
`10
`
`Amazon / Zentian Limited
`Exhibit 1006
`Page 15
`
`
`
`and logarithm could also be executed effectively by mathematic co-processors. Second, the
`computation of observation probabilities could be logically separated from the search part, so they
`can be executed in parallel and the search part only needs to access the observation probability
`result buffer filled by the observation probability generator. This functional separation could also
`be used to speed up large vocabulary Viterbi search on a single processor machine. For example,
`while there are about 2,000 distributions in the SPHINX system, it needs to update about
` 8,000
`transitions at every frame (see Section 2.1.1). Therefore, without functional separation, it will
`need to compute 8,000 distributions instead of 2,000 and many of them are actually re-computed
`several times. The recent SPHINX system [22] used a semi-continuous density function which
`resembles a 4-mixture multivariate Gaussian density function and resulted in only a 3-5 times
`slower recognition performance than the discrete SPHINX, while the performance is about 3 times
`slower without the functional separation. Moreover, the observation probability generator is a
`sequence of vector-style operations which could be implemented in parallel and involve almost
`no synchronization and load-balancing. The vector-style observation probability generator could
`even be accelerated by vector processing machines like the CRAY and Alliance.
`
`2.2. Training
`
`In comparison with recognition, training is a much less serious problem because the paths to
`search in training are always known. A good implementation on a commercial single processor
`workstations (SUNs, DEC Stations, NEXT) could easily perform the job in real-time. However,
`the size of the training database normally consists thousands of utterances constituting hours of
`speech. For example, the D ARPA speaker-independent resource management training database[44]
`consists of about 4,000 utterances and CMU's general English training database consists of more
`than 20,000 utterances [21]. The training could also consists of several iterations and several
`phases, such as mono-phone training and triphone training. Besides, the discriminative training
`schemes (such as Maximum Mutual Information training, corrective training) [11,5] could add a lot
`of phases to the training process. Therefore, it would be nice for the speech hardware architecture
`to accelerate the training process as well.
`
`As mentioned before, training on one utterance is basically a one-path parametric re-estimation
`and thus is a sequential behavior which could not benefit from a parallel implementation. However,
`since the training database contains many utterances, it is possible to train different utterances in
`parallel. This decomposition is very easy and requires only a small amount of synchronization (the
`
`11
`
`Amazon / Zentian Limited
`Exhibit 1006
`Page 16
`
`
`
`granularity is relatively large, only 1-2 seconds between synchronization).
`
`Although the memory bandwidth requirement for training is much less than that for recognition,
`the total memory needed for training could be much more because training normally needs several
`copies of the HMM's. Since the training on different utterances would require accessing different
`models, the visibility of all the models for training processor would be essential for speed-up. If
`the primary memory can't hold all the spaces needed for training, disk swapping would be fatal
`to the training process. Fortunately, swapping between disks and primary memory, and between
`primary memory and cache only happens when completing the training on the current utterance
`and beginning to train on the next utterances. We could arrange the training database so that the
`differences between consecutive utterances are minimized, therefore requiring less page-in's and
`page-out's. The other possibility is to create one process to pre-fetch the required data for the next
`utterances when training on the current utterance [2]. Even if the pre-fetch requires page-in or
`page-out, the fetch operation would not affect the training on the current utterance since the CPU
`operation and IO operation are interleaved.
`
`3. Design Criteria of Speech Hardware Architectures
`
`In this section, we would like to give some of the desired criteria for speech hardware architecture
`designs. A good arc