throbber
A Gaussian Probability Accelerator
`for SPHINX 3
`
`Binu K. Mathew, Al Davis, Zhen Fang
`
`UUCS-03-02
`
`School of Computing
`University of Utah
`Salt Lake City, UT 84112 USA
`
`November 18, 2002
`Revised on July 22, 2003
`
`Abstract
`
`Accurate real-time speech recognition is not currently possible in the mobile embedded
`space where the need for natural voice interfaces is clearly important. The continuous na-
`ture of speech recognition coupled with an inherently large working set creates significant
`cache interference with other processes. Hence real-time recognition is problematic even
`on high-performance general-purpose platforms. This paper provides a detailed analysis
`of CMU’s latest speech recognizer (Sphinx 3.2), identifies three distinct processing phases,
`and quantifies the architectural requirements for each phase. Several optimizations are
`then described which expose parallelism and drastically reduce the bandwidth and power
`requirements for real-time recognition. A special-purpose accelerator for the dominant
`Gaussian probability phase is developed for a 0:25(cid:22) CMOS process which is then analyzed
`and compared with Sphinx’s measured energy and performance on a 0:13(cid:22) 2.4 GHz Pen-
`tium4 system. The results show an improvement in power consumption by a factor of 29
`at equivalent processing throughput. However after normalizing for process, the special-
`purpose approach has twice the throughput, and consumes 104 times less energy than the
`general-purpose accelerator. The energy-delay product is a better comparison metric due to
`the inherent design trade-offs between energy consumption and performance. The energy-
`delay product of the special-purpose approach is 196 times better than the Pentium4. These
`results provide strong evidence that real-time large vocabulary speech recognition can be
`done within a power budget commensurate with embedded processing using today’s tech-
`nology.
`
`IPR2023-00033
`
`

`

`1 Introduction
`
`For ubiquitous computing to become both useful and real, the computing embedded in
`all aspects of our environment must be accessible via natural human interfaces. Future
`embedded environments need to at least support interfaces such as speech (this paper’s
`focus), feature, and gesture recognition. A viable speech recognizer needs to be speaker
`independent, accurate, cover a large vocabulary, handle continuous speech, and have imple-
`mentations amenable to mobile as well as tethered computing platforms. Current systems
`fall short of these goals primarily in the accuracy, real time, and power requirements. This
`work addresses the latter two problems. Modern approaches to large vocabulary continuous
`speech recognition are surprisingly similar in terms of their high-level structure[16]. Our
`work is based on CMU’s Sphinx3[7, 10] system. Sphinx3 uses a continuous model that
`is much more accurate than the previous semi-continuous Sphinx 2 system but requires
`significantly more compute power.
`
`Sphinx3 runs at 1.8x slower than real time on a 1.7 GHz AMD Athlon. Performance is
`hardly the problem since Moore’s Law improvement rates means all that is needed is a
`little time. A much more important problem is that the real time main memory bandwidth
`requirement of Sphinx3 is 800 MB/sec. Our 400 MHz StrongARM development system
`has a peak bandwidth capability of only 64 MB/sec and this bandwidth costs 0.47 watts of
`power. A reasonable approximation is that power varies with main memory bandwidth for
`Sphinx3 indicating that Sphinx3 is at least an order of magnitude too slow and consumes an
`order of magnitude too much power for embedded applications. This provides significant
`motivation to investigate an alternate approach.
`
`A simplistic view of Sphinx3’s high-level structure consists of 3 phases: front-end signal
`processing which transforms raw signal data into feature vectors; acoustic modeling which
`converts feature vectors into a series of phonemes; and a language model based search that
`transforms phoneme sequences into sequences of words. The process inherently considers
`multiple probable candidate phoneme and word sequences simultaneously. The final choice
`is made based on both phoneme and word context. We focus on the dominant processing
`component of the acoustic and search phases in this paper which we call GAU and HMM
`respectively.
`
`The organization of Sphinx3 is shown in Figure1. Rectangles represent algorithmic phases
`and rounded boxes represent databases. The numbers in parenthesis are the approximate
`on-disk size of the databases before they are loaded into memory and possibly expanded.
`The 3 phases are front end signal processing, Gaussian probability estimation, and Hidden
`Markov Model evaluation, subsequently referred to as FE, GAU, and HMM.
`
` Page 2
`
`

`

`
`
`Figure 1: Anatomy of a Speech Recognizer
`
`A moreaccurate and detailed view is that GAU precomputes Gaussian probabilities for
`sub-phonetic HMM states (senones). The output of the GAU phase is used during acoustic
`model evaluation and represents the probability of observing a feature vector in an HMM
`state. The set of feature vectors that can be observed from a state are assumed to follow
`a Gaussian distribution. The Gaussian probability is computed as the weighted sum of
`the Mahanalobis distance of the feature from a set of references used while training the
`recognizer. The Mahanalobisdistanceisa statistically significant distance squared metric
`between two vectors. Given a feature vector Feat andthe pair of vectors (M, V) (hereafter
`called a component) which represent the mean and variance from a reference, GAU spends
`mostof its time computing the quantity:
`
`d = ¥8_, FinalWeight, + FinalScale, * 3~3°,(Feat{i] M{i])? « V {iJ
`
`GAUis a componentof several recognizers including Sphinx3, Cambridge University’s
`HTK,ISIP andSirocco to name a few [6, 7, 14, 13, 4]. For Sphinx3, the length ofall three
`vectors is 39 and each element is an IEEE 754 32-bitfloating pomt number. The Gaussian
`table contains 49,152 components.
`
`Sphinx uses feedback from the HMM phase to minimize the number of components GAU
`needs to evaluate.
`In the worst case, every single component needs to be evaluated for
`every single frame. A real time recognizer should havethe ability to perform 4.9 inillion
`componentevaluations per second. In practice, the feedback heuristic manages to reduce
`this number to well under 50%. The Viterbi search algorithm for HMMsis multiplication
`intensive, but Sphinx as well as many other speech recognizers convert it to an integer
`addition problem by using fixed point arithmetic in a logarithmic domain. FE and GAU are
`the only floating point tensive components of Sphinx.
`
`Page 3
`
`

`

`The Sphinx3 code spends less than 1% of its time on front end processing, 57.5% of the
`time on the Gaussian phase and 41.5% on the HMM phase. While our work has addressed
`the entire application, the work reported here addresses the optimization and implementa-
`tion of the dominant Gaussian phase. The contributions include an analysis of the Sphinx3
`system, an algorithmic modification which exposes additional parallelism at the cost of
`increased work, an optimization which drastically reduces bandwidth requirements, and
`a special-purpose co-processor architecture which improves Sphinx3’s performance while
`simultaneously reducing the energy requirements to the point where real-time, speaker-
`independent speech recognition is viable on embedded systems in today’s technology.
`
`2 Characterization and Optimization of Sphinx 3
`
`To fully characterize the complex behavior of Sphinx, we developed several variants of the
`original application. In addition to the FE, GAU and HMM phases, Sphinx has a lengthy
`startup phase and extremely large data structures which could cause high TLB miss rates on
`embedded platforms with limited TLB reach. To avoid performance characteristics being
`aliased by startup cost and TLB miss rate, Sphinx 3.2 was modified to support check-
`pointing and fast restart. For embedded platforms, the check-pointed data structures may
`be moved to ROM in a physically mapped segment similar to kseg0 in MIPS processors.
`Results in this paper are based on this low startup cost version of Sphinx referred to as
`original.
`
`Previous studies have not characterized the 3 phases separately [2, 8]. To capture the phase
`characteristics and to separate optimizations for embedded architectures, we developed a
`“phased” version of Sphinx 3. In phased, each of the FE, GAU and HMM phases can
`be run independently with input and output data redirected to intermediate files. In the
`rest of this paper FE, GAU, HMM refers to the corresponding phase run in isolation while
`phased refers to all three chained sequentially with no feedback. In Phased, FE and HMM
`are identical to original, while GAU work is increased by the lack of dynamic feedback
`from HMM. Breaking this feedback path exposes parallelism in each phase and allows the
`phases to be pipelined. GAU OPT refers to a cache optimized version of the GAU phase
`alone. PAR runs each of the FE, GAU OPT and HMM phases on separate processors. It
`also uses the same cache optimizations as GAU OPT.
`
`We used both simulation and native profiling tools to analyze Sphinx 3. Simulations pro-
`vide flexibility and a high degree of observability, while profiled execution on a real plat-
`form provides realistic performance measures and serves as a way to validate the accuracy
`of the simulator. The configurations used to analyze Sphinx 3 are:
`
` Page 4
`
`

`

`Native execution: SGI Onyx3, 32 R12K processors at 400 MHz, 32KB 2 way IL1, 32KB
`2 way DL1, 8 MB L2. Software: IRIX 64, MIPS Pro compiler, Perfex, Speedshop.
`Simulator (default configuration): SimpleScalar 3.0, out of order CPU model, PISA ISA,
`8 KB 2 way IL1, 2 cycle latency, 32 KB 2 way DL1, 4 cycle latency, 2 MB 2 way L2, 20
`cycle latency, 228 cycle DRAM latency, L1 line size 64 bytes, L2 line size 128 bytes.
`
`It appeared likely that a multi-GHz processor might be required to operate Sphinx in real
`time. Parameters like L1 cache hit time, memory accesstime, floating point latencies etc
`were measured on a 1.7GHz AMD Athlon processor using the Imbench hardware perfor-
`mance analysis benchmark [9]. Numbers that could not be directly measured were ob-
`tained from vendor micro architecture references. Simplescalar was configured to reflect
`these parameters. Unless mentioned otherwise, the remainder of this paper uses the default
`configuration.
`
`Native profiling indicates that the original Sphinx 3 spends approximately 0.89%, 49.8%
`and 49.3% ofits compute cycles in the FE, GAU and HMM phasesrespectively. Another
`recent study found that as high as 70% of another speech recognizers execution time was
`spent in Gaussian probability computation [8]. In thephased version we found that approx-
`imately 0.74%, 55.5% and 41.3% of time was spent in FE, GAU and HMM respectively.
`Since FE is such a small component of the execution time, we ignore it in the rest of this
`study and concentrate on GAU and HMM.
`
` SSS.LLLLLLLLTCTS80LS.
` 9.43%
` 7.27% MissRate
`
`(Percent)
`
`A.
`
`L1 Data Cache Size
`
`Figure 2: L1 DCache Missrate
`
`Figures 2 and 3 show the L1 Dcache and L2 cache miss rates for original, phased, FE,
`
`Page 5
`
`

`

`x
`
`40
`
`
`
`o|
`=
`8
`OF eR
`*

`3
`2
`as] 3g gs & =F x,
`Pd
`ad
`2
`to OF
`o
`5
`Ae se xe oe2
`2
`oO
`q
`y
`aa
`V8
`ANE ee
`35
`NY
`NY
`N
`Nx 28 JN
`a Ce
`ce
`ee eee
`+f SN
`NY
`NY
`N-&
`N
`&
`ag AAW
`\J
`VP
`\a: x
`e
`7 N
`ING
`BBN
`NY
`NS Se
`&
`#
`s>)B
`ANY
`BIN
`3 N
`NY
`N& sf
`8

`ZNO
`EDNG
`BENG
`FENG
`Ne 3 2
`2
`—YUNY
`BANG
`BENG
`BENZ
`\ian
`#
`STUNG BANG
`BANG
`BENG
`BENGE Ne &
`WANG AAG WG ANG
`ENG
`MEN: BE?
`Y
`Nx
`NY
`N
`=
`¥N = Fe=e
`ONG WAG WG WS WS EG ess
`104
`NYA
`BANG
`BANG
`BAZNSY
`BANG
`BEA
`Ech?
`NG We 2 Vg V3 Ve VF
`YN BAND
`BUND
`BAA
`Yen BN BUNY
`ANY BANG
`BONG
`BANG
`PANG
`BAN Paw
`ANG DA a ee a
`ee
`ra?
`ye
`ane
`we
`ew
`ae”
`L2 Cache Size
`
`Figure 3: L2 Cache Missrate
`
`HMM and GAUfor a variety of configurations. Since earlier studies showed that larger
`line sizes benefited Sphinx II, 64 byte L1 and 128 byte L2 cache line sizes were chosen [2].
`In addition, the L2 cache experiments assume a 32K L1 Deache. Both figures assume an
`8 KB ICache. Since Sphinx has an extremely low instruction cache missrate of 0.08% for
`an 8KB ICache, no other ICache experiments were done. The SGIdata providesa reality
`check since they represent results obtained using hardware performance counters. Though
`the SGI memory system latency is much lowerthan that of simulated processors on account
`of low processor to memory clock ratio, the L2 results are very similar in character to the
`8MB simulation results in spite of the effects of out of order execution affected by memory
`system latency and differences in cache replacement policy. The L1 results are not directly
`comparable since the R12000 uses a 32 byte L1 line size and suffers from cache pollution
`caused by abundant DTLB misses.
`
`Figure 4 showsthe average bandwidth required to process the workloadin real time. This
`is obtained by dividing the total L2 to memory traffic while Sphinx operates on a speech
`file by the duration in seconds of the speech signal. The evidence suggests that bandwidth
`starvation leadingto stalls on L2 missesis the reason this application is not able to meet real
`time requirements. The memory bandwidth required for this application is several times
`higher than whatis available in practice (not theoretical peak) on most architectures. For
`example, a 16 fold improvementin L2 size from 256 KB (the L2 size of a 1.7 GHz Athlon)
`to 8 MB (SGI Onyx) produces only a very small decrease in the bandwidth requirement
`of GAU.This phase essentially works in stream mode making 100 sequential passes per
`
`Page 6
`
`

`

`i
`g
`ei
`PEE
`on
`4
`s
`a
`iY
`A
`=
`Y
`Ys
`waneEee adeclis &
`= 75041
`y
`y
`er aie
`YjSALA 8
`a
`00
`Y
`y
`y
`y
`
`
`ZAaaY y
`Y)
`y
`y
`y
`A
`UZ)44
`y
`y
`y
`y
`y
`y
`y
`y
`UA,
`AN(4
`VA)
`eZ
`pe gw, Fh
`
`
`
`289
`
`SSS468
`
`3
`Al40
`
`=z
`28
`
`&
`
`3
`
`305
`
`SSS468
`
`A.hi
`aot?
`
`a
`
`ESS66
`
`1895
`
`0
`
`17
`-
`
`1000
`
`8
`—
`
`C
`
`x
`o
`2
`
`3
`
`5
`
`L2 Cache Size
`
`Figure 4: L2 to Memory Bandwidth
`
`second over a 14 MB Gaussian table. The speech signal itself contributes only 16KB/s
`to the total bandwidth requirements. Some computation saving heuristics in Sphinx also
`have the beneficial side effect of helping to save bandwidth by not touching blocks that are
`deemed improbable. Until the L2 size reaches 8 MB, long term reuse of Gaussian table
`entries in the L2 is infrequent. It should be noted that the bandwidth requirement of GAU
`in isolation is more severe than if it were operating inside original, since feedback driven
`heuristics cannot be applied.
`
`2.1
`
`ILP in Sphinx
`
`Before exploring special-purpose architecture extensions for speech, it is worthwhile to in-
`vestigate the limits of modern architectures. GAUis a floating point dominant code while
`HMM is dominated by integer computations. GAU also appears to be easily vectorizable.
`We performed twosimulation studies to explore possibilities for extracting ILP. For GAU,
`a surplus of integer ALUs were provided and the numberoffloating point units were var-
`ied. Since this algorithm uses an equal number of multiplies and adds, the number of
`floating point adders and multipliers were increased in equal numbers from | to 4, which
`corresponds to the X axis varying from 2 to 8 FPUs in Figure 5. Two different memory
`system hierarchies were considered: a reasonable one for a multi GHZ processor() and an
`aggressive memory system with lower latencies. Reasonable configuration: 32KB DL1,
`
`Page 7
`
`€
`

`

`4 cycle latency, 2MB L2, 20 cycle latency, 2 memory ports. Aggressive configuration:
`32KB DL1, 2 cycle latency, 8MB L2, 20 cycle latency, 4 memory ports.
`
`The SGI-2+2f entry describes the measured total IPC on the R12000 which has 2 integer
`and 2 floating point units. The SGI-2 entry is the measured floating point IPC alone.In the
`case of GAU,IPC remains low becauseof the inability of the algorithm to have sufficient
`memory bandwidth to keep the FPUsactive. In the case of the R12000 whichcan issue two
`floating point operations per cycle, the IPC for this loop is an underwhelming 0.37. GAU
`OPT, uncovers opportunities for ILP by virtue of its cache optimizations thereby improving
`IPC greatly. However, IPC saturates at 1.2 in spite of available function units. A recently
`published study also indicated IPC in the range of 0.4 to 1.2 for another speech recognizer
`[8]. Clearly, the architecture and compiler are unable to automatically extract the available
`ILP which again argues for custom acceleration strategies.
`
`GAUOPTIPC &
`
`BBVOOr ?
`
`
`
`&
`
`S&S
`
`Number of FPUs
`
`Number of FPUs
`
`Figure 5: GAU and GAU OPT IPC
`
`Figure 6 shows the corresponding experiment for the HMM phase.In this experiment, the
`numberof integer adders and multipliers in equal numbers varies from | to 4. In spite of
`available execution resources, IPC remains low.It should be notedthat in both experiments,
`the SGI results are indicative of cases where the CPU to memory clockratio is low. This
`ratio will undoubtedly increase in the future.
`
`The observations from sections 2, 2.1 have several implications: If speech is an “always on”
`feature, it could cause significant L2 cache pollution and memory bandwidth degradation to
`the foreground application. To guarantee real time processing, it might be better to stream
`data around the L2 rather than pollute it. Since the L2 cache is one of the largest sources
`of capacitance on the chip, accessing it for stream data also incurs a large power overhead.
`Low power embeddedplatforms may not need any L2 cacheatall since dramatic increases
`
`Page 8
`
`

`

` on
`OoO=-=]|NNWoaa
`
`0.85
`
`a0.
`
`1.05
`——
`
`ll Reasonable
`Gi Aggressive
`
`Speedup
`
`1.00
`
`.7
`
`Number of ALUs
`

`
`Figure 6: HMM IPC
`
`Figure 7: Measured Speedup on R12K
`
`in L2 size are not accompanied by corresponding improvements in DRAM bandwidth or
`performance. Bandwidth reduction is important for its own sake as well as to reduce power
`consumption. Bandwidth partitioning so that each phase has independentaccessto its data
`set is important.
`
`2.2 Results of Software Optimizations for Sphinx
`
`Cache Optimizations: In the Section 2, GAU was shown to be bandwidth starved. The
`GAUcode in phased was instrumented and found to require approximately twice the
`amount of computation as in original. However, Figure 7 showsthat phased suffers only
`0.85 times slow down overoriginal on an R12000. Clearly, a large fraction of the excess
`computation is hidden by memory latency. With processor to memory speedratios increas-
`ing in the future, an out of order processor can hide an even larger amount of compute
`overhead. The key is to improve the memory system behavior without an unreasonable
`increase in compute requirements.
`
`To achieve this goal, two transformations were performed on phased. First, a blocking op-
`timization similar in spirit to loop tiling is performed whichdelaysthe initial speech signal
`by 100ms or 10 frames. The Gaussian probabilities for all 10 frames are then computed
`by making a single pass over the Gaussian tables. This effectively reduces the number
`of passes to 10 per second where original would have done 100. The blocking factor is
`limited to 10 to avoid a perceptible real-time lag at the decoder output. Sphinx allocates
`the mean and variance vectors used for Gaussian computation described in section 1 sepa-
`rately. Second, every component evaluation consumes one mean and one variance vector.
`Weinterleaved corresponding vectors to reduce cache conflicts. Since Sphinx originally
`allocated each table of vectors separately and each is more than 7 MB,they potentially
`
`Page 9
`
`

`

`for(senone = 0; senone < N; senone++)
`for(block=0; block < 10; block++)
`for(c=0; c < 8; c++)
`{
`
`for(i=0, sum=0.0; i < 39; i++)
`{
`
`// Loop 0
`// Loop 1
`// Loop 2
`
`// Loop 3
`
`t = X[block][i] -
`Gautable[senone][c].vector[i].Mean;
`sum = t * t *
`Gautable[senone][c].vector[i].Var;
`sum = max(sum, MINIMUM_VALUE);
`
`}
`score[senone][block] += sum *
`Gautable[senone][c].FinalScale +
`Gautable[senone][c].FinalWeight;
`
`}
`
`Figure 8: Cache Optimized Gaussian Algorithm
`
`conflict with each other in the cache. To avoid this, corresponding mean and variance vec-
`tors are interleaved and padded with an additional 64 bytes to be exactly 3 L2 cache lines
`long. This padding strategy consumes bandwidth but simplifies DMA transfers for the co-
`processor architecture described later. The optimized version appears in Figure 8. Note
`the interleaving of vectors and a blocking loop that is not present in the equation shown in
`Section 1. More details of how this affects a hardware implementation will be presented in
`the next section. The optimized version appears in Figures 2, 3, 4 and 7 as the data point
`GAU OPT.
`
`GAU OPT demonstrates the true streaming nature of GAU. Figure 4 shows that GAU
`OPT uses a factor of 4.7 to 3.9 less bandwidth than GAU in simulation with a factor of 4.2
`improvement obtained on a real machine. This supports our claim that GAU processing
`can be done without an L2 cache. With a 256 KB L2 cache, the GAU OPT bandwidth is
`174 MB/s. We have calculated that with no heuristic, and no L2 cache, GAU OPT can meet
`its real time requirements with 180 MB/s of main memory bandwidth. This has important
`implications for the scalability of servers that process speech.
`
`Figures 2 and 3 show dramatic reduction in the cache miss rates in both simulation and
`native execution. The L2 native execution results are better than simulation results. The
`large variation in the L1 results is due to the 32 byte L1 line size on the R12000 and also
`
`
` Page 10
`
`

`

`possibly because of an extremely large number of TLB misses. The software miss handler
`could easily pollute the L1 cache. The important point is that Figure 7 shows that OPT,
`a version of phased with our GAU OPT blocking optimizations achieves a slight speedup
`over original in spite of performing a larger number of computations. In summary, to be
`able to extract parallelism, the feedback loop was broken which approximately doubled the
`GAU workload. With cache optimizations (which are not possible with feedback), the loss
`due to the extra GAU workload is recovered and the exposed parallelism is now open for
`further optimization.
`
`Parallelization: Based on the percentage of execution time, Amdahl’s law predicts a factor
`of 1.97 speedup if GAU and HMM processing could be entirely overlapped. It is clear that
`a special-purpose architecture for GAU can have significant speedup, as well as power and
`scaling benefits. We parallelized Sphinx in order to see if there were any practical impedi-
`ments to achieving good speedup. We developed, a parallel version of Sphinx, called PAR,
`which runs each of the FE, GAU OPT and HMM phases on separate processors. In effect,
`this models an SMP version of Sphinx 3 as well as the case where each processor could
`be replaced by a special-purpose accelerator. As shown in Figure 7, the parallel version
`achieves a speedup of 1.67 over the original sequential version. A custom accelerator will
`likely be even better. The HMM phase was further multi-threaded to use 4 processors in-
`stead of 1, but the resulting 5 processor version was slower than the 2 processor version
`due to the high synchronization overhead. Our research shows that HMM processing also
`benefits from special-purpose acceleration but that work is not reported here.
`
`3 A GAU Acceleration Architecture
`
`The tight structure of the GAU computation lends itself to a high-throughput custom im-
`plementation. The key questions are how to achieve area, power and bandwidth efficiency
`as well as scalability. This section describes how we achieved these goals by a) reduc-
`ing the floating point precision, b) restructuring the computation, and c) sharing memory
`bandwidth.
`
`Sphinx designers try hard to eliminate floating point computation wherever possible. GAU
`and FE are the only floating point dominant computations in Sphinx. An attempt was
`made to convert GAU to use fixed point integer arithmetic. This was a total failure. The
`computations require a very high dynamic range which cannot be provided with 32 bit
`scaled integer arithmetic. Fortunately, the scores of the highly probable states are typically
`several orders of magnitude higher than those of the less likely ones indicating that a wide
`range is more important than precision.
`
`
` Page 11
`
`

`

`Earlier work explored the use of special-purposefloating point formats in Gaussian estima-
`tion to save memory bandwidth [11]. Special floating point formats should be almost in-
`visible to the application. This reduces complexity and enables the development of speech
`models without access to any special hardware. We conducted an empirical search for the
`precision requirements by creating a custom software floating point emulation library for
`GAU.Thelibrary supports multiplication, addition, MAC, and (a — b)? operations on IEEE
`754 format floating point numbers. The approach was to experimentally reduce mantissa
`and exponent sizes without changing the output results of the Sphinx3 recognizer. The
`result is an IEEE 754 compatible 12-bit mantissa and 8-bit exponent format similar to an
`IEEE 754 numberin that, it has a sign-bit, an 8-bit excess 127 exponent and a hidden one-
`bit in its normalized mantissa. Unlike IEEE 754 whichhas 23 explicit-bits in the mantissa,
`we only need 12-bits. Conversion between the reduced precision representation and IEEE
`754 is trivial. Though our study was done independently, we subsequently found a previous
`study whicharrived at similar conclusions based on an earlier version of Sphinx [15]. How-
`ever this study used digit serial multipliers which cannot provide the kind of throughput
`required for GAU computation. Hence wechoseto use fully pipelined reduced precision
`multipliers instead.
`
`Another key insight is that current high performance microprocessors provide a fused mul-
`tiply add operation which would benefit GAU. However, GAU also needs an add multiply
`(subtract-square) operation. There is scope for floating point circuit improvements relying
`on the nature of (a — b)? alwaysreturning a positive number. Further gains can be obtained
`both in area, latency, power and magnitude of the numerical error by fusing the operations
`(a — b)? « c. This is the approach wehavetaken.
`
`Figure 9 illustrates the system context for our GAU accelerator. Figure 10 showsthe details
`of the acceleratoritself.
`
`
`Coproc 2
`Memory
`
`
`Access
`Interface
`
`Low Priority
`on
`
`DRAM Bus
`
`Requests
`
`
`
`Gaussian
`Accelerator
`
`Figure 9: Top Level Organization of Gaussian Estimator
`
`Page 12
`
`

`

`We implement loops 1,2,3 (from the optimized GAU algorithm in Figure 8) in hardware
`while the outer loop is implemented in software. The max operation can be folded into the
`denormal floating point number handling section of our floating point adder without addi-
`tional latency, but empirically it can be discarded without sacrificing recognition accuracy.
`The organization in Figure 9 is essentially a decoupled access/execute architecture where
`the outer loop runs on a host processor and instructs a DMA engine to transfer X, Mean and
`Var vectors into the accelerators input memory. A set of 10 input blocks are transferred into
`the accelerator memory and retained for the duration of a pass over the entire interleaved
`Mean/Var table. The Mean/Var memory is double buffered for simultaneous access by the
`DMA engine and the accelerator. The accelerator sends results to an output queue from
`where they are read by the host processor using its coprocessor access interface.
`
`(cid:3) c floating point unit, followed by an adder that
`The datapath consists of an (a (cid:0) b)2
`accumulates the sum as well as a fused multiply add (a (cid:3) b + c) unit that performs the
`final scaling and accumulates the score. Given that X, Mean, and Var are 39 element
`vectors, a vector style architecture is suggested. The problem comes in the accumulation
`step since this operation depends on the sum from the previous cycle, and floating point
`adders have multi-cycle latencies. For a vector length of N and an addition latency of M, a
`straight forward implementation takes (N-1) * M cycles. Binary tree reduction (similar to
`an optimal merge algorithm) is possible, but even then the whole loop cannot be pipelined
`with unit initiation interval.
`
`This problem is solved using by reordering Loops 1,2,3 to a 2,3,1 order. This calculates
`(cid:3) V term for each input block while reading out the mean and variance
`an (X (cid:0) M )2
`values just once from the SRAM. Effectively this is an interleaved execution of 10 separate
`vectors on a single function unit which leaves enough time to do a floating point addition
`of a partial sum term before the next term arrives for that vector. The cost is an additional
`10 internal registers to maintain partial sums. Loops 2,3,1 can now be pipelined with
`unit initiation interval. In the original algorithm the Mean/Var SRAM is accessed every
`cycle whereas with the loop interchanged version this 64-bit wide SRAM is accessed only
`once every 10 cycles. Since SRAM read current is comparable to function unit current in
`the CMOS technology we use, the loop interchange also contributes significant savings in
`power consumption.
`
`The Final Sigma unit in Figure 10 works in a similar manner except that instead of a floating
`point adder, it uses a fused multiply add unit. It scales the sum, adds the final weight and
`accumulates the final score. Due to the interleaved execution this unit also requires 10
`intermediate sum registers. This unit has a fairly low utilization since it receives only 8
`* 10 inputs every 39 * 10 * 8 cycles. It is useful since it makes it possible for the host
`processor to read one combined value per block instead of having to do 8 coprocessor
`reads. To save power this unit is disabled when it is idle. In a multi-channel configuration
`
`
` Page 13
`
`

`

`it is possible to share this unit between multiple channels.
`
`Input
`Fen
`Processor
`
`Control
`
`Control
`
`“Inata
`from
`DMA
`
`SOE Partial
`
`Results
`to
`Processor
`
`32
`
`sum
`
`Partial
`sum
`
`Figure 10: Gaussian Coprocessor
`
`The datapath shown in Figure 10 has been implementedusing a datapath description lan-
`guage (Synopsys Module Compiler Language) and is subsequently synthesized for a 0.25
`CMOSprocess. The control sections were written in Verilog and synthesized using Syn-
`opsys Design Compiler. The gate level netlist is then annotated with worst case wire loads
`calculated using the same wire load model used for synthesis. The netlist is then simulated
`at the Spice level using Synopsys Nanosim andtransistor parameters extracted for the same
`0.25 process by MOSIS. Energy consumption is estimated from the RMSsupply current
`computed by Spice. The unoptimized fully pipelined design can operate above 300 MHz
`at the nominal voltage of 2.5 volts with unitinitiation interval. At this frequency the per-
`formance exceeds the real time requirements for GAU,indicating an opportunity to further
`reduce power. A lower frequency and voltage can be used to further reduce power.
`
`The accelerator was designed and simulated along with a low-power embedded MIPS-like
`processor that we could modify as needed to support special purpose co-processoracceler-
`ators. This control processoris a simple in-order design that uses a blocking L1 DCache
`and has no L2 cache. To support the equivalent of multiple outstanding loads, it uses the
`MIPScoprocessor interface to directly submit DMA requests to a low priority queue in
`the on-chip memory controller. The queue supports 16 outstanding low priority block read
`requests with block sizes being multiples of 128 bytes. A load request specifies a ROM ad-
`dress and a destination — one of the Feat, Mean or Var SRAMs. The memory controllerini-
`tiates a queued memory read andtransfers the data directly to the requested SRAM index.
`A more capable out of order processor couldinitiate the loads directly. Software running on
`the processor core does the equivalent of the GAU OPTphase. It accumulates 100ms or 10
`frames of speech feature vectors (1560 bytes) into the Feat SRAM periodically. This trans-
`fer uses the memory controller queue interface. Next, it loads two interleaved Mean/Var
`vectors from ROM into the corresponding SRAM using the queue interface . A single
`
`Page 14
`
`

`

`transfer in this case is 640 bytes. The Mean/Var SRAM is double buffered to hide the
`memory latency. Initially, the software fills both the buffers. It then queues up a series of
`vector execute commands to the control logic of the Gaussian accelerator. A single com-
`mand corresponds to executing the interchanged loops 2,3,1. The processor then proceeds
`to read results from the output queue of the Gaussian accelerator. When 10 results have
`been read, it is time to switch to the next Mean/Var vector and refill the used up half of the
`Mean/Var SRAM. This process continues until

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket