throbber
ISCA Archive
`http://www.isca-speech.org/archive
`
`2nd European Conference on
`Speech Communication and Technology
`EUROSPEECH ’91
`Genova, Italy, September 24-26,1991
`
`SPEAKER INDEPENDENT CONTINUOUS HMM-BASED RECOGNITION
`OF ISOLATED WORDS ON A REAL-TIME MULTI-DSP SYSTEM
`
`Abdulmesih Aktas and Klaus Zünkler
`
`Siemens AG, Corporate Research & Development
`Otto-Hahn-Ring 6,8000 München 83, Germany
`
`Abstract
`This paper describes a speaker independent
`continuous Hidden Markov Model recognizer
`implemented on a real-time multi-DSP system.
`Training and recognition are based on continuous
`mixture density HMMs for phonemes. Context
`dependent triphone models are used and the Viterbi
`algorithm is applied for both training and recognition.
`The system is implemented on a workstation with an
`integrated multi-DSP based acoustic front-end
`employing three Texas Instruments TMS320C25 signal
`processors and a Siemens ASIC for vector quantization.
`In spite of the simplifications made in order to reduce
`the high computational requirements for the
`continuous mixture densities, the system has a
`recognition rate of 99.5% for the speaker independent
`German digit task with telephone quality.
`
`Keywords: Speech Recognition, Hidden Markov Model,
`mixture density, speaker independent, real-time.
`
`1. Introduction
`
`Hidden Markov Models (HMM) with continuous
`mixture densities as stochastic models for speech
`recognition have been shown to be well-suited for
`modeling spectral and temporal variabilities in speech
`[4], The computation of the emission probabilities re­
`quired for the probability density functions is usually a
`highly time consuming task. Therefore most real-time
`applications of HMM-based recognizers on digital
`signal processors (DSP) use discrete models (e.g. [3]).
`For the implementation of the continuous mixture
`density HMMs powerful floating point DSPs or special
`ASICs have to be employed. We use in our approach a
`vector quantization VLSI-chip (VQ) for producing the
`emission probabilities. For the feature extraction
`integer DSPs are employed. The Viterbi search is
`implemented on a RISC-based workstation.
`The recognizer was evaluated on two different
`speaker independent German digit databases, whereby
`recognition rates of 99.5% were achieved. More details
`on the recognition experiments are given in [6,7].
`
`The following section of our paper describes the
`structure of our models. Section 3 deals with the real­
`time multi-DSP front-end, while section 4 presents
`implementation aspects of the recognition algorithm on
`the hardware. The VQ realization of the emission
`probability computation is explained in more detail.
`Finally some results conclude the paper.
`
`2. Model Structure
`
`As experiments with word models have shown a
`poor ability of discriminating confusable words,
`phoneme models are used in our system [6]. The
`phonemes are modelled with context dependency
`because their acoustic realizations are often heavily
`affected by coarticulation. The HMMs that we use for
`each phoneme consist of three segments for modeling
`the beginning, the middle and the end. Each segment
`consists of two states with tied emission probabilities.
`The first and the third segment consider coarticulatory
`effects due to the transitions to the neighboring
`phonemes, whereas the second segment stands for the
`middle part which is less affected by the phonetic
`context.
`According to a pronunciation lexicon word models
`are concatenated of phoneme models. During
`recognition only phoneme sequences contained in the
`lexicon are used. On the segment layer triphone
`models, shown in Figure 1, are the basis. In the example
`the concatenation of three phonemes (/d/, /r/ and /ai/) to
`the German word 'drei' is given. The three phonemes
`are embedded into silence models (/si/). The second line
`shows the context dependent segments (si, sld, d, dr, dr
`and so on) with their associated indices, which indicates
`the adjacent or predecessor segment of the current
`phoneme for this word. This kind of modeling allows
`robust and high recognition performance.
`On the segment layer the possible . number of
`triphones is reduced because every middle segment of a
`phoneme model is adjacent only to segments belonging
`to the same phoneme. So the middle segments of the
`phoneme models do not have any context information.
`Each beginning segment of a phoneme is succeeded by
`the same context independent middle segment. Thus
`
`EUROSPEECH ’91,Genova, Italy, September 1991
`
`1345
`
`IPR2023-00034
`Apple EX1059 Page 1
`
`

`

`phonetic transcription:
`/d/
`/si/ 1
`1
`context debendent segmentd:
`si । s’d i d । dr । dr
`i
`i
`i
`i
`
`/r/
`
`1
`i r
`i
`
`/ai/
`r • ।

`•
`ai । ai । ar
`* si
`i
`_i
`evo
`
`/si/
`si
`
`Figure 1: Context dependent phoneme
`based word model of the
`German digit 'drei' (three).
`
`---------------1......... 1
`1
`dual-ported memory enabling a parallel access from
`the triphones are reduced into a mixture of bigrams and
`both the DSP and the host processor. This allows in case
`context independent segments. With an inventory of 40
`of signal acquisition, a real-time transfer of the
`phonemes approximately 3000 segment models would
`preprocessed and analyzed data. The inter-DSPbus
`be possible. In practice, with a small digit vocabulary
`allows fast communication between the signal
`this is reduced to 90 segment models.
`processors on the master and the slave boards without
`interfering with the VMEbus. A global memory scheme
`is realized in order to perform data transfers between
`the processors. For a flexible VMEbus and inter-DSP
`communication an interrupt control is provided [1].
`The VQ processor is able to handle vectors of
`variable dimensions and codebooks of programmable
`size. Each codebook vector can be extended up to 64
`dimensions of 8 bit components, without noticeable loss
`of time performance. Four software selectable codebook
`memory banks each with 1 Mbit are dedicated to the
`processor. The chip can be controlled from either one of
`the DSP subsystems of the AkuFE or from the host. The
`on-chip cache RAM of the VQ processor allows parallel
`load of a vector during the search. Within 1 second the
`VQ can compare 107 vector components of the codebook
`to a given vector. Two different distance measures
`(City-block and Euclid) are supported. The chip delivers
`the codebook entry and the distance of the best
`codebook candidate.
`
`3. Multi-DSP Acoustic Front-End
`
`The recognition system is implemented on a
`UNIX workstation employing a multi-DSP based
`acoustic front-end (AkuFE) [1] which was developed
`within the speaker adaptive continuous speech under­
`standing system SPICOS [4]. In the SPICOS system the
`AkuFE is employed with the acoustic-phonetic
`decoding and the real-time recognition of subword units
`[2], The AkuFE is designed for real-time speech
`processing tasks and is a configurable VMEbus system
`employing Texas Instruments TMS320C25 signal
`processors and a Siemens ASIC for vector quantization
`(VQ). A powerful system can be configured using three
`boards (one master and two slave boards) with a total
`computational power of more than 100 MLPs. In this
`case five signal processors and two VQ processors can
`work in parallel. Figure 2 shows a general
`configuration of an extended system.
`
`4. Real-Time Implementation
`
`Real-Time Considerations
`'
`Training and recognition in our system are based
`on continuous mixture density HMMs for phonemes as
`used in the SPICOS system [4], The main difference
`besides recording conditions and sampling frequencies
`is that context dependent models are used. The Viterbi
`algorithm is implemented for both, training and
`recognition. Our models are left to right first order
`Markov models with a fixed A matrix, N-dimensional
`feature vector and a continuous mixture density
`function p(x) with M modes, which is based on Gaussian
`density functions:
`M
`
`P(Xt) =
`
`m=l
`
`where xt represents the input frame vector, cm is the
`mixture weight for the m-th mode, Um is the mean
`
`Figure 2: The multi-DSP acoustic front-end system
`
`The master board is the basic control system of an
`extendable multi-DSP architecture. It consists of an
`analogue (A/D and D/A converter) and digital, Texas
`Instruments TMS320C25-based subsystem, whereas
`the slave board employs two TMS320C25 DSPs and a
`VLSI-chip for VQ. A local data and program memory is
`dedicated to each signal processor. A data memory
`region of 1 KWord is provided on the master board as
`
`EUROSPEECH ’91,Genova, Italy, September 1991
`
`1346
`
`IPR2023-00034
`Apple EX1059 Page 2
`
`

`

`vector of the mixture mode m and Cm is the covariance
`matrix.
`In order to achieve real-time processing some
`assumptions concerning the emission probabilities are
`made [6]:
`.
`The covariance matrix C is assumed to be
`•
`diagonal.
`It is assumed that the contribution of the nearest
`mixture density is much larger than the con­
`tribution of the others, thus the probability is
`computed only for -the best fitting mode k for a
`given input vector Xf Additionally the variances
`of all mixture densities are considered to be equal.
`This results in the reduction of formula (1) to:
`(2)
`Pk(xt) =
`Ai
`where Ki and K2 are constants which depend on the
`variance of the distributions. Simulation experiments
`have shown that this is no limitation if there are
`enough mixtures available for modeling the density
`functions. With these simplifications the emission
`probability depends only on the mean of the best fitting
`mixture density and its mixture weight. In the
`logarithmic domain, the computation of the probability
`simplifies to an Euclidian distance
`N
`- In pk(xt) = Ky - In cfc + K2 (*'" “
`n=l
`which can be easily computed by the special hardware.
`With the assumption of Laplacian density
`functions the mixture computation can be further
`reduced to the City-block distance and modified to a
`distance measure dkt derived from the probabilities:
`- In pi,^) Ki
`-----r-----------t
`A 2
`A2
`N
`1
`in Ck . .
`~
`= Q + > y Pin “ /¿kn| , With Ck =--- —
`(4)
`2
`n—1
`Ck is here a mode dependent constant.
`Previous experiments with both distance
`measures have not revealed significant differences in
`the recognition results.
`
`(3)
`
`•
`
`=
`
`The Viterbi search algorithm is split into two
`parts which run on different processors. The production
`of the emission probabilities for each frame is
`performed on the VQ processor while the calculation of
`the accumulated word model probabilities is processed
`on the host machine. Figure 3 shows the organization of
`the system moduls on the hardware.
`In the current system three DSPs and one VQ
`processor are employed for the preprocessing, feature
`extraction and emission probability calculation. The
`DSP on the master board and the VQ control processor
`work in multitasking modes. The first DSP handles
`
`Figure 3: The organization of the recognition moduls on
`the hardware.
`
`additionally the double-buffered organization of the
`real-time signal acquisition. The host machine receives
`for each 10 ms an interrupt which initiates a buffer
`read on the mailbox of the master board to get the
`distances for the current frame.
`
`Feature Extraction
`Two boards of the AkuFE system are employed for
`the recognition task. The first board of the AkuFE
`system performs the preprocessing steps, amplification,
`low-pass filtering at a corner frequency, of 3.4 kHz,
`sampling with 8 kHz and A/D conversion (12 bit
`resolution). After the preemphasis a 256-point
`Hamming window is multiplied to the speech frame
`with a step size of 10 ms. Further tasks of this board are
`the control of the slave DSPs and of the data flow over
`the Mailbox and global memory.
`Mel-scaled filter bank energies and their
`differences are used as feature vector. The time
`consuming computation of this vector is implemented
`in integer arithmetic on the slave board employing both
`of the DSP subsystems. A 256-point FFT is applied to
`the windowed signal. The spectrum is passed to a filter
`bank. The Mel-filter bank basically performs a
`summation of the FFT-channel energies within the
`selected frequency ranges varying according to the Mel­
`scale. The filter bank energies are then transformed to
`the log-domain. This allows a coding of each dimension
`with 8 bit resolution. In order to eliminate irrelevant
`energy variations for neighbored frequencies which are
`caused by the pitch, a cepstral smoothing is applied to
`the 24-dimensional spectras. The smoothing is done by
`a convolution of a liftering window transformed into the
`frequency domain. Additionally the total frame energy,
`the first and the second derivative of the Mel-vector is
`computed. After condensation of the derivates the
`feature vector has 52 dimensions.
`.
`
`Computation of the emission probabilities
`The feature vector, which is computed in real-time
`every 10 ms, is passed to the VQ stage in order to
`compute the local emission probabilities for the current
`
`EUROSPEECH ’91,Genova, Italy, September 1991
`
`1347
`
`IPR2023-00034
`Apple EX1059 Page 3
`
`

`

`frame. The organization of the codebook is as follows:
`For each segment there are about ten codebook entries
`containing the mean vector and the mixture weight of
`each density. The VQ processor performs in real-time a
`search over all codebook segments. As each segment is
`assigned to a subword model the vector quantizer chip
`computes the distance to the mean of the best fitting
`mixture density. This is based on the simplifications
`given above in this section (equation (4)). The minimum
`distance for each frame and segment model is passed to
`the workstation which evaluates the word-HMMs in
`parallel and in real-time. As the local emission
`probabilities are represented in the log-domain, the
`calculation of the word probabilities is reduced to a
`.
`summation.
`The VQ processor is controlled by one of the slave
`DSPs. After each search within a segment the VQ
`generates an interrupt, causing a distance and index
`read by the DSP.
`
`Viterbi Search
`As our system is a phoneme based recognition
`system each word in the lexicon is built up from a
`sequence of phoneme units. For the purpose of word
`boundary detection a silence model is added to the
`beginning and end of each model. The number of states
`in the word models defines the search space for each
`word. This remains static for each vocabulary. In our
`case of small vocabularies the search space for all words
`is prepared in advance. For the digit task the maximum
`search space includes 256 states (sum of all states in the
`word models). For each search space the word model
`probability has to be calculated. The computation can
`be done in a parallel manner by considering the current
`feature vector in each search space of each word. This
`requires about 8 bytes memory per state. A 10-MIPs
`workstation is necessary in order to handle this task in
`real-time.
`The end detection of the utterances is checked
`automatically during the search. The best word
`candidate is evaluated at the final pause state of each
`word model and if a candidate is the winner for at least
`300 ms it is selected as the best word.
`The multi-DSP system is also well-suited for the
`training purposes of the Hidden Markov Models
`because the computation of the indices of the best
`modes by the VQ processor allows us to reestimate the
`mean values of the distributions.
`
`5. Results and Conclusion
`
`The recognition experiments for this work are
`described in [6], The German digit database we use
`contains 11 words (0-9 and and an alternative
`pronounciation for the German digit 'two'(zwo)). The
`training set consisted of utterances spoken by 15 male
`and 15 female speaker each with four versions in
`
`average. Speech data of another 30 male and female
`speakers (440 utterances) was used for test experi­
`ments. The recording was performed over an ISDN
`telephone device with a sampling rate of 8 kHz. The
`recognition rate is 99.5% for the test database.
`In the current demonstration system the Mel-
`vector calculation requires about 20 MIPs. The next
`generation DSP (e.g. TMS320C50) will allow us to
`perform the whole Mel-vector computation on one
`DSP.With a faster FFT, which runs now on the first
`slave DSP it is possible to reorganize the moduls and
`extend the number of the segments. That means, new
`words can be added to the lexicon. Currently the VQ
`requires 6 ms and about 40 MIPs for the search, given a
`segment based codebook. A possibility to double the
`vocabulary is to employ two VQ processor at a time. The
`time intensive implementation of the feature extraction
`algorithms is justified under the aspects of transferring
`the algorithms to VLSI chips. Expensive silicon space
`can be saved on this way.
`As our system is phoneme based the emission
`probability production can be easily extended to
`continuous speech recognition applications. For the
`1000 word vocabulary of SPICOS around 130 context
`dependent segments has to be considered in real-time.
`
`References
`
`[1] A. Aktas and H. Hoge: ”Multi-DSP and VQ-ASIC
`Based Acoustic Front-End for Real-Time Speech
`Processing Tasks”, Proc. EUROSPEECH '89, Paris,
`pp.586-589
`[2] A. Aktas and H. Hoge: "Real-Time Recognition of
`Subword Units on a Hybrid Multi-DSP/ASIC Based
`Acoustic Front-End”, Proc, of the ICASSP '89,
`Glasgow, pp.101-103
`[3] A. Ciaremella et.al.: ”A PC-housed speaker
`independent large vocabulary continuous telephonic
`speech recognizer”, in these Proceedings
`[4] H. Hoge: "SPICOS H - A speech understanding
`dialogue system”, Proc, of the Int. Conf, on Spoken
`Language Processing (ICSLP), Kobe, Japan (1990)
`[5] H. Ney and A. Noll: "Phoneme modeling using
`continuous mixture densities”, Proc, of the ICASSP
`'88, New York, pp.437-440
`[6] K. Zunkler: "An ISDN speech server based on
`speaker independent continuous Hidden Markov
`Models”, Proc, of NATO-ASI, Cetraro, July 1990
`[7] K. Zunkler: "A Discriminative Recognizer for
`Isolated and Continuous Speech Recognition Using
`Statistical Information Measures”, in these
`Proceedings
`
`EUROSPEECH ’91,Genova, Italy, September 1991
`
`1348
`
`IPR2023-00034
`Apple EX1059 Page 4
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket