`
`IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 17, NO. 1, JANUARY 1999
`
`Quantization of Cepstral Parameters for Speech
`Recognition over the World Wide Web
`
`Vassilios V. Digalakis, Leonardo G. Neumeyer, Member, IEEE, and Manolis Perakakis
`
`Abstract— We examine alternative architectures for a client-
`server model of speech-enabled applications over the World Wide
`Web (WWW). We compare a server-only processing model where
`the client encodes and transmits the speech signal to the server,
`to a model where the recognition front end runs locally at the
`client and encodes and transmits the cepstral coefficients to the
`recognition server over the Internet. We follow a novel encoding
`paradigm, trying to maximize recognition performance instead
`of perceptual reproduction, and we find that by transmitting the
`cepstral coefficients we can achieve significantly higher recog-
`nition performance at a fraction of the bit rate required when
`encoding the speech signal directly. We find that the required
`bit rate to achieve the recognition performance of high-quality
`unquantized speech is just 2000 bits per second.
`
`Index Terms—Speech coding, speech recognition, World Wide
`Web.
`
`I. INTRODUCTION
`
`MOTIVATED by the explosive growth of the Internet,
`
`speech researchers have been working on the inte-
`gration of speech technologies into the World Wide Web
`(WWW) [1]–[8]. Applications include Internet
`telephony,
`speech-enabled browsers,
`speech and natural
`language
`understanding systems, and speaker verification. Developers
`have successfully adapted existing systems, or created new
`ones, that can be deployed over the WWW.
`In this paper we consider a client-server speech recognition
`system. We assume that communication channels between the
`client and the server may have limited bandwidth. That would
`be a realistic assumption in applications that communicate over
`the Internet or through wireless channels. The architecture
`of the client-server speech recognition is shown in Fig. 1.
`A central server provides speech recognition services. The
`clients are deployed on heterogeneous environments, such as
`personal computers, smart devices, and mobile devices. Speech
`is captured by the clients and, after some local processing, the
`information is sent to the server. The server recognizes the
`speech according to an application framework and sends the
`result string or action back to the client.
`Essentially, this system uses two major speech technologies:
`speech recognition and speech coding. In a complex dialog
`
`Manuscript received February 1998; revised July 1998. This work was
`supported in part by Telia Research of Sweden and SRI.
`V. V. Digalakis is with the Department of Electronics and Computer
`Engineering, Technical University of Crete, Chania 73100, Greece and with
`SRI International, Menlo Park, CA 94025 USA.
`L. G. Neumeyer is with SRI International, Menlo Park, CA 94025 USA.
`M. Perakakis is with the Department of Electronics and Computer Engi-
`neering, Technical University of Crete, Chania 73100 Greece.
`Publisher Item Identifier S 0733-8716(99)00005-0.
`
`Fig. 1. Client-server speech recognition system.
`
`system coding would be required to present audio prompts to
`the user. Standard coding techniques can be used to send the
`speech over low-bandwidth channels and produce perceptually
`acceptable speech to the user. In this paper, however, we focus
`on the opposite path; that is, the speech data sent from the
`client to the server.
`Traditional speech coding research focuses on the perfor-
`mance tradeoff between transmission rates and perceptual
`reproduction quality. To achieve high perceptual quality at low
`transmission rates, several successful techniques have been
`developed, resulting in dramatic technological advances. The
`data compression problem for state-of-the-art hidden Markov
`model (HMM)-based speech recognition systems differs from
`the traditional speech coding problem in that the optimization
`criterion is recognition accuracy instead of perceptual quality
`of the reproduced data. In addition to the practical goal of
`developing a client-server architecture, we also have an interest
`in understanding how much and what information is actually
`being modeled by the HMM’s. Understanding what data is
`redundant in the representation of the speech signal may open
`the door to new ideas on how to better model it for recognition.
`In Section II of this paper we briefly review HMM-based
`speech recognizers. Section III examines alternative architec-
`tures for the implementation of speech-enabled applications
`over the WWW. In Section IV we discuss techniques for
`encoding the front-end feature vectors at the client side and
`in Section V we present our experimental results. Section VI
`presents our summary and conclusions.
`
`II. HMM-BASED SPEECH RECOGNITION
`Today’s state-of-the-art speech recognizers are based on
`statistical techniques, with hidden Markov modeling being
`0733–8716/99$10.00 ª
`
`1999 IEEE
`
`Comcast - Exhibit 1006, page 82
`
`
`
`DIGALAKIS et al.: QUANTIZATION OF CEPSTRAL PARAMETERS
`
`83
`
`the dominant approach. The typical components of a speech
`recognition and understanding system are the front-end proces-
`sor, the decoder with its acoustic and language models, and
`the language understanding component. The latter component
`extracts the meaning of a decoded word sequence and is an
`essential part of a natural language system. The remainder of
`this section briefly reviews the front end and the decoder.
`The front-end processor typically performs a short-time
`Fourier analysis and extracts a sequence of observation vectors
`(or acoustic vectors)
`
`
`
`84
`
`IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 17, NO. 1, JANUARY 1999
`
`• Feature extraction is only a small part of the computation
`that takes place in a speech recognition and understanding
`application.
`• Speech recognition needs only a small part of the informa-
`tion that the speech signal carries. The representation of
`the speech signal used for recognition concentrates on the
`part of the signal that is related to the vocal-tract shape.
`The first observation implies that we can run the front-end
`processing (the feature extraction) at the client side on a much
`wider range of machines than the ones that will support the
`whole recognition process. There are additional advantages
`to client-server processing over the client-only model. The
`recognizer may need information that exists on the server
`side in order to guide the decoding process. This information
`would have to be transmitted to the client in the client-only
`model, something unnecessary in the client-server model since
`the decoding takes place at the server side. To make speech
`recognition servers available from a variety of systems, front-
`end processing and compression can be standardized. Standard
`front-end modules can be installed on the client machines as
`a system resource, a Java applet, or a browser plug-in.
`Our second observation clearly shows the advantage of
`client-server processing over the server only model. Tradi-
`tional speech coding focuses on the perceptual reproduction
`quality of the coded speech. As a result, the speech coder may
`transmit redundant information, and at the same time introduce
`noise to the features that are important in the recognition
`process because of bandwidth limitations. When the objective
`is to transmit the speech to a recognition server, there is a
`clear shift in the speech-coding paradigm, and the objective
`of the coding process should be recognition accuracy. If the
`information used by the recognition process is contained in
`a set of features, then only this set of features needs to be
`compressed and transmitted to the server. For example, typical
`state-of-the-art speech recognizers represent the vocal tract
`information using a set of the first few cepstral coefficients. In
`view of our objective, we should expect a significant reduction
`in bit rate if we encode this set of cepstral features, as opposed
`to encoding the speech signal itself.
`Of course, encoding and transmitting only the front-end
`processed tokens can become a disadvantage since, without
`any representation of the speech associated with these tokens,
`the input speech cannot be labeled. As a result, it may not be
`possible to monitor in-service performance or to collect labeled
`speech data for development and performance improvement.
`To overcome this limitation and collect labeled data during the
`initial deployment of the application, it is possible to transmit
`the original speech encoded using a very low bit-rate coder as
`side information. This side information can be transmitted on
`top of the encoded front-end tokens during the development
`phase only.
`The client-server model can be applied to the Internet,
`as well as to wireless channels. The Aurora Project
`is a
`joint initiative in which a number of companies, including
`Alcatel, Ericsson, IBM, Matra, Nokia, Nortel, and Siemens,
`are working to establish a global standard for distributed
`speech recognition over wireless channels [7]. The speech
`representations adopted by different speech recognition servers
`
`are usually very similar. The definition of the standard is not
`an easy task, however, since the speech-recognition devel-
`opers put a significant amount of effort into fine tuning the
`recognizer parameters to the specific representation, making it
`difficult to switch to a new, albeit similar representation. This
`task could be facilitated by the adoption of an intermediate
`representation, such as the mel-filterbank coefficients which
`exist in many front-ends. On the Internet, the client-server
`model has been adopted by the BBN Labs speech on the
`Internet (SPIN) system that was presented in [8]. Although
`the details of this system are not known, it was reported
`that it encodes speech at 3.8 kb/s in a form suitable for
`discrete-density HMM’s.
`In our work, we follow the client-server model using
`the encoding scheme that
`is described in Section IV. We
`implemented a highly modular signal processing front-end
`in Java to compute the cepstral coefficients and encode the
`parameters. We verified that the system is fast enough to
`handle the feature extraction in real-time using a pentium 166
`MHz computer and a Java virtual machine (JVM) with a just-
`in-time (JIT) compiler. We also ran benchmarks to compare
`performance on the computation of the fast Fourier transform
`(FFT). We found that the optimized C code is twice as fast as
`the Java implementation. We believe that as the JVM’s become
`more efficient, the gap between C and Java performance will
`become even smaller.
`The Java applet is downloaded from the server. By default,
`the Java security model prevents an applet from accessing na-
`tive resources. There are various possible approaches to grant
`permission to access native resources. The various approaches
`for handling security policies in the Java model are beyond
`the scope of this paper.
`
`IV. CODING OF CEPSTRAL FEATURES
`In the server-only model, toll-quality speech can be coded
`and transmitted to the server by using standard speech coding
`techniques, such as ADPCM at 32 kb/s, or newer schemes
`that are used today in mobile telephony, such as global
`standard for mobile communication (GSM) or code-excited
`linear prediction (CELP) at bit rates of 13 kb/s or below.
`In Section V, however, we show that
`in addition to the
`recognition performance degradation that one encounters when
`using toll quality instead of high-quality speech, we have an
`additional drop in performance when hybrid coding schemes
`such as GSM or CELP are used at low bit rates.
`In contrast, for the client-server approach, we need only
`transmit the set of coefficients that will be used in recogni-
`tion. Mel frequency-warped cepstral coefficients (MFCC’s),
`constitute the set of features used by most state-of-the-art
`HMM-based speech recognizers today [10]. Because of their
`superior recognition performance, we have chosen to encode
`and transmit the set of cepstral coefficients rather than working
`with representations that are typically used in the speech-
`coding applications. Typical choices for the dimension of the
`feature vector and the rate at which it is computed are 13 and
`100 times per second, respectively [11]. Secondary features,
`such as the first- and second-order derivatives of this feature
`
`Comcast - Exhibit 1006, page 84
`
`
`
`DIGALAKIS et al.: QUANTIZATION OF CEPSTRAL PARAMETERS
`
`85
`
`vector that are also used in recognition, do not have to be
`coded and transmitted since this information can be obtained
`at the server side. Hence, one need only quantize a total of
`1300 parameters per second of speech.
`Discrete-density HMM’s also quantize the front-end fea-
`tures and then model directly the quantized features, using
`discrete densities. A common choice is to use six features—the
`energy,
`the vector of cepstral coefficients, and their first-
`and second-order derivatives—and quantize them by using
`separate vector-quantization (VQ) codebooks. In a typical
`discrete-density HMM [11], 256-dimensional codebooks are
`used for the 12-dimensional cepstral-coefficient vector and
`the corresponding vectors of cepstral derivatives, and 32-
`dimensional codebooks are used for the three scalar energy
`features. If a discrete HMM approach is adopted for our client-
`server model, the required bit rate would be (3
`
`
`
`86
`
`IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 17, NO. 1, JANUARY 1999
`
`with an initial bit allocation to subvectors and then increase the
`bit rate by adding bits to the subvectors that yield the maximal
`incremental increase in recognition performance as follows.
`Initialization: Allocate the initial number of bits to subvec-
`tors and evaluate speech recognition performance. Set this as
`the current configuration.
`Step One: For each subvector, increase its allocated number
`of bits by one and evaluate speech recognition performance,
`keeping the number of bits assigned to each of the remaining
`subvectors as in the current configuration. Assign the addi-
`tional bit to the subvector that resulted to the maximal increase
`in recognition performance and set the new assignment as the
`current configuration.
`Step Two: If the desired recognition performance has been
`achieved, or the maximum available bit rate has been reached,
`stop. Otherwise, go to Step One.
`Any available metric can be used to evaluate speech recog-
`nition performance. In this work we have used the WER,
`which is the percentage of words that were erroneously recog-
`nized (i.e., the recognizer has added, deleted, or replaced some
`of the words that have been spoken in the initial sentence).
`Thus
`
`WER
`
`
`
`DIGALAKIS et al.: QUANTIZATION OF CEPSTRAL PARAMETERS
`
`87
`
`Fig. 2. Recognition performance as a function of the bit rate for speech coding (GSM) and for MFCC-coefficient encoding using nonuniform scalar
`quantization. The coefficient encoding performance is shown for both a high-quality and a toll-quality front end.
`
`TABLE III
`BIT RATES AND WER’S FOR SCALAR QUANTIZATION OF
`CEPSTRAL COEFFICIENTS IN HIGH-QUALITY SPEECH
`
`significantly better than the GSM performance, although the
`latter used a bit rate that was four times higher. In addition, we
`see that the nonuniform quantization outperforms the uniform
`quantization significantly, especially at low numbers of bits
`per cepstral coefficient.
`A significant advantage of running the front end at the
`client side, however, is that we can use the high-quality front
`end that uses a higher sampling rate and a larger number
`of bits per waveform sample. The baseline performance for
`the high-quality front end is 6.55
`
`
`
`88
`
`IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 17, NO. 1, JANUARY 1999
`
`TABLE V
`PROGRESSION OF THE BIT-ALLOCATION ALGORITHM FOR THE CASE OF FIVE SUBVECTORS. THE BITS ASSIGNED TO EACH
`SUBVECTOR, THE TOTAL BIT RATE, AND THE CORRESPONDING WER ARE SHOWN AT INTERMEDIATE STEPS OF THE ALGORITHM
`
`TABLE VI
`PROGRESSION OF THE BIT-ALLOCATION ALGORITHM FOR THE CASE OF SCALAR QUANTIZATION (13 SUBVECTORS). THE BITS ASSIGNED TO EACH
`COEFFICIENT, THE TOTAL BIT RATE, AND THE CORRESPONDING WER ARE SHOWN AT INTERMEDIATE STEPS OF THE ALGORITHM
`
`allocation algorithm. The five subvectors consisted of the
`cepstral coefficients {(1, 5), (3, 9, 12, 13), (4, 6), (2, 7,
`11), (8, 10)} and {(1, 2), (3, 4), (5, 6, 7), (8, 9, 10), (11,
`12, 13)} for the correlation-based and the knowledge-based
`partition schemes, respectively. We see that the knowledge-
`based partitioning exhibits significantly better performance at
`all bit rates, and converges to the unquantized WER of 6.55
`
`
`
`DIGALAKIS et al.: QUANTIZATION OF CEPSTRAL PARAMETERS
`
`89
`
`Fig. 3. Recognition performance as a function of the bit rate for various types of MFCC encoding. Nonuniform scalar quantization with constant and
`variable number of bits per coefficient. Product-code VQ with different numbers of subvectors.
`
`since it is a special case of product-code VQ with single-
`element subvectors. The progression of the algorithm in this
`case is shown in Table VI. The initial bit rate was 1700 b/s by
`assigning 17 bits to the 13 coefficients, as shown in the first
`row of Table VI. The algorithm was sped up by increasing
`the number of bits at each step by two and assigning them
`to the two coefficients that decreased the WER the most.
`We can see that in this case rates of at least 2600–2800
`b/s are required before the unquantized-speech performance
`is reached. The final bit allocation uses 3 bits for the first four
`cepstral coefficients and 2 bits for the remaining coefficients.
`In Fig. 3, we have plotted speech recognition performance
`as a function of the bit rate for different numbers of subvectors
`in the product-code VQ (three and five) and for the nonuni-
`form scalar quantization with a variable number of bits per
`coefficient. In the same figure, we also show the WER for
`nonuniform scalar quantization using 2 bits per coefficient.
`The partitioning of cepstral coefficients into subvectors for the
`case of five subvectors was given above, whereas for the case
`of three subvectors the partitioning was {(1, 2, 3), (4, 5, 6,
`7, 8), (9, 10, 11, 12, 13)}. Scalar quantization with a variable
`number of bits demonstrates significantly better performance
`than the scalar quantization scheme with a fixed number
`of bits per coefficient
`that we examined in Section V-B,
`reducing the WER to 6.81
`
`
`
`90
`
`IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 17, NO. 1, JANUARY 1999
`
`sentences,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-
`28, pp. 357–366, Aug. 1980.
`[11] V. Digalakis and H. Murveit, “Genones: Optimizing the degree of
`mixture tying in a large vocabulary hidden Markov model based speech
`recognizer,” IEEE Trans. Speech Audio Processing, vol 4, pp. 281–289,
`July 1996.
`[12] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression.
`Norwell MA: Kluwer, 1991.
`[13] J. Makhoul, S. Roucos, and H. Gish, “Vector quantization in speech
`coding,” Proc. IEEE, vol. 73, pp. 1551–1588, Nov. 1985.
`[14] P. Price, “Evaluation of spoken language systems: The ATIS domain,”
`in Proc. 3rd DARPA Speech and Natural Language Workshop, Hidden
`Valley, PA, June 1990, pp. 91–95.
`
`Leonardo G. Neumeyer (S’88–M’90) was born in
`Buenos Aires, Argentina. He received the Engineer
`degree in electronics from the University of Buenos
`Aires, Buenos Aires, Argentina, in 1988 and the
`M.Eng. degree from Carleton University, Ottawa,
`Canada, in 1990.
`From 1991 to 1992 he worked on the devel-
`opment of low bit rate speech coding systems at
`Novatel Communications, Calgary, Canada. Since
`March 1992 he has been with the Speech Technol-
`ogy and Research Laboratory at SRI International,
`Menlo Park, CA, where he works on speech recognition research. His research
`interests include speech recognition in noisy environments, speaker adaptation,
`speech technologies for language learning, and speech recognition in network
`environments.
`
`Vassilios V. Digalakis was born in Chania, Greece,
`on February 2, 1963. He received the Diploma in
`electrical engineering from the National Technical
`University of Athens, Athens, Greece, in 1986, the
`M.S. degree in electrical engineering from North-
`eastern University, Boston, MA, in 1988, and the
`Ph.D. degree in electrical and systems engineering
`from Boston University, Boston, MA, in 1992.
`From January 1992 to February 1995 he was with
`the Speech Technology and Research Laboratory of
`SRI International, Menlo Park, CA. At SRI he was
`a Principal Investigator for ARPA research contracts and developed new
`speech recognition and speaker adaptation algorithms for the DECIPHER
`speech recognition system. He also developed language education algorithms
`using speech recognition techniques. He is currently with the Department
`of Electronic and Computer Engineering of the Technical University of
`Crete, Chania, Greece, where he is an Assistant Professor. He teaches
`undergraduate and graduate courses on speech processing and on digital and
`analog communications. His research interests are in the areas of pattern and
`speech recognition, information theory, and digital communications.
`
`Manolis Perakakis was born in Ierapetra, Greece,
`on May 30, 1974. Since September 1993 he has
`been an undergraduate student in the Department
`of Electronics and Computer Engineering of the
`Technical University of Crete, Chania, Greece. He
`currently is pursuing the Diploma in electronics and
`computer engineering.
`His research interests include multimedia, net-
`works, and communication systems.
`
`Comcast - Exhibit 1006, page 90
`
`