throbber
82
`
`IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 17, NO. 1, JANUARY 1999
`
`Quantization of Cepstral Parameters for Speech
`Recognition over the World Wide Web
`
`Vassilios V. Digalakis, Leonardo G. Neumeyer, Member, IEEE, and Manolis Perakakis
`
`Abstract— We examine alternative architectures for a client-
`server model of speech-enabled applications over the World Wide
`Web (WWW). We compare a server-only processing model where
`the client encodes and transmits the speech signal to the server,
`to a model where the recognition front end runs locally at the
`client and encodes and transmits the cepstral coefficients to the
`recognition server over the Internet. We follow a novel encoding
`paradigm, trying to maximize recognition performance instead
`of perceptual reproduction, and we find that by transmitting the
`cepstral coefficients we can achieve significantly higher recog-
`nition performance at a fraction of the bit rate required when
`encoding the speech signal directly. We find that the required
`bit rate to achieve the recognition performance of high-quality
`unquantized speech is just 2000 bits per second.
`
`Index Terms—Speech coding, speech recognition, World Wide
`Web.
`
`I. INTRODUCTION
`
`MOTIVATED by the explosive growth of the Internet,
`
`speech researchers have been working on the inte-
`gration of speech technologies into the World Wide Web
`(WWW) [1]–[8]. Applications include Internet
`telephony,
`speech-enabled browsers,
`speech and natural
`language
`understanding systems, and speaker verification. Developers
`have successfully adapted existing systems, or created new
`ones, that can be deployed over the WWW.
`In this paper we consider a client-server speech recognition
`system. We assume that communication channels between the
`client and the server may have limited bandwidth. That would
`be a realistic assumption in applications that communicate over
`the Internet or through wireless channels. The architecture
`of the client-server speech recognition is shown in Fig. 1.
`A central server provides speech recognition services. The
`clients are deployed on heterogeneous environments, such as
`personal computers, smart devices, and mobile devices. Speech
`is captured by the clients and, after some local processing, the
`information is sent to the server. The server recognizes the
`speech according to an application framework and sends the
`result string or action back to the client.
`Essentially, this system uses two major speech technologies:
`speech recognition and speech coding. In a complex dialog
`
`Manuscript received February 1998; revised July 1998. This work was
`supported in part by Telia Research of Sweden and SRI.
`V. V. Digalakis is with the Department of Electronics and Computer
`Engineering, Technical University of Crete, Chania 73100, Greece and with
`SRI International, Menlo Park, CA 94025 USA.
`L. G. Neumeyer is with SRI International, Menlo Park, CA 94025 USA.
`M. Perakakis is with the Department of Electronics and Computer Engi-
`neering, Technical University of Crete, Chania 73100 Greece.
`Publisher Item Identifier S 0733-8716(99)00005-0.
`
`Fig. 1. Client-server speech recognition system.
`
`system coding would be required to present audio prompts to
`the user. Standard coding techniques can be used to send the
`speech over low-bandwidth channels and produce perceptually
`acceptable speech to the user. In this paper, however, we focus
`on the opposite path; that is, the speech data sent from the
`client to the server.
`Traditional speech coding research focuses on the perfor-
`mance tradeoff between transmission rates and perceptual
`reproduction quality. To achieve high perceptual quality at low
`transmission rates, several successful techniques have been
`developed, resulting in dramatic technological advances. The
`data compression problem for state-of-the-art hidden Markov
`model (HMM)-based speech recognition systems differs from
`the traditional speech coding problem in that the optimization
`criterion is recognition accuracy instead of perceptual quality
`of the reproduced data. In addition to the practical goal of
`developing a client-server architecture, we also have an interest
`in understanding how much and what information is actually
`being modeled by the HMM’s. Understanding what data is
`redundant in the representation of the speech signal may open
`the door to new ideas on how to better model it for recognition.
`In Section II of this paper we briefly review HMM-based
`speech recognizers. Section III examines alternative architec-
`tures for the implementation of speech-enabled applications
`over the WWW. In Section IV we discuss techniques for
`encoding the front-end feature vectors at the client side and
`in Section V we present our experimental results. Section VI
`presents our summary and conclusions.
`
`II. HMM-BASED SPEECH RECOGNITION
`Today’s state-of-the-art speech recognizers are based on
`statistical techniques, with hidden Markov modeling being
`0733–8716/99$10.00 ª
`
`1999 IEEE
`
`Comcast - Exhibit 1006, page 82
`
`

`

`DIGALAKIS et al.: QUANTIZATION OF CEPSTRAL PARAMETERS
`
`83
`
`the dominant approach. The typical components of a speech
`recognition and understanding system are the front-end proces-
`sor, the decoder with its acoustic and language models, and
`the language understanding component. The latter component
`extracts the meaning of a decoded word sequence and is an
`essential part of a natural language system. The remainder of
`this section briefly reviews the front end and the decoder.
`The front-end processor typically performs a short-time
`Fourier analysis and extracts a sequence of observation vectors
`(or acoustic vectors)
`
`

`

`84
`
`IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 17, NO. 1, JANUARY 1999
`
`• Feature extraction is only a small part of the computation
`that takes place in a speech recognition and understanding
`application.
`• Speech recognition needs only a small part of the informa-
`tion that the speech signal carries. The representation of
`the speech signal used for recognition concentrates on the
`part of the signal that is related to the vocal-tract shape.
`The first observation implies that we can run the front-end
`processing (the feature extraction) at the client side on a much
`wider range of machines than the ones that will support the
`whole recognition process. There are additional advantages
`to client-server processing over the client-only model. The
`recognizer may need information that exists on the server
`side in order to guide the decoding process. This information
`would have to be transmitted to the client in the client-only
`model, something unnecessary in the client-server model since
`the decoding takes place at the server side. To make speech
`recognition servers available from a variety of systems, front-
`end processing and compression can be standardized. Standard
`front-end modules can be installed on the client machines as
`a system resource, a Java applet, or a browser plug-in.
`Our second observation clearly shows the advantage of
`client-server processing over the server only model. Tradi-
`tional speech coding focuses on the perceptual reproduction
`quality of the coded speech. As a result, the speech coder may
`transmit redundant information, and at the same time introduce
`noise to the features that are important in the recognition
`process because of bandwidth limitations. When the objective
`is to transmit the speech to a recognition server, there is a
`clear shift in the speech-coding paradigm, and the objective
`of the coding process should be recognition accuracy. If the
`information used by the recognition process is contained in
`a set of features, then only this set of features needs to be
`compressed and transmitted to the server. For example, typical
`state-of-the-art speech recognizers represent the vocal tract
`information using a set of the first few cepstral coefficients. In
`view of our objective, we should expect a significant reduction
`in bit rate if we encode this set of cepstral features, as opposed
`to encoding the speech signal itself.
`Of course, encoding and transmitting only the front-end
`processed tokens can become a disadvantage since, without
`any representation of the speech associated with these tokens,
`the input speech cannot be labeled. As a result, it may not be
`possible to monitor in-service performance or to collect labeled
`speech data for development and performance improvement.
`To overcome this limitation and collect labeled data during the
`initial deployment of the application, it is possible to transmit
`the original speech encoded using a very low bit-rate coder as
`side information. This side information can be transmitted on
`top of the encoded front-end tokens during the development
`phase only.
`The client-server model can be applied to the Internet,
`as well as to wireless channels. The Aurora Project
`is a
`joint initiative in which a number of companies, including
`Alcatel, Ericsson, IBM, Matra, Nokia, Nortel, and Siemens,
`are working to establish a global standard for distributed
`speech recognition over wireless channels [7]. The speech
`representations adopted by different speech recognition servers
`
`are usually very similar. The definition of the standard is not
`an easy task, however, since the speech-recognition devel-
`opers put a significant amount of effort into fine tuning the
`recognizer parameters to the specific representation, making it
`difficult to switch to a new, albeit similar representation. This
`task could be facilitated by the adoption of an intermediate
`representation, such as the mel-filterbank coefficients which
`exist in many front-ends. On the Internet, the client-server
`model has been adopted by the BBN Labs speech on the
`Internet (SPIN) system that was presented in [8]. Although
`the details of this system are not known, it was reported
`that it encodes speech at 3.8 kb/s in a form suitable for
`discrete-density HMM’s.
`In our work, we follow the client-server model using
`the encoding scheme that
`is described in Section IV. We
`implemented a highly modular signal processing front-end
`in Java to compute the cepstral coefficients and encode the
`parameters. We verified that the system is fast enough to
`handle the feature extraction in real-time using a pentium 166
`MHz computer and a Java virtual machine (JVM) with a just-
`in-time (JIT) compiler. We also ran benchmarks to compare
`performance on the computation of the fast Fourier transform
`(FFT). We found that the optimized C code is twice as fast as
`the Java implementation. We believe that as the JVM’s become
`more efficient, the gap between C and Java performance will
`become even smaller.
`The Java applet is downloaded from the server. By default,
`the Java security model prevents an applet from accessing na-
`tive resources. There are various possible approaches to grant
`permission to access native resources. The various approaches
`for handling security policies in the Java model are beyond
`the scope of this paper.
`
`IV. CODING OF CEPSTRAL FEATURES
`In the server-only model, toll-quality speech can be coded
`and transmitted to the server by using standard speech coding
`techniques, such as ADPCM at 32 kb/s, or newer schemes
`that are used today in mobile telephony, such as global
`standard for mobile communication (GSM) or code-excited
`linear prediction (CELP) at bit rates of 13 kb/s or below.
`In Section V, however, we show that
`in addition to the
`recognition performance degradation that one encounters when
`using toll quality instead of high-quality speech, we have an
`additional drop in performance when hybrid coding schemes
`such as GSM or CELP are used at low bit rates.
`In contrast, for the client-server approach, we need only
`transmit the set of coefficients that will be used in recogni-
`tion. Mel frequency-warped cepstral coefficients (MFCC’s),
`constitute the set of features used by most state-of-the-art
`HMM-based speech recognizers today [10]. Because of their
`superior recognition performance, we have chosen to encode
`and transmit the set of cepstral coefficients rather than working
`with representations that are typically used in the speech-
`coding applications. Typical choices for the dimension of the
`feature vector and the rate at which it is computed are 13 and
`100 times per second, respectively [11]. Secondary features,
`such as the first- and second-order derivatives of this feature
`
`Comcast - Exhibit 1006, page 84
`
`

`

`DIGALAKIS et al.: QUANTIZATION OF CEPSTRAL PARAMETERS
`
`85
`
`vector that are also used in recognition, do not have to be
`coded and transmitted since this information can be obtained
`at the server side. Hence, one need only quantize a total of
`1300 parameters per second of speech.
`Discrete-density HMM’s also quantize the front-end fea-
`tures and then model directly the quantized features, using
`discrete densities. A common choice is to use six features—the
`energy,
`the vector of cepstral coefficients, and their first-
`and second-order derivatives—and quantize them by using
`separate vector-quantization (VQ) codebooks. In a typical
`discrete-density HMM [11], 256-dimensional codebooks are
`used for the 12-dimensional cepstral-coefficient vector and
`the corresponding vectors of cepstral derivatives, and 32-
`dimensional codebooks are used for the three scalar energy
`features. If a discrete HMM approach is adopted for our client-
`server model, the required bit rate would be (3
`
`

`

`86
`
`IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 17, NO. 1, JANUARY 1999
`
`with an initial bit allocation to subvectors and then increase the
`bit rate by adding bits to the subvectors that yield the maximal
`incremental increase in recognition performance as follows.
`Initialization: Allocate the initial number of bits to subvec-
`tors and evaluate speech recognition performance. Set this as
`the current configuration.
`Step One: For each subvector, increase its allocated number
`of bits by one and evaluate speech recognition performance,
`keeping the number of bits assigned to each of the remaining
`subvectors as in the current configuration. Assign the addi-
`tional bit to the subvector that resulted to the maximal increase
`in recognition performance and set the new assignment as the
`current configuration.
`Step Two: If the desired recognition performance has been
`achieved, or the maximum available bit rate has been reached,
`stop. Otherwise, go to Step One.
`Any available metric can be used to evaluate speech recog-
`nition performance. In this work we have used the WER,
`which is the percentage of words that were erroneously recog-
`nized (i.e., the recognizer has added, deleted, or replaced some
`of the words that have been spoken in the initial sentence).
`Thus
`
`WER
`
`

`

`DIGALAKIS et al.: QUANTIZATION OF CEPSTRAL PARAMETERS
`
`87
`
`Fig. 2. Recognition performance as a function of the bit rate for speech coding (GSM) and for MFCC-coefficient encoding using nonuniform scalar
`quantization. The coefficient encoding performance is shown for both a high-quality and a toll-quality front end.
`
`TABLE III
`BIT RATES AND WER’S FOR SCALAR QUANTIZATION OF
`CEPSTRAL COEFFICIENTS IN HIGH-QUALITY SPEECH
`
`significantly better than the GSM performance, although the
`latter used a bit rate that was four times higher. In addition, we
`see that the nonuniform quantization outperforms the uniform
`quantization significantly, especially at low numbers of bits
`per cepstral coefficient.
`A significant advantage of running the front end at the
`client side, however, is that we can use the high-quality front
`end that uses a higher sampling rate and a larger number
`of bits per waveform sample. The baseline performance for
`the high-quality front end is 6.55
`
`

`

`88
`
`IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 17, NO. 1, JANUARY 1999
`
`TABLE V
`PROGRESSION OF THE BIT-ALLOCATION ALGORITHM FOR THE CASE OF FIVE SUBVECTORS. THE BITS ASSIGNED TO EACH
`SUBVECTOR, THE TOTAL BIT RATE, AND THE CORRESPONDING WER ARE SHOWN AT INTERMEDIATE STEPS OF THE ALGORITHM
`
`TABLE VI
`PROGRESSION OF THE BIT-ALLOCATION ALGORITHM FOR THE CASE OF SCALAR QUANTIZATION (13 SUBVECTORS). THE BITS ASSIGNED TO EACH
`COEFFICIENT, THE TOTAL BIT RATE, AND THE CORRESPONDING WER ARE SHOWN AT INTERMEDIATE STEPS OF THE ALGORITHM
`
`allocation algorithm. The five subvectors consisted of the
`cepstral coefficients {(1, 5), (3, 9, 12, 13), (4, 6), (2, 7,
`11), (8, 10)} and {(1, 2), (3, 4), (5, 6, 7), (8, 9, 10), (11,
`12, 13)} for the correlation-based and the knowledge-based
`partition schemes, respectively. We see that the knowledge-
`based partitioning exhibits significantly better performance at
`all bit rates, and converges to the unquantized WER of 6.55
`
`

`

`DIGALAKIS et al.: QUANTIZATION OF CEPSTRAL PARAMETERS
`
`89
`
`Fig. 3. Recognition performance as a function of the bit rate for various types of MFCC encoding. Nonuniform scalar quantization with constant and
`variable number of bits per coefficient. Product-code VQ with different numbers of subvectors.
`
`since it is a special case of product-code VQ with single-
`element subvectors. The progression of the algorithm in this
`case is shown in Table VI. The initial bit rate was 1700 b/s by
`assigning 17 bits to the 13 coefficients, as shown in the first
`row of Table VI. The algorithm was sped up by increasing
`the number of bits at each step by two and assigning them
`to the two coefficients that decreased the WER the most.
`We can see that in this case rates of at least 2600–2800
`b/s are required before the unquantized-speech performance
`is reached. The final bit allocation uses 3 bits for the first four
`cepstral coefficients and 2 bits for the remaining coefficients.
`In Fig. 3, we have plotted speech recognition performance
`as a function of the bit rate for different numbers of subvectors
`in the product-code VQ (three and five) and for the nonuni-
`form scalar quantization with a variable number of bits per
`coefficient. In the same figure, we also show the WER for
`nonuniform scalar quantization using 2 bits per coefficient.
`The partitioning of cepstral coefficients into subvectors for the
`case of five subvectors was given above, whereas for the case
`of three subvectors the partitioning was {(1, 2, 3), (4, 5, 6,
`7, 8), (9, 10, 11, 12, 13)}. Scalar quantization with a variable
`number of bits demonstrates significantly better performance
`than the scalar quantization scheme with a fixed number
`of bits per coefficient
`that we examined in Section V-B,
`reducing the WER to 6.81
`
`

`

`90
`
`IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 17, NO. 1, JANUARY 1999
`
`sentences,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-
`28, pp. 357–366, Aug. 1980.
`[11] V. Digalakis and H. Murveit, “Genones: Optimizing the degree of
`mixture tying in a large vocabulary hidden Markov model based speech
`recognizer,” IEEE Trans. Speech Audio Processing, vol 4, pp. 281–289,
`July 1996.
`[12] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression.
`Norwell MA: Kluwer, 1991.
`[13] J. Makhoul, S. Roucos, and H. Gish, “Vector quantization in speech
`coding,” Proc. IEEE, vol. 73, pp. 1551–1588, Nov. 1985.
`[14] P. Price, “Evaluation of spoken language systems: The ATIS domain,”
`in Proc. 3rd DARPA Speech and Natural Language Workshop, Hidden
`Valley, PA, June 1990, pp. 91–95.
`
`Leonardo G. Neumeyer (S’88–M’90) was born in
`Buenos Aires, Argentina. He received the Engineer
`degree in electronics from the University of Buenos
`Aires, Buenos Aires, Argentina, in 1988 and the
`M.Eng. degree from Carleton University, Ottawa,
`Canada, in 1990.
`From 1991 to 1992 he worked on the devel-
`opment of low bit rate speech coding systems at
`Novatel Communications, Calgary, Canada. Since
`March 1992 he has been with the Speech Technol-
`ogy and Research Laboratory at SRI International,
`Menlo Park, CA, where he works on speech recognition research. His research
`interests include speech recognition in noisy environments, speaker adaptation,
`speech technologies for language learning, and speech recognition in network
`environments.
`
`Vassilios V. Digalakis was born in Chania, Greece,
`on February 2, 1963. He received the Diploma in
`electrical engineering from the National Technical
`University of Athens, Athens, Greece, in 1986, the
`M.S. degree in electrical engineering from North-
`eastern University, Boston, MA, in 1988, and the
`Ph.D. degree in electrical and systems engineering
`from Boston University, Boston, MA, in 1992.
`From January 1992 to February 1995 he was with
`the Speech Technology and Research Laboratory of
`SRI International, Menlo Park, CA. At SRI he was
`a Principal Investigator for ARPA research contracts and developed new
`speech recognition and speaker adaptation algorithms for the DECIPHER
`speech recognition system. He also developed language education algorithms
`using speech recognition techniques. He is currently with the Department
`of Electronic and Computer Engineering of the Technical University of
`Crete, Chania, Greece, where he is an Assistant Professor. He teaches
`undergraduate and graduate courses on speech processing and on digital and
`analog communications. His research interests are in the areas of pattern and
`speech recognition, information theory, and digital communications.
`
`Manolis Perakakis was born in Ierapetra, Greece,
`on May 30, 1974. Since September 1993 he has
`been an undergraduate student in the Department
`of Electronics and Computer Engineering of the
`Technical University of Crete, Chania, Greece. He
`currently is pursuing the Diploma in electronics and
`computer engineering.
`His research interests include multimedia, net-
`works, and communication systems.
`
`Comcast - Exhibit 1006, page 90
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket