throbber
A Review of Algorithms for Perceptual Coding of Digital Audio
`Signals †
`
`Ted Painter, Student Member IEEE, and Andreas Spanias, Senior Member IEEE
`
`Department of Electrical Engineering, Telecommunications Research Center
`Arizona State University, Tempe, Arizona 85287-7206
`spanias@asu.edu, painter@asu.edu
`
`ABSTRACT
`During the last decade, CD-quality digital audio has essentially replaced analog audio. During this same period, new
`digital audio applications have emerged for network, wireless, and multimedia computing systems which face such con-
`straints as reduced channel bandwidth, limited storage capacity, and low cost. These new applications have created a de-
`mand for high-quality digital audio delivery at low bit rates. In response to this need, considerable research has been de-
`voted to the development of algorithms for perceptually transparent coding of high-fidelity (CD-quality) digital audio. As a
`result, many algorithms have been proposed, and several have now become international and/or commercial product stan-
`dards. This paper reviews algorithms for perceptually transparent coding of CD-quality digital audio, including both re-
`search and standardization activities. The paper is organized as follows. First, psychoacoustic principles are described
`with the MPEG psychoacoustic signal analysis model 1 discussed in some detail. Then, we review methodologies which
`achieve perceptually transparent coding of FM- and CD-quality audio signals, including algorithms which manipulate
`transform components and subband signal decompositions. The discussion concentrates on architectures and applications
`of those techniques which utilize psychoacoustic models to exploit efficiently masking characteristics of the human receiver.
`Several algorithms which have become international and/or commercial standards are also presented, including the
`ISO/MPEG family and the Dolby AC-3 algorithms. The paper concludes with a brief discussion of future research direc-
`tions.
`
`I. INTRODUCTION
`Audio coding or audio compression algorithms are
`used to obtain compact digital representations of high-
`fidelity (wideband) audio signals for the purpose of ef-
`ficient transmission or storage. The central objective in
`audio coding is to represent the signal with a minimum
`number of bits while achieving transparent signal re-
`production, i.e., while generating output audio which
`cannot be distinguished from the original input, even by
`a sensitive listener (“golden ears”). This paper gives a
`review of algorithms for transparent coding of high-
`fidelity audio.
`The introduction of the compact disk (CD) in the
`early eighties [1] brought to the fore all of the advan-
`tages of digital audio representation, including unprece-
`dented high-fidelity, dynamic range, and robustness.
`These advantages, however, came at the expense of
`high data rates. Conventional CD and digital audio tape
`(DAT) systems are typically sampled at 44.1 or 48 kilo-
`hertz (kHz), using pulse code modulation (PCM) with a
`sixteen bit sample resolution. This results in uncom-
`pressed data rates of 705.6/768 kilobits per second
`(kbps) for a monaural channel, or 1.41/1.54 megabits
`per second (Mbps) for a stereo pair at 44.1/48 kHz, re-
`spectively. Although high, these data rates were ac-
`
`commodated successfully in first generation digital
`audio applications such as CD and DAT. Unfortu-
`nately, second generation multimedia applications and
`wireless systems in particular are often subject to band-
`width or cost constraints which are incompatible with
`high data rates. Because of the success enjoyed by the
`first generation, however, end users have come to ex-
`pect “CD-quality” audio reproduction from any digital
`system. New network and wireless multimedia digital
`audio systems, therefore, must reduce data rates without
`compromising reproduction quality. These and other
`considerations have motivated considerable research
`during the last decade towards formulation of compres-
`sion schemes which can satisfy simultaneously the con-
`flicting demands of high compression ratios and trans-
`parent reproduction quality for high-fidelity audio sig-
`nals [2][3][4][5][6][7][8][9][10][11]. As a r esult, sev-
`eral standards have been developed [12][13][14][15],
`particularly in the last five years [16][17][18][19], and
`several are now being deployed commercially [94]
`[97][100][102] (Table 2).
`A. GENERIC PERCEPTUAL AUDIO CODING AR-
`CHITECTURE
`This review considers several classes of analysis-
`synthesis data compression algorithms, including those
`
`†
`
`Portions of this work have been sponsored by a grant from the NDTC Committee of the Intel Corporation Direct all communications to A Spanias
`
`IPR2016-01710
`UNIFIED EX1013
`
`

`
`which manipulate: transform components, time-domain
`sequences from critically sampled banks of bandpass
`filters, linear predictive coding (LPC) model parame(cid:173)
`ters, or some hybrid parametric set. We note here that
`although the enonnous capacity of new storage media
`such as Digital Versa.tile Disc (DVD) can accommodate
`lossless audio coding [20][21], the research interest and
`hence all of the algorithms we describe are lossy com(cid:173)
`pression schemes which seek to exploit the psychoa(cid:173)
`coustic principles described in section two. Lossy
`schemes offer the advantage of lower bit rates (e.g., less
`than 1 bit per sample) relative to lossless schemes (e.g.,
`10 bits per sample) . Naturally, there is a debate over
`the quality limitations associated with lossy compres(cid:173)
`sion. In fact, some experts believe that uncompressed
`digital CD-quality audio ( 44. l kHz/l 6b) is intrinsically
`inferior to the analog original. They contend that sam(cid:173)
`ple rates above 55 kHz and word lengths greater than 20
`bits [21] are necessa1y to achieve transparency in the
`absence of any compression. It is beyond the scope of
`this review to address this debate.
`Before considering different classes of audio coding
`algorithms, it is first useful to note the architectural
`similarities which characterize most perceptual audio
`coders. The lossy compression systems described
`throughout the remainder of this review achieve coding
`gain by exploiting both percep tual irrelevancies and
`statistical redundancies. All of these algorithms are
`based on the generic architecture shown in Fig. 1. The
`coders typically segment input signals into quasi(cid:173)
`stationaiy frames ranging from 2 to 50 milliseconds in
`duration. A time-frequency analysis section then de(cid:173)
`composes each analysis frame. The time/frequency
`analysis approximates the temporal and spectral analy(cid:173)
`sis properties of the human auditory system. It trans(cid:173)
`fonns input audio into a set of pai·ameters which can be
`quantized and encoded according to a perceptual distor(cid:173)
`tion metric. Depending on overall system objectives and
`design philosophy, the time-frequency analysis section
`might contain a
`• Unita1y transform
`+ Time-invariant bank of unifonn bandpass filters
`+ Time-varying (signal-adaptive), critically sainpled
`bank of non-unifo1m bandpass filters
`• Hybrid transfonn/filterbank signal analyzer
`• Haimonic/sinusoidal analyzer
`• Source-system analysis (LPC/Multipulse excita-
`tion)
`The choice of time-frequency analysis methodology al(cid:173)
`ways involves a fundamental tradeoff between time and
`frequency resolution requirements. Perceptual distor(cid:173)
`tion control is achieved by a psychoacoustic signal
`analysis section which estimates signal masking power
`based on psychoacoustic principles (see section two).
`The psychoacoustic model delivers masking thresholds
`which quantify the maximum amount of distortion that
`
`can be injected at each point in the time-frequency
`plane during quantization and encoding of the time(cid:173)
`frequency parameters without introducing audible ai·ti(cid:173)
`facts in the reconstmcted signal. The psychoacoustic
`model therefore allows the quantization and encoding
`section to exploit perceptual iiTelevancies in the time(cid:173)
`frequency parameter set. The quantization and encod(cid:173)
`ing section can also exploit statistical redundancies
`through classical techniques such as differential pulse
`(DPCM) or adaptive DPCM
`code modulation
`(ADPCM). Quantization might be uniform or pdf(cid:173)
`optimized (Lloyd-Max), and it inight be performed on
`either scalar or vector quantities (VQ). Once a quan(cid:173)
`tized compact parametric set has be.en fonned, remain(cid:173)
`ing redundancies are typically removed through mn(cid:173)
`length (RL) and entropy (e.g. Huffman, ai·ithmetic,
`LZW) coding techniques. Since the psychoacoustic
`distortion control model is signal adaptive, most algo(cid:173)
`rithms are inherently variable rate. Fixed channel rate
`requirements are usually satisfied through buffer feed(cid:173)
`back schemes, which often introduce encoding delays.
`....
`TimJF~ "--
`An.ty.ri•
`
`o(o)
`
`Qu.11ntaaticin
`
`Encoding
`
`A-
`
`P.,cb-.:11tic
`
`Mu.king
`'lbn.boldo
`
`Bi.t AILoc.ticin
`
`.. , .....
`
`Fig. 1. Generic Perceptual Audio Encoder
`
`The study of perceptual entropy (PE) suggests that
`transpai·ent coding is possible in the neighborhood of 2
`bits per sample [45] for most for high-fidelity audio
`sources (~88 kpbs given 44.1 kHz sainpling). The lossy
`perceptual coding algorithms discussed in the remainder
`of this paper confom this possibility. In fact, several
`coders approach transparency in the neighborhood of 1
`bit per sample. Regardless of design details, all per(cid:173)
`ceptual audio coders seek to achieve transparent quality
`at low bit rates with tractable complexity and manage(cid:173)
`able delay. The discussion of algorithms given in sec(cid:173)
`tions three through five brings to light many of the
`tradeoffs involved with the vai·ious coder design phi(cid:173)
`losophies.
`B. PAPER ORGANIZA.TION
`The rest of the paper is organized as follows.
`In
`section II, psychoacoustic principles are described
`which can be exploited for significant coding gain.
`Johnston's notion of perceptual entropy is presented as
`a measure of the fundamental limit of transparent com(cid:173)
`pression for audio. Sections III through V review state(cid:173)
`of-the-art algorithms which achieve transparent coding
`of FM- and CD-quality audio signals, including several
`techniques which are established in international stan(cid:173)
`dai·ds. Transform coding methodologies are described
`in section III, and subband coding algorithms are ad(cid:173)
`dressed in section IV. In addition to methods based on
`uniform bandwidth filterbanks, section IV covers cod(cid:173)
`ing methods which utilize discrete wavelet transfonns
`
`2
`
`

`
`100
`
`80
`
`60
`
`40
`
`20
`
`0
`
`Sound Pressure Level, SPL (dB)
`
`102
`
`103
`Frequency (Hz)
`
`10
`
`Fig. 2. The Absolute Threshold of Hearing
`
`quiet threshold is well approximated [37] by the non-
`)
`(
`)
`-- --
`0 8
`/
`
`1000
`3 64.
`f
`(
`)
`0 6
`1000 3 3
`f
`
`6 5.
`e
`)
`(
`+
`10
`1000
`f
`which is representative of a young listener with acute
`hearing. When applied to signal compression, Tq(f) can
`be interpreted as a maximum allowable energy level for
`coding distortions introduced in the frequency domain
`(Fig. 2). Algorithm designers have no a priori knowl-
`edge regarding actual playback levels, therefore the
`sound pressure level (SPL) curve is often referenced to
`the coding system by equating the lowest point on the
`curve (i.e., 4 kHz) to the energy in +/- 1 bit of signal
`amplitude. Such a practice is common in algorithms
`which utilize the absolute threshold of hearing.
`B. CRITICAL BANDS
`Using the absolute threshold of hearing to shape the
`coding distortion spectrum represents the first step to-
`wards perceptual coding. Next we consider how the ear
`actually does spectral analysis. It turns out that a fre-
`quency-to-place transformation takes place in the inner
`ear, along the basilar membrane. Distinct regions in the
`cochlea, each with a set of neural receptors, are “tuned”
`to different frequency bands. Empirical work by sev-
`eral observers led to the modern notion of critical bands
`[28][29][30][31] which correspond to these cochlear re-
`gions. In the experimental sense, critical bandwidth can
`be loosely defined as the bandwidth at which subjective
`responses change abruptly. For example, the perceived
`loudness of a narrowband noise source at constant
`sound pressure level remains constant even as the
`bandwidth is increased up to the critical bandwidth.
`The loudness then begins to increase. In a different ex-
`periment (Fig 3a), the detection threshold for a narrow-
`band noise source between two masking tones remains
`constant as long as the frequency separation between
`the tones remains within a critical bandwidth. Beyond
`
`= -
`
`linear function(
`
`T f
`q
`
` (dB SPL)
`
`(1)
`
`42
`
`//
`
`3
`
`and non-uniform filterbanks. Finally, section V is con-
`cerned with standardization activities in audio coding.
`It describes recently adopted standards including the
`ISO/IEC MPEG family, the Phillips’ Digital Compact
`Cassette (DCC), the Sony Minidisk, and the Dolby AC-
`3 algorithms. The paper concludes with a brief discus-
`sion of future research directions.
`For additional information, one can also refer to in-
`formative reviews of recent progress in wideband and
`hi-fidelity audio coding which have appeared in the lit-
`erature. Discussions of audio signal characteristics and
`the application of psychoacoustic principles to audio
`coding can be found in [22],[23], and [24]. Jayant, et
`al. of Bell Labs also considered perceptual models and
`their applications to speech, video, and audio signal
`compression [25]. Noll describes current algorithms in
`[26] and [27], including the ISO/MPEG audio compres-
`sion standard.
`
`II. PSYCHOACOUSTIC PRINCIPLES
`High precision engineering models for high-fidelity
`audio currently do not exist. Therefore, audio coding
`algorithms must rely upon generalized receiver models
`to optimize coding efficiency. In the case of audio, the
`receiver is ultimately the human ear and sound percep-
`tion is affected by its masking properties. The field of
`psychoacoustics [28][29][30][31][32][33][34] has made
`significant progress toward characterizing human audi-
`tory perception and particularly the time-frequency
`analysis capabilities of the inner ear. Although apply-
`ing perceptual rules to signal coding is not a new idea
`[35], most current audio coders achieve compression by
`exploiting the fact that “irrelevant” signal information is
`not detectable by even a well trained or sensitive lis-
`tener. Irrelevant information is identified during signal
`analysis by incorporating into the coder several psy-
`choacoustic principles,
`including absolute hearing
`thresholds, critical band frequency analysis, simultane-
`ous masking, the spread of masking along the basilar
`membrane, and temporal masking. Combining these
`psychoacoustic notions with basic properties of signal
`quantization has also led to the development of percep-
`tual entropy [36], a quantitative estimate of the funda-
`mental limit of transparent audio signal compression.
`This section reviews psychoacoustic fundamentals and
`perceptual entropy, then gives as an application exam-
`ple some details of the ISO/MPEG psychoacoustic
`model one.
`A. ABSOLUTE THRESHOLD OF HEARING
`The absolute threshold of hearing is characterized by
`the amount of energy needed in a pure tone such that it
`can be detected by a listener in a noiseless environment.
`The frequency dependence of this threshold was quanti-
`fied as early as 1940, when Fletcher [28] reported test
`results for a range of listeners which were generated in
`an NIH study of typical American hearing acuity. The
`
`3
`
`

`
`this bandwidth, the threshold rapidly decreases (Fig 3c).
`iii'
`ii5'
`:g,
`:g,
`Qi
`Qi
`i;
`i;
`...J
`...J
`~
`~
`:l
`
`M
`
`------
`
`:l "' "' 2! ""
`
`-0
`c
`:l
`0
`<fl
`
`"' ~
`""
`-0

`0
`<fl
`
`Freq.
`
`(a)
`
`Freq.
`
`(b)
`
`~
`c
`:3
`..0 :a
`:l <
`
`~
`c
`:3
`..0 :a
`:l <
`
`tJ.f
`
`fd>
`fd>
`(d)
`(c)
`Fig. 3. Critical Band Measurement Methods
`2 s1 ---.----.-.----.-----.----.---::===:'.::::::;i'1===j
`
`2 0
`
`l S
`
`5000
`
`" :=. 4000
`!i
`~
`~
`: 3000
`
`..:
`2
`.!:: 2 000
`b
`
`1 000
`
`x - CB filter cent e r frequencies
`
`5000
`
`5000
`0
`10 000 15000 2 0000
`Fr equency, f {Hz)
`(a)
`
`10 000 1 5 000 2 0000
`
`x - CB fil ter center f requencies
`
`10 '
`
`1 0 '
`f
`( llz}
`
`10'
`
`Frequency,
`(b)
`Fig. 4. (a) Critical Band Rate, z(j) , and (b) Critical
`Bandwidth, BW.
`A similar notched-noise experiment can be constiucted
`with masker and maskee roles reversed (Fig. 3b,d).
`Critical bandwidth tends to remain constant (about 100
`Hz) up to 500 Hz, and increases to approximately 20%
`of the center frequency above 500 Hz. For an average
`
`(2)
`
`listener, critical bandwidth (Fig. 4b) is conveniently ap(cid:173)
`proximated [33] by
`BW,(J) = 25
`+ 75[ 1+1.4(/ I 1000)2 ]° 69
`(Hz)
`Although the function BW. is continuous, it is useful
`when building practical systems to ti·eat the ear as a dis(cid:173)
`crete set of bandpass filters which obeys Eq. (2) . Table
`1 gives an idealized filterbank which con-esponds to the
`discrete points labeled on the curve in Figs. 4a, 4b. A
`distance of 1 critical band is conunonly refen-ed to as
`"one bark" in the literature. The function [33]
`z(f) = 13arctan(.00076/j
`, ( f )2
`+ 3.5arct - -
`7500
`is often used to convert from frequency in Hertz to the
`bark scale (Fig 4a). Con-esponding to the center fre(cid:173)
`quencies of the Table 1 filterbank, the numbered points
`in Fig. 4a illusti·ate that the non-unifonn Hertz spacing
`of the filterbank (Fig. 5) is actually uniform on a bark
`scale. Thus, one critical bandwidth comprises one bark.
`Inti·a-band and inter-band masking properties associated
`with the ear' s critical band mechanisms are routinely
`used by modem audio coders to shape the coding dis(cid:173)
`tortion spectium. These masking properties are de(cid:173)
`scribed next.
`
`(Bark)
`
`(3)
`
`Band"'idth (Hz)
`
`Center
`Band
`Freo. fl'h \
`No.
`-100
`50
`l
`100-200
`150
`2
`200-300
`2 50
`3
`300-400
`350
`4
`400-510
`5
`450
`5 10-630
`570
`6
`630-770
`700
`7
`770-920
`840
`8
`920-1080
`1000
`9
`1270-1480
`1370
`11
`1480-1720
`1600
`12
`1720-2000
`1850
`13
`2000-2320
`2 150
`14
`2320-2700
`2 500
`15
`2700-3 150
`2900
`16
`3150-3700
`3400
`17
`4000
`18
`3700-4400
`4800
`19
`4400-5300
`5300-6400
`5800
`20
`6400-7700
`7000
`21
`7700-9500
`8500
`22
`10,500
`23
`9500-12000
`13,500
`24
`12000-15500
`15500-
`19.500
`25
`Table 1 Critical Band Filterbank [after Scharf]
`
`C. SIMULTANEOUS MASKING AND THE SPREAD
`OF MASKING
`Masking refers to a process where one sound is ren(cid:173)
`dered inaudible because of the presence of another
`sound. Simultaneous masking refers to a frequency-
`
`4
`
`

`
`domain phenomenon which has been observed within
`critical bands (in-band). For the purposes of shaping
`coding disto1tions it is convenient to distinguish be(cid:173)
`tween two types of simultaneous masking, namely tone(cid:173)
`masking-noise [31 ], and noise-masking-tone [32].
`In
`the first case, a tone occurring at the center of a critical
`band masks noise of any subcritical bandwidth or shape,
`
`1.2
`
`0.4
`
`parameter K has typically been set bef.\¥een 3 and 5
`dB. Masking thresholds are commonly referred to in
`the literature as (bark scale) functions of just noticeable
`distortion (JND). One psychoacoustic coding scenario
`might involve first classifying masking signals as either
`noise or tone, next computing appropriate thresholds,
`then using this infonnation to shape the noise spectrum
`beneath JND. Note that the absolute threshold (T ABJ of
`--:----Masking Tone
`li5'
`~
`Qi
`5:;
`..J
`QI
`
`6i
`
`i::.:::
`
`~
`
`"' "' ~ p...
`"i:::l

`0
`<fl
`
`I
`
`, Minimum
`- -~ - - - -masking Thresh.
`I , _____ --
`' r------·
`, _____ --
`'
`
`m-1
`m
`m+l
`
`Freq.
`
`0 2
`
`0.4
`
`0 6
`
`0.8
`
`1.2
`1
`Frequency (Hz)
`Fig. 5. Idealized Critical Band Filterbank
`
`1.4
`
`1 6
`
`18
`
`x 104
`
`: Crit. : Neighboring
`: Band : Band
`Fig. 6. Schematic Representation of Simultaneous
`Masking (after [26])
`
`I
`
`I
`
`I
`
`provided the noise spectrum is below a predictable
`threshold directly related to the strength of the masking
`tone. The second masking type follows the same pat(cid:173)
`tern with the roles of masker and maskee reversed. A
`simplified explanation of the mechanism underlying
`both masking phenomena is as follows. The presence
`of a strong noise or tone masker creates an excitation of
`sufficient strength on the basilar membrane at the criti(cid:173)
`cal band location to effectively block transmission of a
`weaker signal.
`Inter-band masking has also been ob(cid:173)
`served, i.e., a masker centered within one critical band
`has some predictable effect on detection thresholds in
`other critical bands. This effect, also known as the
`spread of masking, is often modeled in coding applica(cid:173)
`tions by an approximately triangular spreading function
`which has slopes of +25 and -10 dB per bark. A con(cid:173)
`venient analytical expression [35] is given by:
`SFdB (x) = 15.81+7 5(x + 0.474)
`- 175JI+(x+0.474) 2 dB
`
`(4)
`where x has units of barks and s~b (x) is expressed in
`dB. After critical band analysis is done and the spread
`of masking has been accounted for, masking thresholds
`in psychoacoustic coders are often established by the
`[38] decibel (dB) relations:
`(5)
`THN =Er -14.5-B
`THT =EN -K
`(6)
`where THN and THr, respectively, are the noise and
`tone masking thresholds due to tone-masking noise and
`noise-masking-tone, EN and Er are the critical band
`noise and tone masker energy levels, and B is the criti(cid:173)
`cal band number. Depending upon the algorithm, the
`
`5
`
`hearing is also considered when shaping the noise spec(cid:173)
`tra, and that MAX(JND, T ABJ is most often used as the
`pennissible disto1t ion threshold. Notions of critical
`bandwidth and simultaneous masking in the audio cod(cid:173)
`ing context give rise to some convenient tenninology
`illustrated in Fig. 6, where we consider the case of a
`single masking tone occwTing at the center of a critical
`band. All levels in the figure are given in terms of dB
`SPL. A hypothetical masking tone occw·s at some
`masking level. This generates an excitation along the
`basilar membrane which is modeled by a spreading
`function and a corresponding masking threshold. For
`the band under consideration, the minimum masking
`threshold denotes the spreading function in-band mini(cid:173)
`mum. Assuming the masker is quantized using an m-bit
`uniform scalar quantizer, noise might be introduced at
`the level m. Signal-to-mask ratio (SMR) and noise-to(cid:173)
`mask ratio (NMR) denote the log distances from the
`minimum masking threshold to the masker and noise
`levels, respectively.
`D. TEMPORAL MASKING
`In the
`Masking also occurs in the time-domain.
`context of audio signal analysis, abrnpt signal transients
`(e.g., the onset of a percussive musical instrument) cre(cid:173)
`ate pre- and post- masking regions in time during which
`a listener will not perceive signals beneath the elevated
`audibility thresholds produced by a masker. The skiits
`on both regions are schematically represented in Fig. 7.
`In other words, absolute audibility thresholds for
`masked sounds are artificially increased prior to, during,
`and following the occwTence of a masking signal.
`Whereas premasking tends to last only about 5 ms,
`
`

`
`Pre-
`
`Simultaneous
`
`to account for inter-band masking. An estimation of the
`tonelike or noiselike quality for C; is then obtained us-
`ing the spectral flatness measure [40] (SFM)
`
`100
`50
`0
`-50
`200
`150
`100
`50
`Time at\er masker appearance ( ms)
`Time at\er masker removaJ ( ms)
`Fig. 7. Schematic Representation of Temporal Masking
`Prope1ties of the Human Ear (after [33])
`
`150
`
`postmasking will extend anywhere from 50 to 300 ms,
`depending upon the strength and duration of the masker
`[33][39]. Temporal masking has been used in several
`audio coding algorithms. Pre-masking in paiticular has
`been exploited in conjunction with adaptive block size
`transfonn coding to compensate for pre-echo distortions
`(section III).
`E. PERCEPTUAL ENTROPY
`Johnston at Bell Labs has combined notions of psy(cid:173)
`choacoustic masking with signal quantization principles
`to define perceptual entropy (PE), a measure of per(cid:173)
`ceptually relevant information contained in any audio
`record. Expressed in bits per sainple, PE represents a
`theoretical limit on the compressibility of a paiticular
`signal. PE measurements reported in [36] and [ 6] sug(cid:173)
`gest that a wide variety of CD quality audio source ma(cid:173)
`terial can be transpai·ently compressed at approximately
`2. 1 bits per sample. The PE estimation process is ac(cid:173)
`complished as follows. The signal is first windowed
`and transformed to the frequency domain. A masking
`threshold is then obtained using perceptual rules. Fi(cid:173)
`nally, a detennination is made of the number of bits re(cid:173)
`quired to quantize the spectrum without injecting per(cid:173)
`ceptible noise. The PE measurement is obtained by
`constructing a PE histograin over many frames and then
`choosing a worst-case value as the actual measurement.
`The frequency-domain transformation is done with a
`Hanning window followed by a 2048-point FFT.
`Masking thresholds are obtained by perfonning critical
`band analysis (with spreading), making a detennination
`of the noiselike or tonelike nature of the signal, apply(cid:173)
`ing thresholding rules for the signal quality, then ac(cid:173)
`counting for the absolute hearing threshold. First, real
`and imaginary transfonn components are converted to
`power spectral components
`(7)
`P (co) =Re 2 (co) +lm2 (co)
`then a discrete bark spectmm is formed by summing the
`energy in each critical band (Table 1)
`bh,
`B; = 2,P(co)
`tu=bl,
`where the summation limits are the critical band
`boundaries. The range of the index, i , is sainple rate
`dependent, and in particular i e {1,25} for CD-quality
`signals. A basilar spreading function (Eq.4) 1s then
`convolved with the discrete bark spectmm
`
`(8)
`
`(9)
`
`6
`
`(10)
`
`SFM= µg
`µa
`where µ g andµ. correspond to
`the geometric and
`arithmetic means of the PSD components for each band.
`The SFM has the prope11y that it is bounded by 0 and 1.
`Values close to 1 will occur if the spectrum is flat in a
`paiticulai· band, indicating a decorrelated (noisy) band.
`Values close to zero will occur if the spectmm in a pai·(cid:173)
`ticular band is nearly sinusoidal. A "coefficient of to(cid:173)
`nality," a , is next derived from the SFM on a dB scale
`a=mm(s~7odb ·1)
`and this is used to weight the thresholding rules given
`by Eq. (5) and Eq. (6) [with K = 5.5] as follows for each
`band to form an offset
`(12)
`O; =a(l4.5+i) + (l -a)5.5 (in dB)
`A set of JND estimates in the frequency power domain
`are then fonned by subtracting the offsets from the bark
`spectral components
`
`(11)
`
`lo
`
`I
`
`(13)
`
`(C ) o,
`T = 10 '" '-10
`These estimates are scaled by a correction factor to
`simulate deconvolution of the spreading function, then
`each T;
`is checked against the absolute threshold of
`hearing and replaced by max( T;, T AIJS (i)) . As previously
`noted, the absolute threshold is referenced to the energy
`in a 4 kHz sinusoid of+/- 1 bit amplitude. By applying
`uniform quantization principles to the signal and associ(cid:173)
`ated set of JND estimates, it is possible to estimate a
`lower bound on the number of bits required to achieve
`In fact, it can be shown that the
`transpai·ent coding.
`perceptual entropy in bits per sample is given by
`
`( Re(co) ) J
`PE = L L log2 2 n int .J61Jk; + 1
`67;
`
`2s
`
`bh;
`
`(
`
`l=i W=b/;
`
`I
`
`+log2 2nint(~) +l
`(
`J
`
`67;/k;
`
`(14)
`
`(bits/sample)
`where i is the index of critical band, bl; and bh; ai·e the
`upper and lower bounds of band i , k; is the number of
`I'; is the masking
`transfonn components in band i ,
`threshold in band i (Eq. (13)), and nint denotes round(cid:173)
`ing to the nearest integer. Note that if 0 occurs in the
`log we assign 0 for the result.
`The masking thresholds used in the above PE com(cid:173)
`putation also fonn the basis for a transform coding algo(cid:173)
`rithm described in section III.
`
`

`
`G. APPLICATION OF PSYCHOACOUSTIC PRINCI-
`PLES: ISO 11172-3 (MPEG-1)
`PSYCHOACOUSTIC MODEL 1
`It is useful to consider an example of how the psy-
`choacoustic principles described thus far are applied in
`actual coding algorithms.
` The ISO/IEC 11172-3
`(MPEG-1, layer 1) psychoacoustic model 1 [17] deter-
`mines the maximum allowable quantization noise en-
`ergy in each critical band such that quantization noise
`remains inaudible. In one of its modes, the model uses
`a 512-point DFT for high resolution spectral analysis
`(86.13 Hz), then estimates for each input frame individ-
`ual simultaneous masking thresholds due to the pres-
`ence of tone-like and noise-like maskers in the signal
`spectrum. A global masking threshold is then estimated
`for a subset of the original 256 frequency bins by
`(power) additive combination of the tonal and non-tonal
`individual masking thresholds. The remainder of this
`section describes the step-by-step model operations.
`Sample results are given for one frame of CD-quality
`pop music sampled at 44.1 kHz/16-bits per sample. The
`five steps leading to computation of global masking
`thresholds are as follows:
`
`SPECTRAL ANALYSIS AND
`1:
`STEP
`NORMALIZATION
`First, incoming audio samples, ( )
`s n , are normalized
`according to the FFT length, N , and the number of bits
`per sample, b , using the relation
`( )
`( )
`s n
`(
`x n
`-2 1
`N b
`Normalization references the power spectrum to a 0-dB
`( )
`maximum. The normalized input,
`x n , is then seg-
`mented into 12 ms frames (512 samples) using a 1/16th-
`overlapped Hann window such that each frame contains
`10.9 ms of new data. A power spectral density (PSD)
`( )P k , is then obtained using a 512-point FFT,
`estimate,
`i.e.,
`( )
`P k
`
`=
`
`)
`
`=
`
`PN
`
`+
`
`10
`
`SPL
`
`(15)
`
`(16)
`
`N
`2
`
`p
`kn
`(cid:215) £ £-
`( ) ( )
`N
`w n x n e
`k
`log
`0
`
`
`1
`
`2
`
`2
`
`j
`
`=-
`
`N
`
`n
`
`0
`
`10
`
`where the power normalization term, PN , is fixed at 90
`( )w n , is defined as
`dB and the Hann window,
`- (cid:230)Ł(cid:231) (cid:246)ł(cid:247)غŒ øßœ
`p
`( )w n
`2
`n
`N
`Because playback levels are unknown during psychoa-
`coustic signal analysis, the normalization procedure
`(Eq. 15) and the parameter PN in Eq. (16) are used to
`estimate SPL conservatively from the input signal. For
`example, a full-scale sinusoid which is precisely re-
`
`1
`
`cos
`
`1 2
`
`=
`
`(17)
`
`F. PRE-ECHO DISTORTION
`A problem known as “pre-echo” can arise in trans-
`form coders using perceptual coding rules. Pre-echoes
`occur when a signal with a sharp attack begins near the
`end of a transform block immediately following a re-
`gion of low energy. This situation can arise when cod-
`ing recordings of percussive instruments such as the
`castanets, for example (Fig 8a). The inverse transform
`spreads quantization distortion evenly throughout the
`reconstructed block according to the relatively lax
`masking thresholds associated with the block average
`spectral estimate (Fig 8b), resulting in unmasked distor-
`tion in the low energy region preceding in time the sig-
`nal attack at the decoder. Although it has the potential
`to compensate for pre-echo, temporal premasking is
`possible only if the transform block size is sufficiently
`small (minimal coder delay). A more robust solution to
`the problem relies upon the use of adaptive transform
`block sizes. Long blocks are applied during steady-
`state audio segments, and short blocks are applied when
`pre-echo is likely. Several algorithms make use of this
`approach.
`
`200
`
`400
`
`600
`
`800
`
`1200
`1000
`Sample (n)
`
`1400
`
`1600
`
`1800
`
`2000
`
`(a)
`
`200
`
`400
`
`600
`
`800
`
`1200
`1000
`Sample (n)
`
`1400
`
`1600
`
`1800
`
`2000
`
`1
`
`0.8
`
`0.6
`
`0.4
`
`0.2
`
`0
`
`−0.2
`
`−0.4
`
`−0.6
`
`−0.8
`
`1
`
`0.8
`
`0.6
`
`0.4
`
`0.2
`
`0
`
`−0.2
`
`−0.4
`
`−0.6
`
`−0.8
`
`−1
`
`Amplitude
`
`Amplitude
`
`(b)
`Fig. 8. Pre-Echo Example: (a) Uncoded Castanets. (b)
`Transform Coded Castanets, 2048-Point Block Size
`
`7
`
`(cid:229)
`

`
`vertical lines are included in the bark scale plot to show
`the associated critical band for each masker.
`
`STEP 3: DECIMATION AND REORGANIZATION OF
`MASKERS
`In this step, the number of maskers is reduced using
`two criteria. First, any tonal or noise maskers below the
`absolute threshold are discarded, i.e., only maskers
`which satisfy
`( )
`( )
`k
`T k
`P
`,
`q
`TM NM
`( )
` is the SPL of the threshold in
`are retained, where
`T kq
`quiet at spectral line k . In the pop music example, two
`high-frequency noise maskers identified during step 2
`(Fig. 9a) are dropped after application of Eq. 23 (Figs.
`9c-e). Next, a sliding 0.5 Bark-wide window is used to
`replace any pair of maskers occurring within a distance
`of 0.5 Bark by

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket