`Samsung v. Affinity
`IPR2014-01181
`Page 00001
`
`
`
`Principles of
`Digital Audio
`
`Ken 0. lfohlmann
`
`Fourth Edition
`
`New York San Francisco Washington. D.C. Auckland Bogoté
`Caracas Lisbon London Madrid Mexico City Milan
`Montreal New Delhi San Juan Singapore
`Sydney Tokyo Toronto
`
`McGraw-Hill
`
`Page 00002
`
`
`
`Library of Congress Cataloging-irr
`
`blication Data
`
`J
`
`2
`
`l'll'Ill _.
`
`'
`‘
`Pohlmann, Ken C.
`Principles of digital audio / Ken C. Poh1mann.—4th ed.
`p.
`cm.
`Includes bibliographical references and index.
`ISBN 0-07-134819-0
`1. Sound—Recording and reproducing—Digital techniques.
`TK7881.4 P63 2000
`621.389’3—dc21
`
`I. Title.
`
`99-054165
`
`McGraw—Hill
`A Divisibn of The McGraw-Hill Companies
`
`32
`
`Copyright © 2000 by The McGraw-Hill Companies, Inc. All rights reserved.
`Printed in the United States of America. Except as permitted under the United
`States Copyright Act of 1976, no part of this publication may be reproduced or
`distributed in any form or by any means, or stored in a data base or retrieval
`system, without the prior written permission of the publisher.
`
`1234567890 AGM/AGM 90543210
`
`ISBN 0-07-134819-0
`
`I
`The sponsoring editor for this book was Stephen S. Chapman and the
`production supervisor was Maureen Harper. It was set in Century Schoolbook by
`Pro-Image Corporation.
`
`Printed and bound by Quebecor/Martinsburg.
`
`This book was printed on recycled, acid-free paper containing
`a minimum of 50% recycled, de-inked fiber.
`'
`
`McGraw-Hill books are available at special quantity discounts to use as premiums
`and sales promotions, or for use in corporate training programs. For more infor-
`mation, please write to the Director of Special Sales, Professional Publishing, Mc-
`Graw-Hill, Two Penn Plaza, New York, NY 10121-2298. Or contact your local
`bookstore.
`
`Information contained in this work has been obtained by The McGraw—
`Hill Companies, Inc. (“Me-Graw~I~l1']l”) from sources believed to be reli~
`able. However, neither McGraw-Hill nor its authors guarantee the ac-
`curacy or completeness of any information published herein, and neither
`McGrs.w-Hill nor its authors shall be responsible for any errors,
`omissions, or damages arising out of use of this information. This work
`is piiblished’ with the understanding that McG1-aw-Hill and its authors
`are supplyinginforination but are not attempting to render engineering
`or other professional services. If such services are required, the assis-
`tance of an appropriate-profssional should be sought.
`
`Page 00003
`
`
`
`Perceptual Coding
`
`327
`
`the modulated lapped transform (MLT). In the MDCT, the length of the over-
`lapping windows is twice that of the block time (shift length of the transform).
`Frequency-domain subsampling is performed; the number of time and fre-
`quency components equals the shift length of the input time-domain sampled
`signal. MDCT also lends itself to adaptive window switching approaches with
`different window functions for the first and second half of the window; the
`time domain aliasing property must be independently valid for each window
`half. Many bands are possible with the MDCT with good efficiency, on the
`order of an FFT computation. Many codecs apply a Window function to blocks
`prior to transformation; this helps minimize spectral leakage of spectral co-
`efficients. A window is a time function that is multiplied by an audio block to
`provide a windowed audio block; the window shape governs the frequency
`selectivity of the filter bank. The overlap/ add characteristic minimizes block-
`ing artifacts and leakage of spectral coefficients. Digital filters and Windows
`are discussed in chapter 17.
`Hybrid filter banks use a cascade of different filter types (such as polyphase
`and MDCT) to provide different frequency resolutions at different frequencies
`with moderate complexity; for example, MPEG-1 Layer III encoders use a
`hybrid filter with a polyphase filter bank and MDCT. The ATRAC algorithm
`used in the MiniDisc, examined in chapter 12, is a hybrid coder that uses
`QMF to divide the signal into three subbands, then each subband is trans-
`formed into the frequency domain using the MDCT. Table 10.3 compares the
`properties of filter banks used in several low-bit rate coders.
`
`MPEG-1 Audio Standard
`
`.The International Standards Organization and the International Electro-
`technical Commission formed the Moving Pictures Expert Group (MPEG) in
`1988 to devise compression techniques for audio and video. This group has
`
`TABLE 10.3 Comparison of filter-bank properties.
`
`Feature
`
`Layer 1
`
`Layer 2
`
`Layer 3
`
`AC-2
`
`AC-3
`
`ATRAC*
`
`PAC/MPAC
`
`Filterbank type
`
`PQMF
`
`PQMF
`
`MDCT/MDST MDCT
`
`Hybrid
`PQMFI
`MDCT
`41.66 Hz
`
`750 Hz
`
`750 Hz
`
`93.75 Hz
`
`93.75 Hz
`
`Hybrid
`QMF/
`MDCT
`46.87 Hz
`
`MDCT
`
`23.44 Hz
`
`Frequency resolution
`at 48 kHz
`Time resolution at
`48 kHz
`Impulse response
`(LW)
`Impulse response
`(SW)
`23 ms
`10.66 ms
`32 ms
`32 ms
`24 ms
`24 ms
`8 ms
`Frame length at 48
`kHz
`
`
`0.66 ms
`
`0.66 ms
`
`4 ms
`
`1.3 ms
`
`2.66 ms
`
`1.3 ms
`
`2.66 ms
`
`512
`
`-—
`
`512
`
`-
`
`1664
`
`896
`
`512
`
`128
`
`512
`
`256
`
`1024
`
`128
`
`2048
`
`256
`
`*ATRAC is operating at a sampling frequency at 44.1 kHz. For comparison, the frame length and impulse response
`figures are given for an ATRAC system working at 48 kHz.
`(Brandenburg and Bosi)
`
`Page 00004
`
`
`
`328
`
`Chapter Ten
`
`developed several highly successful standards. It first devised the ISO/IEC
`International Standard 11172 “Coding of Moving Pictures and Associated Au-
`dio for Digital Storage Media at up to about 1.5 Mbit/ s” for reduced data rate
`coding of digital video and audio signals; the standard was finalized in No-
`vember, 1992. It is commonly known as MPEG-1. (The acronym is pronounced
`“m-peg”) The standard has three major parts: system (multiplexed Video and
`audio), video, and audio; a fourth part defines conformance testing. The max-
`imum audio bit rate is set at 1.856 Mbps. The audio portion of the standard
`(11172-3) has found many applications such as Video CD, CD-ROM, ISDN,
`video games, and digital audio broadcasting. It supports coding of 32, 44.1,
`and 48 kHz PCM data at bit rates of approximately 32 to 224 kbps/ channel
`(64 to 448 kbps for stereo). (Because data networks use data rates of 64 kbps
`(8 bits sampled at 8 kHz), most coders output a data channel rate that is a
`multiple of 64.)
`The ISO/MPEIG-1 standard was specifically developed to support audio and
`video coding for CD playback Withinwthe CD’s bandwidth of 1.41 Mbps. How-
`ever, the standard supports stereo bit rates ranging from 64 kbps to 448 kbps,
`as well as mono audio coding of 32 kbps. In addition, in the stereo modes,
`stereophonic irrelevance and redundancy can be optionally exploited to reduce
`the bit rate. Stereo audio bit rates below 256 kbps are useful for applications
`requiring more than two audio channels While maintaining full screen motion
`video. Rates above 256 kbps are useful for applications requiring higher audio
`quality, and partial screen video images. In either case, the bit allocation is
`dynamically adaptable according to need. The MPEG-1 standard is based on
`a history of research and development of data reduction algorithms.
`MUSICAM (Masking-pattern Universal Subband Integrated Coding And
`Multiplexing) was an early and successful perceptual coding algorithm. De-
`rived from MASCAM (Masking-pattern Adapted Subband Coding And Mul-
`tiplexing), MUSICAM divides the input audio signal into 32 subbands and
`uses perceptual coding models of minimum hearing threshold and masking to
`achieve data reduction. With a sampling frequency of 48 kHz, the subbands
`are each 750 Hz wide. Each subband is given a 6-bit scale factor according to
`the peak value in the subband’s 12 samples and quantized with a variable
`word ranging from 0 to 15 bits. Scale factors are calculated over a 24-ms
`interval, corresponding to 36 samples. A subband is quantized only if it con-
`tains audible signals above the masking threshold. Subbands with signals
`well above the threshold are coded with more bits, yielding a higher S/N ratio.
`In other words, within a given bit rate, bits are assigned where they are most
`needed. In addition, a side-chain Fourier spectral analysis is performed on the
`input signal to assist in the masking threshold calculations. In this way, the
`data rate is reduced, to perhaps 128 kbps per mono channel (256 kbps for
`stereo). Extensive tests of 128 kbps MUSICAM showed that the coder achieves
`fidelity that is indistinguishable from a CD source, that it is monophonically
`compatible, that at least two cascaded codec stages produce no audible deg-
`radation, and that it is preferred to very high quality FM signals.
`The audio portion of the ISO/MPEG-1 standard can trace its origins to tests
`conducted by Swedish Radio in July 1990. MUSICAM coding was judged su-
`
`Page 00005
`
`
`
`Perceptual Coding
`
`329
`
`perior in complexity and coding delay; however, the ASPEC (Adaptive Spectral
`Perceptual Entropy Coding) transform coder provided superior sound quality
`at very low data rates. The architectures of these two coding methods form
`the basis for the ISO/MPEG-1 audio standard. The 11172-3 standard de-
`scribes three layers of coding, each with different applications. Specifically,
`Layer I describes the least sophisticated method that requires relatively high
`data rates (approximately 192 kbps/ channel). Layer II is based on Layer I
`but is more complex and operates at somewhat lower data rates (approxi-
`mately 96-128 kbps/ channel). Layer IIA is a joint stereo version operating at
`128 and 192 kbps per stereo pair. Layer III is somewhat conceptually different
`from I and II, is the most sophisticated, and operates at the lowest data rate
`(approximately 64 kbps/ channel). The increased complexity from Layer I to
`III is reflected in the fact that at low data rates, Layer III will perform best
`for audio fidelity. Generally, Layers II, IIA and III have been judged to be
`acceptable for broadcast applications; in other Words, the 128 kbps/ channel
`data reduction does not impair the quality of the original audio signal.
`In very general terms, all three coders operate similarly. The audio signal
`passes through a filter bank and is analyzed in the frequency domain. The
`subsampled components are regarded as subband values, or spectral coeffi—
`cients. The output of a side-chain transform, or the filter bank itself, is used
`to estimate masking thresholds. The subband values or spectral coefficients
`are quantized according to the psychoacoustic model. Coded mapped samples
`and bit allocation information are packed into frames prior to transmission.
`In each case, the encoders are not defined by the ISO/MPEG-1 standard,‘ only
`the decoders are specified. This forward adaptive bit allocation permits im-
`provements in encoding methods, particularly in the psychoacoustic modeling,
`provided the data output from the encoder can be decoded according to the
`standard. In other words, existing coders Will play data from improved encod-
`ers.
`
`The MPEG-1 layers support stereo joint coding using intensity coding. Left/
`right high frequency subband samples are summed into one channel but scale
`factors remain left/right independent. The decoder forms the envelopes of the
`original left and right channels using the scale factors. The spectral shape of
`the left and right channels is the same in these upper subbands, but their
`amplitudes differ. The bound for joint coding is selectable at four frequencies:
`3, 6, 9, and 12 kHz at a 48-kHz sampling frequency; the bound can be changed
`from one frame to another. Care must be taken to avoid aliasing between
`subbands and negative correlation between channels when joint coding. Layer
`III also supports MS (sum and difference) coding between channels, as de-
`scribed below. Joint stereo "coding increases coder complexity only slightly.
`MPEG data is transmitted in frames, as shown in Fig. 10.16, with each
`frame being individually decodable. The length of a frame depends on the
`layer and MPEG algorithm used. In MPEG-1, Layer II and III have the same
`frame length representing 1152 audio samples. In Layer II, the audio data of
`a frame is located in the audio frame to which it corresponds. Unlike the other
`layers, in Layer III the number of bits per frame can vary; this allocation
`provides flexibility according to the coding demands of the audio signal.
`
`Page 00006
`
`
`
`A
`
`ISO/MPEG/AUDIO layer I frame structure: valid for 384 PCM audio input samples
`Duration: 8 ms with a sampling rate of 48 kHz
`
`CRC
`
`B't
`anocgtion
`
`|
`
`:
`4 bit linear-
`
`Scalefactors
`
`6 bit linear
`
`I
`I
`I
`I
`I
`I
`I
`I
`P
`I
`I
`I
`
`I
`I
`I
`I
`r
`I
`I
`I
`I
`I
`I
`f
`
`Subband
`
`Samples
`
`1 subband sample corresponds to
`32 PCM audio input samples.
`
`
`
`
`
`Auxiliarydatafield
`
`
`
`lengthnotspecified
`
`I
`
`I
`
`Header
`
`:A) 12 bit
`'
`E
`.
`sync
`I
`1
`:
`:B) 20bit
`.
`system.
`I
`info
`'
`
`I I
`
`ISO/MPEG/AUDIO layer II frame structure: valid for 1152 PCM audio input samples
`Duration: 24 ms with a sampling rate of 48 kHz
`SCFSI
`
`CRC
`
`allocation
`
`Scale factors
`
`Subband
`
`Samples
`
`Gr“:
`I
`: 6bit linear bro
`:Low subbandsi
`-4bit linear
`I 00 -
`II:I:II:I:II:II:II:I:II:II:I:II:I -
`I
`I
`I "b-
`l
`:Mid subbands : 01 : _[>_
`_b_ ' 12 granules [Gr] of 3 subbaud :
`.3 bit linear
`I
`I
`:
`Samples each.
`I
`I
`I 10 I
`I 3 subband samples correspond‘
`'Hi3h 311bba“d5'
`'
`' t 96 PCM aud'o 'nput samples.'
`:2biI:1inear
`:nH>~4> i:°
`“
`:
`I
`'2 bit
`'
`'
`
`.‘»'>U
`
`
`Auxiliarydata
`
`
`fieldlengthnotspecified
`
`24ms
`Main data
`‘
`.
`0 Scale factors
`- Coded subband samples
`° Auxiliary data
`
`‘ - — — - _
`
`,
`
`|
`i
`
`11
`
`ide info
`
`I
`
`Main data
`begin
`
`1
`256 bit (stereo)
`|
`Slgranuleo l SI granulel
`L Sidelnfo for subframe l
`Sideinfo for subframe 0
`Scale factor select information
`
`Private bits (user defined)
`Pointer to begin of main data of this frame
`
`32 bit
`
`B_
`mite
`
`Syncword
`
`|
`
`.
`
`Emphasis
`Original/home
`---k Copyright
`-- Mode, mode extension (mono, stereo...)
`Sampling frequency
`Bit rate
`
`t I
`
`I
`
`Layer
`
`Figure 10.16 Structure of the ISO/ MPEG-1 audio Layer 1, II, and III bit streams. The
`header and some other fields are common, but other fields differ. Higher-level coders
`might transcode lower-level bit streams. A. Layer I bit stream format. B. Layer II bit
`stream format. C. Layer III bit stream format.
`
`Page 00007
`
`
`
`Perceptual Coding
`
`331
`
`A frame begins with a 32-bit ISO header with a 12-bit synchronizing pattern
`and 20 bits of general data on layer, bit rate index, sampling frequency, type
`of emphasis, etc. This is followed by an optional 16-bit CRCC check word with
`generation polynomial xi“ + x15 + x2 fr 1. Subsequent fields describe bit allo-
`cation data (number of bits used to code subband samples), scale factor selec-
`tion data, and scale factors themselves. This varies from layer to layer. For
`example, Layer I sends a fixed 6-bit scale factor for each coded subband. Layer
`II examines scale factors and uses dynamic scale-factor selection information
`(SCFSI) to avoid redundancy; this reduces the scale factor bit rate by a factor
`of two.
`The largest part of the frame is occupied by subband samples. Again, this
`varies among layers. In Layer II, for example, samples are grouped in gran-
`ules. The length of the field is determined by a bit rate index, but the bit
`allocation determines the actual number of bits used to code the signal; if the
`frame length exceeds the number of bits allocated, the remainder of the frame
`can be occupied by ancillary data (this feature is used by MPEG-2, for ex-
`ample). Ancillary data is coded similarly to primary frame data. Frames con-
`tain 384 samples in Layer I and 1152 samples in II and III (or 8 and 24 ms
`respectively at 48 kHz).
`The similarity between the layers promotes tandem operation; for example,
`Layer III data can be transcoded to Layer II without returning to the analog
`domain (other digital processing is required however). A full MPEG-1 decoder
`must be able to decode its layer, and all layers below it. There are also layer
`X coders that only code one layer. Layer I preserves highest fidelity for ac-
`quisition and production work at high bit rates where six or more codings can
`take place; Layer II distributes programs efficiently Where two codings can
`occur; Layer III is most efficient, with lowest rates, with somewhat lower fi-
`delity, and a single coding.
`Extensive tests have demonstrated that either Layer II or III at 2 X128
`kbps or 192 kbps joint stereo can convey a stereo audio program with no
`audible degradation compared to a 16-bit linear system. If a higher data rate
`of 384 kbps is allowed, Layer I also achieves transparency compared to 16-bit
`linear PCM. At rates as low as 128 kbps, Layers II and III can convey stereo
`material that is subjectively Very close to 16-bit fidelity. Tests also have stud-
`ied the effects of cascading MPEG codecs. For example, in one experiment,
`critical audio material was passed through four Layer II codec stages at 192
`kbps and two stages at 128 kbps, and they were found to be transparent. On
`the other hand, a cascade of five codec stages at 128 kbps was not transparent
`for all music programs. More specifically, a source reduced to 384 kbps with
`MPEG-1 Layer II sustained about 15 code/decodes before noise became sig-
`nificant; however, at 192 kbps, only two codings were possible. These partic-
`ular tests did not enjoy the benefit of joint stereo coding, and as with any
`MPEG perceptual coder, overall performance can be improved by substituting
`new psychoacoustic models in the encoder. In addition, transcoding produces
`no appreciable noise after multiple MPEG code/ decodes.
`.’
`MPEG-2, discussed below, incorporates the three audio layers of MPEG-1,
`and adds additional features, principally surround sound. However, MPEG-2
`
`Page 00008
`
`
`
`332
`
`Chapter Ten
`
`decoders can play MPEG-1 audio files, and MPEG-1 two-channel decoders can
`decode stereo information from surround sound MPEG-2 files.
`
`Psychoacoustic models
`
`The MPEG-1 standard suggests two psychoacoustic models which determine
`the minimum masking threshold for inaudibility. The models are needed only
`in the encoder. Simple encoders do not employ a psychoacoustic model; The
`difference between the maximum signal level and the masking threshold is
`used by the bit allocator to set the quantization levels. Generally, model 1 is
`applied to Layers I and II and model 2 is applied to Layer III. In both cases,
`the models follow an algorithm to output signal-to-mask ratios for each sub-
`band or group of subbands. For example, model 1 performs these nine steps:
`
`1. Perform time to frequency mapping: A 512- or 1024—point fast Fourier trans-
`form is used, with a Hann window‘to reduce edge effects, to transform time
`domain data to the frequency domain; in this way, precise masking thresh-
`olds can be calculated.
`
`SPL levels: This calculation is performed for each sub-
`. Determine
`band using spectral data and scale factors. Maxima are considered to be
`potential maskers, used in forming the masking threshold.
`
`. Determine threshold in quiet: An absolute hearing threshold is determined
`in the absence of any signal; this forms the lower masking bound.
`
`. Identify tonal and nontonal components: Tonal (sinusoidal) and nontonal
`(noiselike) components in the signal are identified and processed separately
`because they provide different masking thresholds.
`'
`
`. Decimation of maskers: The number of maskers is reduced to obtain only
`the relevant maskers; their magnitude and distance in bark must be ap-
`propriate.
`
`. Calculate masking thresholds: Noise masking thresholds for each subband
`are determined by applying a masking function to the signal. When the
`subband is wide compared to the critical band, the spectral model selects
`minimum threshold; when it is narrow, the model averages the thresholds
`covering the subband.
`
`. Determine global masking threshold: This is the summation of the upper
`and lower slopes of individual subband masking curves, as well as the
`threshold in quiet to form a composite contour.
`
`. Determine minimum masking threshold: These values are determined for
`each subband, based on the global masking threshold.
`
`. Calculate signal—t0-mask ratios: The difference between the maximum SPL
`levels and the minimum masking threshold values determines the SMR
`ratio in each subband; this value is supplied to the bit allocator.
`
`Page 00009
`
`
`
`Perceptual Coding
`
`333
`
`Although the validity of the psychoacoustic model is crucial to the success
`of any perceptual coder, it is the actual employment of the model in the quan-
`tization process that ultimately determines the audibility of noise. In that
`respect, the interrelationship of the model and the quantizer is the most pro-
`prietary part of any codec.
`
`Layer I is a simplified version of the MUSICAM standard; block diagrams of
`a single-channel Layer I encoder and decoder (which also applies to Layer II)
`are shown in Fig. 10.17. Its aim is to provide high fidelity at low cost, at a
`somewhat high data rate. A polyphase filter is used to split the Wideband
`signal into 32 subbands of equal width. The filter is critically sampled; there
`is the same number of samples in the analyzed domain as in the time domain.
`Adjacent subbands overlap; a single frequency can affect two subbands. The
`filter and its inverse are not lossless; however, the error is small. The filter
`bank’s bands are all equal width, but the ear’s critical bands are not; this is
`compensated for in the bit allocation algorithm; for example, lower bands are
`
`31 Subband
`
`Digitlaltgléiijo
`signa
`}
`(2,468 kbps)
`
`Filterbank
`32 subbands
`
`0
`
`Linear ——-—-—-
`.
`quantlzer
`
`Codiflg Of
`i
`side
`mformauon Auxiliary data
`
`.
`Bitstrcam
`formatting
`CRc'°he°k _’
`
`Coded audio
`-
`.1
`kbps...
`2:192 kbps)
`
`lozgwgmts
`13
`
`Psycho-
`acmislic
`model
`
`Encoded audio
`bitstream _.
`(2*32 kbps___
`2,492 kbps)
`
`Demultiplexing
`and
`error check
`
`1
`Auxiliary data
`
`Dequantization
`of
`subband samples
`
`31
`
`0
`
`Inverse
`filterbank
`32 subbands
`
`_
`Stercophomc
`audio signal
`(2*768 kbps)
`
`Decoding of
`side information
`
`13
`
`Figure 10.17 ISO/MPEG-1 Layer I or II audio encoder and decoder. The 32-subband filter
`bank is common to all three layers. A. Layer I or II encoder (single-channel mode). B.
`Layer I or II two-channel decoder.
`
`Page 00010
`
`
`
`334
`
`Chapter Ten
`
`usually assigned more bits, increasing their resolution over higher bands. This
`polyphase filter bank with 32 subbands is used in all three layers; Layer III
`adds additional hybrid processing.
`>
`The filter outputs 32 samples, one sample per band, for every 32 input
`samples. In Layer I, 12 subband samples from each of the 32 subbands are
`grouped to form a frame; this represents 384 wideband samples. Each sub-
`band group of 12 samples is given a bit allocation; subbands judged inaudible
`are given a zero allocation. Based on the calculated masking threshold (just
`audible noise), the bit allocation determines the number of bits used to quan-
`tize those samples. A floating point notation is used to code samples; the man-
`tissa determines resolution and the exponent determines dynamic range. A
`fixed scale factor exponent is computed for each subband with a nonzero al-
`location; it is based on the largest sample value in the subband. Each of the
`12 subband samples in a block is normalized by dividing it by the same scale
`factor; this optimizes quantizer resolution.
`Using the scale factor informationnand spectral analysis from a 512-sample
`FFT wideband transform, a psychoacoustic model compares the data to the
`minimum threshold curve; the normalized samples are quantized by the bit
`allocator to achieve data reduction. The subband data is coded, not the FFT
`spectra. Dynamic bit allocation assigns m‘antissa bits to the samples in each
`coded subband, or omits coding for inaudible subbands. Each sample is coded
`with one PCM codeword; the quantizer provides 2” - 1 steps Where 2 S n S 15.
`Subbands with a high signal-to-mask ratio are given a long word, subbands
`with a low SMR ratio are given fewer bits; in other words, the SMR ratio
`determines the minimum signal-to-noise ratio that has to be met by the quan-
`tization of the subband samples. However, quantization is performed itera-
`tively; when available, additional bits are added to codewords to increase the
`S/N ratio above the minimum. The block scale factor exponent and sample
`mantissas are output. Error correction and other information is added to the
`signal at the output of the coder.
`Decoding is performed by decoding the bit allocation information, and de-
`coding the scale factors. Samples are requantized by multiplying them with
`the correct scale factor. The scale factors provide all the information needed
`to recalculate the masking thresholds; in other words, the decoder does not
`need‘ a psychoacoustic model. Samples are applied to an inverse synthesis
`filter to output the waveform.
`
`Example of Layer I algorithm
`
`As with other perceptual coding methods, Layer I uses the ear’s audiology
`performance as its guide for audio encoding, relying on principles such as
`amplitude masking to encode a signal that is perceptually identical. Generally,
`Layer I operating at 384 kbps achieves the same quality as a Layer II coder
`operating at 256 kbps. Also, Layer I can be transcoded to Layer II. The fol-
`lowing describes a typical Layer I implementation.
`
`Page 00011
`
`
`
`Perceptual Coding
`
`335
`
`Signals input to an encoder can be analog, or PCM digital with 32-, 44.1-,
`or 48-kHz sampling frequencies. At these three sampling frequencies, the sub-
`band width is 500, 689, and 750 Hz, and the frame period is 12, 8.7, and 8
`ms, respectively. The following description assumes a 48—kHz sampling fre-
`quency. The stereo audio signal is passed to the first stage in a Layer I encoder,
`as shown in Fig. 10.18. A 24-bit FIR filter with the equivalent of 512 taps
`divides the audio band into 32 subbands of equal 750-Hz width. The filter
`window is shifted by 32 samples each time (12 shifts) so all the 384 samples
`in the 8—ms frame are analyzed. The filter bank outputs 32 subbands. With
`this filter, the effective sampling rate of a subband is reduced by 32 to 1, for
`example, from a frequency of 48 kHz to 1.5 kHz. Although the channels are
`bandlimited, they are still in PCM representation at this point in the algo-
`rithm. The subbands are equal width, Whereas the ear’s critical bands are not.
`With critical bands, the number of bits allocated may be equal in each band.
`This can be compensated for in equal subbands by unequally allocating bits
`to the subbands; more bits are given to code signals in lower-frequency sub-
`bands.
`
`The encoder analyzes the energy in each subband to determine which sub-
`bands contain audible information. This example of a Layer I encoder does
`not use an FFT side chain. The algorithm calculates average power levels in
`each subband over the 8-ms (12 sample) period. Masking levels in subbands
`and adjacent subbands are estimated. Minimum threshold levels are applied.
`Peak power levels in each subband are calculated and compared to mask-
`ing levels. The SMR ratio (difference between the maximum signal and
`the masking threshold) is calculated for each subband and is used to de-
`termine the number of bits N assigned to a subband (i) such‘ that N, 2
`(SMR, — 1.76)/ 6.02. A bit pool approach is taken to optimally code signals
`within the given bit rate. Quantized values form a mantissa, with a possible
`
`Allocation
`calculation
`
`Allocation information
`
`Filter
`
`Broad
`band
`audio
`
`,
`Normalized
`-- Quantization
`
`generator
`
`Scale factors
`
`Scale factor indexes -b
`
`Figure 10.18 Example of an ISO/MPEG-1 Layer I encoder; the FFT side
`chain is omitted. (Philips)
`
`1
`Coding info
`
`Page 00012
`
`
`
`336
`
`Chapter Ten
`
`range of 2 to 15 bits, thus a maximum resolution of 92 dB is available from
`this part of the coding word. ‘In practice, in addition to signal strength, man-
`tissa values also are affected by rate of change of the waveform pattern, and
`available data capacity. In any event, new mantissa values are calculated for
`every sample period.
`Quantized values are normalized (scaled) to optimally use the dynamic
`range of the processor. Specifically, six exponent bits form a scale factor, which
`is determined by the signa1’s absolute amplitude. This scale factor covers the
`range from -118 dB to +6 dB in 2-dB steps. Because the audio signal varies
`slowly in relation to the sampling frequency, the masking threshold and scale
`factors are calculated only once for every group of 12 samples, forming a frame
`(12 samples/subband X 32 subbands = 384 samples). For every subband, the
`absolute peak value of the 12 samples is compared to a table of scale factors,
`and the closest (next highest) constant is applied; the other sample values are
`normalized to that factor, and during decoding will be used as multipliers to
`compute the correct subband signalpplevel.
`A fioating—point representation is used; one field contains a fixed length 6-
`bit exponent, and another field contains a variable length 2- to 15-bit man-
`tissa. Every block of 12 subband samples may have different mantissa lengths
`and values, but would share the same exponent. Allocation information de-
`tailing the length of a mantissa is placed in a 4-bit field in each frame. Because
`the total number of bits representing each sample within a subband is con-
`stant, this allocation information (like the exponent) needs to be transmitted
`only once every 12 samples. A null allocation value is conveyed when a sub-
`band is not encoded; in this case neither exponent nor mantissa values within
`that subband are transmitted. The 15-bit mantissa yields a maximum signal-
`to-noise ratio of 92 dB. The 6-bit exponent can convey 64 values; however, a
`pattern of all 1’s is not used, and another value is used as a reference. There
`are thus 62 values, each representing 2-dB steps for an ideal total of 124 dB.
`The reference is used to divide this into two ranges, one from O to -118 dB,
`and the other from 0 to +6 dB. The 6 dB of headroom is needed because a
`component in a single subband might have a peak amplitude 6 dB higher
`than the broadband composite audio signal. In this example, the broadband
`dynamic range is thus equivalent to 19 bits of linear coding.
`A complete frame contains synchronization information, sample bits, scale
`factors, bit allocation information, and control bits for sampling frequency
`information, emphasis, etc. The total number of bits in a frame (with 2 chan-
`nels, with 384 samples, over 8 ms, sampled at 48 kHz) is 3072. This in turn
`yields a 384-kbps transmission rate. With the addition of error detection and
`correction code, and modulation, the final bit rate to a storage medium might
`be 768 kbps. The first set of subband samples in a frame is calculated from
`512 samples by the 512-tap filter and the filter window is shifted by 32 sam-
`ples each time into 11 more positions during a frame period; thus each frame
`incorporates information from 864 broadband audio samples per channel.
`Sampling frequencies of 32 and 44.1 kHz also are supported, and because
`the number of bands remains fixed at 32, the subband width becomes 689.06
`
`Page 00013
`
`
`
`Perceptual Coding
`
`337
`
`Hz with a 44.1-kHz sampling frequency. Because the output bit rate is fixed
`at 384 kbps, and 384 samples/ channel per frame is fixed, there is a reduction
`in frame rate at sampling frequencies of 32 and 44.1 kHz, and thus an in-
`crease in the number of bits per frame. These additional bits per frame are
`used by the algorithm to further increase audio quality.
`Layer I decoding proceeds frame by frame, using the processing shown in
`Fig. 10.19. Data is reformatted to linear PCM by a subband decoder, using
`allocation information and scale factors. Received scale factors are placed in
`an array with two columns of 32 rows, each six bits Wide. Each column rep-
`resents an output channel, and each row represents one subband. The sub-
`band samples are multiplied by the scale factors to restore them to their quan-
`tized values; empty subbands are automatically assigned a zero value. A
`synthesis reconstruction filter recombines the 32 subbands into one broadband
`audio signal. This subband filter operates identically (but inversely) to the
`input filter. As in the encoder, 384 samples/ channel represent 8 ms of audio
`signal (at a sampling frequency of 48 kHz). Following this subband filtering,
`the signal is ready for reproduction through D/A converters.
`Because psychoacoustic processing, bit allocation, and other operations are
`not used in the decoder, its cost is quite low. More importantly, the decoder is
`transparent to improvements in encoder technology. If the psychoacoustic
`models used in encoders are improved, the resulting fidelity would improve
`as well. Because the encoding algorithm is a function of digital signal proc-
`essing, more sophisticated coding is possible. For example, because the num-
`ber of bits per frame varies according to sample rate, it might be expedient
`to c