`
`Sony Exhibit 1013
`Sony v. MZ Audio
`
`
`
`Supplied by the British Library 08 Sep 2022, 09:49 (BST)
`
`
`
`Signal Processing 66 (1998) 337—355
`
`Robust audio watermarking using perceptual masking
`
`Mitchell D. Swanson *, Bin Zhu , Ahmed H. Tewfik , Laurence Boney
`
` Department of Electrical Engineering, University of Minnesota, Minneapolis, MN 55455, USA
` Ecole Nationale Supe& rieure des Te& le& communications/SIG, 46 rue Barrault, 75634 Paris Cedex 13, France
`Received 10 February 1997; received in revised form 11 November 1997
`
`Abstract
`
`We present a watermarking procedure to embed copyright protection into digital audio by directly modifying the
`audio samples. Our audio-dependent watermarking procedure directly exploits temporal and frequency perceptual
`masking to guarantee that the embedded watermark is inaudible and robust. The watermark is constructed by breaking
`each audio clip into smaller segments and adding a perceptually shaped pseudo-random sequence. The noise-like
`watermark is statistically undetectable to prevent unauthorized removal. Furthermore, the author representation we
`introduce resolves the deadlock problem. We also introduce the notion of a dual watermark: one which uses the original
`signal during detection and one which does not. We show that the dual watermarking approach together with the
`procedure that we use to derive the watermarks effectively solves the deadlock problem. We also demonstrate the
`robustness of that watermarking procedure to audio degradations and distortions, e.g., those that result from colored
`noise, MPEG coding, multiple watermarks, and temporal resampling. 1998 Elsevier Science B.V. All rights reserved.
`
`Zusammenfassung
`
`Wir stellen ein Wasserzeichen-Verfahren zur Einbettung des Urheberrechtsschutzes in digitale Audiodaten vor, wobei
`die Audiosignalwerte direkt modifiziert werden. Unser audioabha¨ ngiges Wasserzeichen-Verfahren nu¨ tzt unmittelbar die
`Wahrnehmungsverdeckung in Zeit-und Frequenzbereich aus, um sicherzustellen, da¨ s das eingebettete Wasserzeichen
`unho¨ rbar und robust ist. Das Wasserzeichen wird konstruiert, indem jeder Audioabschnitt in kleinere Segmente zerteilt
`wird und eine wahrnehmungsgerecht geformte Pseudozufallsfolge hinzuaddiert wird. Das gera¨ uschartige Wasserzeichen
`ist statistisch nicht erkennbar, um unautorisiertes Entfernen zu verhindern. Weiters lo¨ st die von uns eingefu¨ hrte
`Autorendarstellung das Pattstellungsproblem. Wir fu¨ hren auch den Begriff dualer Wasserzeichen ein: eines, das das
`Originalsignal wa¨ hrend der Erkennung benutzt, und eines, das es nicht benutzt. Wir zeigen, da¨ s der Ansatz mit dualen
`Wasserzeichen in Verbindung mit dem Verfahren, das wir zur Herleitung der Wasserzeichen einsetzen, das Pattstellun-
`gsproblem wirksam lo¨ st. Wir zeigen auch die Robustheit des Wasserzeichen-Verfahrens gegenu¨ ber Audiosto¨ rungen
`und -verzerrrungen, z.B.
`jenen, die von farbigem Rauschen, MPEG-Codierung, mehrfachen Wasserzeichen, und
`Abtastratenwandlung herru¨ hren. 1998 Elsevier Science B.V. All rights reserved.
`
`Re´ sume´
`
`Nous pre´ sentons dans cet article une proce´ dure de watermarking permettant d’inte´ grer une protection de droits
`d’auteur dans des donne´ es audio nume´ riques par modification directe des e´ chantillons audio. Cette proce´ dure exploite
`
`* Corresponding author.
` This work was supported by AFOSR under grant AF/F49620-94-1-0461. Patent pending, Media Science, Inc., 1996.
`
`0165-1684/98/$19.00 1998 Elsevier Science B.V. All rights reserved.
`PII S 0 1 6 5 - 1 6 8 4 ( 9 8 ) 0 0 0 1 4 - 0
`
`Supplied by the British Library 07 Sep 2022, 08:00 (BST)
`
`
`
`338
`
`M.D. Swanson et al. / Signal Processing 66 (1998) 337–355
`
`directement les masquages perceptuels temporel et fre´ quentiel pour garantir que le filigrane nume´ rique (watermark) est
`inaudible et robuste. Le watermark est construit en fragmentant chaque morceau audio en segments plus petits et en
`ajoutant une se´ quence pseudo-ale´ atoire modele´ e perceptuellement. Le watermark semblable a` du bruit est inde´ tectable
`statistiquement afin d’empeˆ cher une suppression non autorise´ e de celui-ci. De plus, la repre´ sentation de l’auteur que nous
`introduisons re´ soud le proble` me de l’impasse. Nous introduisons e´ galement la notion de watermark dual: l’un qui utilise
`le signal original lors de la de´ tection et l’autre non. Nous montrons que l’approche de watermarking dual combine´ e avec
`la proce´ dure que nous utilisons pour de´ river les watermarks re´ soud effectivement le proble` me de l’impasse. Nous mettons
`e´ galement en e´ vidence la robustesse de cette proce´ dure de watermarking vis-a` -vis des de´ gradations et distorsions audio,
`telles que celles qui re´ sultent d’un bruit colore´ , d’un codage MPEG, de watermarks multiples, et de re´ -e´ chantillonnage
`temporel. 1998 Elsevier Science B.V. All rights reserved.
`
`Keywords: Copyright protection; Masking; Digital watermarking
`
`1. Introduction
`
`Efficient distribution, reproduction, and manip-
`ulation have led to wide proliferation of digital
`media, e.g., audio, video, and images. However,
`these efficiencies also increase the problems asso-
`ciated with copyright enforcement. For this reason,
`creators and distributors of digital data are hesitant
`to provide access to their intellectual property. They
`are actively seeking reliable solutions to the prob-
`lems associated with copyright protection of multi-
`media data.
`Digital watermarking has been proposed as
`a means to identify the owner or distributor of
`digital data. Watermarking is the process of encod-
`ing hidden copyright information in digital data by
`making small modifications to the data samples.
`Unlike encryption, watermarking does not restrict
`access to the data. Once encrypted data is decrypted,
`the media is no longer protected. A watermark is
`designed to permanently reside in the host data.
`When the ownership of a digital work is in question,
`the information can be extracted to completely
`characterize the owner.
`To function as a useful and reliable intellectual
`property protection mechanism, the watermark
`must be:
`E embedded within the host media;
`E perceptually inaudible within the host media;
`E statistically undetectable to ensure security and
`thwart unauthorized removal;
`E robust to manipulation and signal processing
`operations on the host signal, e.g., noise, com-
`
`pression, cropping, resizing, D/A conversions, etc.;
`and
`E readily extracted to completely characterize the
`copyright owner.
`In particular, the watermark may not be stored
`in a file header, a separate bit stream, or a separate
`file. Such copyright mechanisms are easily removed.
`The watermark must be inaudible within the host
`audio data to maintain audio quality. The water-
`mark must be statistically undetectable to thwart
`unauthorized removal by a ‘pirate’. A watermark
`which may be localized through averaging, correla-
`tion, spectral analysis, Kalman filtering, etc., may
`be readily removed or altered, thereby destroying
`the copyright information.
`The watermark must be robust to signal distor-
`tions, incidental and intentional, applied to the host
`data. For example, in most applications involving
`storage and transmission of audio, a lossy coding
`operation is performed on the audio to reduce
`bit-rates and increase efficiency. Operations which
`damage the host audio also damage the embedded
`watermark. The watermark is required to survive
`such distortions to identify the owner of the data.
`Furthermore, a resourceful pirate may use a variety
`of signal processing operations to attack a digital
`watermarking. A pirate may attempt to defeat
`a watermarking procedure in two ways: (1) damage
`the host audio to make the watermark undetectable,
`or (2) establish that the watermarking scheme is
`unreliable, i.e., it detects a watermark when none is
`present. The watermark should be impossible to
`defeat without destroying the host audio.
`
`Supplied by the British Library 07 Sep 2022, 08:00 (BST)
`
`
`
`M.D. Swanson et al. / Signal Processing 66 (1998) 337—355
`
`339
`
`Finally, the watermark should be readily extrac-
`ted given the watermarking procedure and the
`proper author signature. Without the correct signa-
`ture, the watermark cannot be removed. The ex-
`tracted watermark must correctly identify the owner
`and solve the deadlock issue (cf. Section 2) when
`multiple parties claim ownership.
`Watermarking digital media has received a great
`deal of attention recently in the literature and the
`research community. Most watermarking schemes
`focus on image and video copyright protection, e.g.,
`[1—3,7,10,14,15,18,19,22,24]. A few audio water-
`marking techniques have been reported. Several
`techniques have been proposed in [1]. Using a phase
`coding approach, data is embedded by modifying
`the phase values of Fourier transform coefficients
`of audio segments. Embedding data as spread spec-
`trum noise have also been proposed. A third tech-
`nique, echo coding, employs multiple decaying
`echoes to place a peak in the cepstrum at a known
`location. Another audio watermarking technique is
`proposed in [21], where Fourier transform coeffi-
`cients over the middle frequency bands are replaced
`with spectral components from a signature. some
`commercial products are also available. The
`ICE system from Central Research Laboratories
`inserts a pair of very short
`tone sequences
`into an audio track. An audio watermarking
`product MusiCode is available from ARIS techno-
`logies.
`Most schemes utilize the fact that digital media
`contain perceptually insignificant components
`which may be replaced or modified to embed copy-
`right protection. However, the techniques do not
`directly exploit spatial/temporal and frequency
`masking. Thus, the watermark is not guaranteed
`inaudible. Furthermore, robustness is not maxi-
`mized. The amount of modification made to each
`coefficient to embed the watermark are estimated
`and not necessarily the maximum amount possible.
`In this paper, we introduce a novel watermarking
`scheme for audio which exploits the human auditory
`system (HAS) to guarantee that the embedded
`watermark is imperceptible. As the perceptual char-
`acteristics of
`individual audio signals vary, the
`watermark adapts to and is highly dependent on
`the audio being watermarked. Our watermark is
`generated by filtering a pseudo-random sequence
`
`(author id) with a filter that approximates the
`frequency masking characteristics of the HAS. The
`resulting sequence is further shaped by the temporal
`masking properties of the audio. Based on pseudo-
`random sequences, the noise-like watermark is
`statistically undetectable. Furthermore, we will show
`in the sequel that the watermark is extremely robust
`to a large number of signal processing operations
`and is easily extracted to prove ownership.
`The work presented in this paper offers several
`major contributions to the field, including
`A perception-based watermarking procedure: The
`embedded watermark adapts to each individual
`host signal. In particular, the temporal and fre-
`quency distribution of the watermark are dictated
`by the temporal and frequency masking character-
`istics of the host audio signal. As a result, the
`amplitude (strength) of the watermark increases
`and decreases with host, e.g., lower amplitude in
`‘quiet’ regions of the audio. This guarantees that
`the embedded watermark is inaudible while having
`the maximum possible energy. Maximizing the
`energy of the watermark adds robustness to attacks.
`An author representation which solves the deadlock
`problem: An author is represented with a pseudo-
`random sequence created by a pseudo-random
`generator [13] and two keys. One key is author
`dependent, while the second key is signal dependent.
`The representation is able to resolve rightful owner-
`ship in the face of multiple ownership claims.
`A dual watermark. The watermarking scheme
`uses the original audio signal to detect the presence
`of a watermark. The procedure can handle virtually
`all types of distortions, including cropping, temporal
`rescaling, etc., using a generalized likelihood ratio
`test. As a result, the watermarking procedure is
`a powerful digital copyright protection tool. We
`integrate this procedure with a second watermark
`which does not require the original signal. The dual
`watermarks also address the deadlock problem.
`In the next section, we introduce our noise-like
`author representation and the dual watermarking
`scheme. Our frequency and temporal masking mod-
`els are reviewed in Section 3. Our watermarking
`design and detection algorithms are introduced in
`Sections 4 and 5. Finally, experimental results
`are presented in Section 6. Watermark statistics
`and fidelity results for four test audio signals are
`
`Supplied by the British Library 07 Sep 2022, 08:00 (BST)
`
`
`
`340
`
`M.D. Swanson et al. / Signal Processing 66 (1998) 337–355
`
`presented. The robustness of our watermarking
`procedure is illustrated for a wide assortment of
`signal processing operations and distortions. We
`present our conclusion in Section 7.
`
`2. Author representation, dual watermarking and
`the deadlock problem
`
`Data embedding algorithms may be used to
`establish ownership and distribution of data. In
`fact, this is the application of data embedding or
`watermarking that has received most attention in
`the literature. Unfortunately, most current water-
`marking schemes are unable to resolve rightful
`ownership of digital data when multiple ownership
`claims are made, i.e., when a deadlock problem
`arises. The inability of many data embedding algo-
`rithms to deal with deadlock, first described by
`Craver et al. [4], is independent of how the water-
`mark is inserted in the multimedia data or how
`robust it is to various types of modifications.
`Today, no scheme can unambiguously determine
`ownership of a given multimedia signal if it does
`not use an original or other copy in the detection
`process to at least construct the watermark to be
`detected. A pirate can simply add his or her water-
`mark to the watermarked data or counterfeit
`a watermark that correlates well or is detected in
`the contested signal. Current data embedding
`schemes used as copyright protection algorithms
`are unable to establish who watermarked the data
`first. Furthermore, none of the current data embed-
`ding schemes has been proven to be immune to
`counterfeiting watermarks that will correlate well
`with a given signal as long as the watermark is not
`restricted to partially depend in a non-invertible
`manner on the signal.
`If the detection scheme can make use of the
`original to construct the watermark, then it may be
`possible to establish unambiguous ownership of the
`data regardless of whether the detection scheme
`subtracts the original from the signal under consid-
`eration prior to watermark detection or not. Spe-
`cifically, [5] derives a set of sufficient conditions
`that watermarks and watermarking schemes must
`satisfy to provide unambiguous proof of ownership.
`For example, one can use watermarks derived from
`
`pseudo-random sequences that depend on the signal
`and the author. Ref. [5] establishes that this will
`work for all watermarking procedures regardless of
`whether they subtract the original from the signal
`under consideration prior to watermark detection
`or not. Ref. [20] independently derived a similar
`result
`for a restricted class of watermarking
`techniques that rely on subtracting a signal
`derived from the original from the signal under
`consideration prior
`to watermark detection.
`The signal-dependent key also helps to thwart
`the ‘mix-and-match’ attack described in [5].
`An author can construct a watermark that de-
`pends on the audio signal and the author and
`provides unambiguous proof of ownership as fol-
`lows. The author has two random keys x and
`from which a pseudo-random
`x (i.e., seeds)
`sequence y can be generated using a suitable
`pseudo-random sequence generator [16]. Popular
`generators include RSA, Rabin, Blum/Micali, and
`Blum/Blum/Shub [6]. With the two proper keys,
`the watermark may be extracted. Without the two
`keys, the data hidden in the signal is statistically
`undetectable and impossible to recover. Note that
`classical maximal length pseudo noise sequence
`(i.e., m-sequence) generated by linear feedback shift
`registers are not used to generate a watermark.
`Sequences generated by shift registers are crypto-
`graphically insecure: one can solve for the feedback
`pattern (i.e., the keys) given a small number of
`output bits y.
`The noise-like sequence y may be used to derive
`the actual watermark hidden into the audio signal
`or control the operation of the watermarking algo-
`rithm, e.g., determine the location of samples that
`may be modified. The key x is author dependent.
`The key x is signal dependent. The key x is the
`secret key assigned to (or chosen by) the author.
`Key x is computed from the audio signal which the
`author wishes to watermark. It is computed from
`the signal using a one-way hash function. For
`example, the tolerable error levels supplied by
`masking models (see Section 3) are hashed in [20]
`to a key x. Any one of a number of well-known
`secure one-way hash functions may be used to
`compute x, including RSA, MD4 [17], and SHA
`[12]. For example, the Blum/Blum/Shub pseudo-
`random generator uses the one way function
`
`Supplied by the British Library 07 Sep 2022, 08:00 (BST)
`
`
`
`M.D. Swanson et al. / Signal Processing 66 (1998) 337—355
`
`341
`
`y"gL(x)"x mod n where n"pq for primes p and
`q so that p"q"3 mod 4. It can be shown that
`generating x or y from partial knowledge of y is
`computationally infeasible for the Blum/Blum/Shub
`generator.
`The signal-dependent key x makes counterfeiting
`very difficult. The pirate can only provide key x to
`the arbitrator. Key x is automatically computed
`by the watermarking algorithm from the original
`signal. As it is computationally infeasible to invert
`the one-way hash function, the pirate is unable to
`fabricate a counterfeit original which generates
`a desired or predetermined watermark.
`Deadlock may also be resolved using the dual
`watermarking scheme of [20]. That scheme employs
`a pair of watermarks. One watermarking procedure
`requires the original data set for watermark detec-
`tion. This paper provides a detailed description of
`that procedure and of its robustness. The second
`watermarking procedure does not require the orig-
`inal data set. A data embedding technique which
`satisfies the restrictions outlined in [5] can be used
`to insert the second watermark. The second water-
`mark need not be highly robust to editing of the
`audio segment since, as we shall see below, it is
`meant to protect the audio clip that a pirate claims
`to be his original. The robustness level of most of
`the recent watermarking techniques that do not
`require the original for watermark detection is quite
`adequate. The arbitrator would expect the original
`to be of a high enough quality. This limits the
`operations that a pirate can apply to an audio clip
`and still claim it to be his high-quality original
`sound. The watermark that requires the original
`audio sequence for its detection is very robust as we
`show in this paper.
`In case of deadlock, the arbitrator simply first
`checks for the watermark that requires the original
`for watermark detection. If the pirate is clever and
`has used the attack suggested in [4] and outlined
`above, the arbitrator would be unable to resolve
`the deadlock with this first test. The arbitrator
`simply then checks for the watermark that does not
`require the original audio sequence in the audio
`segments that each ownership contender claims to
`be his original. Since the original audio sequence of
`a pirate is derived from the watermarked copy
`produced by the rightful owner, it will contain the
`
`watermark of the rightful owner. On the other
`hand, the true original of the rightful owner will not
`contain the watermark of the pirate since the pirate
`has no access to that original and the watermark
`does not require subtraction of another data set for
`its detection.
`
`3. Audio masking
`
`Audio masking is the effect by which a faint but
`audible sound becomes inaudible in the presence of
`another louder audible sound, i.e., the masker [9].
`The masking effect depends on the spectral and
`temporal characteristics of both the masked signal
`and the masker. Our watermarking procedure
`directly exploits both frequency and temporal mask-
`ing characteristics to embed an inaudible and robust
`watermark.
`
`3.1. Frequency masking
`
`Frequency masking refers to masking between
`frequency components in the audio signal. If two
`signals, which occur simultaneously, are close to-
`gether in frequency, the stronger masking signal
`may make the weaker signal inaudible. The masking
`threshold of a masker depends on the frequency,
`sound pressure level (SPL), and tone-like or noise-
`like characteristics of both the masker and the
`masked signal [13]. It is easier for a broadband
`noise to mask a tonal, than for a tonal signal to
`mask out a broadband noise. Moreover, higher-
`frequency signals are more easily masked.
`The human ear acts as a frequency analyzer and
`can detect sounds with frequencies which vary from
`10 to 20 000 Hz. The HAS can be modeled by a set
`of 26 band-pass filters with bandwidths that increase
`with increasing frequency. The 26 bands are known
`as the critical bands. The critical bands are defined
`around a center frequency in which the noise band-
`width is increased until there is a just noticeable
`difference in the tone at the center frequency. Thus,
`if a faint tone lies in the critical band of a louder
`tone, the faint tone will not be perceptible.
`Frequency masking models are readily obtained
`from the current generation of high-quality audio
`
`
`
`342
`
`M.D. Swanson et al. / Signal Processing 66 (1998) 337–355
`
`codes. In this work, we use the masking model
`defined in ISO-MPEG Audio Psychoacoustic
`Model 1, for Layer I [8]. We are currently updating
`our frequency masking model to the model specified
`by ISO-MPEG Audio Layer III. The Layer I mask-
`ing method is summarized as follows for a 32 kHz
`sampling rate [8,11]. The MPEG model also sup-
`ports sampling rates of 44.1 kHz and 48 kHz.
`
`Step 1: Calculate the spectrum. Each 16 ms segment
`of the signal s(n), N"512 samples, is weighted with
`a Hann window, h(n):
`h(n)"(8/3
`
`
`2 1!cos 2♳ nN.
`
`(1)
`
`The power spectrum of the signal s(n) is calculated
`as
`
`
`
`S(k)"10 log 1N,\
`
`
`L
`
`
`
`s(n)h(n) exp !j2♳nkN .
`
`(2)
`
`The maximum is normalized to a reference sound
`pressure level of 96 dB. The power spectrum of
`a 32 kHz test signal is shown in Fig. 1.
`
`components. Tonal
`Step 2:
`Identify tonal
`(noisy) components
`(sinusoidal) and non-tonal
`are identified because their masking models are
`different.
`A tonal component is a local maximum of the
`spectrum (S(k)'S(k#1) and S(k)*S(k!1)) sat-
`isfying:
`S(k)!S(k#j)*7 dB,
`j3[!2,#2]
`if 2(k(63
`j3[!3,!2,#2,#3]
`if 63)k(127
`j3[!6,2,!2,#2,2,#6]
`if 127)k)250.
`
`We add to its intensity those of the previous and
`following components: Other tonal components in
`the same frequency band are no longer considered.
`Non-tonal components are made of the sum of the
`intensities of the signal components remaining in
`each of the 24 critical bands between 0 and
`15 500 Hz. The auditory system behaves as a bank
`of bandpass filters, with continuously overlapping
`center frequencies. These ‘auditory filters’ can be
`approximated by rectangular filters with critical
`
`Fig. 1. Power spectrum of audio signal.
`
`Supplied by the British Library 07 Sep 2022, 08:00 (BST)
`
`
`
`M.D. Swanson et al. / Signal Processing 66 (1998) 337—355
`
`343
`
`bandwidth increasing with frequency. In this model,
`the audible band is therefore divided into 24 non-
`regular critical bands. Tonal and non-tonal compo-
`nents of the example audio signal are shown in
`Fig. 2.
`Step 3: Remove masked components. Components
`below the absolute hearing threshold and tonal
`components separated by less than 0.5 Barks are
`removed. A plot of the removed components, along
`with the absolute hearing threshold is shown in
`Fig. 3.
`Step 4: Individual and global masking thresholds.
`In this step, we account for the frequency masking
`effects of the HAS. We need to discretize the fre-
`quency axis according to hearing sensitivity and
`express frequencies in Barks. Note that hearing
`sensitivity is higher at low frequencies. The resulting
`masking curves are almost linear and depend on
`a masking index different for tonal and non-tonal
`components. They are characterized by different
`lower and upper slopens depending on the distance
`between the masked and the masking component.
`We use f to denote the set of frequencies present in
`the test signal. The global masking threshold for
`
`each frequency f takes into account the absolute
`hearing threshold S and the masking curves P of
`the N tonal components and N non-tonal compo-
`nents:
`
`S ( f)"10log 101 D# ,
`10.DD..
`
`10.DD.
`
`H
`
`# ,
`H
`
`(3)
`
`The masking threshold is then the minimum of
`the local masking threshold and the absolute hear-
`ing threshold in each of the 32 equal width sub-
`bands of the spectrum. Any signal which falls below
`the masking threshold is inaudible. A plot of the
`original spectrum, along with the masking threshold,
`is shown in Fig. 4.
`As a result, for each audio block of N"512
`samples, a masking value (i.e., threshold) for each
`frequency component is produced. Modifications
`to the audio-frequency components less than the
`masking threshold create no audible distortions to
`the audio piece.
`
`Fig. 2. Identification of tonal components.
`
`Supplied by the British Library 07 Sep 2022, 08:00 (BST)
`
`
`
`344
`
`M.D. Swanson et al. / Signal Processing 66 (1998) 337–355
`
`Fig. 3. Removal of masked components.
`
`3.2. Temporal masking
`
`4. Watermark design
`
`Temporal masking refers to both pre- and
`post-masking. Pre-masking effects render weaker
`signals inaudible before the stronger masker is
`turned on, and post-masking effects
`render
`weaker
`signals
`inaudible after
`the
`stronger
`masker is turned off. Pre-masking occurs from 5
`to 20 ms before the masker is turned on while
`post-masking occurs from 50 to 200 ms after the
`masker is turned off [13]. Note that temporal and
`frequency masking effects have dual localization
`properties. Specifically, frequency masking effects
`are localized in the frequency domain, while tem-
`poral masking effects are localized in the time
`domain.
`We approximate temporal masking effects using
`the envelope of the host audio. The envelope is
`modeled as a decaying exponential. In particular,
`the estimated envelope t(i) of signal s(i) increases
`with the signal and decays as e\?R. An audio
`signal, along with its estimated envelope, is shown
`in Fig. 5.
`
`Each audio signal is watermarked with a unique
`noise-like sequence shaped by the masking phe-
`nomena. The watermark consists of (1) an author
`representation (cf. Section 2), and (2) spectral and
`temporal shaping using the masking effects of the
`HAS.
`Our watermarking scheme is based on a re-
`peated application of a basic watermarking
`operation on smaller segments of the audio signal.
`A diagram of our audio watermarking technique
`is shown in Fig. 6. The length N audio signal
`length
`is first segmented into blocks sG(k) of
`i"0,1,2,WN/512X!1, and k"
`512 samples,
`0,1,2,511. The block size of 512 samples is dictated
`by the frequency masking model we employ. Block
`sizes of 1024 have also been used. The algorithm
`works as follows. For each audio segment sG(k):
`1. compute the power spectrum SG(k) of the audio
`segment sG(k) (Eq. (2));
`2. compute the frequency mask MG(k) of the power
`spectrum SG(k) (cf. Section 3.1);
`
`Supplied by the British Library 07 Sep 2022, 08:00 (BST)
`
`
`
`M.D. Swanson et al. / Signal Processing 66 (1998) 337—355
`
`345
`
`Fig. 4. Original spectrum and masking threshold.
`
`(cf.
`
`3. use the mask MG(k) to weight the noise-like
`author representation for that audio block,
`creating the shaped author signature PG(k)"
`½
`G(k)MG(k);
`4. compute the inverse FFT of the shaped noise
`pG(k)"IFFT(PG(k));
`5. compute the temporal mask tG(k) of sG(k)
`Section 3.2);
`6. use the temporal mask tG(k) to further shape the
`frequency shaped noise, creating the watermark
`wG(k)"tG(k)pG(k) of that audio segment;
`G(k)"sG(k)#wG(k).
`7. create the watermarked block s
`The overall watermark for a signal is simply the
`concatenation of the watermark segments wG for all
`of the length 512 audio blocks. The author signature
`yG for block i is computed in terms of the personal
`author key x and signal-dependent key x com-
`puted from block sG.
`The dual localization effects of the frequency and
`temporal masking control the watermark in both
`domains. As noted earlier,
`frequency-domain
`shaping alone is not enough to guarantee that the
`watermark will be inaudible. Frequency-domain
`masking computations are based on a Fourier
`transform analysis. A fixed length Fourier transform
`
`does not provide good time localization for our
`application. In particular, a watermark computed
`using frequency-domain masking will spread in time
`over the entire analysis block. If the signal energy is
`concentrated in a time interval that is shorter than
`the analysis block length, the watermark is not
`masked outside of that subinterval. This leads to
`audible distortion, e.g., pre-echoes. The temporal
`mask guarantees that the ‘quiet’ regions are not
`disturbed by the watermark.
`
`5. Watermark detection
`
`The watermark should be extractable even if
`common signal processing operations are applied
`to the host audio. This is particularly true in the
`case of deliberate unauthorized attempts to remove
`it. For example, a pirate may attempt to add noise,
`filter, code, re-sample, etc., an audio piece in an
`attempt to destroy the watermark. As the embedded
`watermark is noise-like, a pirate has insufficient
`knowledge to directly remove the watermark.
`Therefore, any destruction attempts are done
`blindly.
`
`Supplied by the British Library 07 Sep 2022, 08:00 (BST)
`
`
`
`346
`
`M.D. Swanson et al. / Signal Processing 66 (1998) 337–355
`
`Fig. 5. Audio signal and estimated envelope.
`
`Fig. 6. Diagram of audio watermarking procedure.
`
`Let r(i), 0)i)N!1, be N samples of recovered
`audio piece which may or may not have a water-
`mark. Assume first that we know the exact location
`of the received signal. Without loss of generality, we
`will assume that r(i)"s(i)#d(i), 0)i)N!1,
`where d(i) is a disturbance that consists of noise
`only, or noise and a watermark. The detection
`scheme relies on the fact that the author or arbitra-
`tor has access to, or can compute, the original
`signal and the two keys x and x required to
`
`generate the pseudo-random sequence y. Therefore,
`detection of the watermark is accomplished via
`hypothesis testing. Since s(i) is known, we specifically
`need to consider the hypothesis test
`H: t(i)"r(i)!s(i)"n(i),
`0)i)N!1 (No watermark),
`H: t(i)"r(i)!s(i)"w(i)#n(i),
`0)i)N!1 (Watermark),
`
`(4)
`
`Supplied by the British Library 07 Sep 2022, 08:00 (BST)
`
`
`
`M.D. Swanson et al. / Signal Processing 66 (1998) 337—355
`
`347
`
`where w(i) is the potentially modified watermark,
`and n(i) is noise. The correct hypothesis is estimated
`by measuring the similarity between the extracted
`signal t(i) and original watermark w(i):
`
`(5)
`
`Sim(x,w)" ,\
`H t( j)w( j)
` ,\
`H w( j)w( j)
`and comparing with a threshold ¹. Note that Eq. (5)
`implicitly assumes that the noise n(i) is white, Gaus-
`sian with a zero mean, even though this assumption
`may not be true. It also assumes that w(i) has not
`been modified. These two assumptions do not hold
`true in most situations. However, our experiments
`indicate that, in practice, the detection test given in
`Eq. (5) is very robust (see Section 6). Our experi-
`ments also indicate that a threshold ¹"0.15 yields
`a high detection performance.
`Suppose now that we do not know the location of
`the observed clip r(i). Specifically, suppose that
`r(i)"s(i#♸)#d(i), 0)i)N!1, where, as be-
`fore, d(i) is a disturbance that consists of noise only,
`or noise and a watermark, and ♸ is the unknown
`delay corresponding to the clip. Note that ♸ is not
`necessarily an integer. In this case, we need to
`perform a generalized likelihood ratio test [23] to
`determine whether the received signal has been
`watermarked or not. Once more, we assume that
`the noise n(i) is white, Gaussian with a zero mean
`even though this may not be true. This leads us to
`compare the ratio
`
`maxO exp(! ,\(r(i)!(s(i#♸)#w(i#♸)))
`
`(r(i)!s(i#♸)))
`maxO exp(! ,\
`
`L
`
`L
`
`(6)
`
`with a threshold. If this ratio is higher than the
`threshold, we would declare the watermark to be
`present. Note that since ♸ is not necessarily an
`integer, computing the numerator and denominator
`of Eq. (6) requires that we perform interpolation or
`evaluate these expressions in the Fourier domain
`using Parseval’s theorem.
`A generalized likelihood ratio test is also needed
`if one suspects that the rec