throbber
A Highly Robust Audio Fingerprinting System
`
`A Highly Robust Audio Fingerprinting System
`Ton Kalker
`Jaap Haitsma
`Philips Research
`Philips Research
`Prof. Holstlaan 4, 5616 BA,
`Prof. Holstlaan 4, 5616 BA,
`Eindhoven, The Netherlands
`Eindhoven, The Netherlands
`++31-40-2742648
`++31-40-2743839
`Jaap.Haitsma@philips.com
`Ton.Kalker@ieee.org
`
`ABSTRACT
`Imagine the following situation. You’re in your car, listening to
`the radio and suddenly you hear a song that catches your attention.
`It’s the best new song you have heard for a long time, but you
`missed the announcement and don’t recognize the artist. Still, you
`would like to know more about this music. What should you do?
`You could call the radio station, but that’s too cumbersome.
`Wouldn’t it be nice if you could push a few buttons on your
`mobile phone and a few seconds later the phone would respond
`with the name of the artist and the title of the music you’re
`listening to? Perhaps even sending an email to your default email
`address with some supplemental information. In this paper we
`present an audio fingerprinting system, which makes the above
`scenario possible. By using the fingerprint of an unknown audio
`clip as a query on a fingerprint database, which contains the
`fingerprints of a large library of songs, the audio clip can be
`identified. At the core of the presented system are a highly robust
`fingerprint extraction method and a very efficient fingerprint
`search strategy, which enables searching a large fingerprint
`database with only limited computing resources.
`1. INTRODUCTION
`Fingerprint systems are over one hundred years old. In 1893 Sir
`Francis Galton was the first to “prove” that no two fingerprints of
`human beings were alike. Approximately 10 years later Scotland
`Yard accepted a system designed by Sir Edward Henry for
`identifying fingerprints of people. This system relies on the pattern
`of dermal ridges on the fingertips and still forms the basis of all
`“human” fingerprinting techniques of today. This type of forensic
`“human” fingerprinting system has however existed for longer
`than a century, as 2000 years ago Chinese emperors were already
`using thumbprints to sign important documents. The implication is
`that already those emperors (or at least their administrative
`servants) realized that every fingerprint was unique. Conceptually
`a fingerprint can be seen as a “human” summary or signature that
`is unique for every human being. It is important to note that a
`human fingerprint differs from a textual summary in that it does
`not allow the reconstruction of other aspects of the original. For
`example, a human fingerprint does not convey any information
`about the color of the person’s hair or eyes.
`Recent years have seen a growing scientific and industrial interest
`in computing fingerprints of multimedia objects [1][2][3][4]
`[5][6]. The growing industrial interest is shown among others by a
`large number of (startup) companies [7][8][9][10][11][12][13]
`and the recent request for information on audio fingerprinting
`technologies by the International Federation of the Phonographic
`Industry (IFPI) and the Recording Industry Association of
`America (RIAA) [14].
`
`Permission to make digital or hard copies of all or part of this
`work for personal or classroom use is granted without fee
`provided that copies are not made or distributed for profit or
`commercial advantage and that copies bear this notice and the
`full citation on the first page.
`© 2002 IRCAM – Centre Pompidou
`
`3.
`
`2.
`
`The prime objective of multimedia fingerprinting is an efficient
`mechanism to establish the perceptual equality of two multimedia
`objects: not by comparing the (typically large) objects themselves,
`but by comparing the associated fingerprints (small by design). In
`most systems using fingerprinting technology, the fingerprints of a
`large number of multimedia objects, along with their associated
`meta-data (e.g. name of artist, title and album) are stored in a
`database. The fingerprints serve as an index to the meta-data. The
`meta-data of unidentified multimedia content are then retrieved by
`computing a fingerprint and using this as a query in the
`fingerprint/meta-data database. The
`advantage of using
`fingerprints instead of the multimedia content itself is three-fold:
`1. Reduced memory/storage requirements as fingerprints
`are relatively small;
`Efficient comparison as perceptual irrelevancies have
`already been removed from fingerprints;
`Efficient searching as the dataset to be searched is
`smaller.
`As can be concluded from above, a fingerprint system generally
`consists of two components: a method to extract fingerprints and a
`method to efficiently search for matching fingerprints in a
`fingerprint database.
`This paper describes an audio fingerprinting system that is suitable
`for a large number of applications. After defining the concept of
`an audio fingerprint in Section 2 and elaborating on possible
`applications in Section 3, we focus on the technical aspects of the
`proposed audio fingerprinting system. Fingerprint extraction is
`described in Section 4 and fingerprint searching in Section 5.
`2. AUDIO FINGERPRINTING CONCEPTS
`2.1 Audio Fingerprint Definition
`Recall that an audio fingerprint can be seen as a short summary of
`an audio object. Therefore a fingerprint function F should map an
`audio object X, consisting of a large number of bits, to a
`fingerprint of only a limited number of bits.
`Here we can draw an analogy with so-called hash functions1,
`which are well known in cryptography. A cryptographic hash
`function H maps an (usually large) object X to a (usually small)
`hash value (a.k.a. message digest). A cryptographic hash function
`allows comparison of two large objects X and Y, by just
`comparing their respective hash values H(X) and H(Y). Strict
`mathematical equality of the latter pair implies equality of the
`former, with only a very low probability of error. For a properly
`designed cryptographic hash function this probability is 2-n, where
`n equals the number of bits of the hash value. Using cryptographic
`hash functions, an efficient method exists to check whether or not
`a particular data item X is contained in a given and large data set
`Y={Yi}. Instead of storing and comparing with all of the data in Y,
`
`1 In the literature fingerprinting is sometimes also referred to as
`robust or perceptual hashing[5].
`
`EX1029
`Roku V. Media Chain
`U.S. Patent No. 9,715,581
`
`0001
`
`

`

`A Highly Robust Audio Fingerprinting System
`
`it is sufficient to store the set of hash values {hi = H(Yi)}, and to
`compare H(X) with this set of hash values.
`At first one might think that cryptographic hash functions are a
`good candidate for fingerprint functions. However recall from the
`introduction that, instead of strict mathematical equality, we are
`interested in perceptual similarity. For example, an original CD
`quality version of ‘Rolling Stones – Angie’ and an MP3 version at
`128Kb/s sound the same to the human auditory system, but their
`waveforms can be quite different. Although the two versions are
`perceptually similar they are mathematically quite different.
`Therefore cryptographic hash functions cannot decide upon
`perceptual equality of
`these
`two versions. Even worse,
`cryptographic hash functions are typically bit-sensitive: a single
`bit of difference in the original object results in a completely
`different hash value.
`Another valid question the reader might ask is: “Is it not possible
`to design a fingerprint function that produces mathematically
`equal fingerprints for perceptually similar objects?” The question
`is valid, but the answer is that such a modeling of perceptual
`similarity is fundamentally not possible. To be more precise: it is a
`known fact that perceptual similarity is not transitive. Perceptual
`similarity of a pair of objects X and Y and of another pair of
`objects Y and Z does not necessarily imply the perceptual
`similarity of objects X and Z. However modeling perceptual
`similarity by mathematical equality of fingerprints would lead to
`such a relationship.
`Given the above arguments, we propose to construct a fingerprint
`function in such a way that perceptual similar audio objects result
`in similar fingerprints. Furthermore,
`in order
`to be able
`discriminate between different audio objects, there must be a very
`high probability that dissimilar audio objects result in dissimilar
`fingerprints. More mathematically, for a properly designed
`fingerprint function F, there should be a threshold T such that
`with very high probability ||F(X)-F(Y)||T if objects X and Y are
`similar and ||F(X)-F(Y)||>T when they are dissimilar.
`2.2 Audio Fingerprint System Parameters
`Having a proper definition of an audio fingerprint we now focus
`on the different parameters of an audio fingerprint system. The
`main parameters are:
` Robustness: can an audio clip still be identified after
`severe signal degradation? In order to achieve high
`robustness the fingerprint should be based on perceptual
`features that are invariant (at least to a certain degree)
`with respect to signal degradations. Preferably, severely
`degraded audio still leads to very similar fingerprints.
`The false negative rate is generally used to express the
`robustness. A
`false negative occurs when
`the
`fingerprints of perceptually similar audio clips are too
`different to lead to a positive match.
` Reliability: how often is a song incorrectly identified?
`E.g. “Rolling Stones – Angie” being identified as
`“Beatles – Yesterday”. The rate at which this occurs is
`usually referred to as the false positive rate.
`Fingerprint size: how much storage is needed for a
`fingerprint? To enable fast searching, fingerprints are
`usually stored
`in RAM memory. Therefore
`the
`fingerprint size, usually expressed in bits per second or
`bits per song, determines to a large degree the memory
`resources that are needed for a fingerprint database
`server.
` Granularity: how many seconds of audio is needed to
`identify an audio clip? Granularity is a parameter that
`
`
`
`
`
`can depend on the application. In some applications the
`whole song can be used for identification, in others one
`prefers to identify a song with only a short excerpt of
`audio.
`Search speed and scalability: how long does it take to
`find a fingerprint in a fingerprint database? What if the
`database contains thousands and thousands of songs?
`For the commercial deployment of audio fingerprint
`systems, search speed and scalability are a key
`parameter. Search speed should be in the order of
`milliseconds for a database containing over 100,000
`songs using only limited computing resources (e.g. a
`few high-end PC’s).
`These five basic parameters have a large impact on each other. For
`instance, if one wants a lower granularity, one needs to extract a
`larger fingerprint to obtain the same reliability. This is due to the
`fact that the false positive rate is inversely related to the
`fingerprint size. Another example: search speed generally
`increases when one designs a more robust fingerprint. This is due
`to the fact that a fingerprint search is a proximity search. I.e. a
`similar (or the most similar) fingerprint has to be found. If the
`features are more robust the proximity is smaller. Therefore the
`search speed can increase.
`3. APPLICATIONS
`In this section we elaborate on a number of applications for audio
`fingerprinting.
`3.1 Broadcast Monitoring
`Broadcast monitoring is probably the most well known application
`for audio fingerprinting[2][3][4][5][12][13]. It refers to the
`automatic playlist generation of radio,
`television or web
`broadcasts for, among others, purposes of royalty collection,
`program verification, advertisement verification and people
`metering. Currently broadcast monitoring is still a manual process:
`i.e. organizations interested in playlists, such as performance
`rights organizations, currently have “real” people listening to
`broadcasts and filling out scorecards.
`A large-scale broadcast monitoring system based on fingerprinting
`consists of several monitoring sites and a central site where the
`fingerprint server is located. At the monitoring sites fingerprints
`are extracted from all the (local) broadcast channels. The central
`site
`collects
`the
`fingerprints
`from
`the monitoring sites.
`Subsequently, the fingerprint server, containing a huge fingerprint
`database, produces the playlists of all the broadcast channels.
`3.2 Connected Audio
`Connected audio is a general term for consumer applications
`where music is somehow connected to additional and supporting
`information. The example given in the abstract, using a mobile
`phone to identify a song is one of these examples. This business is
`actually pursued by a number of companies [10][13]. The audio
`signal in this application is severely degraded due to processing
`applied by radio stations, FM/AM transmission, the acoustical
`path between the loudspeaker and the microphone of the mobile
`phone, speech coding and finally the transmission over the mobile
`network. Therefore, from a technical point of view, this is a very
`challenging application.
`Other examples of connected audio are (car) radios with an
`identification button or fingerprint applications “listening” to the
`audio streams leaving or entering a soundcard on a PC. By
`pushing an “info” button in the fingerprint application, the user
`could be directed to a page on the Internet containing information
`about the artist. Or by pushing a “buy” button the user would be
`
`0002
`
`

`

`A Highly Robust Audio Fingerprinting System
`
`able to buy the album on the Internet. In other words, audio
`fingerprinting can provide a universal linking system for audio
`content.
`3.3 Filtering Technology for File Sharing
`Filtering refers to active intervention in content distribution. The
`prime example for filtering technology for file sharing was
`Napster [15]. Starting in June 1999, users who downloaded the
`Napster client could share and download a large collection of
`music for free. Later, due to a court case by the music industry,
`Napster users were forbidden to download copyrighted songs.
`Therefore in March 2001 Napster installed an audio filter based
`on file names, to block downloads of copyrighted songs. The filter
`was not very effective, because users started to intentionally
`misspell filenames. In May 2001 Napster introduced an audio
`fingerprinting system by Relatable [8], which aimed at filtering
`out copyrighted material even if it was misspelled. Owing to
`Napster’s closure only two months later, the effectiveness of that
`specific fingerprint system is, to the best of the author’s
`knowledge, not publicly known.
`In a legal file sharing service one could apply a more refined
`scheme than just filtering out copyrighted material. One could
`think of a scheme with free music, different kinds of premium
`music (accessible to those with a proper subscription) and
`forbidden music.
`Although from a consumer standpoint, audio filtering could be
`viewed as a negative technology, there are also a number of
`potential benefits to the consumer. Firstly it can organize music
`song titles in search results in a consistent way by using the
`reliable meta-data of
`the
`fingerprint database. Secondly,
`fingerprinting can guarantee that what is downloaded is actually
`what it says it is.
`3.4 Automatic Music Library Organization
`Nowadays many PC users have a music library containing several
`hundred, sometimes even thousands, of songs. The music is
`generally stored in compressed format (usually MP3) on their
`hard-drives. When these songs are obtained from different
`sources, such as ripping from a CD or downloading from file
`sharing networks, these libraries are often not well organized.
`Meta-data is often inconsistent, incomplete and sometimes even
`incorrect. Assuming that the fingerprint database contains correct
`meta-data, audio fingerprinting can make the meta-data of the
`songs in the library consistent, allowing easy organization based
`on, for example, album or artist. For example, ID3Man [16], a
`tool powered by Auditude [7] fingerprinting technology is already
`available for tagging unlabeled or mislabeled MP3 files. A similar
`tool from Moodlogic [11] is available as a Winamp plug-in [17].
`4. AUDIO FINGERPRINT EXTRACTION
`4.1 Guiding Principles
`Audio fingerprints intend to capture the relevant perceptual
`features of audio. At the same time extracting and searching
`fingerprints should be fast and easy, preferably with a small
`granularity to allow usage in highly demanding applications (e.g.
`mobile phone recognition). A few fundamental questions have to
`be addressed before starting the design and implementation of
`such an audio fingerprinting scheme. The most prominent
`question to be addressed is: what kind of features are the most
`suitable. A scan of the existing literature shows that the set of
`relevant features can be broadly divided into two classes: the class
`of semantic features and the class of non-semantic features.
`Typical elements in the former class are genre, beats-per-minute,
`and mood. These types of features usually have a direct
`
`2.
`
`3.
`
`interpretation, and are actually used to classify music, generate
`play-lists and more. The latter class consists of features that have a
`more mathematical nature and are difficult for humans to ‘read’
`directly from music. A
`typical element
`in
`this class
`is
`AudioFlatness that is proposed in MPEG-7 as an audio descriptor
`tool [2]. For the work described in this paper we have for a
`number of reasons explicitly chosen to work with non-semantic
`features:
`1.
`
`Semantic features don’t always have a clear and
`unambiguous meaning. I.e. personal opinions differ over
`such classifications. Moreover, semantics may actually
`change over
`time. For example, music
`that was
`classified as hard rock 25 years ago may be viewed as
`soft listening today. This makes mathematical analysis
`difficult.
`Semantic features are in general more difficult to
`compute than non-semantic features.
`Semantic features are not universally applicable. For
`example, beats-per-minute does not typically apply to
`classical music.
`A second question to be addressed is the representation of
`fingerprints. One obvious candidate is the representation as a
`vector of real numbers, where each component expresses the
`weight of a certain basic perceptual feature. A second option is to
`stay closer in spirit to cryptographic hash functions and represent
`digital fingerprints as bit-strings. For reasons of reduced search
`complexity we have decided in this work for the latter option. The
`first option would imply a similarity measure involving real
`additions/subtractions and depending on the similarity measure
`maybe even real multiplications. Fingerprints that are based on bit
`representations can be compared by simply counting bits. Given
`the expected application scenarios, we do not expect a high
`robustness for each and every bit in such a binary fingerprint.
`Therefore, in contrast to cryptographic hashes that typically have a
`few hundred bits at the most, we will allow fingerprints that have
`a few thousand bits. Fingerprints containing a large number bits
`allow reliable identification even if the percentage of non-
`matching bits is relatively high.
`A final question involves the granularity of fingerprints. In the
`applications that we envisage there is no guarantee that the audio
`files that need to be identified are complete. For example, in
`broadcast monitoring, any interval of 5 seconds is a unit of music
`that has commercial value, and therefore may need to be identified
`and recognized. Also, in security applications such as file filtering
`on a peer-to-peer network, one would not wish that deletion of the
`first few seconds of an audio file would prevent identification. In
`this work we therefore adopt the policy of fingerprints streams by
`assigning sub-fingerprints to sufficiently small atomic intervals
`(referred to as frames). These sub-fingerprints might not be large
`enough to identify the frames themselves, but a longer interval,
`containing sufficiently many frames, will allow robust and reliable
`identification.
`4.2 Extraction Algorithm
`Most fingerprint extraction algorithms are based on the following
`approach. First the audio signal is segmented into frames. For
`every frame a set of features is computed. Preferably the features
`are chosen such that they are invariant (at least to a certain degree)
`to signal degradations. Features that have been proposed are well
`known audio features such as Fourier coefficients [4], Mel
`Frequency Cepstral Coefficients (MFFC) [18], spectral flatness
`[2], sharpness [2], Linear Predictive Coding (LPC) coefficients
`[2] and others. Also derived quantities such as derivatives, means
`and variances of audio features are used. Generally the extracted
`
`0003
`
`

`

`Original
`Original
`
`
`
`MP3@128KbpsMP3@128Kbps
`
`
`
`Bit ErrorsBit Errors
`
`
`
`B31..……… ….……… ……… .B0B31..……… ….……… ……… .B0
`
`
`
`B31..………….……………….B0B31..………….……………….B0
`
`
`
`B31..……… ….……… ……… .B0B31..……… ….……… ……… .B0
`
`
`
`00
`
`Time (Frames)
`Time (Frames)
`
`
`
`256256
`
`
`(a)(a)
`
`(b)(b)
`
`(c)(c)
`Figure 2. (a) Fingerprint block of original music clip,
`(b) fingerprint block of a compressed version, (c) the
`difference between a and b showing the bit errors in
`black (BER=0.078).
`Figure 1, where T is a delay element):
`
`
`
`
` ,( ) ,( mnEmnE
`
` ,( ) ,( mnEmnE
`
`
`
`
`
`)1
`
`)1
`
`
`
`
`(nE
`
`
`(nE
`
`
`
`
`,1
`,1
`
`
`
`) (nEm
`
`) (nEm
`
`
`
`
`
`
`
`,1
`,1
`
`m
`m
`
`
`)1
`
`)1
`
`
`
`
`
`
`
`0
`0
`
` (1)
`
`if1
`if0
`
`
`
`
`
`
`
` ,( )mnF
`
`
`
`Figure 2 shows an example of 256 subsequent 32-bit sub-
`fingerprints (i.e. a fingerprint block), extracted with the above
`scheme from a short excerpt of “O Fortuna” by Carl Orff. A ‘1’
`bit corresponds to a white pixel and a ‘0’ bit to a black pixel.
`Figure 2a and Figure 2b show a fingerprint block from an original
`CD and the MP3 compressed (32Kbps) version of the same
`excerpt, respectively. Ideally these two figures should be identical,
`but due to the compression some of the bits are retrieved
`incorrectly. These bit errors, which are used as the similarity
`measure for our fingerprint scheme, are shown in black in Figure
`2c.
`The computing resources needed for the proposed algorithm are
`limited. Since the algorithm only takes into account frequencies
`below 2kHz the received audio is first down sampled to a mono
`audio stream with a sampling rate of 5kHz. The sub-fingerprints
`are designed such that they are robust against signal degradations.
`Therefore very simple down sample filters can be used without
`introducing any performance degradation. Currently 16 tap FIR
`filters are used. The most computationally demanding operation is
`the Fourier transform of every audio frame. In the down sampled
`audio signal a frame has a length of 2048 samples. If the Fourier
`transform is implemented as a fixed point real-valued FFT the
`fingerprinting algorithm has been shown to run efficiently on
`portable devices such as a PDA or a mobile phone.
`4.3 False Positive Analysis
`Two 3-second audio signals are declared similar if the Hamming
`distance (i.e. the number of bit errors) between the two derived
`fingerprint blocks is below a certain threshold T. This threshold
`value T directly determines the false positive rate Pf, i.e. the rate at
`which audio signals are incorrectly declared equal: the smaller T,
`the smaller the probability Pf will be. On the other hand, a small
`value of T will negatively effect the false negative probability Pn,
`
`A Highly Robust Audio Fingerprinting System
`
`
`
`FramingFraming
`
`
`FourierFourier
`
`TransformTransform
`
`FF
`
`
`
`ABSABS
`
`
`BandBand
`
`DivisionDivision
`
`
`EnergyEnergy
`
`ComputationComputation
`
`
`
`++
`
`
`
`++
`
`
`
`++
`
`
`
`--
`
`
`
`--
`
`
`
`--
`
`
`x2x2
`
`x2x2
`
`
`x2x2
`
`x2x2
`
`
`Bit DerivationBit Derivation
`
`T -TT -
`
`++
`
`T -TT -
`
`++
`
`
`
`>0>0
`
`
`
`>0>0
`
`
`
`F(n,0)F(n,0)
`
`
`
`F(n,1)F(n,1)
`
`T -TT -
`
`
`++
`
`
`
`>0>0
`
`
`
`F(n,31)F(n,31)
`
`Figure 1. Overview fingerprint extraction scheme.
`
`features are mapped into a more compact representation by using
`classification algorithms, such as Hidden Markov Models [3], or
`quantization [5]. The compact representation of a single frame
`will be referred to as a sub-fingerprint. The global fingerprint
`procedure converts a stream of audio into a stream of sub-
`fingerprints. One sub-fingerprint usually does not contain
`sufficient data to identify an audio clip. The basic unit that
`contains sufficient data to identify an audio clip (and therefore
`determining the granularity) will be referred to as a fingerprint-
`block.
`The proposed fingerprint extraction scheme is based on this
`general streaming approach. It extracts 32-bit sub-fingerprints for
`every interval of 11.6 milliseconds. A fingerprint block consists of
`256 subsequent sub-fingerprints, corresponding to a granularity of
`only 3 seconds. An overview of the scheme is shown in Figure 1.
`The audio signal is first segmented into overlapping frames. The
`overlapping frames have a length of 0.37 seconds and are
`weighted by a Hanning window with an overlap factor of 31/32.
`This strategy results in the extraction of one sub-fingerprint for
`every 11.6 milliseconds. In the worst-case scenario the frame
`boundaries used during identification are 5.8 milliseconds off with
`respect to the boundaries used in the database of pre-computed
`fingerprints. The large overlap assures that even in this worst-case
`scenario the sub-fingerprints of the audio clip to be identified are
`still very similar to the sub-fingerprints of the same clip in the
`database. Due to the large overlap subsequent sub-fingerprints
`have a large similarity and are slowly varying in time. Figure 2a
`shows an example of an extracted fingerprint block and the slowly
`varying character along the time axis.
`The most important perceptual audio features live in the frequency
`domain. Therefore a spectral representation is computed by
`performing a Fourier transform on every frame. Due to the
`sensitivity of the phase of the Fourier transform to different frame
`boundaries and the fact that the Human Auditory System (HAS) is
`relatively insensitive to phase, only the absolute value of the
`spectrum, i.e. the power spectral density, is retained.
`In order to extract a 32-bit sub-fingerprint value for every frame,
`33 non-overlapping frequency bands are selected. These bands lie
`in the range from 300Hz to 2000Hz (the most relevant spectral
`range for the HAS) and have a logarithmic spacing. The
`logarithmic spacing is chosen, because it is known that the HAS
`operates on approximately logarithmic bands (the so-called Bark
`scale). Experimentally it was verified that the sign of energy
`differences (simultaneously along the time and frequency axes) is
`a property that is very robust to many kinds of processing. If we
`denote the energy of band m of frame n by E(n,m) and the m-th bit
`of the sub-fingerprint of frame n by F(n,m), the bits of the sub-
`fingerprint are formally defined as (see also the gray block in
`
`0004
`
`

`

`Translating the above back to the case of fingerprints bits, a
`correlation factor a between subsequent fingerprint bits implies an
`increase in standard deviation for the BER by a factor
`
`(6)
`
`.
`
`22
`aa
`
`11
`
`To determine the distribution of the BER with real fingerprint
`blocks a fingerprint database of 10,000 songs was generated.
`Thereafter the BER of 100,000 randomly selected pairs of
`fingerprint blocks were determined. The standard deviation of the
`resulting BER distribution was measured
`to be 0.0148,
`approximately 3 times higher than the 0.0055 one would expect
`from random i.i.d. bits.
`Figure 3 shows the log Probability Density Function (PDF) of the
`measured BER distribution and a normal distribution with mean
`of 0.5 and a standard deviation of 0.0148. The PDF of the BER is
`a close approximation to the normal distribution. For BERs below
`0.45 we observe some outliers, due to insufficient statistics. To
`incorporate the larger standard deviation of the BER distribution
`Formula (2) is modified by inclusion of a factor 3.

`)21(
`
`23
`The threshold for the BER used during experiments was  = 0.35.
`This means that out of 8192 bits there must be less than 2867 bits
`in error in order to decide that the fingerprint blocks originate
`from the same song. Using formula (7) we arrive at a very low
`false positive rate of erfc(6.4)/2= 3.6·10-20.
`4.4 Experimental Robustness Results
`In this subsection we show the experimental robustness of the
`proposed audio fingerprinting scheme. That is, we try to answer
`the question of whether or not the BER between the fingerprint
`block of an original and a degraded version of an audio clip
`remains under the threshold α
`We selected four short audio excerpts (Stereo, 44.1kHz, 16bps)
`from songs that belong to different musical genres: “O Fortuna”
`by Carl Orff, “Success has made a failure of our home” by Sinead
`o’Connor, “Say what you want” by Texas and “A whole lot of
`Rosie” by AC/DC. All of the excerpts were subjected to the
`following signal degradations:
` MP3 Encoding/Decoding at 128 Kbps and 32 Kbps.
` Real Media Encoding/Decoding at 20 Kbps.
` GSM Encoding at Full Rate with an error-free channel and
`a channel with a carrier to interference (C/I) ratio of 4dB
`(comparable to GSM reception in a tunnel).
` All-pass Filtering using the system function: H(z)=(0.81z2-
`1.64z+1)/ (z2-1.64z+0.81).
` Amplitude Compression with the following compression
`ratios: 8.94:1 for |A|  -28.6 dB; 1.73:1 for -46.4 dB  |A|  -
`28.6 dB; 1:1.61 for |A|  -46.4 dB.
` Equalization A typical10-band equalizer with the following
`settings:
`8k 16k
`4k
`2k
`1k
`62 125 250 500
`31
`Freq.(Hz)
`-3
`+3
`+3
`-3
`+3
`+3
`-3
`+3
`-3
`-3
`Gain(dB)
` Band-pass Filtering using a second order Butterworth filter
`with cut-off frequencies of 100Hz and 6000Hz.
` Time Scale Modification of +4% and -4% . Only the tempo
`changes, the pitch remains unaffected.
`
`
`
`n
`
`
`
`erfc
`
`21
`
`Pf
`
`
`

`
`
`
`(7)
`
`A Highly Robust Audio Fingerprinting System
`
`
`
`22
`
`
`
`11
`
`
`
`00
`
`
`
`-1-1
`
`
`
`-2-2
`
`
`
`-3-3
`
`log(pdf)
`log(pdf)
`
`
`
`0.440.44
`
`
`
`0.460.46
`
`
`
`0.480.48
`
`
`0.50.5
`
`Bit Error RateBit Error Rate
`Figure 3. Comparison of the probability density function of
`the BER plotted as ‘+’ and the normal distribution.
`
`
`
`0.520.52
`
`
`
`0.540.54
`
`
`
`0.560.56
`
`i.e. the probability that two signals are ‘equal’, but not identified
`as such.
`In order to analyze the choice of this threshold T, we assume that
`the
`fingerprint
`extraction process yields
`random
`i.i.d.
`(independent and identically distributed) bits. The number of bit
`errors will then have a binomial distribution (n,p), where n equals
`the number of bits extracted and p (= 0.5) is the probability that a
`‘0’ or ‘1’ bit is extracted. Since n (= 8192 = 32  256) is large in
`our application, the binomial distribution can be approximated by
`a normal distribution with a mean μ = np and standard deviation
`σ =√(np(1-p)). Given a fingerprint block F1, the probability that a
`randomly selected fingerprint block F2 has less than T = α n errors
`with respect to F1 is given by:
`
`
`


`
`

`21
`n
`
`where α denotes the Bit Error Rate (BER).
`However, in practice the sub-fingerprints have high correlation
`along the time axis. This correlation is due not only to the inherent
`time correlation in audio, but also by the large overlap of the
`frames used in fingerprint extraction. Higher correlation implies a
`larger standard deviation, as shown by the following argument.
`Assume a symmetric binary source with alphabet {-1,1} such that
`the probability that symbol xi and symbol xi+1 are the same is
`equals to q. Then one may easily show that
`E[
`]
`,
`xx
`a
`|k
`|
`
`i
`ki
`where a = 2·q-1. If the source Z is the exclusive-or of two such
`sequences X and Y, then Z is symmetric and
`E[
`]
`.
`zz
`a
`|2 k
`|
`
`i
`ki
`
`(2)
`
`
`
`n
`

`)21(
`
`2
`
`
`
`erfc
`
`21
`
`
`
`x
`
`22
`
`e
`
`dx
`
`
`
`21
`
`P
`f
`
`
`
`(3)
`
`(4)
`NZ over N
`For N large, the standard deviation of the average
`consecutive samples of Z can be approximately described by a
`normal distribution with mean 0 and standard deviation equal to
`1
`a
`
`1(
`N
`
`
`(5)
`
`.
`
`2
`
`)
`
`2 a
`
`0005
`
`

`

`
`
`1010
`
`0123456789
`0123456789
`
`Bit errors per sub-fingerprint ( )
`Bit errors per sub-fingerprint ( )
`
`
`5050
`
`100100
`
`150150
`
`200200
`
`Frame numberFrame number
`Figure 4. Bit errors per sub-fingerprint for the “MP3@
`128Kbps version” of excerpt of ‘O Fortuna’ by Carl Orff.
`
`
`
`250250
`
`Most reliable erroneous bit ( )
`Most reliable erroneous bit ( )
`
`
`
`3030
`
`
`
`2525
`
`
`
`2020
`
`
`
`1515
`
`
`
`1010
`
`
`
`55
`
`
`
`00
`
`
`
`5050
`
`
`
`100100
`
`
`150150
`
`Frame numberFrame number
`
`
`
`200200
`
`
`
`250250
`
`
`
`3030
`
`
`
`2525
`
`
`
`2020
`
`
`
`1515
`
`
`
`1010
`
`
`
`55
`
`
`
`00
`
`Bit errors per sub-fingerprint ( )
`Bit errors per sub-fingerprint ( )
`
`Figure 5. Bit errors per sub-fingerprint (gray line) and the
`reliability of the most reliable erroneous bit (black line) for
`the “MP

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket