`
` Exhibit 54
`
`
`
`Case 1:14-cv-02396-PGG-SN Document 241-29 Filed 11/12/20 Page 2 of 15
`
`Pattern Recognition 41 (2008) 3467 - 3480
`
`Waveprint: Efficient wavelet-based audio fingerprinting
`Shumeet Baluja*, Michele Covell
`
`Google,
`
`Inc., Mountain View, CA, USA
`
`ARTICLE
`
`INFO
`
`ABSTRACT
`
`Article
`
`Received
`
`Received
`
`Accepted
`
`history:
`17 July 2007
`in revised form 6 March 2008
`2 May 2008
`
`Keywords:
`Audio
`retrieval
`
`Applications
`Image/video
`Pattern
`
`analysis
`
`retrieval
`
`In this paper, we present Waveprint, a novel method for audio identification. Waveprint uses a
`tion of computer-vision
`techniques and large-scale data-stream processing algorithms to create
`fingerprints of audio data that can be efficiently matched. The resulting system has excellent
`capabilities for small snippets of audio that have been degraded in a variety of manners,
`tion
`including
`recording quality and cell-phone playback. We explicitly measure the tradeoffs
`noise, poor
`competing
`between performance, memory usage, and computation
`through extensive
`experimentation. The system
`in terms of memory usage and computation, while being more accurate when compared
`is more efficient
`with previous state of the art systems. The
`applications of Waveprint
`include song identification for
`end-consumer use, copyright
`protection for audio assets, copyright
`protection for television assets and
`of off-line audio sources, such as live television.
`synchronization
`
`compact
`identifica-
`
`combina-
`
`1.
`
`Introduction
`
`fingerprinting provides the ability to link short, unlabeled
`Audio
`snippets of audio content to corresponding data about that content.
`There are an immense number of applications for audio fingerprint-
`In the consumer space, it provides the ability for consumers to
`ing.
`identify unknown audio, suchas songs]
`Songs can
`automatically
`be tagged automatically with the performing artist's name, album or
`other metadata. Other applications include broadcast monitoring for
`not only songs, but also other non-musical content. One such appli-
`time, and duration of
`cation is to continuously monitor the number,
`broadcasts of pre-recorded programs and advertisements. An appli-
`cation that is rapidly growing in importance is the immediate identi-
`fication of copyrighted material. As the ease and popularity of music
`and video sharing increases
`the need to recognize copyrighted
`content grows. For this task, audio fingerprinting is proving to be
`useful for recognizing not only audio material, but also video content
`the use of the audio track). Similarly, automatically rec-
`(through
`ognizing ambient audio signals from television broadcasts has also
`proven to be much easier than recognizing the video stream. This
`has enabled numerous applications ranging from enhanced televi-
`sion {5} to automatic advertisement detection and replacement
`There are a number of issues that make fingerprinting a chal-
`lenging task. The simplest approaches, directly comparing the au-
`dio samples, will not work. The query and stored version of a song
`
`*
`
`Corresponding
`
`author.
`
`0031-3203/$30.00 © 2008 Elsevier Ltd. All
`doi:10.1016/j patcog.2008.05.006
`
`rights
`
`reserved.
`
`© 2008 Elsevier Ltd. All rights
`
`reserved.
`
`reference
`
`may be aurally similar while having distinct sample values. Even di-
`rect comparisons of spectrograms are susceptible to changesin qual-
`ity settings, compression schemes, equalization settings,
`Further, if radio broadcasts are included in the probe set,
`codecs, etc.
`tempo and sometimes pitch changes may be introduced, since ra-
`dio stations often change the speed of a song to fit their
`programing
`there are numerous issues introduced by
`Finally,
`requirements
`the many forms of playback available to the end-consumer. Music
`is played through a cell-phone, computer speakers, or high-end
`that
`audio equipment will have very different audio characteristics that
`must be taken into account.
`In this paper, we describe a system which can handle signals that
`have been degraded by echoes, passed through a cell-phone codec,
`recorded in the presence of structured noise, and modified in its
`timing, with respect to either/both pitch and tempo. The system is
`always tested where even coarse offsets into the matching song are
`unknown (for example, cues such as “the snippet occurs with the
`first 30s” will not be used). This offset-invariant
`testing is required
`for broadcast monitoring and for edited-media
`identification. Most
`open-source music-track identification systems (such as Foosic lib-
`and MusicBrainz Picard Tagger {9}) are not designed for
`FoolD
`use underthis assumption.
`Instead, we compare ourselves to an-
`other state-of-the-art open-source system {TOE allowing us to com-
`load as well as accuracy.
`pare memory usage and computational
`For our explorations,
`there are practical system-design con-
`straints. The most
`restrictive are the memory
`requirements. To
`the number andsize of disk reads, we will keep as much
`minimize
`in a computer's memory as possible. Therefore, the fin-
`information
`gerprints need to be compact. Second, the system must be designed
`to be easily parallelizable to multiple machines. New audio content
`
`GOOG-NETWORK-00610574
`
`
`
`Case 1:14-cv-02396-PGG-SN Document 241-29 Filed 11/12/20 Page 3 of 15
`
`3468
`
`S. Baluja, M. Covell / Pattern Recognition 41 (2008) 3467 - 3480
`
`the system must
`may be continuously added to the system. Third,
`be able to work with non-music information, such as the audio track
`of television programs. The least restrictive of our
`requirements is
`the computation speed; we need to recognize an audio-track as it is
`it must be at least
`being played. Therefore,
`real-time.
`In Section 2, we review previous approaches to audio identifi-
`cation, with special attention to those systems that
`influenced our
`In Section 3, we present and explain the methods
`own approach.
`we have used for our system,
`termed Waveprint {TT} Section 4 ex-
`plores the tradeoffs across computational complexity, memory usage
`and recognition accuracy, as a function of the parameter
`settings.
`Section 5 provides detailed performance results and compares the
`performance to the best previously published system {TO In recog-
`nition of the importance of scaling properties, Section 6 analyzes,
`in practice, the accuracy and computation required as a function of
`database size. Finally, Section 7 concludes the paper
`known-audio
`and outlines several directions for further
`exploration.
`
`2. Previous
`
`approaches
`
`A good overview of the audio fingerprinting field can be found in
`track-based approaches, such as Foosic
`Refs. {12,13} Most
`libFoolD
`often make use of starting-
`and MusicBrainz Picard Tagger
`time constraints in matching fingerprints. Some commercial prod-
`ucts are targeted at broadcast or edited content
`One of the most widely used systems {T8§ uses overlapping win-
`dows of audio to extract
`interesting features. Thirty-three
`Bark-
`(BFCC) bands, covering 300-2000 Hz,
`frequency cepstral coefficient
`are used for the spectral representation, with spectral slices taken ev-
`ery 11.6ms and with eachslice based on 370 msof audio. This large
`overlap ensures that the sub-fingerprints slowly vary over time, pro-
`viding fine-grain shift invariance to the representation. A vector of 32
`bits, called a sub-fingerprint, are extracted from each slice position
`according to the increasing/decreasing difference pairs across suc-
`cessive bands and successive spectral slices. These
`sub-fingerprints
`are largely insensitive to small changesin the audio signal since no
`actual difference values are kept;
`instead, only the sign bits com-
`pose the sub-fingerprint. Comparisons with these fingerprints are
`efficient; a simple Hamming distance can be used.
`For a database of 10,000 songs of average length 5 min, approx-
`sub-fingerprints were generated
`250,000,000
`imately
`During
`from this reference set, the entire database cannot be exam-
`retrieval
`Instead, the authors of Ref.
`assume that, even with poten-
`ined.
`tial audio degradations,at least one sub-fingerprint
`from each query
`sound will have a (correct) exact match in the database. This allows
`them to use a hash table to find exact copies. Once an exact match
`the temporal sequencing of the indexed song is used to
`is found,
`analyze the surrounding sub-fingerprints. This allows a simple com-
`putation of the Hamming distance over the full snippet
`length. The
`only “search” done is over candidates that are retrieved by one or
`more exact matches. If extreme distortions are expected, such that
`even a single exact match is not guaranteed,
`their approach is mod-
`ified to search for vectors that are small Hamming distances away
`from the original
`sub-fingerprints.
`A recent extension to the above work was presented in Ref. {70}
`Based on Ref. {T8§ Ke introduced a learning approach into the fea-
`ture selection process. An important
`insight provided by Ref.
`[TOF
`is that
`the 1-D audio signal can be processed as an image when
`viewed in a 2-D time-frequency
`representation. Their learning sys-
`tem finds features that
`integrate the energy in selected frequencies
`time via AdaBoost
`The basis of feature selection
`over
`learning
`is the discriminative power of the region in differentiating
`between
`two matching frames (within a distortion set) and two mismatched
`frames. Thirty-two “boxlet” features are selected, each yielding a bi-
`nary value. These 32 bits are then used in an analogous procedure to
`
`the 32-bit features found by Ref. {18} Temporal coherence is mea-
`sured by a simple transition model.
`An alternate approach is explored in Refs. {20;2T} Their work
`uses a perceptually weighted log spectrogram as the feature set.
`This log spectrogram is sampled only once every 186 ms and uses
`372 ms of data to provide 2048 frequency samples between DC and
`5.05 kHz. Burges et al. extract noise-tolerant
`fingerprints from this
`spectrogram using distortion discriminant analysis (DDA). The fin-
`gerprints are more complex than in the studies by Refs.
`but
`also summarize longer segments of audio than in the other work.
`DDA is based on a variant of linear discriminant analysis (LDA) called
`oriented principal components analysis (OPCA). OPCA assumes that
`versions of the training samples are available. OPCA se-
`distorted
`lects a set of directions for modeling the subspace that maximizes
`the signal variance while minimizing the noise power.
`OPCA yields
`a set of potentially non-orthogonal vectors that account
`for noise
`statistics {2T§ The final result of their system effectively maps 110K
`inputs into 64 outputs. These 64 outputs are the
`sub-fingerprints
`that are matched. The experiments conducted in Refs.
`have
`found that the fingerprints are resistant to problems with alignment
`and types of noise not found in the training set.
`
`3. Description of the Waveprint system
`
`Our system builds on the insight from Ref.
`computer-vision
`techniques can be a powerful method for analyzing audio data. How-
`instead of a learning approach, we examine the
`ever,
`applicability
`of a wavelet-based approach developed by Ref. {22} for
`efficiently
`image queries in large databases. To make the algorithm
`performing
`scale, we employ hashing approaches from large-scale
`data-stream
`processing. The sub-fingerprints that we develop are more compre-
`since they will represent a longer
`hensive than used in Refs.
`in a manner closer to the work presented in Ref.
`time period,
`We structure our discussion of the Waveprint system in three
`stages. The first is the creation of a compact
`representation of songs
`that will be inserted into our database for retrieval. The second stage
`is the creation and organization of that database. The third stage
`is an efficient procedure to lookup a candidate match when a new
`query audio snippet is received.
`
`3.1, Fingerprint creation
`
`The overall architecture of the fingerprint creation
`described in this section, is shown inFig. Fig. also showsa typical
`spectrogram and its decomposition into a sparse representation that
`will be converted into the signatures stored in our database.
`
`procedure,
`
`3.1.1. Spectral-image creation
`We begin by converting the audio input into a spectrogram {23}
`The simplest spectrogram extracts overlapping segments of audio,
`tapers each audio slice to reduce the sensitivity to end effects,
`takes the Fourier transform of each audio slice to give a
`short-time
`representation, and then discards the phase
`component,
`frequency
`keeping only Fourier magnitudes on the positive frequency bands.
`We create the spectrograms using parameter settings that worked
`We use slices that
`well in previous audio fingerprinting studies
`are 371 ms long, taken every 11.6ms, reduced to 32
`logarithmically
`frequency bins between 318Hz and 2kHz. An
`spaced
`important
`consequence of the slice length/spacing combination of parameters
`is that
`the spectrogram varies slowly in time, providing matching
`robustness to position uncertainty (in time). The use of
`logarithmic
`spacing in frequency was selected based on simplicity, since the
`band-edge locations are unlikely to have a strong effect
`detailed
`under such coarse sampling (only 32 samples across
`frequency).
`
`GOOG-NETWORK-00610575
`
`
`
`We then extract spectral images, 11.6 +wms long, each sampling
`in the
`offset apart. The sampling offsets that we use are constant
`process (s sec separation) but are non-uniform in
`database-creation
`the probe sampling process. We discuss the parameter choices (s
`and w) further in Section 4. Extracting known-length spectral
`images
`from the spectrograms allows us to create sub-fingerprints that
`in-
`clude some temporal structure without been unduly susceptible to
`gradual changesin timing. At this point in the processing, we treat
`images as if they were components in an
`the spectral
`image-query
`system. Rather than performing retrieval by directly comparing the
`“pixels” of the spectral image, we will use a representation based on
`wavelets.
`
`images
`
`3.1.2. Wavelets on spectral
`Wavelets are a mathematical tool for hierarchically
`decomposing
`functions. They allow a function to be described by its overall shape,
`plus successively increasing details. A good description of wavelets
`can be found in Ref.
`We use wavelets in this audio-retrieval
`task
`
`Case 1:14-cv-02396-PGG-SN Document 241-29 Filed 11/12/20 Page 4 of 15
`
`3469
`
`S. Baluja, M. Covell / Pattern Recognition 41 (2008) 3467 - 3480
`due to their successful use in image retrieval {22} In Ref.
`rather
`they first decom-
`than comparing images directly in the pixel space,
`posed the image through the use of multi-resolution Haar wavelets.
`Samples of the wavelet decomposition,
`for a typical
`image and for
`a spectrogram image, are showninj
`query images that were hand drawn or low quality sketches of the
`image. The results were better than those achieved through
`target
`simple histogram or pixel
`differences.
`images, Haar wavelets are
`In our system, for each of the spectral
`computed. By itself, the wavelet
`image is not resistant to noise or
`instead of using the
`audio degradations. To provide this resistance,
`entire set of wavelets, we only keep the ones that most
`characterize
`the image, by selecting the top-t wavelets (by magnitude), where t <
`image_pixels. When we look at the wavelets for successive
`images
`for two songs,
`we see easily identifiable patterns both in the wavelet
`space and even more clearly when the top-t wavelets are kept. This
`is shown in Fig.
`Oneof the important
`findings by Jacobs {22§ was that only sign
`the full coefficients)
`for the top wavelets were needed.
`bits (not
`This allows a bit-vector
`representation in which each wavelet was
`represented as only two bits—which is ideal for applications, such as
`this one, with stringent memory requirements. For each of the top-
`it is labeled as 10 (for positive values) or 01
`t magnitude wavelets,
`(for negative values). The majority of wavelets, which are not in the
`top-t set are labeled with 00. This representation makes the resulting
`bit vector extremely sparse and amenable to further
`dimensionality
`the next step of dimensionality
`reduction, we use
`reduction.
`For
`Min-Hash, which is described in the next section.
`
`1.
`
`Given the spectrogram of a song, divide the audio into
`smaller spectral
`images.
`
`For each spectral
`
`image:
`
`2. Compute the wavelets on the spectral
`
`images.
`
`3 Extract the top-t wavelets, measured by magnitude.
`
`4. Create a binary representation of the top-wavelets.
`
`Fig. 1. Overall architecture for fingerprint creation.
`
`3.2. Min-Hash-based
`
`sub-fingerprints
`
`The final step of sub-fingerprint creation is to determine a com-
`pact but nearest-neighbor
`indexable representation of the sparse
`described in the previous section; for this we explore
`wavelet-vector
`the use of Min-Hash [25 To support efficient
`nearest-neighbor
`retrieval, we require that sub-fingerprint v1 and sub-fingerprint v2
`are highly similar if and only if wavelet signature (v1) and wavelet
`(v2) are highly similar. Because we retain only the top-t
`signature
`coefficients, we determine similarity based on those top
`wavelet
`wavelets, without
`rewarding matches on the zeroed positions. For
`the purposes of this discussion, given two vectors v1 and v2, we refer
`to match types as being of four types a, b,c,d,as shown in {fablek de-
`pending on the corresponding bits in the vectors. Given these types
`of matches/mismatches, we note that for sparse vectors, most of the
`
`Fig. 2. Wavelet examples for a typical
`
`image (left) and a
`
`spectrogram image (right). These are computed
`
`in the “standard” manner—one dimension at a time.
`
`GOOG-NETWORK-00610576
`
`
`
`Case 1:14-cv-02396-PGG-SN Document 241-29 Filed 11/12/20 Page 5 of 15
`
`3470
`
`S. Baluja, M. Covell / Pattern Recognition 41 (2008) 3467 - 3480
`
`The Dave Matthews Band
`
`Lie
`
`the second
`For each song, the top row is the original
`frames shown for each, skipping
`Fig. 3. The representation for three songs—five
`consecutive
`spectrogram image,
`the third row shows the top-200 (¢ = 200) wavelets. Note that the top wavelets have a distinctive pattern for each of the three songs.
`row is the wavelet magnitudes,
`each song, the Lop 2 cows in the figure have been extensively visually enhanced to be visible when printed on paper.)
`
`(For
`
`Table 1
`Types of match/mismatch between single bits of two binary
`
`vectors
`
`bit positions will be of type d. To avoid computing similarity based
`on the uninformative zeroed rows, we will define the similarity of
`two vectors to be the relative number of rows that are of type a from
`the other non-zero rows: i.e., Similarity (v1; v2) = a/(a + b +c).
`The Min-Hash technique works with binary vectors as
`follows:
`Select a random, but known, reordering ofall the vector positions. Re-
`order each vector by this permutation. With this new ordering, deter-
`mine in which position the first “1” occurs.It is important to note for
`first_1_occurrence(v1) =
`two vectors, v1 and v2, the probability that
`is the same as the probability of finding a row
`first_1_occurrence(v2)
`in both v1 and v2, from the set of rows that have 1
`in
`that has a 1
`either v1 or v2. Therefore,
`the hash values
`for a given permutation,
`agree if the first position with a 1 is the samein both bit vectors, and
`they disagree if the first such position is a row where one but not
`both, vectors contained a 1. Note that this is exactly whatis required;
`it measures the similarity of the sparse bit vectors based on matching
`“on”
`positions.
`We repeat the above procedure multiple times, each time witha
`new permutation of bit positions. If we repeat the process p times,
`with p unique permutations, we get p largely independent
`projec-
`tions of the bit vector. These p values are the signature for the bit
`compare the similarity of the bit vectors by look-
`vector. We can
`ing at the exact matches in the signatures of length p; for a large
`enough p, it will be very close to the similarity of the original vectors.
`In our system, we do not keep the intermediate bit
`representation
`Instead, we store the Min-Hash
`described in the previous section.
`
`this is the final sub-fingerprint of the spectral
`computed signature;
`image. We experimented with a variety of values for p; these are
`presented in Section 4. An in-depth description of the Min-Hash pro-
`cess is given in Ref.
`Methods to make the matching process ef-
`ficient given these signatures will be presented with the description
`of the retrieval process.
`On a pragmatic note, because memory efficiency is
`paramount
`Instead of
`for deployment, each signature element must be small.
`tracking the first position in which a “1” occurs, we truncate the
`at position 255.If the first 1 occurs after position 255,
`computation
`is demarcated as if it occurred at position 255. This allows us
`it
`to keep each component of the signature as a single byte. Fig.
`shows a histogram of the position of the first
`1 computed for the
`for three values of t (the number of top
`snippets in our database,
`the cumulative probability of occurring
`kept). Note that
`wavelets
`after 255 is very low when more than 50 top coefficients are
`the single-byte representation loses little
`this indicates that
`kept;
`accuracy.
`In this study, Min-Hash reduces the size of the signatures from
`intermediate wavelet
`representation described in this previous
`the
`representation of p-bytes. There are numer-
`section to a compact
`ous other
`techniques that are commonly used for
`dimensionality
`them techniques such as principal
`reduction—among
`components
`analysis (PCA). We chose Min-Hash due to a chain of reasons. We
`across our
`discriminative power
`top-wavelet
`require
`signatures,
`not descriptive power. This requirement means that PCA may not
`be the best
`representation and instead has traditionally been han-
`dled by LDA-based methods
`Since our top wavelet
`signatures
`representation, we employed
`are already a sparse-vector
`tech-
`niques explicitly designed to handle probabilistic matching across
`sparse vectors, and did not
`require transformation to continuous
`values. Min-Hash is such a method and has been used
`exten-
`sively (and successfully) in data-stream processing for this class of
`problems.
`
`GOOG-NETWORK-00610577
`
`
`
`Case 1:14-cv-02396-PGG-SN Document 241-29 Filed 11/12/20 Page 6 of 15
`
`S. Baluja, M. Covell / Pattern Recognition 41 (2008) 3467 - 3480
`
`3471
`
`Probability of First Non-Zero Element in Each Position
`
`200
`50
`
`|
`
`800
`
`16
`
`14
`
`12
`
`.
`
`k
`
`Probability
`
`rererwrerwrrwrerwrrerrwrierirwri
`
`SELL rlLS@RQaNRak
`
`er
`
`Fig. 4. Probability of the first non-zero element occurring in each bit position, measured over all of the snippets in the database. The curves shown are for keeping the top
`there is a lange percentage after the truncation
`(as we will use in our system), keeping only 50, and keeping 800. Note that with 50 top coefficients,
`coefficients
`200
`wavelet
`point of 255 (the spike at the right edge of the graph). When keeping 800 coefficients,
`notice that the probability of a “1" in the first position examined increases;
`this is
`the sparseness of the bit vectors has decreased
`because
`significantly.
`
`Position
`
`3.3. Database-creation process
`To this point, we have reduced the amount of information from
`each spectrogram through three steps. First, we kept only the top
`wavelets of the spectral
`images associated with the
`spectrogram.
`Second, we reduced each of the top wavelets to only twobits. Third,
`we used the Min-Hash procedure to reduce the resulting bit vector
`to p values, which becamethe sub-fingerprint
`that was stored for
`the audio segment.
`these steps, each spectral
`image is represented by a series
`After
`of p 8-bit integers, the sub-fingerprint. Even with this compression,
`finding near-neighbors in a p-dimensional space is not
`efficiently
`trivial
`task (when p>50); naive comparisons are not
`a
`practical.
`Instead, we use a technique termed locality-sensitive hashing (LSH)
`in the number of comparisons that are
`It is not only efficient
`fraction of the data set will be examined for a
`required (a small
`lookup), but also provides noise-robustness
`properties.
`In standard hashing, the query’s sub-fingerprint is used, in its en-
`indexes into a bin that contains elements
`tirety, as a hash key that
`that have also been hashed into the same bin
`The elements
`that are found in that bin are then considered candidate matches,
`and can be compared in detail
`(for example, by comparing the full
`to obtain the final matches. The implicit assumption in
`signatures)
`using the entire fingerprint
`is that the procedures employed to cre-
`ate the sub-fingerprint already fully accounted for the noise that the
`system would encounter. In contrast to the standard hashing proce-
`dure, LSH performs a scrics of hashes, cach of which examines only
`
`a portion of the sub-fingerprint. The goal is to partition the feature
`vectors into
`subvectors and to hash each into
`separate hash ta-
`bles, each hash table using one of the subvectors as input
`to the
`In this study, because the Min-Hashesare signatures
`function.
`hash
`in which the constituents are determined randomly, we simply as-
`elements for each hash. Can-
`sign non-overlapping, consecutive,
`didate neighbors can then be retrieved by partitioning the probe
`feature vector in the same manner, and collecting the entries in the
`
`corresponding hash bins. The final list of potential neighbors can be
`created by vote counting, with each hash casting votes for the en-
`tries of its indexed bin, and retaining the candidates that receive a
`minimum number of votes, v. If v = 1, this takes the union of the
`candidate lists. If v = this takes the
`intersection.
`Unlike standard hashing schemes, in which the entire
`fingerprint
`must be exactly the same for candidate retrieval, LSH relaxes this
`requiring aslittle as p/f bytes of the sub-fingerprints to
`restriction,
`match. It should be noted that the number of hash tables, /, can vary
`from the size of the sub-fingerprint, p. We experiment
`independently
`with a variety of setting for while keeping the size of the Min-Hash
`constant.
`signature
`When creating the audio fingerprint database, we extract a sub-
`fingerprint of length p, once each s seconds. We retain these sub-
`in the observed time-order sequencing, with a pointer
`fingerprints
`to a song identifier. The time ordering is encoded implicitly in our
`index within a linear array. In addition
`system by the sub-fingerprint
`to the time-ordered representation of the sub-fingerprints, each sub-
`is inserted into the ! hash tables of our LSH according to
`fingerprint
`the hash-function mapping of the corresponding p/I bytes to a hash
`bin. This provides us with the complete song database from which
`we will retrieve matching songs; this is described in the next section.
`
`3.4. Retrieval process
`
`In the previous two subsections, we described the technique to
`create sub-fingerprints and to store them in a database for fast re-
`In this section, we describe the process to find similar snip-
`trieval.
`pets once given this database and a new query snippet. The complete
`is shown graphically in Fig, Note the first
`few steps
`procedure
`(steps 1-5) of retrieval process follow the database
`sub-fingerprint
`creation process closely; this is to ensure that the query and stored
`data are processed in the same manner. After the candidate matches
`are found (steps 6-8), a measure of temporal coherence is used to
`verify the findings (not shownin the diagram).
`
`GOOG-NETWORK-00610578
`
`
`
`Case 1:14-cv-02396-PGG-SN Document 241-29 Filed 11/12/20 Page 7 of 15
`
`3472
`
`S. Baluja, M. Covell / Pattern Recognition 41 (2008) 3467 - 3480
`
`Full Snippet
`
`Input
`
`1.
`
`Spectral
`Computed at
`
`Images
`
`Variable
`
`Strides
`
`2. Compute Wavelets on spectral
`images
`3. Extract Top-t Wavelets
`
`an
`
`4. Binary
`
`Representation
`
`ABCDEFGHIJKLMNOPQRSTUV WXY
`
`5. Generate Min-Hash Signature
`there shownas 25 bytes)
`
`ABCDE
`
`Hash Table 1
`
`FGHIJ
`
`Hash Table 2
`
`UVWXY
`
`Hash Table /
`
`12,50,92,302
`
`Vv
`
`Fingerprint & count
`5
`
`1
`
`50:
`92:3
`102:
`
`302:
`
`2
`
`1
`
`eee
`
`92,102
`
`7,92,102
`
`6. LSH procedure
`with / hashes
`
`92
`102
`
`> Compare Sig (92), Sig
`
`Compare (Sig (102), Sig (q1))
`
`7. Count Votes & Retain those
`above minimum, v=2.
`
`8. Compare full signature with
`query signature.
`
`Fig. 5. Overall architecture of the retrieval process. Dynamic-time warping not
`
`shown.
`
`The first difference in the retrieval process,
`in comparison to the
`is that the song is divided into ran-
`generation process,
`fingerprint
`domly overlapping segments rather than uniformly overlapping seg-
`ments. Randomly selecting the stride amount avoids problems of
`if the sampling of the probe is kept constant, it
`unlucky alignments;
`may repeatedly find samples that have uniformly large offsets from
`the sampling used to create the database. After the audio snippet
`is created, its signature is computed in exactly the same manner as
`described in the database generation process (steps 2-5). The next
`steps 6-9 are described in the remainder of this section.
`
`3.4.1.
`Lookup using LSH
`We described the basic characteristics of LSH in the previous sub-
`section. Each component hash table votes for the
`sub-fingerprints
`that were
`retrieved using its p/l-bytes. Then, only those sub-
`fingerprints with a minimum of v component votes are retained. For
`example, by setting v= 1, a candidate must only match in one of the
`subregions in order to be includedin the final candidate list. At the
`the candidate must match on all
`opposite extreme, by setting
`subregions and the LSH operates identically (although
`to a standard hash table.
`least v votes are then compared
`fingerprints that have at
`The
`with the query sub-fingerprint
`(we refer to this as a full compare).
`Because each byte of the sub-fingerprint
`is a Min-Hash signature, we
`simply look at the number of bytes (out of p) that match exactly. The
`
`inefficiently)
`
`sub-fingerprint with the maximum ofthis score is the best match
`for the query spectral
`image.
`
`3.4.2. Dynamic programming for temporal ordering
`To this point, we have
`discussed matching individual
`sub-
`from the probe into the database.
`In this section, we
`fingerprints
`describe methods to accumulate evidence across
`sub-fingerprints
`over the duration of the probe snippet.
`The simplest way to combine evidence is voting without
`regard
`information. With this approach, we keep a
`similar-
`to temporal
`ity counter for each song. At the start of a new probe snippet, all
`of these counters are zero. Then, for each candidate
`subfingerprint
`match that passed the required hash-voting threshold, v, we incre-
`the corresponding song counter by the Hamming
`ment
`similarity
`between the probe and candidate sub-fingerprints. The advantage of
`this approach is its simplicity and low memory requirements. The
`is that evidence for a song accumulates without
`regard
`disadvantage
`to temporal ordering of the sub-fingerprint matches. Pairs of sub-
`fingerprints p(t)/p(t + At) within the probe are rewarded equally for
`matching song sub-fingerprints s(t)/s(t — At) as they are for match-
`ing s(t)/s(t + At):
`there is no penalty for time reversal. Similarly,
`for
`matching s(t)/s(t + 10At):
`there is no penalty for changing tempo.
`We also explored dynamic-time warping (DTW)
`to accumulate
`time. DTW is a form of dynamic
`evidence
`across
`programming
`imposing “tempo” constraints in mapping one sequence onto
`for
`
`GOOG-NETWORK-00610579
`
`
`
`Case 1:14-cv-02396-PGG-SN Document 241-29 Filed 11/12/20 Page 8 of 15
`
`S. Baluja, M. Covell / Pattern Recognition 41 (2008) 3467 - 3480
`
`3473
`
`40% faster probe
`
`hashes (v). By increasing this number, we dramatically reduce the
`number offull comparisons that need to be conducted;
`the impact
`on accuracy will be carefully studied, as the likelihood of a missed
`match
`increases.
`
`4. Parameter exploration
`
`time
`
`db-song
`
`One of the intrinsic difficulties in designing systems with multiple
`is conducting a thorough exploration of the parameter
`components
`settings. Because of the need to deploy this system in a
`large-scale
`setting, we extensively measure the effects of parame-
`production
`ter values. The results reported in this section represent multiple
`decades of computational processing; over 50,400 different
`param-
`eter combinations were tried to ensure that we select the best set-
`tings and understand the tradeoffs with each parameter. The param-
`eters that we explored for the fingerprint and database generation
`and retrieval are shownin Jable-2 To make the results more under-
`standable, we also present someof our derived intuition about how
`the parameters affect computational complexity, memory usage and
`retrieval
`accuracy.
`space, we used a 10,000 song
`To explore this large parameter
`database, with an average song duration of 3.5 min. Since our goal at
`this point is to understand the general system performance, we did
`not use heuristics such as unequal weighting of songs to influence
`the results.
`In the future,