throbber
Case 1:14-cv-02396-PGG-SN Document 241-29 Filed 11/12/20 Page 1 of 15
`
` Exhibit 54
`
`

`

`Case 1:14-cv-02396-PGG-SN Document 241-29 Filed 11/12/20 Page 2 of 15
`
`Pattern Recognition 41 (2008) 3467 - 3480
`
`Waveprint: Efficient wavelet-based audio fingerprinting
`Shumeet Baluja*, Michele Covell
`
`Google,
`
`Inc., Mountain View, CA, USA
`
`ARTICLE
`
`INFO
`
`ABSTRACT
`
`Article
`
`Received
`
`Received
`
`Accepted
`
`history:
`17 July 2007
`in revised form 6 March 2008
`2 May 2008
`
`Keywords:
`Audio
`retrieval
`
`Applications
`Image/video
`Pattern
`
`analysis
`
`retrieval
`
`In this paper, we present Waveprint, a novel method for audio identification. Waveprint uses a
`tion of computer-vision
`techniques and large-scale data-stream processing algorithms to create
`fingerprints of audio data that can be efficiently matched. The resulting system has excellent
`capabilities for small snippets of audio that have been degraded in a variety of manners,
`tion
`including
`recording quality and cell-phone playback. We explicitly measure the tradeoffs
`noise, poor
`competing
`between performance, memory usage, and computation
`through extensive
`experimentation. The system
`in terms of memory usage and computation, while being more accurate when compared
`is more efficient
`with previous state of the art systems. The
`applications of Waveprint
`include song identification for
`end-consumer use, copyright
`protection for audio assets, copyright
`protection for television assets and
`of off-line audio sources, such as live television.
`synchronization
`
`compact
`identifica-
`
`combina-
`
`1.
`
`Introduction
`
`fingerprinting provides the ability to link short, unlabeled
`Audio
`snippets of audio content to corresponding data about that content.
`There are an immense number of applications for audio fingerprint-
`In the consumer space, it provides the ability for consumers to
`ing.
`identify unknown audio, suchas songs]
`Songs can
`automatically
`be tagged automatically with the performing artist's name, album or
`other metadata. Other applications include broadcast monitoring for
`not only songs, but also other non-musical content. One such appli-
`time, and duration of
`cation is to continuously monitor the number,
`broadcasts of pre-recorded programs and advertisements. An appli-
`cation that is rapidly growing in importance is the immediate identi-
`fication of copyrighted material. As the ease and popularity of music
`and video sharing increases
`the need to recognize copyrighted
`content grows. For this task, audio fingerprinting is proving to be
`useful for recognizing not only audio material, but also video content
`the use of the audio track). Similarly, automatically rec-
`(through
`ognizing ambient audio signals from television broadcasts has also
`proven to be much easier than recognizing the video stream. This
`has enabled numerous applications ranging from enhanced televi-
`sion {5} to automatic advertisement detection and replacement
`There are a number of issues that make fingerprinting a chal-
`lenging task. The simplest approaches, directly comparing the au-
`dio samples, will not work. The query and stored version of a song
`
`*
`
`Corresponding
`
`author.
`
`0031-3203/$30.00 © 2008 Elsevier Ltd. All
`doi:10.1016/j patcog.2008.05.006
`
`rights
`
`reserved.
`
`© 2008 Elsevier Ltd. All rights
`
`reserved.
`
`reference
`
`may be aurally similar while having distinct sample values. Even di-
`rect comparisons of spectrograms are susceptible to changesin qual-
`ity settings, compression schemes, equalization settings,
`Further, if radio broadcasts are included in the probe set,
`codecs, etc.
`tempo and sometimes pitch changes may be introduced, since ra-
`dio stations often change the speed of a song to fit their
`programing
`there are numerous issues introduced by
`Finally,
`requirements
`the many forms of playback available to the end-consumer. Music
`is played through a cell-phone, computer speakers, or high-end
`that
`audio equipment will have very different audio characteristics that
`must be taken into account.
`In this paper, we describe a system which can handle signals that
`have been degraded by echoes, passed through a cell-phone codec,
`recorded in the presence of structured noise, and modified in its
`timing, with respect to either/both pitch and tempo. The system is
`always tested where even coarse offsets into the matching song are
`unknown (for example, cues such as “the snippet occurs with the
`first 30s” will not be used). This offset-invariant
`testing is required
`for broadcast monitoring and for edited-media
`identification. Most
`open-source music-track identification systems (such as Foosic lib-
`and MusicBrainz Picard Tagger {9}) are not designed for
`FoolD
`use underthis assumption.
`Instead, we compare ourselves to an-
`other state-of-the-art open-source system {TOE allowing us to com-
`load as well as accuracy.
`pare memory usage and computational
`For our explorations,
`there are practical system-design con-
`straints. The most
`restrictive are the memory
`requirements. To
`the number andsize of disk reads, we will keep as much
`minimize
`in a computer's memory as possible. Therefore, the fin-
`information
`gerprints need to be compact. Second, the system must be designed
`to be easily parallelizable to multiple machines. New audio content
`
`GOOG-NETWORK-00610574
`
`

`

`Case 1:14-cv-02396-PGG-SN Document 241-29 Filed 11/12/20 Page 3 of 15
`
`3468
`
`S. Baluja, M. Covell / Pattern Recognition 41 (2008) 3467 - 3480
`
`the system must
`may be continuously added to the system. Third,
`be able to work with non-music information, such as the audio track
`of television programs. The least restrictive of our
`requirements is
`the computation speed; we need to recognize an audio-track as it is
`it must be at least
`being played. Therefore,
`real-time.
`In Section 2, we review previous approaches to audio identifi-
`cation, with special attention to those systems that
`influenced our
`In Section 3, we present and explain the methods
`own approach.
`we have used for our system,
`termed Waveprint {TT} Section 4 ex-
`plores the tradeoffs across computational complexity, memory usage
`and recognition accuracy, as a function of the parameter
`settings.
`Section 5 provides detailed performance results and compares the
`performance to the best previously published system {TO In recog-
`nition of the importance of scaling properties, Section 6 analyzes,
`in practice, the accuracy and computation required as a function of
`database size. Finally, Section 7 concludes the paper
`known-audio
`and outlines several directions for further
`exploration.
`
`2. Previous
`
`approaches
`
`A good overview of the audio fingerprinting field can be found in
`track-based approaches, such as Foosic
`Refs. {12,13} Most
`libFoolD
`often make use of starting-
`and MusicBrainz Picard Tagger
`time constraints in matching fingerprints. Some commercial prod-
`ucts are targeted at broadcast or edited content
`One of the most widely used systems {T8§ uses overlapping win-
`dows of audio to extract
`interesting features. Thirty-three
`Bark-
`(BFCC) bands, covering 300-2000 Hz,
`frequency cepstral coefficient
`are used for the spectral representation, with spectral slices taken ev-
`ery 11.6ms and with eachslice based on 370 msof audio. This large
`overlap ensures that the sub-fingerprints slowly vary over time, pro-
`viding fine-grain shift invariance to the representation. A vector of 32
`bits, called a sub-fingerprint, are extracted from each slice position
`according to the increasing/decreasing difference pairs across suc-
`cessive bands and successive spectral slices. These
`sub-fingerprints
`are largely insensitive to small changesin the audio signal since no
`actual difference values are kept;
`instead, only the sign bits com-
`pose the sub-fingerprint. Comparisons with these fingerprints are
`efficient; a simple Hamming distance can be used.
`For a database of 10,000 songs of average length 5 min, approx-
`sub-fingerprints were generated
`250,000,000
`imately
`During
`from this reference set, the entire database cannot be exam-
`retrieval
`Instead, the authors of Ref.
`assume that, even with poten-
`ined.
`tial audio degradations,at least one sub-fingerprint
`from each query
`sound will have a (correct) exact match in the database. This allows
`them to use a hash table to find exact copies. Once an exact match
`the temporal sequencing of the indexed song is used to
`is found,
`analyze the surrounding sub-fingerprints. This allows a simple com-
`putation of the Hamming distance over the full snippet
`length. The
`only “search” done is over candidates that are retrieved by one or
`more exact matches. If extreme distortions are expected, such that
`even a single exact match is not guaranteed,
`their approach is mod-
`ified to search for vectors that are small Hamming distances away
`from the original
`sub-fingerprints.
`A recent extension to the above work was presented in Ref. {70}
`Based on Ref. {T8§ Ke introduced a learning approach into the fea-
`ture selection process. An important
`insight provided by Ref.
`[TOF
`is that
`the 1-D audio signal can be processed as an image when
`viewed in a 2-D time-frequency
`representation. Their learning sys-
`tem finds features that
`integrate the energy in selected frequencies
`time via AdaBoost
`The basis of feature selection
`over
`learning
`is the discriminative power of the region in differentiating
`between
`two matching frames (within a distortion set) and two mismatched
`frames. Thirty-two “boxlet” features are selected, each yielding a bi-
`nary value. These 32 bits are then used in an analogous procedure to
`
`the 32-bit features found by Ref. {18} Temporal coherence is mea-
`sured by a simple transition model.
`An alternate approach is explored in Refs. {20;2T} Their work
`uses a perceptually weighted log spectrogram as the feature set.
`This log spectrogram is sampled only once every 186 ms and uses
`372 ms of data to provide 2048 frequency samples between DC and
`5.05 kHz. Burges et al. extract noise-tolerant
`fingerprints from this
`spectrogram using distortion discriminant analysis (DDA). The fin-
`gerprints are more complex than in the studies by Refs.
`but
`also summarize longer segments of audio than in the other work.
`DDA is based on a variant of linear discriminant analysis (LDA) called
`oriented principal components analysis (OPCA). OPCA assumes that
`versions of the training samples are available. OPCA se-
`distorted
`lects a set of directions for modeling the subspace that maximizes
`the signal variance while minimizing the noise power.
`OPCA yields
`a set of potentially non-orthogonal vectors that account
`for noise
`statistics {2T§ The final result of their system effectively maps 110K
`inputs into 64 outputs. These 64 outputs are the
`sub-fingerprints
`that are matched. The experiments conducted in Refs.
`have
`found that the fingerprints are resistant to problems with alignment
`and types of noise not found in the training set.
`
`3. Description of the Waveprint system
`
`Our system builds on the insight from Ref.
`computer-vision
`techniques can be a powerful method for analyzing audio data. How-
`instead of a learning approach, we examine the
`ever,
`applicability
`of a wavelet-based approach developed by Ref. {22} for
`efficiently
`image queries in large databases. To make the algorithm
`performing
`scale, we employ hashing approaches from large-scale
`data-stream
`processing. The sub-fingerprints that we develop are more compre-
`since they will represent a longer
`hensive than used in Refs.
`in a manner closer to the work presented in Ref.
`time period,
`We structure our discussion of the Waveprint system in three
`stages. The first is the creation of a compact
`representation of songs
`that will be inserted into our database for retrieval. The second stage
`is the creation and organization of that database. The third stage
`is an efficient procedure to lookup a candidate match when a new
`query audio snippet is received.
`
`3.1, Fingerprint creation
`
`The overall architecture of the fingerprint creation
`described in this section, is shown inFig. Fig. also showsa typical
`spectrogram and its decomposition into a sparse representation that
`will be converted into the signatures stored in our database.
`
`procedure,
`
`3.1.1. Spectral-image creation
`We begin by converting the audio input into a spectrogram {23}
`The simplest spectrogram extracts overlapping segments of audio,
`tapers each audio slice to reduce the sensitivity to end effects,
`takes the Fourier transform of each audio slice to give a
`short-time
`representation, and then discards the phase
`component,
`frequency
`keeping only Fourier magnitudes on the positive frequency bands.
`We create the spectrograms using parameter settings that worked
`We use slices that
`well in previous audio fingerprinting studies
`are 371 ms long, taken every 11.6ms, reduced to 32
`logarithmically
`frequency bins between 318Hz and 2kHz. An
`spaced
`important
`consequence of the slice length/spacing combination of parameters
`is that
`the spectrogram varies slowly in time, providing matching
`robustness to position uncertainty (in time). The use of
`logarithmic
`spacing in frequency was selected based on simplicity, since the
`band-edge locations are unlikely to have a strong effect
`detailed
`under such coarse sampling (only 32 samples across
`frequency).
`
`GOOG-NETWORK-00610575
`
`

`

`We then extract spectral images, 11.6 +wms long, each sampling
`in the
`offset apart. The sampling offsets that we use are constant
`process (s sec separation) but are non-uniform in
`database-creation
`the probe sampling process. We discuss the parameter choices (s
`and w) further in Section 4. Extracting known-length spectral
`images
`from the spectrograms allows us to create sub-fingerprints that
`in-
`clude some temporal structure without been unduly susceptible to
`gradual changesin timing. At this point in the processing, we treat
`images as if they were components in an
`the spectral
`image-query
`system. Rather than performing retrieval by directly comparing the
`“pixels” of the spectral image, we will use a representation based on
`wavelets.
`
`images
`
`3.1.2. Wavelets on spectral
`Wavelets are a mathematical tool for hierarchically
`decomposing
`functions. They allow a function to be described by its overall shape,
`plus successively increasing details. A good description of wavelets
`can be found in Ref.
`We use wavelets in this audio-retrieval
`task
`
`Case 1:14-cv-02396-PGG-SN Document 241-29 Filed 11/12/20 Page 4 of 15
`
`3469
`
`S. Baluja, M. Covell / Pattern Recognition 41 (2008) 3467 - 3480
`due to their successful use in image retrieval {22} In Ref.
`rather
`they first decom-
`than comparing images directly in the pixel space,
`posed the image through the use of multi-resolution Haar wavelets.
`Samples of the wavelet decomposition,
`for a typical
`image and for
`a spectrogram image, are showninj
`query images that were hand drawn or low quality sketches of the
`image. The results were better than those achieved through
`target
`simple histogram or pixel
`differences.
`images, Haar wavelets are
`In our system, for each of the spectral
`computed. By itself, the wavelet
`image is not resistant to noise or
`instead of using the
`audio degradations. To provide this resistance,
`entire set of wavelets, we only keep the ones that most
`characterize
`the image, by selecting the top-t wavelets (by magnitude), where t <
`image_pixels. When we look at the wavelets for successive
`images
`for two songs,
`we see easily identifiable patterns both in the wavelet
`space and even more clearly when the top-t wavelets are kept. This
`is shown in Fig.
`Oneof the important
`findings by Jacobs {22§ was that only sign
`the full coefficients)
`for the top wavelets were needed.
`bits (not
`This allows a bit-vector
`representation in which each wavelet was
`represented as only two bits—which is ideal for applications, such as
`this one, with stringent memory requirements. For each of the top-
`it is labeled as 10 (for positive values) or 01
`t magnitude wavelets,
`(for negative values). The majority of wavelets, which are not in the
`top-t set are labeled with 00. This representation makes the resulting
`bit vector extremely sparse and amenable to further
`dimensionality
`the next step of dimensionality
`reduction, we use
`reduction.
`For
`Min-Hash, which is described in the next section.
`
`1.
`
`Given the spectrogram of a song, divide the audio into
`smaller spectral
`images.
`
`For each spectral
`
`image:
`
`2. Compute the wavelets on the spectral
`
`images.
`
`3 Extract the top-t wavelets, measured by magnitude.
`
`4. Create a binary representation of the top-wavelets.
`
`Fig. 1. Overall architecture for fingerprint creation.
`
`3.2. Min-Hash-based
`
`sub-fingerprints
`
`The final step of sub-fingerprint creation is to determine a com-
`pact but nearest-neighbor
`indexable representation of the sparse
`described in the previous section; for this we explore
`wavelet-vector
`the use of Min-Hash [25 To support efficient
`nearest-neighbor
`retrieval, we require that sub-fingerprint v1 and sub-fingerprint v2
`are highly similar if and only if wavelet signature (v1) and wavelet
`(v2) are highly similar. Because we retain only the top-t
`signature
`coefficients, we determine similarity based on those top
`wavelet
`wavelets, without
`rewarding matches on the zeroed positions. For
`the purposes of this discussion, given two vectors v1 and v2, we refer
`to match types as being of four types a, b,c,d,as shown in {fablek de-
`pending on the corresponding bits in the vectors. Given these types
`of matches/mismatches, we note that for sparse vectors, most of the
`
`Fig. 2. Wavelet examples for a typical
`
`image (left) and a
`
`spectrogram image (right). These are computed
`
`in the “standard” manner—one dimension at a time.
`
`GOOG-NETWORK-00610576
`
`

`

`Case 1:14-cv-02396-PGG-SN Document 241-29 Filed 11/12/20 Page 5 of 15
`
`3470
`
`S. Baluja, M. Covell / Pattern Recognition 41 (2008) 3467 - 3480
`
`The Dave Matthews Band
`
`Lie
`
`the second
`For each song, the top row is the original
`frames shown for each, skipping
`Fig. 3. The representation for three songs—five
`consecutive
`spectrogram image,
`the third row shows the top-200 (¢ = 200) wavelets. Note that the top wavelets have a distinctive pattern for each of the three songs.
`row is the wavelet magnitudes,
`each song, the Lop 2 cows in the figure have been extensively visually enhanced to be visible when printed on paper.)
`
`(For
`
`Table 1
`Types of match/mismatch between single bits of two binary
`
`vectors
`
`bit positions will be of type d. To avoid computing similarity based
`on the uninformative zeroed rows, we will define the similarity of
`two vectors to be the relative number of rows that are of type a from
`the other non-zero rows: i.e., Similarity (v1; v2) = a/(a + b +c).
`The Min-Hash technique works with binary vectors as
`follows:
`Select a random, but known, reordering ofall the vector positions. Re-
`order each vector by this permutation. With this new ordering, deter-
`mine in which position the first “1” occurs.It is important to note for
`first_1_occurrence(v1) =
`two vectors, v1 and v2, the probability that
`is the same as the probability of finding a row
`first_1_occurrence(v2)
`in both v1 and v2, from the set of rows that have 1
`in
`that has a 1
`either v1 or v2. Therefore,
`the hash values
`for a given permutation,
`agree if the first position with a 1 is the samein both bit vectors, and
`they disagree if the first such position is a row where one but not
`both, vectors contained a 1. Note that this is exactly whatis required;
`it measures the similarity of the sparse bit vectors based on matching
`“on”
`positions.
`We repeat the above procedure multiple times, each time witha
`new permutation of bit positions. If we repeat the process p times,
`with p unique permutations, we get p largely independent
`projec-
`tions of the bit vector. These p values are the signature for the bit
`compare the similarity of the bit vectors by look-
`vector. We can
`ing at the exact matches in the signatures of length p; for a large
`enough p, it will be very close to the similarity of the original vectors.
`In our system, we do not keep the intermediate bit
`representation
`Instead, we store the Min-Hash
`described in the previous section.
`
`this is the final sub-fingerprint of the spectral
`computed signature;
`image. We experimented with a variety of values for p; these are
`presented in Section 4. An in-depth description of the Min-Hash pro-
`cess is given in Ref.
`Methods to make the matching process ef-
`ficient given these signatures will be presented with the description
`of the retrieval process.
`On a pragmatic note, because memory efficiency is
`paramount
`Instead of
`for deployment, each signature element must be small.
`tracking the first position in which a “1” occurs, we truncate the
`at position 255.If the first 1 occurs after position 255,
`computation
`is demarcated as if it occurred at position 255. This allows us
`it
`to keep each component of the signature as a single byte. Fig.
`shows a histogram of the position of the first
`1 computed for the
`for three values of t (the number of top
`snippets in our database,
`the cumulative probability of occurring
`kept). Note that
`wavelets
`after 255 is very low when more than 50 top coefficients are
`the single-byte representation loses little
`this indicates that
`kept;
`accuracy.
`In this study, Min-Hash reduces the size of the signatures from
`intermediate wavelet
`representation described in this previous
`the
`representation of p-bytes. There are numer-
`section to a compact
`ous other
`techniques that are commonly used for
`dimensionality
`them techniques such as principal
`reduction—among
`components
`analysis (PCA). We chose Min-Hash due to a chain of reasons. We
`across our
`discriminative power
`top-wavelet
`require
`signatures,
`not descriptive power. This requirement means that PCA may not
`be the best
`representation and instead has traditionally been han-
`dled by LDA-based methods
`Since our top wavelet
`signatures
`representation, we employed
`are already a sparse-vector
`tech-
`niques explicitly designed to handle probabilistic matching across
`sparse vectors, and did not
`require transformation to continuous
`values. Min-Hash is such a method and has been used
`exten-
`sively (and successfully) in data-stream processing for this class of
`problems.
`
`GOOG-NETWORK-00610577
`
`

`

`Case 1:14-cv-02396-PGG-SN Document 241-29 Filed 11/12/20 Page 6 of 15
`
`S. Baluja, M. Covell / Pattern Recognition 41 (2008) 3467 - 3480
`
`3471
`
`Probability of First Non-Zero Element in Each Position
`
`200
`50
`
`|
`
`800
`
`16
`
`14
`
`12
`
`.
`
`k
`
`Probability
`
`rererwrerwrrwrerwrrerrwrierirwri
`
`SELL rlLS@RQaNRak
`
`er
`
`Fig. 4. Probability of the first non-zero element occurring in each bit position, measured over all of the snippets in the database. The curves shown are for keeping the top
`there is a lange percentage after the truncation
`(as we will use in our system), keeping only 50, and keeping 800. Note that with 50 top coefficients,
`coefficients
`200
`wavelet
`point of 255 (the spike at the right edge of the graph). When keeping 800 coefficients,
`notice that the probability of a “1" in the first position examined increases;
`this is
`the sparseness of the bit vectors has decreased
`because
`significantly.
`
`Position
`
`3.3. Database-creation process
`To this point, we have reduced the amount of information from
`each spectrogram through three steps. First, we kept only the top
`wavelets of the spectral
`images associated with the
`spectrogram.
`Second, we reduced each of the top wavelets to only twobits. Third,
`we used the Min-Hash procedure to reduce the resulting bit vector
`to p values, which becamethe sub-fingerprint
`that was stored for
`the audio segment.
`these steps, each spectral
`image is represented by a series
`After
`of p 8-bit integers, the sub-fingerprint. Even with this compression,
`finding near-neighbors in a p-dimensional space is not
`efficiently
`trivial
`task (when p>50); naive comparisons are not
`a
`practical.
`Instead, we use a technique termed locality-sensitive hashing (LSH)
`in the number of comparisons that are
`It is not only efficient
`fraction of the data set will be examined for a
`required (a small
`lookup), but also provides noise-robustness
`properties.
`In standard hashing, the query’s sub-fingerprint is used, in its en-
`indexes into a bin that contains elements
`tirety, as a hash key that
`that have also been hashed into the same bin
`The elements
`that are found in that bin are then considered candidate matches,
`and can be compared in detail
`(for example, by comparing the full
`to obtain the final matches. The implicit assumption in
`signatures)
`using the entire fingerprint
`is that the procedures employed to cre-
`ate the sub-fingerprint already fully accounted for the noise that the
`system would encounter. In contrast to the standard hashing proce-
`dure, LSH performs a scrics of hashes, cach of which examines only
`
`a portion of the sub-fingerprint. The goal is to partition the feature
`vectors into
`subvectors and to hash each into
`separate hash ta-
`bles, each hash table using one of the subvectors as input
`to the
`In this study, because the Min-Hashesare signatures
`function.
`hash
`in which the constituents are determined randomly, we simply as-
`elements for each hash. Can-
`sign non-overlapping, consecutive,
`didate neighbors can then be retrieved by partitioning the probe
`feature vector in the same manner, and collecting the entries in the
`
`corresponding hash bins. The final list of potential neighbors can be
`created by vote counting, with each hash casting votes for the en-
`tries of its indexed bin, and retaining the candidates that receive a
`minimum number of votes, v. If v = 1, this takes the union of the
`candidate lists. If v = this takes the
`intersection.
`Unlike standard hashing schemes, in which the entire
`fingerprint
`must be exactly the same for candidate retrieval, LSH relaxes this
`requiring aslittle as p/f bytes of the sub-fingerprints to
`restriction,
`match. It should be noted that the number of hash tables, /, can vary
`from the size of the sub-fingerprint, p. We experiment
`independently
`with a variety of setting for while keeping the size of the Min-Hash
`constant.
`signature
`When creating the audio fingerprint database, we extract a sub-
`fingerprint of length p, once each s seconds. We retain these sub-
`in the observed time-order sequencing, with a pointer
`fingerprints
`to a song identifier. The time ordering is encoded implicitly in our
`index within a linear array. In addition
`system by the sub-fingerprint
`to the time-ordered representation of the sub-fingerprints, each sub-
`is inserted into the ! hash tables of our LSH according to
`fingerprint
`the hash-function mapping of the corresponding p/I bytes to a hash
`bin. This provides us with the complete song database from which
`we will retrieve matching songs; this is described in the next section.
`
`3.4. Retrieval process
`
`In the previous two subsections, we described the technique to
`create sub-fingerprints and to store them in a database for fast re-
`In this section, we describe the process to find similar snip-
`trieval.
`pets once given this database and a new query snippet. The complete
`is shown graphically in Fig, Note the first
`few steps
`procedure
`(steps 1-5) of retrieval process follow the database
`sub-fingerprint
`creation process closely; this is to ensure that the query and stored
`data are processed in the same manner. After the candidate matches
`are found (steps 6-8), a measure of temporal coherence is used to
`verify the findings (not shownin the diagram).
`
`GOOG-NETWORK-00610578
`
`

`

`Case 1:14-cv-02396-PGG-SN Document 241-29 Filed 11/12/20 Page 7 of 15
`
`3472
`
`S. Baluja, M. Covell / Pattern Recognition 41 (2008) 3467 - 3480
`
`Full Snippet
`
`Input
`
`1.
`
`Spectral
`Computed at
`
`Images
`
`Variable
`
`Strides
`
`2. Compute Wavelets on spectral
`images
`3. Extract Top-t Wavelets
`
`an
`
`4. Binary
`
`Representation
`
`ABCDEFGHIJKLMNOPQRSTUV WXY
`
`5. Generate Min-Hash Signature
`there shownas 25 bytes)
`
`ABCDE
`
`Hash Table 1
`
`FGHIJ
`
`Hash Table 2
`
`UVWXY
`
`Hash Table /
`
`12,50,92,302
`
`Vv
`
`Fingerprint & count
`5
`
`1
`
`50:
`92:3
`102:
`
`302:
`
`2
`
`1
`
`eee
`
`92,102
`
`7,92,102
`
`6. LSH procedure
`with / hashes
`
`92
`102
`
`> Compare Sig (92), Sig
`
`Compare (Sig (102), Sig (q1))
`
`7. Count Votes & Retain those
`above minimum, v=2.
`
`8. Compare full signature with
`query signature.
`
`Fig. 5. Overall architecture of the retrieval process. Dynamic-time warping not
`
`shown.
`
`The first difference in the retrieval process,
`in comparison to the
`is that the song is divided into ran-
`generation process,
`fingerprint
`domly overlapping segments rather than uniformly overlapping seg-
`ments. Randomly selecting the stride amount avoids problems of
`if the sampling of the probe is kept constant, it
`unlucky alignments;
`may repeatedly find samples that have uniformly large offsets from
`the sampling used to create the database. After the audio snippet
`is created, its signature is computed in exactly the same manner as
`described in the database generation process (steps 2-5). The next
`steps 6-9 are described in the remainder of this section.
`
`3.4.1.
`Lookup using LSH
`We described the basic characteristics of LSH in the previous sub-
`section. Each component hash table votes for the
`sub-fingerprints
`that were
`retrieved using its p/l-bytes. Then, only those sub-
`fingerprints with a minimum of v component votes are retained. For
`example, by setting v= 1, a candidate must only match in one of the
`subregions in order to be includedin the final candidate list. At the
`the candidate must match on all
`opposite extreme, by setting
`subregions and the LSH operates identically (although
`to a standard hash table.
`least v votes are then compared
`fingerprints that have at
`The
`with the query sub-fingerprint
`(we refer to this as a full compare).
`Because each byte of the sub-fingerprint
`is a Min-Hash signature, we
`simply look at the number of bytes (out of p) that match exactly. The
`
`inefficiently)
`
`sub-fingerprint with the maximum ofthis score is the best match
`for the query spectral
`image.
`
`3.4.2. Dynamic programming for temporal ordering
`To this point, we have
`discussed matching individual
`sub-
`from the probe into the database.
`In this section, we
`fingerprints
`describe methods to accumulate evidence across
`sub-fingerprints
`over the duration of the probe snippet.
`The simplest way to combine evidence is voting without
`regard
`information. With this approach, we keep a
`similar-
`to temporal
`ity counter for each song. At the start of a new probe snippet, all
`of these counters are zero. Then, for each candidate
`subfingerprint
`match that passed the required hash-voting threshold, v, we incre-
`the corresponding song counter by the Hamming
`ment
`similarity
`between the probe and candidate sub-fingerprints. The advantage of
`this approach is its simplicity and low memory requirements. The
`is that evidence for a song accumulates without
`regard
`disadvantage
`to temporal ordering of the sub-fingerprint matches. Pairs of sub-
`fingerprints p(t)/p(t + At) within the probe are rewarded equally for
`matching song sub-fingerprints s(t)/s(t — At) as they are for match-
`ing s(t)/s(t + At):
`there is no penalty for time reversal. Similarly,
`for
`matching s(t)/s(t + 10At):
`there is no penalty for changing tempo.
`We also explored dynamic-time warping (DTW)
`to accumulate
`time. DTW is a form of dynamic
`evidence
`across
`programming
`imposing “tempo” constraints in mapping one sequence onto
`for
`
`GOOG-NETWORK-00610579
`
`

`

`Case 1:14-cv-02396-PGG-SN Document 241-29 Filed 11/12/20 Page 8 of 15
`
`S. Baluja, M. Covell / Pattern Recognition 41 (2008) 3467 - 3480
`
`3473
`
`40% faster probe
`
`hashes (v). By increasing this number, we dramatically reduce the
`number offull comparisons that need to be conducted;
`the impact
`on accuracy will be carefully studied, as the likelihood of a missed
`match
`increases.
`
`4. Parameter exploration
`
`time
`
`db-song
`
`One of the intrinsic difficulties in designing systems with multiple
`is conducting a thorough exploration of the parameter
`components
`settings. Because of the need to deploy this system in a
`large-scale
`setting, we extensively measure the effects of parame-
`production
`ter values. The results reported in this section represent multiple
`decades of computational processing; over 50,400 different
`param-
`eter combinations were tried to ensure that we select the best set-
`tings and understand the tradeoffs with each parameter. The param-
`eters that we explored for the fingerprint and database generation
`and retrieval are shownin Jable-2 To make the results more under-
`standable, we also present someof our derived intuition about how
`the parameters affect computational complexity, memory usage and
`retrieval
`accuracy.
`space, we used a 10,000 song
`To explore this large parameter
`database, with an average song duration of 3.5 min. Since our goal at
`this point is to understand the general system performance, we did
`not use heuristics such as unequal weighting of songs to influence
`the results.
`In the future,

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket