throbber
An Industrial-Strength Audio Search Algorithm
`
`Avery Li-Chun Wang
`avery@shazamteam.com
`Shazam Entertainment, Ltd.
`USA:
`2925 Ross Road
`Palo Alto, CA 94303
`
`United Kingdom:
`375 Kensington High Street
`4th Floor Block F
`London W14 8Q
`
`We have developed and commercially deployed a flexible audio search engine. The algorithm is noise and distortion resistant,
`computationally efficient, and massively scalable, capable of quickly identifying a short segment of music captured through a
`cellphone microphone in the presence of foreground voices and other dominant noise, and through voice codec compression, out
`of a database of over a million tracks. The algorithm uses a combinatorially hashed time-frequency constellation analysis of the
`audio, yielding unusual properties such as transparency, in which multiple tracks mixed together may each be identified.
`Furthermore, for applications such as radio monitoring, search times on the order of a few milliseconds per query are attained,
`even on a massive music database.
`
`Introduction
`1
`Shazam Entertainment, Ltd. was started in 2000 with the
`idea of providing a service that could connect people to
`music by recognizing music in the environment by using
`their mobile phones to recognize the music directly. The
`algorithm had to be able to recognize a short audio sample
`of music that had been broadcast, mixed with heavy
`ambient noise, subject to reverb and other processing,
`captured by a little cellphone microphone, subjected to
`voice codec compression, and network dropouts, all before
`arriving at our servers. The algorithm also had to perform
`the recognition quickly over a large database of music with
`nearly 2M tracks, and furthermore have a low number of
`false positives while having a high recognition rate.
`
`This was a hard problem, and at the time there were no
`algorithms known to us that could satisfy all these
`constraints. We eventually developed our own technique
`that met all the operational constraints [1].
`
`We have deployed the algorithm to scale in our commercial
`music recognition service, with over 1.8M tracks in the
`database. The service is currently live in Germany,
`Finland, and the UK, with over a half million users, and
`will soon be available in additional countries in Europe,
`Asia, and the USA. The user experience is as follows: A
`user hears music playing in the environment. She calls up
`our service using her mobile phone and samples up to 15
`seconds of audio. An identification is performed on the
`sample at our server, then the track title and artist are sent
`back to the user via SMS text messaging. The information
`is also made available on a web site, where the user may
`register and log in with her mobile phone number and
`password. At the web site, or on a smart phone, the user
`may view her tagged track list and buy the CD. The user
`may also download the ringtone corresponding to the
`tagged track, if it is available. The user may also send a
`
`30-second clip of the song to a friend. Other services, such
`as purchasing an MP3 download may become available
`soon.
`
`A variety of similar consumer services has sprung up
`recently. Musiwave has deployed a similar mobile-phone
`music identification service on the Spanish mobile carrier
`Amena using Philips’ robust hashing algorithm [2-4].
`Using the algorithm from Relatable, Neuros has included a
`sampling feature on their MP3 player which allows a user
`to collect a 30-second sample from the built-in radio, then
`later plug into an online server to identify the music [5,6].
`Audible Magic uses the Muscle Fish algorithm to offer the
`Clango service for identifying audio streaming from an
`internet radio station [7-9].
`
`The Shazam algorithm can be used in many applications
`besides just music recognition over a mobile phone. Due to
`the ability to dig deep into noise we can identify music
`hidden behind a loud voiceover, such as in a radio advert.
`On the other hand, the algorithm is also very fast and can
`be used for copyright monitoring at a search speed of over
`1000 times realtime, thus enabling a modest server to
`monitor significantly many media streams. The algorithm
`is also suitable for content-based cueing and indexing for
`library and archival uses.
`2 Basic principle of operation
`Each audio file is “fingerprinted,” a process in which
`reproducible hash tokens are extracted. Both “database”
`and “sample” audio files are subjected to the same analysis.
`The fingerprints from the unknown sample are matched
`against a large set of fingerprints derived from the music
`database.
` The candidate matches are subsequently
`evaluated for correctness of match.
` Some guiding
`principles for the attributes to use as fingerprints are that
`they should be temporally localized, translation-invariant,
`robust, and sufficiently entropic. The temporal locality
`
`EX1035
`Roku V. Media Chain
`U.S. Patent No. 10,489,560
`
`0001
`
`

`

`guideline suggests that each fingerprint hash is calculated
`using audio samples near a corresponding point in time, so
`that distant events do not affect the hash. The translation-
`invariant aspect means that fingerprint hashes derived from
`corresponding matching
`content
`are
`reproducible
`independent of position within an audio file, as long as the
`temporal locality containing the data from which the hash
`is computed is contained within the file. This makes sense,
`as an unknown sample could come from any portion of the
`original audio track. Robustness means that hashes
`generated from the original clean database track should be
`reproducible
`from a degraded copy of
`the audio.
`Furthermore, the fingerprint tokens should have sufficiently
`high entropy in order to minimize the probability of false
`token matches at non-corresponding locations between the
`unknown sample and
`tracks within
`the database.
`Insufficient entropy
`leads
`to excessive and spurious
`matches at non-corresponding locations, requiring more
`processing power to cull the results, and too much entropy
`usually
`leads
`to fragility and non-reproducibility of
`fingerprint tokens in the presence of noise and distortion.
`
`There are 3 main components, presented in the next
`sections.
`
`2.1 Robust Constellations
`In order to address the problem of robust identification in
`the presence of highly significant noise and distortion, we
`experimented with a variety of candidate features that could
`survive GSM encoding in the presence of noise. We settled
`on spectrogram peaks, due to their robustness in the
`presence of noise and approximate linear superposability
`[1]. A time-frequency point is a candidate peak if it has a
`higher energy content than all its neighbors in a region
`centered around the point. Candidate peaks are chosen
`according to a density criterion in order to assure that the
`time-frequency strip for the audio file has reasonably
`uniform coverage. The peaks in each time-frequency
`locality are also chosen according amplitude, with the
`justification that the highest amplitude peaks are most
`likely to survive the distortions listed above.
`
`Thus, a complicated spectrogram, as illustrated in Figure
`1A may be reduced to a sparse set of coordinates, as
`illustrated in Figure 1B. Notice that at this point the
`amplitude component has been eliminated. This reduction
`has the advantage of being fairly insensitive to EQ, as
`
`0002
`
`

`

`generally a peak in the spectrum is still a peak with the
`same coordinates in a filtered spectrum (assuming that the
`derivative of the filter transfer function is reasonably
`small—peaks in the vicinity of a sharp transition in the
`transfer function are slightly frequency-shifted). We term
`the sparse coordinate lists “constellation maps” since the
`coordinate scatter plots often resemble a star field.
`
`The pattern of dots should be the same for matching
`segments of audio. If you put the constellation map of a
`database song on a strip chart, and the constellation map of
`a short matching audio sample of a few seconds length on a
`transparent piece of plastic, then slide the latter over the
`former, at some point a significant number of points will
`coincide when the proper time offset is located and the two
`constellation maps are aligned in register.
`
`The number of matching points will be significant in the
`presence of spurious peaks injected due to noise, as peak
`positions are relatively independent; further, the number of
`matches can also be significant even if many of the correct
`points have been deleted. Registration of constellation
`maps is thus a powerful way of matching in the presence of
`noise and/or deletion of features. This procedure reduces
`the search problem to a kind of “astronavigation,” in which
`a small patch of time-frequency constellation points must
`be quickly located within a large universe of points in a
`strip-chart universe with dimensions of bandlimited
`frequency versus nearly a billion seconds in the database.
`
`
`Yang also considered the use of spectrogram peaks, but
`employed them in a different way [10].
`
`2.2 Fast Combinatorial Hashing
`Finding
`the correct registration offset directly from
`constellation maps can be rather slow, due to raw
`constellation points having low entropy. For example, a
`1024-bin frequency axis yields only at most 10 bits of
`frequency data per peak. We have developed a fast way of
`indexing constellation maps.
`
`Fingerprint hashes are formed from the constellation map,
`in which pairs of time-frequency points are combinatorially
`associated. Anchor points are chosen, each anchor point
`having a target zone associated with it. Each anchor point
`is sequentially paired with points within its target zone,
`each pair yielding two frequency components plus the time
`difference between the points (Figure 1C and 1D). These
`hashes are quite reproducible, even in the presence of noise
`and voice codec compression. Furthermore, each hash can
`be packed into a 32-bit unsigned integer. Each hash is also
`associated with the time offset from the beginning of the
`respective file to its anchor point, though the absolute time
`is not a part of the hash itself.
`
`To create a database index, the above operation is carried
`out on each track in a database to generate a corresponding
`list of hashes and their associated offset times. Track IDs
`may also be appended to the small data structs, yielding an
`
`0003
`
`

`

`aggregate 64-bit struct, 32 bits for the hash and 32 bits for
`the time offset and track ID. To facilitate fast processing,
`the 64-bit structs are sorted according to hash token value.
`
`The number of hashes per second of audio recording being
`processed
`is approximately equal
`to
`the density of
`constellation points per second times the fan-out factor into
`the target zone. For example, if each constellation point is
`taken to be an anchor point, and if the target zone has a fan-
`out of size F=10,
`then
`the number of hashes
`is
`approximately equal
`to F=10
`times
`the number of
`constellation points extracted from the file. By limiting the
`number of points chosen in each target zone, we seek to
`limit the combinatorial explosion of pairs. The fan-out
`factor leads directly to a cost factor in terms of storage
`space.
`
`By forming pairs instead of searching for matches against
`individual constellation points we gain a tremendous
`acceleration in the search process. For example, if each
`frequency component is 10 bits, and the ∆t component is
`also 10 bits, then matching a pair of points yields 30 bits of
`information, versus only 10 for a single point. Then the
`specificity of the hash would be about a million times
`greater, due to the 20 extra bits, and thus the search speed
`for a single hash token is similarly accelerated. On the
`other hand, due to the combinatorial generation of hashes,
`assuming symmetric density and fan-out for both database
`and sample hash generation, there are F times as many
`token combinations in the unknown sample to search for,
`and F times as many tokens in the database, thus the total
`
`speedup is a factor of about 1000000/F2, or about 10000,
`over token searches based on single constellation points.
`
`Note that the combinatorial hashing squares the probability
`of point survival, i.e. if p is the probability of a spectrogram
`peak surviving the journey from the original source
`material to the captured sample recording, then the
`probability of a hash from a pair of points surviving is
`approximately p2. This reduction in hash survivability is a
`tradeoff against
`the
`tremendous amount of speedup
`provided. The reduced probability of individual hash
`survival is mitigated by the combinatorial generation of a
`greater number of hashes than original constellation points.
`For example, if F=10, then the probability of at least one
`hash surviving for a given anchor point would be the joint
`probability of the anchor point and at least one target point
`in its target zone surviving. If we simplistically assume IID
`probability p of survival for all points involved, then the
`probability of at least one hash surviving per anchor point
`is p*[1-(1-p)F]. For reasonably large values of F, e.g.
`F>10, and reasonable values of p, e.g. p>0.1, we have
`approximately
`
`p ≈ p*[1-(1-p)F]
`so we are actually not much worse off than before.
`
`We see that by using combinatorial hashing, we have
`traded off approximately 10 times the storage space for
`approximately 10000 times improvement in speed, and a
`small loss in probability of signal detection.
`
`
`0004
`
`

`

`Figure 4: Recognition rate -- Additive Noise
`
`100.00%
`
`90.00%
`
`80.00%
`
`70.00%
`
`60.00%
`
`50.00%
`
`40.00%
`
`30.00%
`
`20.00%
`
`10.00%
`
`0.00%
`
`-15
`
`-12
`
`-9
`
`-6
`
`3
`0
`-3
`Signal/Noise Ratio (dB)
`
`6
`
`9
`
`12
`
`15
`
`15 sec linear
`
`10 sec linear
`
`5 sec linear
`
`Different fan-out and density factors may be chosen for
`different signal conditions. For relatively clean audio, e.g.
`for radio monitoring applications, F may be chosen to be
`modestly small and the density can also be chosen to be
`low, versus for the somewhat more challenging mobile
`phone consumer application. The difference in processing
`requirements can thus span many orders of magnitude.
`
`2.3 Searching and Scoring
`To perform a search, the above fingerprinting step is
`performed on a captured sample sound file to generate a set
`of hash:time offset records. Each hash from the sample is
`used to search in the database for matching hashes. For
`each matching hash
`found
`in
`the database,
`the
`corresponding offset times from the beginning of the
`sample and database files are associated into time pairs.
`The time pairs are distributed into bins according to the
`track ID associated with the matching database hash.
`
`After all sample hashes have been used to search in the
`database to form matching time pairs, the bins are scanned
`for matches. Within each bin the set of time pairs
`represents a scatterplot of association between the sample
`and database sound files. If the files match, matching
`features should occur at similar relative offsets from the
`beginning of the file, i.e. a sequence of hashes in one file
`should also occur in the matching file with the same
`relative time sequence. The problem of deciding whether a
`match has been found reduces to detecting a significant
`cluster of points forming a diagonal line within the
`scatterplot. Various techniques could be used to perform
`the detection, for example a Hough transform or other
`
`robust regression technique. Such techniques are overly
`general, computationally expensive, and susceptible to
`outliers.
`
`Due to the rigid constraints of the problem, the following
`technique solves the problem in approximately N*log(N)
`time, where N is the number of points appearing on the
`scatterplot. For the purposes of this discussion, we may
`assume that the slope of the diagonal line is 1.0. Then
`corresponding
`times of matching
`features between
`matching files have the relationship
`tk’=tk+offset,
`where tk’ is the time coordinate of the feature in the
`matching (clean) database soundfile and tk is the time
`coordinate of the corresponding feature in the sample
`soundfile to be identified. For each (tk’,tk) coordinate in the
`scatterplot, we calculate
`δtk=tk’-tk.
`Then we calculate a histogram of these δtk values and scan
`for a peak. This may be done by sorting the set of δtk values
`and quickly scanning for a cluster of values. The
`scatterplots are usually very sparse, due to the specificity of
`the hashes owing to the combinatorial method of generation
`as discussed above. Since the number of time pairs in each
`bin is small, the scanning process takes on the order of
`microseconds per bin, or less. The score of the match is the
`number of matching points in the histogram peak. The
`presence of a statistically significant cluster indicates a
`match. Figure 2A illustrates a scatterplot of database time
`versus sample time for a track that does not match the
`sample. There are a few chance associations, but no linear
`correspondence appears. Figure 3A shows a case where a
`
`0005
`
`

`

`Figure 5: Recognition rate -- Additive noise + GSM compression
`
`100.00%
`
`90.00%
`
`80.00%
`
`70.00%
`
`60.00%
`
`50.00%
`
`40.00%
`
`30.00%
`
`20.00%
`
`10.00%
`
`0.00%
`
`-15
`
`-12
`
`-9
`
`-6
`
`3
`0
`-3
`Signal/Noise Ratio (dB)
`
`6
`
`9
`
`12
`
`15
`
`15 sec GSM
`
`10 sec GSM
`
`5 sec GSM
`
`significant number of matching time pairs appear on a
`diagonal line. Figures 2B and 3B show the histograms of
`the δtk values corresponding to Figures 2A and 3B.
`
`This bin scanning process is repeated for each track in the
`database until a significant match is found.
`
`Note that the matching and scanning phases do not make
`any special assumption about the format of the hashes. In
`fact, the hashes only need to have the properties of having
`sufficient entropy to avoid too many spurious matches to
`occur, as well as being reproducible. In the scanning phase
`the main thing that matters is for the matching hashes to be
`temporally aligned.
`
`2.3.1 Significance
`As described above, the score is simply the number of
`matching and time-aligned hash tokens. The distribution of
`scores of incorrectly-matching tracks is of interest in
`determining the rate of false positives as well as the rate of
`correct recognitions. To summarize briefly, a histogram of
`the scores of incorrectly-matching tracks is collected. The
`number of tracks in the database is taken into account and a
`probability density function of the score of the highest-
`scoring incorrectly-matching track is generated. Then an
`acceptable false positive rate is chosen (for example 0.1%
`false positive rate or 0.01%, depending on the application),
`then a threshold score is chosen that meets or exceeds the
`false-positive criterion.
`
`3 Performance
`3.1 Noise resistance
`The algorithm performs well with significant levels of
`noise and even non-linear distortion. It can correctly
`identify music in the presence of voices, traffic noise,
`dropout, and even other music. To give an idea of the
`power of this technique, from a heavily corrupted 15
`second sample, a statistically significant match can be
`determined with only about 1-2% of the generated hash
`tokens actually surviving and contributing to the offset
`cluster. A property of the scatterplot histogramming
`technique is that discontinuities are irrelevant, allowing
`immunity to dropouts and masking due to interference.
`One somewhat surprising result is that even with a large
`database, we can correctly identify each of several tracks
`mixed together, including multiple versions of the same
`piece, a property we call “transparency”.
`
`Figure 4 shows the result of performing 250 sample
`recognitions of varying length and noise levels against a
`test database of 10000 tracks consisting of popular music.
`A noise sample was recorded in a noisy pub to simulate
`“real-life” conditions. Audio excerpts of 15, 10, and 5
`seconds in length were taken from the middle of each test
`track, each of which was taken from the test database. For
`each test excerpt, the relative power of the noise was
`normalized to the desired signal-to-noise ratio, then linearly
`added to the sample. We see that the recognition rate drops
`
`0006
`
`

`

`5 References
`[1] Avery Li-Chun Wang and Julius O. Smith, III., WIPO
`publication WO 02/11123A2, 7 February 2002,
`(Priority 31 July 2000).
`[2] http://www.musiwave.com
`[3] Jaap Haitsma, Antonius Kalker, Constant Baggen, and
`Job Oostveen., WIPO publication WO 02/065782A1,
`22 August 2002, (Priority 12 February, 2001).
`[4] Jaap Haitsma, Antonius Kalker, “A Highly Robust
`Audio Fingerprinting System”, International
`Symposium on Music Information Retrieval (ISMIR)
`2002, pp. 107-115.
`[5] Neuros Audio web site: http://www.neurosaudio.com/
`[6] Sean Ward and Isaac Richards, WIPO publication WO
`02/073520A1, 19 September 2002, (Priority 13 March
`2001)
`[7] Audible Magic web site:
`http://www.audiblemagic.com/
`[8] Erling Wold, Thom Blum, Douglas Keislar, James
`Wheaton, “Content-Based Classification, Search, and
`Retrieval of Audio”, in IEEE Multimedia, Vol. 3, No.
`3: FALL 1996, pp. 27-36
`[9] Clango web site: http://www.clango.com/
`[10] Cheng Yang, “MACS: Music Audio Characteristic
`Sequence Indexing For Similarity Retrieval”, in IEEE
`Workshop on Applications of Signal Processing to
`Audio and Acoustics, 2001
`
`
`
`to 50% for 15, 10, and 5 second samples at approximately
`-9, -6, and -3 dB SNR, respectively Figure 5 shows the
`same analysis, except that the resulting music+noise
`mixture was further subjected to GSM 6.10 compression,
`then reconverted to PCM audio. In this case, the 50%
`recognition rate level for 15, 10, and 5 second samples
`occurs at approximately -3, 0, and +4 dB SNR. Audio
`sampling and processing was carried out using 8KHz,
`mono, 16-bit samples.
`
`3.2 Speed
`For a database of about 20 thousand tracks implemented on
`a PC, the search time is on the order of 5-500 milliseconds,
`depending on parameters settings and application. The
`service can find a matching track for a heavily corrupted
`audio sample within a few hundred milliseconds of core
`search time. With “radio quality” audio, we can find a
`match
`in
`less
`than 10 milliseconds, with a
`likely
`optimization goal reaching down to 1 millisecond per
`query.
`
`3.3 Specificity and False Positives
`The algorithm was designed specifically
`to
`target
`recognition of sound files that are already present in the
`database. It is not expected to generalize to live recordings.
`That said, we have anecdotally discovered several artists in
`concert who apparently either have extremely accurate and
`reproducible timing (with millisecond precision), or are
`more plausibly lip synching.
`The algorithm is conversely very sensitive to which
`particular version of a track has been sampled. Given a
`multitude of different performances of the same song by an
`artist, the algorithm can pick the correct one even if they
`are virtually indistinguishable by the human ear.
`We occasionally get reports of false positives. Often times
`we find that the algorithm was not actually wrong since it
`had picked up an example of “sampling,” or plagiarism. As
`mentioned above, there is a tradeoff between true hits and
`false positives, and
`thus
`the maximum allowable
`percentage of false positives is a design parameter that is
`chosen to suit the application.
`4 Acknowledgements
`Special thanks to Julius O. Smith, III and Daniel P. W. Ellis
`for providing guidance. Thanks also to Chris, Philip,
`Dhiraj, Claus, Ajay, Jerry, Matt, Mike, Rahul, Beth and all
`the other wonderful folks at Shazam, and to my Katja.
`
`0007
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket