[Continued on next page]
`herein that provides a stateless connection between the
`client and server for streaming media playback in which the
`data is formatted in a mannerthat allows theclient to make
`decisions and react more quickly to changing network con-
`ditions. The client requests uniform chunks of media from
`the server that include a portion of the media. The adaptive
`streaming system requests portions of a media file or of a
`live streaming event in small-sized chunks each having a
`distinguished URL. This allows streaming media data to be
`cached by existing Internet cache infrastructure. Each
`chunk contains metadata information that describes the en-
`coding of the chunk and media content for playback by the
`client. The server may provide chunks in multiple encod-
`ings so that the client can switch quickly to chunksof a dif-
`ferent bit rate or playback speed.
`Streaming media is multimedia that is constantly received by, and
`normally presented to, an end-user(using a client) while it is being delivered by a
`streaming provider(using a server). Several protocols exist for streaming media,
`including the Real-time Streaming Protocol (RTSP), Real-time Transport Protocol
`(RTP), and the Real-time Transport Control Protocol (RTCP), which streaming
`applications often use together. The Real Time Streaming Protocol (RTSP),
`developed bythe Internet Engineering Task Force (IETF) and created in 1998 as
`Request For Comments (RFC) 2326, is a protocol for use in streaming media
`systems, whichallowsa client to remotely control a streaming media server,
`issuing VCR-like commands such as "play" and "pause", and allowing time-based
`access tofiles on a server.
`[0002] The sending of streaming dataitself is not part of the RTSP protocol.
`Most RTSP servers use the standards-based RTP as the transport protocol for the
`actual audio/video data, acting somewhat as a metadata channel. RTP defines a
`standardized packet format for delivering audio and video overthe Internet. RTP
`was developed by the Audio-Video Transport Working Group of the IETF and first
`published in 1996 as RFC 1889, and superseded by RFC 3550 in 2003. The
`protocolis similar in syntax and operation to Hypertext Transport Protocol (HTTP),
`but RTSP adds new requests. While HTTP is stateless, RTSP is a stateful
`protocol. RTSP usesa session ID to keep track of sessions when needed. RTSP
`messagesare sent from client to server, although some exceptions exist where
`the serverwill send messagesto the client.
`Streaming applications usually use RTP in conjunction with RTCP. While
`RTP carries the media streams (e.g., audio and video) or out-of-band signaling
`(dual-tone multi-frequency (DTMF)), streaming applications use RTCP to monitor
`transmission statistics and quality of service (QOS) information. RTP allows only
`one type of message, one that carries data from the sourceto the destination.
`many cases, there is a need for other messages in a session. These messages
`control the flow and quality of data and allow the recipient to send feedbackto the
`source or sources. RTCP is a protocol designed for this purpose. RTCP has five
`types of messages: senderreport, receiver report, source description message,
`WO 2010/107625
`bye message, and application-specific message. RTCP provides out-of-band
`control information for an RTP flow and partners with RTP in the delivery and
`packaging of multimedia data, but does not transport any data itself. Streaming
`applications use RTCP to periodically transmit control packets to participants in a
`streaming multimedia session. One function of RTCP is to provide feedback on
`the quality of service RTP is providing. RTCP gathers statistics on a media
`connection and information such as bytes sent, packets sent, lost packets, jitter,
`feedback, and round trip delay. An application mayusethis information to
`increase the quality of service, perhapsbylimiting flow or using a different codec
`or bit rate.
`[0004] One problem with existing media streaming architecturesis the tight
`coupling between server and client. The stateful connection between client and
`server creates additional server overhead, because the servertracks the current
`state of each client. This also limits the scalability of the server.
`In addition, the
`client cannot quickly react to changing conditions, such as increased packetloss,
`reduced bandwidth, user requests for different content or to modify the existing
`content (e.g., speed up or rewind), and soforth, withoutfirst communicating with
`the server and waiting for the server to adapt and respond. Often, whena client
`reports a loweravailable bandwidth (e.g., through RTCP), the server does not
`adapt quickly enough causing breaks in the media to be noticed by the user on
`the client as packets that exceed the available bandwidth are not received and
`new lower bit rate packets are not sent from the server in time. To avoid these
`problems, clients often buffer data, but buffering introduces latency, whichfor live
`events may be unacceptable.
`In addition, the Internet contains many types of downloadable media
`content items, including audio, video, documents, and so forth. These content
`items are often very large, such as video in the hundreds of megabytes. Users
`often retrieve documents overthe Internet using HTTP through a web browser.
`The Internet has built up a large infrastructure of routers and proxies that are
`effective at caching data for HTTP. Servers can provide cached datato clients
`with less delay and by using fewer resources than re-requesting the content from
`the original source. For example, a user in New York may download a content
`item served from a host in Japan, and receive the content item through a routerin
`If a user in New Jersey requests the same file, the router in California
`WO 2010/107625
`maybeable to provide the content item without again requesting the data from
`the host in Japan. This reduces the networktraffic over possibly strained routes,
`and allows the user in New Jerseyto receive the content item with less latency.
`[0006] Unfortunately, live media often cannot be cached using existing protocols,
`and eachclient requests the media from the same serveror set of servers.
`addition, when streaming media can be cached, specialized cache hardwareis
`often involved, rather than existing and readily available HTTP-basedInternet
`caching infrastructure. The lack of caching limits the number of concurrent
`viewers and requests that the servers can handle, and limits the attendance of a
`live event. The world is increasingly using the Internet to consume up to the
`minute live information, such as the record numberof users that watchedlive
`events such as the opening of the 2008 Olympicsvia the Internet. The limitations
`of current technology are slowing adoption of the Internet as a medium for
`consuming this type of media content.
`[0007] An adaptive streaming system is described herein that provides a
`stateless connection betweenthe client and server for streaming media playback
`in which the data is formatted in a manner that allows the client to make decisions
`traditionally performed by the server and therefore react more quickly to changing
`network conditions. The client requests uniform chunks of media from the server
`that include a portion of the media. The adaptive streaming system requests
`portions of a mediafile or of a live streaming event in small-sized chunks each
`having a distinguished URL. This allows existing Internet cache infrastructure to
`cache streaming media, thereby allowing more clients to view the same contentat
`about the same time. As the event progresses, the client continues requesting
`chunksuntil the end of the event or media. Each chunk contains metadata
`information that describes the encoding of the chunk and media contentfor
`playback bythe client. The server may provide chunks in multiple encodings so
`that the client can switch quickly to chunksofa different bit rate or playback
`speed. Thus, the adaptive streaming system provides an improved experience to
`the user with fewer breaks in streaming media playback, and an increased
`likelihood that the client will receive the media with lower latency from a more local
`cache server.
`WO 2010/107625
`[0008] This Summary is provided to introduce a selection of concepts in a
`simplified form that are further described below in the Detailed Description. This
`Summaryis not intended to identify key features or essential features of the
`claimed subject matter, noris it intended to be usedto limit the scope of the
`claimed subject matter.
`Figure 1 is a block diagram thatillustrates components of the adaptive
`streaming system, in one embodiment.
`Figure 2 is a block diagram thatillustrates an operating environment of
`the smooth streaming system using Microsoft Windowsand IIS, in one
`Figure 3 is a flow diagram thatillustrates the processing of the adaptive
`streaming system on a client to playback media, in one embodiment.
`Figure 4 is a flow diagram thatillustrates the processing of the adaptive
`streaming system to handle a single media chunk, in one embodiment.
`[0013] An adaptive streaming system is described herein that provides a
`stateless connection betweenthe client and server for streaming media playback
`in which the data is formatted in a manner that allows the client to make decisions
`often left to the server in past protocols, and therefore react more quickly to
`changing network conditions.
`In addition, the adaptive streaming system operates
`in a manner that allows existing Internet cache infrastructure to cache streaming
`media data, thereby allowing moreclients to view the same content at about the
`same time. The adaptive streaming system requests portions of a mediafile or of
`a live streaming event in small-sized chunks each having a distinguished URL.
`Each chunk maybe a media file in its own right or may be a part of a whole media
`file. As the event progresses, the client continues requesting chunksuntil the end
`of the event. Each chunk contains metadata information that describes the
`encoding of the chunk and media content for playback by the client. The server
`may provide chunksin multiple encodings so that the client can, for example,
`switch quickly to chunks of a different bit rate or playback speed. Because the
`chunks adhere to World Wide Web Consortium (W3C) HTTP standards, the
`chunks are small enough to be cached, and the system provides the chunksin the
`same way to each client, the chunks are naturally cached by existing Internet
`WO 2010/107625
`infrastructure without modification. Thus, the adaptive streaming system provides
`an improved experienceto the user with fewer breaks in streaming media
`playback, and an increased likelihood that the client will receive the media with
`lower latency from a morelocal cache server. Because the connection between
`the client and server is stateless, the same client and server need not be
`connected for the duration of a long event. The stateless system described herein
`has no server affinity, allowing clients to piece together manifests from servers
`that may have begun atdifferent times, and also allowing server administrators to
`bring up or shut downorigin servers as load dictates.
`In some embodiments, the adaptive streaming system uses a new data
`transmission format between the serverand client. The client requests chunks of
`media from a serverthat include a portion of the media. For example, for a 10-
`minutefile, the client may request 2-second chunks. Notethat unlike typical
`streaming where the server pushesdatato the client, in this case the client pulls
`media chunks from the server.
`In the caseof a live stream, the server may be
`creating the media on the fly and producing chunksto respond to client requests.
`Thus, the client may only be several chunks behind the server in terms of how fast
`the server creates chunks and howfast the client requests chunks.
`[0015] Each chunk contains metadata and media content. The metadata may
`describe useful information about the media content, such as the bit rate of the
`media content, where the media content fits into a larger media element (e.g., this
`chunk representsoffset 1:10 in a 10 minute videoclip), the codec used to encode
`the media content, and so forth. The client uses this information to place the
`chunk into a storyboard of the larger media element and to properly decode and
`playback the media content.
`Figure 1 is a block diagram thatillustrates components of the adaptive
`streaming system, in one embodiment. The adaptive streaming system 100
`includes a chunk request component 110, a chunk parsing component 120, a
`manifest assembly component 130, a media playback component 140, a QoS
`monitoring component 150, and a clock synchronization component 160. Each of
`these componentsis describedin further detail herein. The adaptive streaming
`system 100 as described herein operates primarily at a client computer system.
`However, those of ordinary skill in the art will recognize that various components
`WO 2010/107625
`of the system maybe placedat various locations within a content network
`environmentto provide particular positive results.
`[0017] The chunk request component 110 makes requests from the client for
`individual media chunksfrom the server. As shownin Figure 2, the client’s
`request maypassfirst to an edge server (e.g., an Internet cache), then to an
`origin server, and then to an ingest server. At each stage, if the requested data is
`found, then the request doesnotgo to the next level. For example, if the edge
`server has the requested data, then the client receives the data from the edge
`server and the origin server does not receive the request. Each chunk may have
`a Uniform Resource Locator (URL) that individually identifies the chunk.
`cache servers are good at caching server responsesto specific URL requests
`(e.g., HTTP GET). Thus, whenthe first client calls through to the server to get a
`chunk, the edge servers cache that chunk and subsequentclients that request the
`same chunk mayreceive the chunk from the edge server (based on the cache
`lifetime and servertime to live (TTL) settings). The chunk request component 110
`receives the chunk and passesit to the chunk parsing component 120 for
`[0018] The chunk parsing component 120 interprets the format of a media chunk
`received by the chunk request component 110 and separates the chunk into its
`component parts. Typically, the chunk includes a headerportion containing
`metadata, and a data portion containing media content. The chunk parsing
`component provides the metadata to the manifest assembly component 130 and
`the media content to the media playback component140.
`[0019] The manifest assembly component 130 builds a manifest that describes
`the media element to which received media content belongs. Large mediafiles
`that clients download as a whole (i.e., not streamed) often include a manifest
`describing the wholefile, the codecs and bit rates used to encode various portions
`of the file, markers about meaningful positions with the file, and so forth. During
`streaming, particularly live content, a server cannot provide a complete manifest
`becausethe eventis still ongoing. Thus, the server provides as muchof the
`manifest as it can through the metadata in the media chunks. The server may
`also provide an application-programming interface (API), such as a predefined
`URL, for the client to request the manifest up to the current point in the media
`stream. This can be useful when the client joins a live, streamed event after the
`WO 2010/107625
`eventis already in progress. The manifest allows the client to request previously
`streamed portions of the media element (e.g., by rewinding), and the client
`continues to receive new portions of the manifest through the metadata of the
`streamed media chunks.
`[0020] The manifest assembly component 130 builds a manifest similar to that
`available for a complete media file. Thus, as the event proceeds if the user wants
`to skip backwards in the media (e.g., rewind or jump to a particular position), then
`skip forward again, the user can do so and the client uses the assembled manifest
`to find the appropriate chunk or chunksto playback to the user. When the user
`pauses, the system 100 may continue to receive media chunks(or only the
`metadata portion of chunks based ona distinguished request URL), so that the
`manifest assembly component 130 can continue to build the manifest and be
`ready for any user requests (e.g., skip to the current live position or play from the
`pause point) after the user is done pausing. The client-side assembled manifest
`allowsthe client to play the media event back as on-demand content as soon as
`the eventis over, and to skip around within the media event as it is going on.
`[0021] The media playback component 140 plays back received media content
`using the client hardware. The media playback component 140 may invoke one
`or more codecsto interpret the container within which the media content is
`transported and to decompressor otherwise decode the media content from a
`compressed formatto a raw format (e.g., YV12, RGBA, or PCM audio samples)
`ready for playback. The media playback component 140 maythenprovide the
`raw format media content to an operating system API (e.g., Microsoft DirectX) for
`playback on local computer system sound and video hardware, such as a display
`and speakers.
`[0022] The QoS monitoring component 150 analyzes the successof receiving
`packets from the server and adaptsthe client’s requests based on a set of current
`network and other conditions. For example, if the client is routinely receiving
`media chunkslate, then the component 150 may determine that the bandwidth
`between the client and the serveris inadequate for the current bit rate, and the
`client may begin requesting media chunksat a lower bit rate. QoS monitoring
`may include measurement of other heuristics, such as render frame rate, window
`size, buffer size, frequency of rebuffering, and so forth. Media chunks for eachbit
`rate may have a distinguished URL so that chunksfor various bit rates are cached
`WO 2010/107625
`by Internet cache infrastructure. Note that the server doesnottrackclient state
`and does not know whatbit rate any particular client is currently playing. The
`server can simply provide the same media element in a variety of bit rates to
`satisfy potential client requests under a range of conditions.
`In addition, the initial
`manifest and/or metadata that the client receives mayinclude information about
`the bit rates and other encoding properties available from the server, so that the
`client can choosethe encoding thatwill provide a good client experience.
`[0023] Note that when switching bit rates, the client simply begins requesting the
`newbit rate and playing back the new bit rate chunksas the client receives the
`chunks. The client does not have to send control information to the server and
`wait for the server to adapt the stream. The client’s request may not even reach
`the server due to a cache in between the client and serversatisfying the request.
`Thus, the client is much quickerto react than clients in traditional media streaming
`systems are, and the burden on the serverof having different clients connecting
`under various current conditions is reduced dramatically.
`In addition, because
`current conditions tend to belocalized, it is likely that many clients in a particular
`geographic region or on a particular Internet service provider (ISP) will experience
`similar conditions and will request similar media encodings (e.g., bit rates).
`Because cachesalso tend to belocalized, it is likely that the clients in a particular
`situation will find that the cache near them is “warm” with the data that they each
`request, so that the latency experienced by eachclient will be low.
`[0024] Theclock synchronization component 160 synchronizesthe clocks of the
`server and the client. Although absolute time is not generally relevant to the client
`and server, being able to identify a particular chunk and knowing the rate (i.e.,
`cadence) at which to request chunksis relevantto the client. For example, if the
`client requests data too quickly, the server will not yet have the data and will
`respond with error responses(e.g., an HTTP 404 not found error response)
`creating many spurious requests that unnecessarily consume bandwidth. On the
`other hand, if the client requests data too slowly, then the client may not have data
`in time for playback creating noticeable breaks in the media played back to the
`user. Thus, the client and server work well when the client knowsthe rate at
`which the server is producing new chunks and knowswherethe current chunk fits
`into the overall timeline. The clock synchronization component 160 providesthis
`information by allowing the server and client to have a similar clock value at a
`WO 2010/107625
`particular time. The server may also mark each media chunk with the time at
`which the server created the chunk.
`[0025] Clock synchronization also gives the server a commonreference across
`each of the encoders. For example, the server may encode data in multiple bit
`rates and using multiple codecs at the same time. Each encoder mayreference
`encodeddatain a different way, but the timestamp can be set in common across
`all encoders.
`In this way,if a client requests a particular chunk, the clientwill get
`media representing the same period regardless of the encoding thatthe client
`[0026] The computing device on which the system is implemented may include a
`central processing unit, memory, input devices (e.g., keyboard and pointing
`devices), output devices (e.g., display devices), and storage devices (e.g., disk
`drives or other non-volatile storage media). The memory and storage devices are
`computer-readable storage media that may be encoded with computer-executable
`instructions (e.g., software) that implement or enable the system.
`In addition, the
`data structures and message structures maybestored or transmitted via a data
`transmission medium, such as a signal on a communication link. Various
`communication links may be used, such as the Internet, a local area network, a
`wide area network, a point-to-point dial-up connection, a cell phone network, and
`so on.
`[0027] Embodiments of the system may be implementedin various operating
`environmentsthat include personal computers, server computers, handheld or
`laptop devices, multiprocessor systems, microprocessor-based systems,
`programmable consumer electronics, digital cameras, network PCs,
`minicomputers, mainframe computers, distributed computing environmentsthat
`include anyof the above systems or devices, and so on. The computer systems
`maybecell phones, personal digital assistants, smart phones, personal
`computers, programmable consumerelectronics, digital cameras, and so on.
`[0028] The system may be described in the general context of computer-
`executable instructions, such as program modules, executed by one or more
`computers or other devices. Generally, program modulesinclude routines,
`programs, objects, components, data structures, and so on that perform particular
`tasks or implement particular abstract data types. Typically, the functionality of
`WO 2010/107625
`the program modules may be combined ordistributed as desired in various
`Figure 2 is a block diagram thatillustrates an operating environment of
`the smooth streaming system using Microsoft Windowsand IIS, in one
`embodiment. The environment typically includes a source client 210, a content
`delivery network 240, and an external network 270. The sourceclient is the
`source of the media or live event. The source client includes a media source 220
`and one or more encoders 230. The media source 220 may include cameras
`each providing multiple camera angles, microphonescapture audio, slide
`presentations, text (such as from a closed captioning service), images, and other
`types of media. The encoders 230 encode the data from the media source 220 in
`one or more encoding formats in parallel. For example, the encoders 230 may
`produce encoded media in a variety ofbit rates.
`[0030] The content delivery network 240 includes one or more ingest servers
`250 and one or moreorigin servers 260. The ingest servers 250 receive encoded
`media in each of the encoding formats from the encoders 230 and create a
`manifest describing the encoded media. The ingest servers 250 may create and
`store the media chunks described herein or may create the chunksonthe fly as
`they are requested. The ingest servers 250 can receive pushed data, such as via
`an HTTP POST, from the encoders 230, or via pull by requesting data from the
`encoders 230. The encoders 230 and ingest servers 250 may be connectedin a
`variety of redundant configurations. For example, each encoder may send
`encoded media data to each ofthe ingest servers 250, or only to one ingest
`serveruntil a failure occurs. The origin servers 260 are the servers that respond
`to client requests for media chunks. The origin servers 260 may also be
`configured in a variety of redundant configurations.
`[0031] The external network 270 includes edge servers 280 and otherInternet
`(or other network) infrastructure and clients 290. When a client makes a request
`for a media chunk, the client addresses the request to the origin servers 260.
`Because of the design of network caching, if one of the edge servers 280 contains
`the data, then that edge server may respond to the client without passing along
`the request. However,if the data is not available at the edge server, then the
`edge server forwards the requestto one ofthe origin servers 260. Likewise, if one
`WO 2010/107625
`of the origin servers 260 receives a request for data that is not available, the origin
`server may request the data from one of the ingest servers 250.
`Figure 3 is a flow diagram thatillustrates the processing of the adaptive
`streaming system on a client to playback media, in one embodiment. Beginning in
`block 310, the system selects an initial encoding at which to request encoded
`media from the server. For example, the system mayinitially select a lowest
`available bit rate. The system may have previously sent a request to the server to
`discover the available bit rates and other available encodings. Continuing in block
`320, the system requests and plays a particular chunk of the media, as described
`further with reference to Figure 4. Continuing in block 330, the system determines
`a quality of service metric based on the requested chunk. For example, the chunk
`may include metadata for as many additional chunks as the server is currently
`storing, which the client can use to determine howfastthe client is requesting
`chunksrelative to how fast the server is producing chunks. This processis
`described in further detail herein.
`[0033] Continuing in decision block 340, if the system determinesthat the
`current QoS metric is too low and the client connection to the server cannot
`handle the current encoding, then the system continues at block 350, else the
`system loops to block 320 to handle the next chunk. Continuing in block 350, the
`system selects a different encoding of the media, wherein the system selects a
`different encoding by requesting data from a different URL for subsequent chunks
`from the server. For example, the system may select an encoding that consumes
`half the bandwidth of the current encoding. Likewise, the system may determine
`that the QoS metric indicates that the client can handle a higherbit rate encoding,
`and the client may request a higherbit rate for subsequent chunks.
`In this way,
`the client adjusts the bit rate up and down based on current conditions.
`[0034] Although Figure3illustrates the QoS determination as occurring after
`each chunk, those of ordinary skill in the art will recognize that other QoS
`implementations are common, such as waiting a fixed number of packets or
`chunks(e.g., every 10th packet) to make a QoS determination. After block 350,
`the system loops to block 320