throbber
Accurate, Scalable In›Network Identi(cid:2)cation of P2P Traf(cid:2)c
`Using Application Signatures
`
`Subhabrata Sen
`AT&T Labs›Research
`Florham Park, NJ 07932
`sen@research.att.com
`
`Oliver Spatscheck
`AT&T Labs›Research
`Florham Park, NJ 07932
`spatsch@research.att.com
`
`Dongmei Wang
`AT&T Labs›Research
`Florham Park, NJ 07932
`mei@research.att.com
`
`ABSTRACT
`The ability to accurately identify the network traf(cid:2)c associated with
`different P2P applications is important to a broad range of net-
`work operations including application-speci(cid:2)c traf(cid:2)c engineering,
`capacity planning, provisioning, service differentiation, etc. How-
`ever, traditional traf(cid:2)c to higher-level application mapping tech-
`niques such as default server TCP or UDP network-port based dis-
`ambiguation is highly inaccurate for some P2P applications.
`In this paper, we provide an ef(cid:2)cient approach for identifying
`the P2P application traf(cid:2)c through application level signatures. We
`(cid:2)rst identify the application level signatures by examining some
`available documentations, and packet-level traces. We then utilize
`the identi(cid:2)ed signatures to develop online (cid:2)lters that can ef(cid:2)ciently
`and accurately track the P2P traf(cid:2)c even on high-speed network
`links.
`We examine the performance of our application-level identi(cid:2)ca-
`tion approach using (cid:2)ve popular P2P protocols. Our measurements
`show that our technique achieves less than 
`false positive and
`false negative ratios in most cases. We also show that our approach
`only requires the examination of the very (cid:2)rst few packets (less
`than
`packets) to identify a P2P connection, which makes our
` 
`approach highly scalable. Our technique can signi(cid:2)cantly improve
`the P2P traf(cid:2)c volume estimates over what pure network port based
`approaches provide. For instance, we were able to identify
`times
`as much traf(cid:2)c for the popular Kazaa P2P protocol, compared to
`the traditional port-based approach.
`
`Categories and Subject Descriptors
`C.2.3 [Computer-Communication Networks]: Network opera-
`tions(cid:151)Network management, Network monitoring; D.2.8 [Software
`Engineering]: Metrics(cid:151)Performance measures
`
`General Terms
`Measurement, Performance, Design
`
`Keywords
`Traf(cid:2)c Analysis, P2P, Application-level Signatures, Online Appli-
`cation Classi(cid:2)cation
`
`1.
`
`INTRODUCTION
`Peer-to-peer (P2P) (cid:2)le sharing applications have dramatically
`grown in popularity over the past few years, and today constitute a
`Copyright is held by the author/owner(s).
`WWW2004, May 17(cid:150)22, 2004, New York, New York, USA.
`ACM 1›58113›844›X/04/0005.
`
`signi(cid:2)cant share of the total traf(cid:2)c in many networks. These appli-
`cations have proliferated in variety and have become increasingly
`sophisticated along a number of dimensions including increased
`scalability, more functionality, better search capabilities and down-
`load times, etc. In particular the newer generation P2P applications
`are incorporating various strategies to avoid detection.
`Access networks as well as enterprise networks require the abil-
`ity to accurately identify the different P2P applications and their as-
`sociated network traf(cid:2)c, for a range of uses, including network op-
`erations and management, application-speci(cid:2)c traf(cid:2)c engineering,
`capacity planning, provisioning, service differentiation and cost re-
`duction. For example, enterprises would like to provide a degraded
`service (via rate-limiting, service differentiation, blocking) to P2P
`traf(cid:2)c to ensure good performance for enterprise critical applica-
`tions, and/or enforce corporate rules guiding running of peer-to-
`peer. Broadband ISPs would like to limit the P2P traf(cid:2)c to limit
`the cost they are charged by upstream ISPs. All these require the
`capability to accurately identify P2P network traf(cid:2)c.
`Application identi(cid:2)cation inside IP networks, in general, can be
`dif(cid:2)cult. In an ideal situation, a network administrator would pos-
`sess precise information on the applications running inside the net-
`work, along with unambiguous mappings between each application
`and its network traf(cid:2)c (e.g., by port numbers used, IP addresses
`sourcing and receiving the particular application data, etc.). How-
`ever, in general, such information is rarely available, up-to-date or
`complete, and identifying either the applications or their associated
`traf(cid:2)c is a challenging proposition.
`In addition, traditional tech-
`niques like network port-based classi(cid:2)cation of applications have
`now become problematic. Although the earlier P2P systems mostly
`used default network ports for communication, we have found that
`substantial P2P traf(cid:2)c nowadays is transmitted over a large number
`of non-standard ports, making default port-based classi(cid:2)cation less
`accurate.
`In this paper, we report on our exploration of online, in-network
`P2P application detection based on application signatures. The fol-
`lowing are some key requirements for such an application-level (cid:2)l-
`ter. It must be accurate, have low overheads, and must be robust
`to effects like packet losses, asymmetric routing, etc. (details in
`Sections 2 and 3) that make it dif(cid:2)cult/impossible for a monitor-
`ing point to observe all the application-level data in a connection
`(cid:3)owing by.
`We designed a real-time classi(cid:2)cation system which operates on
`individual packets in the middle of the network, and developed
`application-level signatures for a number of popular P2P applica-
`tions. Our signatures can be used directly to monitor and (cid:2)lter P2P
`traf(cid:2)c.
`Evaluations using large packet traces at different Internet loca-
`
`512
`
`Splunk Inc. Exhibit 1022 Page 1
`
`
`

`

`tions show that the individual signature-based classi(cid:2)cation (i) has
`good accuracy properties (low false positives and negatives), even
`in situations where not all packets in a connection are observed by
`the monitoring point, (ii) can scale to handle large traf(cid:2)c volumes
`in the order of several Gbps (GigaBits per second), and (iii) can
`signi(cid:2)cantly improve the P2P traf(cid:2)c volume estimates over what
`pure network port based approaches provide. Our (cid:2)lter has been
`successfully deployed and is currently running at multiple network
`monitoring locations.
`A lot of existing research on P2P traf(cid:2)c characterization has only
`considered traf(cid:2)c on default network ports (e.g., [11, 18, 17]). A re-
`cent work [12] uses application signatures to characterize the work-
`load of Kazaa downloads. But they do not provide any evaluation of
`accuracy, scalability or robustness features of their signature. Sig-
`nature based traf(cid:2)c classi(cid:2)cation has been mainly performed in the
`context of network security such as intrusion and anomaly detec-
`tion (e.g. [5, 4, 19, 14]) where one typically seeks to (cid:2)nd a signature
`for an attack. In contrast our approach identi(cid:2)es P2P traf(cid:2)c for net-
`work planning and research purposes. This work, is therefore, more
`closely related to [8] which provides a set of heuristics and signa-
`tures to identify Internet chat traf(cid:2)c. There is also a large body of
`literature on extracting information from packet traces (e.g., [9]);
`however, none of these works provides and evaluates application
`layer P2P signatures.
`The remainder of this paper is organized as follows. Section 2
`highlights the issues involved in identifying P2P traf(cid:2)c in real time
`inside the network. Section 3 discusses some of the design choices
`we made in our approach. Section 4 derives the actual signatures
`used for P2P detection, and Section 5 describes our implementa-
`tion of an online P2P application classi(cid:2)er using these signatures.
`Section 6 presents the evaluation setting, and Section 7 describes
`the evaluation results. Finally, Section 8 concludes the paper.
`
`2. PROBLEM STATEMENT
`
`We (cid:2)rst outline some key requirements of any mapping tech-
`nique for identifying traf(cid:2)c on high speed links inside the network.
`
`Accuracy: The technique should have low false positives (iden-
`tifying other traf(cid:2)c as peer-to-peer) and low false negatives
`(missing peer-to-peer traf(cid:2)c).
`
`Scalability: The technique must be able to process large traf(cid:2)c
`volumes in the order of several hundred thousand to several
`million connections at a time, with good accuracy, and yet
`not be computationally expensive.
`
`Robustness: Traf(cid:2)c measurement in the middle of the network has
`to deal with the effects of asymmetric routing (2 directions
`of a connection follow different paths), packet losses and re-
`ordering.
`
`The above requirements indicate there are tradeoffs in terms of
`the level of accuracy, scalability and robustness that can be achieved.
`On one end of this spectrum is the current practice of TCP/UDP
`port number based application identi(cid:2)cation. Port number based
`application identi(cid:2)cation uses known TCP/UDP port numbers to
`identify traf(cid:2)c (cid:3)ows in the network. It is highly scalable since only
`the UDP/TCP port numbers have to be recorded to identify an ap-
`plication. It is also highly robust since a single packet is suf(cid:2)cient
`to make an application identi(cid:2)cation.
`Unfortunately port number based application identi(cid:2)cation is be-
`coming increasingly inaccurate in identifying P2P traf(cid:2)c. For ex-
`ample, we observed in our traf(cid:2)c traces that a large amount of
`
`Kazaa traf(cid:2)c is not using the default Kazaa port numbers most
`likely (cid:151) we speculate (cid:151) to avoid detection.
`To address this problem we developed and evaluated a set of ap-
`plication layer signatures to improve the accuracy of P2P traf(cid:2)c
`detection. In particular this approach tries to determine common
`signatures in the TCP/UDP payload of P2P applications.
`A key challenge in realizing such signatures is the lack of openly
`available reliable, complete, uptodate and standard protocol speci-
`(cid:2)cations. This is partly due to developmental history and partly a
`result of whether the protocols are open or proprietary. First, the
`protocols are mostly not standardized and they are evolving. For
`some protocols (e.g., Gnutella), there exists some documentation,
`but it is not complete, or uptodate. In addition, there are various
`different implementations of Gnutella clients which do not comply
`with the speci(cid:2)cations in the available documentation, raising po-
`tential inter-operability issues. For a user, this will manifest itself
`in the form of sometimes poor search performance. For an appli-
`cation classi(cid:2)er to be accurate, it is important to identify signatures
`that span all the variants or at least the dominantly used ones. At
`the other end of the spectrum is a protocol like Kazaa, which is
`developed by a single organization and therefore exhibits a more
`homogeneous protocol deployment, but is a proprietary protocol
`with no authoritative protocol description openly available. Finally,
`just access to the protocol speci(cid:2)cation is not suf(cid:2)cient - we need
`signatures that conform to the design decisions outlined above.
`Our approach to signature identi(cid:2)cation has involved combin-
`ing information available documentation, with information gleaned
`from analysis of packet-level traces to develop potential signatures.
`Multiple iterations were used to evaluate the signatures against net-
`work traf(cid:2)c data to improve the accuracy and computation over-
`heads.
`
`3. DESIGN CHOICES
`
`Our main goal is to derive application layer signatures for P2P
`protocols which achieve high accuracy and robustness while being
`able to apply them at least at Gigabit Ethernet speeds in real time.
`As we will discuss in Section 7 we achieved these goals by making
`the following high level design choices.
`
`UDP versus TCP: P2P traf(cid:2)c in principle can (cid:3)ow over UDP and
`TCP. Since currently most P2P protocols transmitted their
`data via TCP we focus on signatures found within TCP based
`P2P traf(cid:2)c. Obviously our signatures could be extended to
`UDP if so desired.
`
`Packets versus Streams: The P2P application layer signatures can
`be applied to individual TCP segments or to fully reassem-
`bled TCP connection data streams. The advantage of apply-
`ing them to TCP data streams is that duplicate data has been
`removed and that signatures can match data which is trans-
`mitted in multiple TCP segments. However, the drawback
`of applying the signatures to TCP data streams is that the
`TCP segments have to be reassembled in real time on the
`monitoring device. In our current design we chose to apply
`the signatures to individual TCP segments which allows us
`to achieve higher speeds. We therefore focus on developing
`signatures that do not span multiple TCP packet boundaries.
`As we will demonstrate we still achieve high accuracy for the
` applications with the signatures that we develop.
`Location of Signature: Again to improve performance we focus
`on (cid:2)nding signatures which appear in the beginning of the
`(cid:2)le downloads. Using this approach allows us to focus our
`
`513
`
`Splunk Inc. Exhibit 1022 Page 2
`
`

`

`signature evaluation on the (cid:2)rst few packets of a TCP con-
`nection. We will study how many packets our signatures re-
`quired in Section 7.
`
`Robustness to network effects: We also aim to develop signatures
`that can independently identify each direction of an application-
`level communication. This is to enhance the potential of
`identifying connections for which the (cid:2)lter does not observe
`one direction of the traf(cid:2)c (due to asymmetric network rout-
`ing), or misses some signature-carrying packets in one or
`both directions
`(caused by either router-based load split-
`ting [16] or other routing instabilities).
`Independent iden-
`ti(cid:2)cation of each direction also serves to decrease the po-
`tential of misclassi(cid:2)cation, by either reinforcing the marking
`(if both directions identify the same application) or (cid:3)agging
`a potential discord (if the 2 directions are identi(cid:2)ed with
`different applications). Note that for some usages, such as
`accounting for total P2P traf(cid:2)c or identifying if some P2P
`communication is being used, where it is more important to
`identify that some P2P communications is being used, the
`last potential (of multiple classi(cid:2)cations of the directions) is
`not an issue.
`
`Early Discard: For ef(cid:2)ciency reasons, we shall consider both sig-
`natures that identify an application as well as those that in-
`dicate that a connection does not belong to an application.
`The latter category of signatures allows us to quickly identify
`packets that are not likely application packets, and thereby
`frees up resources for examining more promising candidates.
`
`Signaling versus Transport: Since the bulk of P2P traf(cid:2)c is re-
`lated to (cid:2)le downloads and not due to (cid:2)le searches (signal-
`ing) we chose to concentrate our efforts on identifying signa-
`tures for (cid:2)le downloads rather than the signaling part of P2P
`protocols.
`
`4. P2P PROTOCOLS AND SIGNATURES
`
`Historically in the client/server model content is stored on the
`server and all clients download content from the server. One draw-
`back of this model is that if the server is overloaded, the server
`becomes the bottleneck. The P2P (cid:2)le sharing model addresses this
`problem by allowing peers to exchange content directly. To per-
`form these (cid:2)le sharing tasks, all popular P2P protocols allow a ran-
`dom host to act as both a client and a server to its peers, even though
`some P2P protocols do not treat all hosts equally.
`Typically the following two phases are involved if a requester
`desires to download content:
`
`Signaling: During the signaling phase a client searches for the
`content and determines which peers are able and willing to
`provide the desired content. In many protocols this does not
`involve any direct communication with the peer which will
`eventually provide the content.
`
`Download: In this phase the requester contacts one or multiple
`peers directly to download the desired content.
`
`In addition to the two phases described above many P2P proto-
`cols also exchange keep-alive messages or synchronize the server
`lists between servers.
`In the remainder of the paper we focus on the download phase
`of the (cid:2)ve most popular P2P protocols (Kazaa, Gnutella, eDon-
`key, DirectConnect, and BitTorrent). We decided to only track the
`
`download phase since it allows us to capture the majority of P2P
`traf(cid:2)c. We will also only classify the (cid:2)rst download in a TCP con-
`nection. This simpli(cid:2)cation is reasonable since it is highly unlikely
`that two different applications will share a single TCP connection.
`In the remainder of this Section we will discuss the signatures we
`discovered for these (cid:2)ve protocols. Unless otherwise speci(cid:2)ed, all
`the identi(cid:2)ed signatures are case insensitive.
`4.1 Gnutella protocol
`Gnutella is a completely distributed protocol. In a Gnutella net-
`work, every client is a server and vice versa. Therefore the client
`and server are implemented in a single system, called servent. A
`servent connects to the Gnutella network through establishing a
`TCP connection to another servent on the network. Once a servent
`has connected successfully to the network, it communicates with
`other servents using Gnutella protocol descriptors for searching the
`network - this is the signaling phase of the protocol. The actual
`(cid:2)le download is achieved using a HTTP-like protocol between the
`requesting servent and a servent possessing the requested (cid:2)le.
`To develop the Gnutella signature we inspected multiple Gnutella
`connections and observed that the request message for Gnutella
`TCP connection creation assumes following format:
`
`GNUTELLA CONNECT/<protocol version string>\n\n
`
`And the response message for Gnutella TCP connection creation
`assumes:
`
`GNUTELLA OK\n\n
`
`We also observed that there is an initial request-response hand-
`shake within each content download. In the download request the
`servent uses the following HTTP request headers:
`
`GET /get/<File Index>/<File Name>
`/HTTP/1.0 \r \n
`Connection: Keep-Alive\r\n
`Range: byte=0-\r\n
`User-Agent: <Name>\r\n
`\r\n
`
`The reply message contains the following HTTP response head-
`ers:
`
`HTTP 200 OK\r\n
`Server: <Name>\r\n
`Content-type: \r\n
`Content-length: \r\n
`\r\n
`
`Based on these observations and performance consideration, we
`recommend the following signatures for identifying Gnutella data
`downloads:
`
` The (cid:2)rst string following the TCP/IP header is ‘GNUTELLA’,
`‘GET’, or ‘HTTP’.
`
`If the (cid:2)rst string is ‘GET’ or ‘HTTP’, there must be a (cid:2)eld
`with one of following strings:
`
`User-Agent: <Name>
`UserAgent: <Name>
`Server: <Name>
`
`514
`
`Splunk Inc. Exhibit 1022 Page 3
`
`
`

`

`is one of the following: LimeWire, Bear-
`where
`
`


`Share, Gnucleus, MorpheusOS, XoloX, MorpheusPE, gtk-
`gnutella, Acquisition, Mutella-0.4.1, MyNapster, Mutella-
`0.4.1, MyNapster, Mutella-0.4, Qtella, AquaLime, NapShare,
`Comeback, Go, PHEX, SwapNut, Mutella-0.4.0, Shareaza,
`Mutella-0.3.9b, Morpheus, FreeWire, Openext, Mutella-0.3.3,
`Phex.
`
`Generally it is much cheaper to match a string with a (cid:2)xed off-
`set than a string with varying locations. Hence we include ‘GET’
`and ‘HTTP’ here to help early discard the packets, which do not
`start with ‘GNUTELLA’, and also are non-HTTP packets. For ro-
`bustness, we included the signatures for the request and response
`header. This way, we can identify Gnutella traf(cid:2)c even if we only
`see one direction of the traf(cid:2)c.
`4.2 eDonkey protocol
`An eDonkey network consists of clients and servers. Each client
`is connected to one main server via TCP. During the signaling
`phase, it (cid:2)rst sends the search request to its main server. (Option-
`ally, the client can send the search request directly to other servers
`via UDP - this is referred to as extended search in eDonkey.) To
`download a (cid:2)le subsequently from other clients, the client estab-
`lishes connections to the other clients directly via TCP, then asks
`each client for different pieces of the (cid:2)le.
`After examining eDonkey packets, we discovered that both sig-
`naling and downloading TCP packets have the following common
`eDonkey header directly following the TCP header:
`
`1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
`+-+-+-+-+-+-+-+-+
`|
`Marker
`|
`+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
`|
`packet Length (4 Bytes)
`|
`+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
`| Message type
`|
`+-+-+-+-+-+-+-+-+
`
`where the marker value is always 0xe3 in hex, the packet length
`is speci(cid:2)ed in network byte order and the value is the byte length
`of the content of the eDonkey message excluding the marker 1 byte
`and the length (cid:2)eld 4 bytes.
`Utilizing these discoveries, we recommend the following signa-
`tures for identifying eDonkey packets:
`For TCP signaling or handshaking data packets, we use two steps
`to identify eDonkey packets.
` The (cid:2)rst byte after the IP+TCP header is the eDonkey marker.
` The number given by the next 4 bytes is equal to the size
`of the entire packet after excluding both the IP+TCP header
`bytes and 5 extra bytes.
`
`Since the accuracy for identifying the P2P connections is pro-
`portional to the length of the signatures, we tend to include as
`many (cid:2)elds as we can so long as they do not increase the com-
`putational complexity signi(cid:2)cantly. Here both marker and length
`(cid:2)elds have a (cid:2)xed offset, therefore the computational complexity is
`the same (O(1)) for matching one of them or both, but the accuracy
`is improved by
`times compared with matching the marker (cid:2)eld
`
`alone.
`We have also identi(cid:2)ed the signatures for UDP handshaking mes-
`sages. However, since UDP is only used for extended searching,
`and is rare compared with TCP communications, we do not report
`it in this study.
`4.3 DirectConnect Protocol
`The DirectConnect network is composed of hubs, clients, and a
`single superhub with multiple servers. All of them listen on TCP
`
`port 411 to connect and exchange commands such as search re-
`quest. Clients (peers) store (cid:2)les and respond to search requests for
`those (cid:2)les. The single superhub acts as a name service for all the
`hubs. All hubs register with the superhub and clients discover hubs
`by asking the superhub. Each of the clients has a username (a.k.a.
`nick). Normally the clients listen at port 412 for client connections.
`If the port 412 is already in use, clients will use ports 413, 414
`and so on. DirectConnect uses TCP for client to server and client
`to client communication, while UDP is used for communication
`between servers. The TCP/UDP data is a series of commands or a
`public chat message. In this study, we focus on the TCP commands.
`The TCP commands are identi(cid:2)ed with following form:
`
`$command_type field1 field2 ...|
`
`’. The
`’, and ends with character ‘
`which starts with character ‘
`list of valid command types for TCP communications are: MyN-
`ick, Lock, Key, Direction, GetListLen, ListLen, MaxedOut, Error,
`Send, Get, FileLength, Canceled, HubName, ValidateNick, Vali-
`dateDenide, GetPass, Mypass, BadPass, Version, Hello, Logedin,
`MyINFO, GetINFO, GetNickList, NickList, OpList, To, Connect-
`ToMe, MultiConnectToMe, RevConnectToMe, Search, MultiSearch,
`SR, Kick, OpForceMove, ForceMove, Quit.
`To improve the evaluation performance we evaluate this signa-
`ture in the following two steps:
`
`1. The (cid:2)rst byte after the IP+TCP header is ‘
`of the packet is ‘
`’.
`
`’, and the last byte
`
`’, the string terminated by a space is one of
`2. Following the ‘
`the valid TCP commands listed above.
`
`Although we are matching a list of strings which can be an ex-
`pensive operation, we shall only perform the string match on pack-
`ets which pass the (cid:2)rst test.
`4.4 BitTorrent Protocol
`The BitTorrent network consists of clients and a centralized server.
`Clients connect to each other directly to send and receive portions
`of a single (cid:2)le. The central server (called a tracker) only coordi-
`nates the action of the clients, and manages connections. Unlike the
`protocols discussed above, the BitTorrent server is not responsible
`for locating the searching (cid:2)les for the clients, instead the BitTorrent
`network client locates a torrent (cid:2)le through the Web, and initiates
`the downloading by clicking on the hyperlink. Hence there is no
`signaling communication for searching in the BitTorrent network.
`To identify BitTorrent traf(cid:2)c, we focus on the downloading data
`packets between clients only since the communication between the
`client and server is negligible.
`The communication between the clients starts with a handshake
`followed by a never-ending stream of length-pre(cid:2)xed messages.
`We discovered that the BitTorrent header of the handshake mes-
`sages assumes following format:
`
`<a character(1 byte)><a string(19 byte)>
`
`The (cid:2)rst byte is a (cid:2)xed character with value ‘
`’, and the string
` 
`value is ‘BitTorrent protocol’. Based on this common header, we
`use following signatures for identifying BitTorrent traf(cid:2)c:
` The (cid:2)rst byte in the TCP payload is the character 19 (0x13).
` The next 19 bytes match the string ‘BitTorrent protocol’.
`The signatures identi(cid:2)ed here are 20 bytes long with (cid:2)xed loca-
`tions, therefore they are very accurate and cost-effective.
`
`515
`
`Splunk Inc. Exhibit 1022 Page 4
`
`
`
`
`
`
`

`

`4.5 Kazaa protocol
`The Kazaa network is a distributed self-organized network. In
`a Kazaa network, clients with powerful connections, and with fast
`computers are automatically selected as Supernodes. Supernodes
`are local search hubs. Normal clients connect to their neighboring
`Supernodes to upload information about (cid:2)les that they share, and
`to perform searches. In turn Supernodes query each other to ful(cid:2)ll
`the search.
`The request message in a Kazaa download contains the following
`HTTP request headers:
`
`GET /.files HTTP/1.1\r\n
`Host: IP address/port\r\n
`UserAgent: KazaaClient\r\n
`X-Kazaa-Username: \r\n
`X-Kazaa-Network: KaZaA\r\n
`X-Kazaa-IP: \r\n
`X-Kazaa-SupernodeIP: \r\n
`
`The Kazaa response contains the following HTTP response head-
`ers:
`
`HTTP/1.1 200 OK\r\n
`Content-Length: \r\n
`Server: KazaaClient\r\n
`X-Kazaa-Username: \r\n
`X-Kazaa-Network: \r\n
`X-Kazaa-IP: \r\n
`X-Kazaa-SupernodeIP: \r\n
`Content-Type: \r\n
`
`For higher Kazaa version (v1.5 or higher), a peer may send an
`encrypted short message before it sends back above response. Note
`that both messages include a (cid:2)eld called X-Kazaa-SupernodeIP.
`This (cid:2)eld speci(cid:2)es the IP address of the supernode to which the
`peer is connected including the TCP/UDP supernode service port.
`This information could be used to identify signaling using (cid:3)ow
`records of all communication.
`Using the special HTTP headers found in the Kazaa data down-
`load we recommend the following two steps to identify Kazaa down-
`loads:
`1. The string following the TCP/IP head is one of following:
`‘GET’, and ‘HTTP’.
`2. There must be a (cid:2)eld with string: X-Kazaa.
`Similar to our Gnutella signatures we include ‘GET’ and ‘HTTP’
`to early discard non-HTTP packets, so that we can avoid searching
`through the whole packet to match ‘X-Kazaa’ if the packet has a
`low probability to contain HTTP request or response headers.
`
`5. SIGNATURE IMPLEMENTATION
`
`As stated earlier we concentrate on P2P application detection in
`TCP traf(cid:2)c. In particular we decomposed our P2P signatures into
`(cid:2)xed pattern matches at (cid:2)xed offsets within a TCP payload and
`variable pattern matches with variable offset within a TCP payload.
`The (cid:2)xed offset operation can be implemented cheaply whereas
`variable pattern matches are substantially more expensive.
`To be able to execute the decomposed signatures on real network
`traf(cid:2)c we implemented them in the context of the Gigascope [7]
`high speed traf(cid:2)c monitor. In this section we will (cid:2)rst discuss the
`issues involved in evaluating (cid:2)xed and variable offset signatures
`and then discuss how we implement them in the context of Gigas-
`cope.
`
`5.1 Fixed Offset Match
`Implementing a (cid:2)xed pattern match at a (cid:2)xed offset within a TCP
`payload is rather trivial. The complexity of this operation in the
`worst case is the size of the pattern matched. Despite this simplicity
`it is useful to provide multiple library functions which perform this
`operation using slightly different parameters to allow for the easy
`implementation of diverse signatures. For example, in the context
`of P2P signatures the offset could be speci(cid:2)ed from the beginning
`or end of the TCP payload and the pattern matches could be a byte,
`a word in little endian byte order, a word in big endian byte order,
`or a string. Therefore, we implemented a library which provides
`the following functions:
`
`byte match offset: returns true if a byte matches the byte in the
`TCP payload on a given offset. If the offset is negative it is
`calculated from the end of the TCP payload.
`
`word match offset: similar to byte match offset, except that a word
`is compared. This function takes as additional argument a
`(cid:3)ag indicating the byte order of the data in the TCP payload.
`
`string match offset: similar to byte match offset, except that a
`(cid:2)xed length sequence of bytes (string) is compared.
`5.2 Variable Offset Match
`There are multiple ways to implement matches at variable offsets
`in an input stream that involve variable length strings. As discussed
`in Section 3 we decided to perform the matches on a per packet ba-
`sis, trading off higher performance against matching strings which
`span multiple packets.
`Using this approach all variable matches we need to perform can
`be expressed as a regular expression match over TCP payloads. For
`example, the Gnutella data download signature can be expressed as:
`
`’(cid:136)(Server:|User-Agent:)[ \t]*(LimeWire|
`BearShare|Gnucleus|Morpheus|XoloX|
`gtk-gnutella|Mutella|MyNapster|Qtella|
`AquaLime|NapShare|Comback|PHEX|SwapNut|
`FreeWire|Openext|Toadnode)’
`
`Due to the fact that it is expensive to perform full regular ex-
`pression matches over all TCP payloads we exploit the fact that the
`required regular expression matches are of a limited variety. In par-
`ticular all of the signatures we need to evaluate can be expressed as
`stringset1.*stringset2 where stringset1 and stringset2 contain a list
`of possible strings. This allows us to use the following algorithms
`for our signatures:
` Standard regex (SR): This is the regular expression match
`function found in the standard c library on FreeBSD 4.7.
` AST regex (AR): Part of the AST library [10], this code is
`based on the Boyer Moore string search algorithm [6] ex-
`tended to handle alternation of (cid:2)xed strings. To search for
`character long string in a
`character sequence,
`an
` 
`the Boyer-Moore algorithm has worst case time complex-
`, but often runs in
`ity
`-time on natural-
`  ( &
` #  &
`language text for small values of
`.
` Karp-Rabin (KR): This is a probabilistic string matching tech-
`nique [13] that compares the hash value of the pattern against
`the hash value of the sub text of a given search text. The
`worst case complexity of Karp-Rabin is
`, but for many
`
 &
`situations is often
`.
` #  &
`
`516
`
`Splunk Inc. Exhibit 1022 Page 5
`
`
`
`

`

`5.3 Gigascope Based Implementation
`Gigascope is a high speed traf(cid:2)c monitor which can perform a
`(2x2.4
`variety of traf(cid:2)c measurement tasks at speeds up to OC-
`./
`Gbps). To evaluate our signature based P2P classi(cid:2)cation we in-
`cluded the libraries described above into the Gigascope framework
`and wrote a set of Gigascope con(cid:2)guration (cid:2)les based on our P2P
`signatures. In the Gigascope framework these con(cid:2)guration (cid:2)les
`are translated into C code which is subsequently compiled. The
`resulting executable is used to perform the network monitoring in
`real time. Gigascope automatically breaks complex computation
`into multiple tasks exploiting multiple processors if available. In
`addition to the real-time P2P detection task we also used Gigascope
`to collect large datasets for our accuracy evaluation as discussed in
`Section 7.
`When we con(cid:2)gured our Gigascope instance we utilized the fact
`that (cid:2)xed offset matches are substantially cheaper to execute than
`variable offset matches. For example, to identify the DirectConnect
`protocol we need to perform a regular expression match for:
`types|MyNick|Lock|Key|Direction|
`GetListLen|ListLen|MaxedOut|Error|
`Send|Get|FileLength|Canceled|HubName|
`ValidateNick|ValidateDenide|GetPass|
`MyP

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket