Accurate, Scalable In›Network Identi(cid:2)cation of P2P Traf(cid:2)c
`Using Application Signatures
`Subhabrata Sen
`AT&T Labs›Research
`Florham Park, NJ 07932
`Oliver Spatscheck
`AT&T Labs›Research
`Florham Park, NJ 07932
`Dongmei Wang
`AT&T Labs›Research
`Florham Park, NJ 07932
`The ability to accurately identify the network traf(cid:2)c associated with
`different P2P applications is important to a broad range of net-
`work operations including application-speci(cid:2)c traf(cid:2)c engineering,
`capacity planning, provisioning, service differentiation, etc. How-
`ever, traditional traf(cid:2)c to higher-level application mapping tech-
`niques such as default server TCP or UDP network-port based dis-
`ambiguation is highly inaccurate for some P2P applications.
`In this paper, we provide an ef(cid:2)cient approach for identifying
`the P2P application traf(cid:2)c through application level signatures. We
`(cid:2)rst identify the application level signatures by examining some
`available documentations, and packet-level traces. We then utilize
`the identi(cid:2)ed signatures to develop online (cid:2)lters that can ef(cid:2)ciently
`and accurately track the P2P traf(cid:2)c even on high-speed network
`We examine the performance of our application-level identi(cid:2)ca-
`tion approach using (cid:2)ve popular P2P protocols. Our measurements
`show that our technique achieves less than 
`false positive and
`false negative ratios in most cases. We also show that our approach
`only requires the examination of the very (cid:2)rst few packets (less
`packets) to identify a P2P connection, which makes our
`approach highly scalable. Our technique can signi(cid:2)cantly improve
`the P2P traf(cid:2)c volume estimates over what pure network port based
`approaches provide. For instance, we were able to identify
`as much traf(cid:2)c for the popular Kazaa P2P protocol, compared to
`the traditional port-based approach.
`Categories and Subject Descriptors
`C.2.3 [Computer-Communication Networks]: Network opera-
`tions(cid:151)Network management, Network monitoring; D.2.8 [Software
`Engineering]: Metrics(cid:151)Performance measures
`General Terms
`Measurement, Performance, Design
`Traf(cid:2)c Analysis, P2P, Application-level Signatures, Online Appli-
`cation Classi(cid:2)cation
`Peer-to-peer (P2P) (cid:2)le sharing applications have dramatically
`grown in popularity over the past few years, and today constitute a
`Copyright is held by the author/owner(s).
`WWW2004, May 17(cid:150)22, 2004, New York, New York, USA.
`ACM 1›58113›844›X/04/0005.
`signi(cid:2)cant share of the total traf(cid:2)c in many networks. These appli-
`cations have proliferated in variety and have become increasingly
`sophisticated along a number of dimensions including increased
`scalability, more functionality, better search capabilities and down-
`load times, etc. In particular the newer generation P2P applications
`are incorporating various strategies to avoid detection.
`Access networks as well as enterprise networks require the abil-
`ity to accurately identify the different P2P applications and their as-
`sociated network traf(cid:2)c, for a range of uses, including network op-
`erations and management, application-speci(cid:2)c traf(cid:2)c engineering,
`capacity planning, provisioning, service differentiation and cost re-
`duction. For example, enterprises would like to provide a degraded
`service (via rate-limiting, service differentiation, blocking) to P2P
`traf(cid:2)c to ensure good performance for enterprise critical applica-
`tions, and/or enforce corporate rules guiding running of peer-to-
`peer. Broadband ISPs would like to limit the P2P traf(cid:2)c to limit
`the cost they are charged by upstream ISPs. All these require the
`capability to accurately identify P2P network traf(cid:2)c.
`Application identi(cid:2)cation inside IP networks, in general, can be
`dif(cid:2)cult. In an ideal situation, a network administrator would pos-
`sess precise information on the applications running inside the net-
`work, along with unambiguous mappings between each application
`and its network traf(cid:2)c (e.g., by port numbers used, IP addresses
`sourcing and receiving the particular application data, etc.). How-
`ever, in general, such information is rarely available, up-to-date or
`complete, and identifying either the applications or their associated
`traf(cid:2)c is a challenging proposition.
`In addition, traditional tech-
`niques like network port-based classi(cid:2)cation of applications have
`now become problematic. Although the earlier P2P systems mostly
`used default network ports for communication, we have found that
`substantial P2P traf(cid:2)c nowadays is transmitted over a large number
`of non-standard ports, making default port-based classi(cid:2)cation less
`In this paper, we report on our exploration of online, in-network
`P2P application detection based on application signatures. The fol-
`lowing are some key requirements for such an application-level (cid:2)l-
`ter. It must be accurate, have low overheads, and must be robust
`to effects like packet losses, asymmetric routing, etc. (details in
`Sections 2 and 3) that make it dif(cid:2)cult/impossible for a monitor-
`ing point to observe all the application-level data in a connection
`(cid:3)owing by.
`We designed a real-time classi(cid:2)cation system which operates on
`individual packets in the middle of the network, and developed
`application-level signatures for a number of popular P2P applica-
`tions. Our signatures can be used directly to monitor and (cid:2)lter P2P
`Evaluations using large packet traces at different Internet loca-
`Splunk Inc. Exhibit 1022 Page 1


`tions show that the individual signature-based classi(cid:2)cation (i) has
`good accuracy properties (low false positives and negatives), even
`in situations where not all packets in a connection are observed by
`the monitoring point, (ii) can scale to handle large traf(cid:2)c volumes
`in the order of several Gbps (GigaBits per second), and (iii) can
`signi(cid:2)cantly improve the P2P traf(cid:2)c volume estimates over what
`pure network port based approaches provide. Our (cid:2)lter has been
`successfully deployed and is currently running at multiple network
`monitoring locations.
`A lot of existing research on P2P traf(cid:2)c characterization has only
`considered traf(cid:2)c on default network ports (e.g., [11, 18, 17]). A re-
`cent work [12] uses application signatures to characterize the work-
`load of Kazaa downloads. But they do not provide any evaluation of
`accuracy, scalability or robustness features of their signature. Sig-
`nature based traf(cid:2)c classi(cid:2)cation has been mainly performed in the
`context of network security such as intrusion and anomaly detec-
`tion (e.g. [5, 4, 19, 14]) where one typically seeks to (cid:2)nd a signature
`for an attack. In contrast our approach identi(cid:2)es P2P traf(cid:2)c for net-
`work planning and research purposes. This work, is therefore, more
`closely related to [8] which provides a set of heuristics and signa-
`tures to identify Internet chat traf(cid:2)c. There is also a large body of
`literature on extracting information from packet traces (e.g., [9]);
`however, none of these works provides and evaluates application
`layer P2P signatures.
`The remainder of this paper is organized as follows. Section 2
`highlights the issues involved in identifying P2P traf(cid:2)c in real time
`inside the network. Section 3 discusses some of the design choices
`we made in our approach. Section 4 derives the actual signatures
`used for P2P detection, and Section 5 describes our implementa-
`tion of an online P2P application classi(cid:2)er using these signatures.
`Section 6 presents the evaluation setting, and Section 7 describes
`the evaluation results. Finally, Section 8 concludes the paper.
`We (cid:2)rst outline some key requirements of any mapping tech-
`nique for identifying traf(cid:2)c on high speed links inside the network.
`Accuracy: The technique should have low false positives (iden-
`tifying other traf(cid:2)c as peer-to-peer) and low false negatives
`(missing peer-to-peer traf(cid:2)c).
`Scalability: The technique must be able to process large traf(cid:2)c
`volumes in the order of several hundred thousand to several
`million connections at a time, with good accuracy, and yet
`not be computationally expensive.
`Robustness: Traf(cid:2)c measurement in the middle of the network has
`to deal with the effects of asymmetric routing (2 directions
`of a connection follow different paths), packet losses and re-
`The above requirements indicate there are tradeoffs in terms of
`the level of accuracy, scalability and robustness that can be achieved.
`On one end of this spectrum is the current practice of TCP/UDP
`port number based application identi(cid:2)cation. Port number based
`application identi(cid:2)cation uses known TCP/UDP port numbers to
`identify traf(cid:2)c (cid:3)ows in the network. It is highly scalable since only
`the UDP/TCP port numbers have to be recorded to identify an ap-
`plication. It is also highly robust since a single packet is suf(cid:2)cient
`to make an application identi(cid:2)cation.
`Unfortunately port number based application identi(cid:2)cation is be-
`coming increasingly inaccurate in identifying P2P traf(cid:2)c. For ex-
`ample, we observed in our traf(cid:2)c traces that a large amount of
`Kazaa traf(cid:2)c is not using the default Kazaa port numbers most
`likely (cid:151) we speculate (cid:151) to avoid detection.
`To address this problem we developed and evaluated a set of ap-
`plication layer signatures to improve the accuracy of P2P traf(cid:2)c
`detection. In particular this approach tries to determine common
`signatures in the TCP/UDP payload of P2P applications.
`A key challenge in realizing such signatures is the lack of openly
`available reliable, complete, uptodate and standard protocol speci-
`(cid:2)cations. This is partly due to developmental history and partly a
`result of whether the protocols are open or proprietary. First, the
`protocols are mostly not standardized and they are evolving. For
`some protocols (e.g., Gnutella), there exists some documentation,
`but it is not complete, or uptodate. In addition, there are various
`different implementations of Gnutella clients which do not comply
`with the speci(cid:2)cations in the available documentation, raising po-
`tential inter-operability issues. For a user, this will manifest itself
`in the form of sometimes poor search performance. For an appli-
`cation classi(cid:2)er to be accurate, it is important to identify signatures
`that span all the variants or at least the dominantly used ones. At
`the other end of the spectrum is a protocol like Kazaa, which is
`developed by a single organization and therefore exhibits a more
`homogeneous protocol deployment, but is a proprietary protocol
`with no authoritative protocol description openly available. Finally,
`just access to the protocol speci(cid:2)cation is not suf(cid:2)cient - we need
`signatures that conform to the design decisions outlined above.
`Our approach to signature identi(cid:2)cation has involved combin-
`ing information available documentation, with information gleaned
`from analysis of packet-level traces to develop potential signatures.
`Multiple iterations were used to evaluate the signatures against net-
`work traf(cid:2)c data to improve the accuracy and computation over-
`Our main goal is to derive application layer signatures for P2P
`protocols which achieve high accuracy and robustness while being
`able to apply them at least at Gigabit Ethernet speeds in real time.
`As we will discuss in Section 7 we achieved these goals by making
`the following high level design choices.
`UDP versus TCP: P2P traf(cid:2)c in principle can (cid:3)ow over UDP and
`TCP. Since currently most P2P protocols transmitted their
`data via TCP we focus on signatures found within TCP based
`P2P traf(cid:2)c. Obviously our signatures could be extended to
`UDP if so desired.
`Packets versus Streams: The P2P application layer signatures can
`be applied to individual TCP segments or to fully reassem-
`bled TCP connection data streams. The advantage of apply-
`ing them to TCP data streams is that duplicate data has been
`removed and that signatures can match data which is trans-
`mitted in multiple TCP segments. However, the drawback
`of applying the signatures to TCP data streams is that the
`TCP segments have to be reassembled in real time on the
`monitoring device. In our current design we chose to apply
`the signatures to individual TCP segments which allows us
`to achieve higher speeds. We therefore focus on developing
`signatures that do not span multiple TCP packet boundaries.
`As we will demonstrate we still achieve high accuracy for the
` applications with the signatures that we develop.
`Location of Signature: Again to improve performance we focus
`on (cid:2)nding signatures which appear in the beginning of the
`(cid:2)le downloads. Using this approach allows us to focus our
`Splunk Inc. Exhibit 1022 Page 2


`signature evaluation on the (cid:2)rst few packets of a TCP con-
`nection. We will study how many packets our signatures re-
`quired in Section 7.
`Robustness to network effects: We also aim to develop signatures
`that can independently identify each direction of an application-
`level communication. This is to enhance the potential of
`identifying connections for which the (cid:2)lter does not observe
`one direction of the traf(cid:2)c (due to asymmetric network rout-
`ing), or misses some signature-carrying packets in one or
`both directions
`(caused by either router-based load split-
`ting [16] or other routing instabilities).
`Independent iden-
`ti(cid:2)cation of each direction also serves to decrease the po-
`tential of misclassi(cid:2)cation, by either reinforcing the marking
`(if both directions identify the same application) or (cid:3)agging
`a potential discord (if the 2 directions are identi(cid:2)ed with
`different applications). Note that for some usages, such as
`accounting for total P2P traf(cid:2)c or identifying if some P2P
`communication is being used, where it is more important to
`identify that some P2P communications is being used, the
`last potential (of multiple classi(cid:2)cations of the directions) is
`not an issue.
`Early Discard: For ef(cid:2)ciency reasons, we shall consider both sig-
`natures that identify an application as well as those that in-
`dicate that a connection does not belong to an application.
`The latter category of signatures allows us to quickly identify
`packets that are not likely application packets, and thereby
`frees up resources for examining more promising candidates.
`Signaling versus Transport: Since the bulk of P2P traf(cid:2)c is re-
`lated to (cid:2)le downloads and not due to (cid:2)le searches (signal-
`ing) we chose to concentrate our efforts on identifying signa-
`tures for (cid:2)le downloads rather than the signaling part of P2P
`Historically in the client/server model content is stored on the
`server and all clients download content from the server. One draw-
`back of this model is that if the server is overloaded, the server
`becomes the bottleneck. The P2P (cid:2)le sharing model addresses this
`problem by allowing peers to exchange content directly. To per-
`form these (cid:2)le sharing tasks, all popular P2P protocols allow a ran-
`dom host to act as both a client and a server to its peers, even though
`some P2P protocols do not treat all hosts equally.
`Typically the following two phases are involved if a requester
`desires to download content:
`Signaling: During the signaling phase a client searches for the
`content and determines which peers are able and willing to
`provide the desired content. In many protocols this does not
`involve any direct communication with the peer which will
`eventually provide the content.
`Download: In this phase the requester contacts one or multiple
`peers directly to download the desired content.
`In addition to the two phases described above many P2P proto-
`cols also exchange keep-alive messages or synchronize the server
`lists between servers.
`In the remainder of the paper we focus on the download phase
`of the (cid:2)ve most popular P2P protocols (Kazaa, Gnutella, eDon-
`key, DirectConnect, and BitTorrent). We decided to only track the
`download phase since it allows us to capture the majority of P2P
`traf(cid:2)c. We will also only classify the (cid:2)rst download in a TCP con-
`nection. This simpli(cid:2)cation is reasonable since it is highly unlikely
`that two different applications will share a single TCP connection.
`In the remainder of this Section we will discuss the signatures we
`discovered for these (cid:2)ve protocols. Unless otherwise speci(cid:2)ed, all
`the identi(cid:2)ed signatures are case insensitive.
`4.1 Gnutella protocol
`Gnutella is a completely distributed protocol. In a Gnutella net-
`work, every client is a server and vice versa. Therefore the client
`and server are implemented in a single system, called servent. A
`servent connects to the Gnutella network through establishing a
`TCP connection to another servent on the network. Once a servent
`has connected successfully to the network, it communicates with
`other servents using Gnutella protocol descriptors for searching the
`network - this is the signaling phase of the protocol. The actual
`(cid:2)le download is achieved using a HTTP-like protocol between the
`requesting servent and a servent possessing the requested (cid:2)le.
`To develop the Gnutella signature we inspected multiple Gnutella
`connections and observed that the request message for Gnutella
`TCP connection creation assumes following format:
`GNUTELLA CONNECT/<protocol version string>\n\n
`And the response message for Gnutella TCP connection creation
`We also observed that there is an initial request-response hand-
`shake within each content download. In the download request the
`servent uses the following HTTP request headers:
`GET /get/<File Index>/<File Name>
`/HTTP/1.0 \r \n
`Connection: Keep-Alive\r\n
`Range: byte=0-\r\n
`User-Agent: <Name>\r\n
`The reply message contains the following HTTP response head-
`HTTP 200 OK\r\n
`Server: <Name>\r\n
`Content-type: \r\n
`Content-length: \r\n
`Based on these observations and performance consideration, we
`recommend the following signatures for identifying Gnutella data
` The (cid:2)rst string following the TCP/IP header is ‘GNUTELLA’,
`‘GET’, or ‘HTTP’.
`If the (cid:2)rst string is ‘GET’ or ‘HTTP’, there must be a (cid:2)eld
`with one of following strings:
`User-Agent: <Name>
`UserAgent: <Name>
`Server: <Name>
`Splunk Inc. Exhibit 1022 Page 3


`is one of the following: LimeWire, Bear-

`Share, Gnucleus, MorpheusOS, XoloX, MorpheusPE, gtk-
`gnutella, Acquisition, Mutella-0.4.1, MyNapster, Mutella-
`0.4.1, MyNapster, Mutella-0.4, Qtella, AquaLime, NapShare,
`Comeback, Go, PHEX, SwapNut, Mutella-0.4.0, Shareaza,
`Mutella-0.3.9b, Morpheus, FreeWire, Openext, Mutella-0.3.3,
`Generally it is much cheaper to match a string with a (cid:2)xed off-
`set than a string with varying locations. Hence we include ‘GET’
`and ‘HTTP’ here to help early discard the packets, which do not
`start with ‘GNUTELLA’, and also are non-HTTP packets. For ro-
`bustness, we included the signatures for the request and response
`header. This way, we can identify Gnutella traf(cid:2)c even if we only
`see one direction of the traf(cid:2)c.
`4.2 eDonkey protocol
`An eDonkey network consists of clients and servers. Each client
`is connected to one main server via TCP. During the signaling
`phase, it (cid:2)rst sends the search request to its main server. (Option-
`ally, the client can send the search request directly to other servers
`via UDP - this is referred to as extended search in eDonkey.) To
`download a (cid:2)le subsequently from other clients, the client estab-
`lishes connections to the other clients directly via TCP, then asks
`each client for different pieces of the (cid:2)le.
`After examining eDonkey packets, we discovered that both sig-
`naling and downloading TCP packets have the following common
`eDonkey header directly following the TCP header:
`1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
`packet Length (4 Bytes)
`| Message type
`where the marker value is always 0xe3 in hex, the packet length
`is speci(cid:2)ed in network byte order and the value is the byte length
`of the content of the eDonkey message excluding the marker 1 byte
`and the length (cid:2)eld 4 bytes.
`Utilizing these discoveries, we recommend the following signa-
`tures for identifying eDonkey packets:
`For TCP signaling or handshaking data packets, we use two steps
`to identify eDonkey packets.
` The (cid:2)rst byte after the IP+TCP header is the eDonkey marker.
` The number given by the next 4 bytes is equal to the size
`of the entire packet after excluding both the IP+TCP header
`bytes and 5 extra bytes.
`Since the accuracy for identifying the P2P connections is pro-
`portional to the length of the signatures, we tend to include as
`many (cid:2)elds as we can so long as they do not increase the com-
`putational complexity signi(cid:2)cantly. Here both marker and length
`(cid:2)elds have a (cid:2)xed offset, therefore the computational complexity is
`the same (O(1)) for matching one of them or both, but the accuracy
`is improved by
`times compared with matching the marker (cid:2)eld
`We have also identi(cid:2)ed the signatures for UDP handshaking mes-
`sages. However, since UDP is only used for extended searching,
`and is rare compared with TCP communications, we do not report
`it in this study.
`4.3 DirectConnect Protocol
`The DirectConnect network is composed of hubs, clients, and a
`single superhub with multiple servers. All of them listen on TCP
`port 411 to connect and exchange commands such as search re-
`quest. Clients (peers) store (cid:2)les and respond to search requests for
`those (cid:2)les. The single superhub acts as a name service for all the
`hubs. All hubs register with the superhub and clients discover hubs
`by asking the superhub. Each of the clients has a username (a.k.a.
`nick). Normally the clients listen at port 412 for client connections.
`If the port 412 is already in use, clients will use ports 413, 414
`and so on. DirectConnect uses TCP for client to server and client
`to client communication, while UDP is used for communication
`between servers. The TCP/UDP data is a series of commands or a
`public chat message. In this study, we focus on the TCP commands.
`The TCP commands are identi(cid:2)ed with following form:
`$command_type field1 field2 ...|
`’. The
`’, and ends with character ‘
`which starts with character ‘
`list of valid command types for TCP communications are: MyN-
`ick, Lock, Key, Direction, GetListLen, ListLen, MaxedOut, Error,
`Send, Get, FileLength, Canceled, HubName, ValidateNick, Vali-
`dateDenide, GetPass, Mypass, BadPass, Version, Hello, Logedin,
`MyINFO, GetINFO, GetNickList, NickList, OpList, To, Connect-
`ToMe, MultiConnectToMe, RevConnectToMe, Search, MultiSearch,
`SR, Kick, OpForceMove, ForceMove, Quit.
`To improve the evaluation performance we evaluate this signa-
`ture in the following two steps:
`1. The (cid:2)rst byte after the IP+TCP header is ‘
`of the packet is ‘
`’, and the last byte
`’, the string terminated by a space is one of
`2. Following the ‘
`the valid TCP commands listed above.
`Although we are matching a list of strings which can be an ex-
`pensive operation, we shall only perform the string match on pack-
`ets which pass the (cid:2)rst test.
`4.4 BitTorrent Protocol
`The BitTorrent network consists of clients and a centralized server.
`Clients connect to each other directly to send and receive portions
`of a single (cid:2)le. The central server (called a tracker) only coordi-
`nates the action of the clients, and manages connections. Unlike the
`protocols discussed above, the BitTorrent server is not responsible
`for locating the searching (cid:2)les for the clients, instead the BitTorrent
`network client locates a torrent (cid:2)le through the Web, and initiates
`the downloading by clicking on the hyperlink. Hence there is no
`signaling communication for searching in the BitTorrent network.
`To identify BitTorrent traf(cid:2)c, we focus on the downloading data
`packets between clients only since the communication between the
`client and server is negligible.
`The communication between the clients starts with a handshake
`followed by a never-ending stream of length-pre(cid:2)xed messages.
`We discovered that the BitTorrent header of the handshake mes-
`sages assumes following format:
`<a character(1 byte)><a string(19 byte)>
`The (cid:2)rst byte is a (cid:2)xed character with value ‘
`’, and the string
`value is ‘BitTorrent protocol’. Based on this common header, we
`use following signatures for identifying BitTorrent traf(cid:2)c:
` The (cid:2)rst byte in the TCP payload is the character 19 (0x13).
` The next 19 bytes match the string ‘BitTorrent protocol’.
`The signatures identi(cid:2)ed here are 20 bytes long with (cid:2)xed loca-
`tions, therefore they are very accurate and cost-effective.
`Splunk Inc. Exhibit 1022 Page 4


`4.5 Kazaa protocol
`The Kazaa network is a distributed self-organized network. In
`a Kazaa network, clients with powerful connections, and with fast
`computers are automatically selected as Supernodes. Supernodes
`are local search hubs. Normal clients connect to their neighboring
`Supernodes to upload information about (cid:2)les that they share, and
`to perform searches. In turn Supernodes query each other to ful(cid:2)ll
`the search.
`The request message in a Kazaa download contains the following
`HTTP request headers:
`GET /.files HTTP/1.1\r\n
`Host: IP address/port\r\n
`UserAgent: KazaaClient\r\n
`X-Kazaa-Username: \r\n
`X-Kazaa-Network: KaZaA\r\n
`X-Kazaa-IP: \r\n
`X-Kazaa-SupernodeIP: \r\n
`The Kazaa response contains the following HTTP response head-
`HTTP/1.1 200 OK\r\n
`Content-Length: \r\n
`Server: KazaaClient\r\n
`X-Kazaa-Username: \r\n
`X-Kazaa-Network: \r\n
`X-Kazaa-IP: \r\n
`X-Kazaa-SupernodeIP: \r\n
`Content-Type: \r\n
`For higher Kazaa version (v1.5 or higher), a peer may send an
`encrypted short message before it sends back above response. Note
`that both messages include a (cid:2)eld called X-Kazaa-SupernodeIP.
`This (cid:2)eld speci(cid:2)es the IP address of the supernode to which the
`peer is connected including the TCP/UDP supernode service port.
`This information could be used to identify signaling using (cid:3)ow
`records of all communication.
`Using the special HTTP headers found in the Kazaa data down-
`load we recommend the following two steps to identify Kazaa down-
`1. The string following the TCP/IP head is one of following:
`‘GET’, and ‘HTTP’.
`2. There must be a (cid:2)eld with string: X-Kazaa.
`Similar to our Gnutella signatures we include ‘GET’ and ‘HTTP’
`to early discard non-HTTP packets, so that we can avoid searching
`through the whole packet to match ‘X-Kazaa’ if the packet has a
`low probability to contain HTTP request or response headers.
`As stated earlier we concentrate on P2P application detection in
`TCP traf(cid:2)c. In particular we decomposed our P2P signatures into
`(cid:2)xed pattern matches at (cid:2)xed offsets within a TCP payload and
`variable pattern matches with variable offset within a TCP payload.
`The (cid:2)xed offset operation can be implemented cheaply whereas
`variable pattern matches are substantially more expensive.
`To be able to execute the decomposed signatures on real network
`traf(cid:2)c we implemented them in the context of the Gigascope [7]
`high speed traf(cid:2)c monitor. In this section we will (cid:2)rst discuss the
`issues involved in evaluating (cid:2)xed and variable offset signatures
`and then discuss how we implement them in the context of Gigas-
`5.1 Fixed Offset Match
`Implementing a (cid:2)xed pattern match at a (cid:2)xed offset within a TCP
`payload is rather trivial. The complexity of this operation in the
`worst case is the size of the pattern matched. Despite this simplicity
`it is useful to provide multiple library functions which perform this
`operation using slightly different parameters to allow for the easy
`implementation of diverse signatures. For example, in the context
`of P2P signatures the offset could be speci(cid:2)ed from the beginning
`or end of the TCP payload and the pattern matches could be a byte,
`a word in little endian byte order, a word in big endian byte order,
`or a string. Therefore, we implemented a library which provides
`the following functions:
`byte match offset: returns true if a byte matches the byte in the
`TCP payload on a given offset. If the offset is negative it is
`calculated from the end of the TCP payload.
`word match offset: similar to byte match offset, except that a word
`is compared. This function takes as additional argument a
`(cid:3)ag indicating the byte order of the data in the TCP payload.
`string match offset: similar to byte match offset, except that a
`(cid:2)xed length sequence of bytes (string) is compared.
`5.2 Variable Offset Match
`There are multiple ways to implement matches at variable offsets
`in an input stream that involve variable length strings. As discussed
`in Section 3 we decided to perform the matches on a per packet ba-
`sis, trading off higher performance against matching strings which
`span multiple packets.
`Using this approach all variable matches we need to perform can
`be expressed as a regular expression match over TCP payloads. For
`example, the Gnutella data download signature can be expressed as:
`’(cid:136)(Server:|User-Agent:)[ \t]*(LimeWire|
`Due to the fact that it is expensive to perform full regular ex-
`pression matches over all TCP payloads we exploit the fact that the
`required regular expression matches are of a limited variety. In par-
`ticular all of the signatures we need to evaluate can be expressed as
`stringset1.*stringset2 where stringset1 and stringset2 contain a list
`of possible strings. This allows us to use the following algorithms
`for our signatures:
` Standard regex (SR): This is the regular expression match
`function found in the standard c library on FreeBSD 4.7.
` AST regex (AR): Part of the AST library [10], this code is
`based on the Boyer Moore string search algorithm [6] ex-
`tended to handle alternation of (cid:2)xed strings. To search for
`character long string in a
`character sequence,
`the Boyer-Moore algorithm has worst case time complex-
`, but often runs in
`-time on natural-
`  ( &
` #  &
`language text for small values of
` Karp-Rabin (KR): This is a probabilistic string matching tech-
`nique [13] that compares the hash value of the pattern against
`the hash value of the sub text of a given search text. The
`worst case complexity of Karp-Rabin is
`, but for many
`situations is often
` #  &
`Splunk Inc. Exhibit 1022 Page 5


`5.3 Gigascope Based Implementation
`Gigascope is a high speed traf(cid:2)c monitor which can perform a
`variety of traf(cid:2)c measurement tasks at speeds up to OC-
`Gbps). To evaluate our signature based P2P classi(cid:2)cation we in-
`cluded the libraries described above into the Gigascope framework
`and wrote a set of Gigascope con(cid:2)guration (cid:2)les based on our P2P
`signatures. In the Gigascope framework these con(cid:2)guration (cid:2)les
`are translated into C code which is subsequently compiled. The
`resulting executable is used to perform the network monitoring in
`real time. Gigascope automatically breaks complex computation
`into multiple tasks exploiting multiple processors if available. In
`addition to the real-time P2P detection task we also used Gigascope
`to collect large datasets for our accuracy evaluation as discussed in
`Section 7.
`When we con(cid:2)gured our Gigascope instance we utilized the fact
`that (cid:2)xed offset matches are substantially cheaper to execute than
`variable offset matches. For example, to identify the DirectConnect
`protocol we need to perform a regular expression match for:

