`Using Application Signatures
`
`Subhabrata Sen
`AT&T Labs›Research
`Florham Park, NJ 07932
`sen@research.att.com
`
`Oliver Spatscheck
`AT&T Labs›Research
`Florham Park, NJ 07932
`spatsch@research.att.com
`
`Dongmei Wang
`AT&T Labs›Research
`Florham Park, NJ 07932
`mei@research.att.com
`
`ABSTRACT
`The ability to accurately identify the network traf(cid:2)c associated with
`different P2P applications is important to a broad range of net-
`work operations including application-speci(cid:2)c traf(cid:2)c engineering,
`capacity planning, provisioning, service differentiation, etc. How-
`ever, traditional traf(cid:2)c to higher-level application mapping tech-
`niques such as default server TCP or UDP network-port based dis-
`ambiguation is highly inaccurate for some P2P applications.
`In this paper, we provide an ef(cid:2)cient approach for identifying
`the P2P application traf(cid:2)c through application level signatures. We
`(cid:2)rst identify the application level signatures by examining some
`available documentations, and packet-level traces. We then utilize
`the identi(cid:2)ed signatures to develop online (cid:2)lters that can ef(cid:2)ciently
`and accurately track the P2P traf(cid:2)c even on high-speed network
`links.
`We examine the performance of our application-level identi(cid:2)ca-
`tion approach using (cid:2)ve popular P2P protocols. Our measurements
`show that our technique achieves less than
`false positive and
`false negative ratios in most cases. We also show that our approach
`only requires the examination of the very (cid:2)rst few packets (less
`than
`packets) to identify a P2P connection, which makes our
`
`approach highly scalable. Our technique can signi(cid:2)cantly improve
`the P2P traf(cid:2)c volume estimates over what pure network port based
`approaches provide. For instance, we were able to identify
`times
`as much traf(cid:2)c for the popular Kazaa P2P protocol, compared to
`the traditional port-based approach.
`
`Categories and Subject Descriptors
`C.2.3 [Computer-Communication Networks]: Network opera-
`tions(cid:151)Network management, Network monitoring; D.2.8 [Software
`Engineering]: Metrics(cid:151)Performance measures
`
`General Terms
`Measurement, Performance, Design
`
`Keywords
`Traf(cid:2)c Analysis, P2P, Application-level Signatures, Online Appli-
`cation Classi(cid:2)cation
`
`1.
`
`INTRODUCTION
`Peer-to-peer (P2P) (cid:2)le sharing applications have dramatically
`grown in popularity over the past few years, and today constitute a
`Copyright is held by the author/owner(s).
`WWW2004, May 17(cid:150)22, 2004, New York, New York, USA.
`ACM 1›58113›844›X/04/0005.
`
`signi(cid:2)cant share of the total traf(cid:2)c in many networks. These appli-
`cations have proliferated in variety and have become increasingly
`sophisticated along a number of dimensions including increased
`scalability, more functionality, better search capabilities and down-
`load times, etc. In particular the newer generation P2P applications
`are incorporating various strategies to avoid detection.
`Access networks as well as enterprise networks require the abil-
`ity to accurately identify the different P2P applications and their as-
`sociated network traf(cid:2)c, for a range of uses, including network op-
`erations and management, application-speci(cid:2)c traf(cid:2)c engineering,
`capacity planning, provisioning, service differentiation and cost re-
`duction. For example, enterprises would like to provide a degraded
`service (via rate-limiting, service differentiation, blocking) to P2P
`traf(cid:2)c to ensure good performance for enterprise critical applica-
`tions, and/or enforce corporate rules guiding running of peer-to-
`peer. Broadband ISPs would like to limit the P2P traf(cid:2)c to limit
`the cost they are charged by upstream ISPs. All these require the
`capability to accurately identify P2P network traf(cid:2)c.
`Application identi(cid:2)cation inside IP networks, in general, can be
`dif(cid:2)cult. In an ideal situation, a network administrator would pos-
`sess precise information on the applications running inside the net-
`work, along with unambiguous mappings between each application
`and its network traf(cid:2)c (e.g., by port numbers used, IP addresses
`sourcing and receiving the particular application data, etc.). How-
`ever, in general, such information is rarely available, up-to-date or
`complete, and identifying either the applications or their associated
`traf(cid:2)c is a challenging proposition.
`In addition, traditional tech-
`niques like network port-based classi(cid:2)cation of applications have
`now become problematic. Although the earlier P2P systems mostly
`used default network ports for communication, we have found that
`substantial P2P traf(cid:2)c nowadays is transmitted over a large number
`of non-standard ports, making default port-based classi(cid:2)cation less
`accurate.
`In this paper, we report on our exploration of online, in-network
`P2P application detection based on application signatures. The fol-
`lowing are some key requirements for such an application-level (cid:2)l-
`ter. It must be accurate, have low overheads, and must be robust
`to effects like packet losses, asymmetric routing, etc. (details in
`Sections 2 and 3) that make it dif(cid:2)cult/impossible for a monitor-
`ing point to observe all the application-level data in a connection
`(cid:3)owing by.
`We designed a real-time classi(cid:2)cation system which operates on
`individual packets in the middle of the network, and developed
`application-level signatures for a number of popular P2P applica-
`tions. Our signatures can be used directly to monitor and (cid:2)lter P2P
`traf(cid:2)c.
`Evaluations using large packet traces at different Internet loca-
`
`512
`
`Cloudflare - Exhibit 1022, page 512
`
`
`
`
`tions show that the individual signature-based classi(cid:2)cation (i) has
`good accuracy properties (low false positives and negatives), even
`in situations where not all packets in a connection are observed by
`the monitoring point, (ii) can scale to handle large traf(cid:2)c volumes
`in the order of several Gbps (GigaBits per second), and (iii) can
`signi(cid:2)cantly improve the P2P traf(cid:2)c volume estimates over what
`pure network port based approaches provide. Our (cid:2)lter has been
`successfully deployed and is currently running at multiple network
`monitoring locations.
`A lot of existing research on P2P traf(cid:2)c characterization has only
`considered traf(cid:2)c on default network ports (e.g., [11, 18, 17]). A re-
`cent work [12] uses application signatures to characterize the work-
`load of Kazaa downloads. But they do not provide any evaluation of
`accuracy, scalability or robustness features of their signature. Sig-
`nature based traf(cid:2)c classi(cid:2)cation has been mainly performed in the
`context of network security such as intrusion and anomaly detec-
`tion (e.g. [5, 4, 19, 14]) where one typically seeks to (cid:2)nd a signature
`for an attack. In contrast our approach identi(cid:2)es P2P traf(cid:2)c for net-
`work planning and research purposes. This work, is therefore, more
`closely related to [8] which provides a set of heuristics and signa-
`tures to identify Internet chat traf(cid:2)c. There is also a large body of
`literature on extracting information from packet traces (e.g., [9]);
`however, none of these works provides and evaluates application
`layer P2P signatures.
`The remainder of this paper is organized as follows. Section 2
`highlights the issues involved in identifying P2P traf(cid:2)c in real time
`inside the network. Section 3 discusses some of the design choices
`we made in our approach. Section 4 derives the actual signatures
`used for P2P detection, and Section 5 describes our implementa-
`tion of an online P2P application classi(cid:2)er using these signatures.
`Section 6 presents the evaluation setting, and Section 7 describes
`the evaluation results. Finally, Section 8 concludes the paper.
`
`2. PROBLEM STATEMENT
`
`We (cid:2)rst outline some key requirements of any mapping tech-
`nique for identifying traf(cid:2)c on high speed links inside the network.
`
`Accuracy: The technique should have low false positives (iden-
`tifying other traf(cid:2)c as peer-to-peer) and low false negatives
`(missing peer-to-peer traf(cid:2)c).
`
`Scalability: The technique must be able to process large traf(cid:2)c
`volumes in the order of several hundred thousand to several
`million connections at a time, with good accuracy, and yet
`not be computationally expensive.
`
`Robustness: Traf(cid:2)c measurement in the middle of the network has
`to deal with the effects of asymmetric routing (2 directions
`of a connection follow different paths), packet losses and re-
`ordering.
`
`The above requirements indicate there are tradeoffs in terms of
`the level of accuracy, scalability and robustness that can be achieved.
`On one end of this spectrum is the current practice of TCP/UDP
`port number based application identi(cid:2)cation. Port number based
`application identi(cid:2)cation uses known TCP/UDP port numbers to
`identify traf(cid:2)c (cid:3)ows in the network. It is highly scalable since only
`the UDP/TCP port numbers have to be recorded to identify an ap-
`plication. It is also highly robust since a single packet is suf(cid:2)cient
`to make an application identi(cid:2)cation.
`Unfortunately port number based application identi(cid:2)cation is be-
`coming increasingly inaccurate in identifying P2P traf(cid:2)c. For ex-
`ample, we observed in our traf(cid:2)c traces that a large amount of
`
`Kazaa traf(cid:2)c is not using the default Kazaa port numbers most
`likely (cid:151) we speculate (cid:151) to avoid detection.
`To address this problem we developed and evaluated a set of ap-
`plication layer signatures to improve the accuracy of P2P traf(cid:2)c
`detection. In particular this approach tries to determine common
`signatures in the TCP/UDP payload of P2P applications.
`A key challenge in realizing such signatures is the lack of openly
`available reliable, complete, uptodate and standard protocol speci-
`(cid:2)cations. This is partly due to developmental history and partly a
`result of whether the protocols are open or proprietary. First, the
`protocols are mostly not standardized and they are evolving. For
`some protocols (e.g., Gnutella), there exists some documentation,
`but it is not complete, or uptodate. In addition, there are various
`different implementations of Gnutella clients which do not comply
`with the speci(cid:2)cations in the available documentation, raising po-
`tential inter-operability issues. For a user, this will manifest itself
`in the form of sometimes poor search performance. For an appli-
`cation classi(cid:2)er to be accurate, it is important to identify signatures
`that span all the variants or at least the dominantly used ones. At
`the other end of the spectrum is a protocol like Kazaa, which is
`developed by a single organization and therefore exhibits a more
`homogeneous protocol deployment, but is a proprietary protocol
`with no authoritative protocol description openly available. Finally,
`just access to the protocol speci(cid:2)cation is not suf(cid:2)cient - we need
`signatures that conform to the design decisions outlined above.
`Our approach to signature identi(cid:2)cation has involved combin-
`ing information available documentation, with information gleaned
`from analysis of packet-level traces to develop potential signatures.
`Multiple iterations were used to evaluate the signatures against net-
`work traf(cid:2)c data to improve the accuracy and computation over-
`heads.
`
`3. DESIGN CHOICES
`
`Our main goal is to derive application layer signatures for P2P
`protocols which achieve high accuracy and robustness while being
`able to apply them at least at Gigabit Ethernet speeds in real time.
`As we will discuss in Section 7 we achieved these goals by making
`the following high level design choices.
`
`UDP versus TCP: P2P traf(cid:2)c in principle can (cid:3)ow over UDP and
`TCP. Since currently most P2P protocols transmitted their
`data via TCP we focus on signatures found within TCP based
`P2P traf(cid:2)c. Obviously our signatures could be extended to
`UDP if so desired.
`
`Packets versus Streams: The P2P application layer signatures can
`be applied to individual TCP segments or to fully reassem-
`bled TCP connection data streams. The advantage of apply-
`ing them to TCP data streams is that duplicate data has been
`removed and that signatures can match data which is trans-
`mitted in multiple TCP segments. However, the drawback
`of applying the signatures to TCP data streams is that the
`TCP segments have to be reassembled in real time on the
`monitoring device. In our current design we chose to apply
`the signatures to individual TCP segments which allows us
`to achieve higher speeds. We therefore focus on developing
`signatures that do not span multiple TCP packet boundaries.
`As we will demonstrate we still achieve high accuracy for the
` applications with the signatures that we develop.
`Location of Signature: Again to improve performance we focus
`on (cid:2)nding signatures which appear in the beginning of the
`(cid:2)le downloads. Using this approach allows us to focus our
`
`513
`
`Cloudflare - Exhibit 1022, page 513
`
`
`
`signature evaluation on the (cid:2)rst few packets of a TCP con-
`nection. We will study how many packets our signatures re-
`quired in Section 7.
`
`Robustness to network effects: We also aim to develop signatures
`that can independently identify each direction of an application-
`level communication. This is to enhance the potential of
`identifying connections for which the (cid:2)lter does not observe
`one direction of the traf(cid:2)c (due to asymmetric network rout-
`ing), or misses some signature-carrying packets in one or
`both directions
`(caused by either router-based load split-
`ting [16] or other routing instabilities).
`Independent iden-
`ti(cid:2)cation of each direction also serves to decrease the po-
`tential of misclassi(cid:2)cation, by either reinforcing the marking
`(if both directions identify the same application) or (cid:3)agging
`a potential discord (if the 2 directions are identi(cid:2)ed with
`different applications). Note that for some usages, such as
`accounting for total P2P traf(cid:2)c or identifying if some P2P
`communication is being used, where it is more important to
`identify that some P2P communications is being used, the
`last potential (of multiple classi(cid:2)cations of the directions) is
`not an issue.
`
`Early Discard: For ef(cid:2)ciency reasons, we shall consider both sig-
`natures that identify an application as well as those that in-
`dicate that a connection does not belong to an application.
`The latter category of signatures allows us to quickly identify
`packets that are not likely application packets, and thereby
`frees up resources for examining more promising candidates.
`
`Signaling versus Transport: Since the bulk of P2P traf(cid:2)c is re-
`lated to (cid:2)le downloads and not due to (cid:2)le searches (signal-
`ing) we chose to concentrate our efforts on identifying signa-
`tures for (cid:2)le downloads rather than the signaling part of P2P
`protocols.
`
`4. P2P PROTOCOLS AND SIGNATURES
`
`Historically in the client/server model content is stored on the
`server and all clients download content from the server. One draw-
`back of this model is that if the server is overloaded, the server
`becomes the bottleneck. The P2P (cid:2)le sharing model addresses this
`problem by allowing peers to exchange content directly. To per-
`form these (cid:2)le sharing tasks, all popular P2P protocols allow a ran-
`dom host to act as both a client and a server to its peers, even though
`some P2P protocols do not treat all hosts equally.
`Typically the following two phases are involved if a requester
`desires to download content:
`
`Signaling: During the signaling phase a client searches for the
`content and determines which peers are able and willing to
`provide the desired content. In many protocols this does not
`involve any direct communication with the peer which will
`eventually provide the content.
`
`Download: In this phase the requester contacts one or multiple
`peers directly to download the desired content.
`
`In addition to the two phases described above many P2P proto-
`cols also exchange keep-alive messages or synchronize the server
`lists between servers.
`In the remainder of the paper we focus on the download phase
`of the (cid:2)ve most popular P2P protocols (Kazaa, Gnutella, eDon-
`key, DirectConnect, and BitTorrent). We decided to only track the
`
`download phase since it allows us to capture the majority of P2P
`traf(cid:2)c. We will also only classify the (cid:2)rst download in a TCP con-
`nection. This simpli(cid:2)cation is reasonable since it is highly unlikely
`that two different applications will share a single TCP connection.
`In the remainder of this Section we will discuss the signatures we
`discovered for these (cid:2)ve protocols. Unless otherwise speci(cid:2)ed, all
`the identi(cid:2)ed signatures are case insensitive.
`4.1 Gnutella protocol
`Gnutella is a completely distributed protocol. In a Gnutella net-
`work, every client is a server and vice versa. Therefore the client
`and server are implemented in a single system, called servent. A
`servent connects to the Gnutella network through establishing a
`TCP connection to another servent on the network. Once a servent
`has connected successfully to the network, it communicates with
`other servents using Gnutella protocol descriptors for searching the
`network - this is the signaling phase of the protocol. The actual
`(cid:2)le download is achieved using a HTTP-like protocol between the
`requesting servent and a servent possessing the requested (cid:2)le.
`To develop the Gnutella signature we inspected multiple Gnutella
`connections and observed that the request message for Gnutella
`TCP connection creation assumes following format:
`
`GNUTELLA CONNECT/<protocol version string>\n\n
`
`And the response message for Gnutella TCP connection creation
`assumes:
`
`GNUTELLA OK\n\n
`
`We also observed that there is an initial request-response hand-
`shake within each content download. In the download request the
`servent uses the following HTTP request headers:
`
`GET /get/<File Index>/<File Name>
`/HTTP/1.0 \r \n
`Connection: Keep-Alive\r\n
`Range: byte=0-\r\n
`User-Agent: <Name>\r\n
`\r\n
`
`The reply message contains the following HTTP response head-
`ers:
`
`HTTP 200 OK\r\n
`Server: <Name>\r\n
`Content-type: \r\n
`Content-length: \r\n
`\r\n
`
`Based on these observations and performance consideration, we
`recommend the following signatures for identifying Gnutella data
`downloads:
`
` The (cid:2)rst string following the TCP/IP header is ‘GNUTELLA’,
`‘GET’, or ‘HTTP’.
`
`If the (cid:2)rst string is ‘GET’ or ‘HTTP’, there must be a (cid:2)eld
`with one of following strings:
`
`User-Agent: <Name>
`UserAgent: <Name>
`Server: <Name>
`
`514
`
`Cloudflare - Exhibit 1022, page 514
`
`
`
`
`is one of the following: LimeWire, Bear-
`where
`
`
`Share, Gnucleus, MorpheusOS, XoloX, MorpheusPE, gtk-
`gnutella, Acquisition, Mutella-0.4.1, MyNapster, Mutella-
`0.4.1, MyNapster, Mutella-0.4, Qtella, AquaLime, NapShare,
`Comeback, Go, PHEX, SwapNut, Mutella-0.4.0, Shareaza,
`Mutella-0.3.9b, Morpheus, FreeWire, Openext, Mutella-0.3.3,
`Phex.
`
`Generally it is much cheaper to match a string with a (cid:2)xed off-
`set than a string with varying locations. Hence we include ‘GET’
`and ‘HTTP’ here to help early discard the packets, which do not
`start with ‘GNUTELLA’, and also are non-HTTP packets. For ro-
`bustness, we included the signatures for the request and response
`header. This way, we can identify Gnutella traf(cid:2)c even if we only
`see one direction of the traf(cid:2)c.
`4.2 eDonkey protocol
`An eDonkey network consists of clients and servers. Each client
`is connected to one main server via TCP. During the signaling
`phase, it (cid:2)rst sends the search request to its main server. (Option-
`ally, the client can send the search request directly to other servers
`via UDP - this is referred to as extended search in eDonkey.) To
`download a (cid:2)le subsequently from other clients, the client estab-
`lishes connections to the other clients directly via TCP, then asks
`each client for different pieces of the (cid:2)le.
`After examining eDonkey packets, we discovered that both sig-
`naling and downloading TCP packets have the following common
`eDonkey header directly following the TCP header:
`
`1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
`+-+-+-+-+-+-+-+-+
`|
`Marker
`|
`+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
`|
`packet Length (4 Bytes)
`|
`+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
`| Message type
`|
`+-+-+-+-+-+-+-+-+
`
`where the marker value is always 0xe3 in hex, the packet length
`is speci(cid:2)ed in network byte order and the value is the byte length
`of the content of the eDonkey message excluding the marker 1 byte
`and the length (cid:2)eld 4 bytes.
`Utilizing these discoveries, we recommend the following signa-
`tures for identifying eDonkey packets:
`For TCP signaling or handshaking data packets, we use two steps
`to identify eDonkey packets.
` The (cid:2)rst byte after the IP+TCP header is the eDonkey marker.
` The number given by the next 4 bytes is equal to the size
`of the entire packet after excluding both the IP+TCP header
`bytes and 5 extra bytes.
`
`Since the accuracy for identifying the P2P connections is pro-
`portional to the length of the signatures, we tend to include as
`many (cid:2)elds as we can so long as they do not increase the com-
`putational complexity signi(cid:2)cantly. Here both marker and length
`(cid:2)elds have a (cid:2)xed offset, therefore the computational complexity is
`the same (O(1)) for matching one of them or both, but the accuracy
`is improved by
`times compared with matching the marker (cid:2)eld
`
`alone.
`We have also identi(cid:2)ed the signatures for UDP handshaking mes-
`sages. However, since UDP is only used for extended searching,
`and is rare compared with TCP communications, we do not report
`it in this study.
`4.3 DirectConnect Protocol
`The DirectConnect network is composed of hubs, clients, and a
`single superhub with multiple servers. All of them listen on TCP
`
`port 411 to connect and exchange commands such as search re-
`quest. Clients (peers) store (cid:2)les and respond to search requests for
`those (cid:2)les. The single superhub acts as a name service for all the
`hubs. All hubs register with the superhub and clients discover hubs
`by asking the superhub. Each of the clients has a username (a.k.a.
`nick). Normally the clients listen at port 412 for client connections.
`If the port 412 is already in use, clients will use ports 413, 414
`and so on. DirectConnect uses TCP for client to server and client
`to client communication, while UDP is used for communication
`between servers. The TCP/UDP data is a series of commands or a
`public chat message. In this study, we focus on the TCP commands.
`The TCP commands are identi(cid:2)ed with following form:
`
`$command_type field1 field2 ...|
`
`which starts with character ‘
`’. The
`’, and ends with character ‘
`list of valid command types for TCP communications are: MyN-
`ick, Lock, Key, Direction, GetListLen, ListLen, MaxedOut, Error,
`Send, Get, FileLength, Canceled, HubName, ValidateNick, Vali-
`dateDenide, GetPass, Mypass, BadPass, Version, Hello, Logedin,
`MyINFO, GetINFO, GetNickList, NickList, OpList, To, Connect-
`ToMe, MultiConnectToMe, RevConnectToMe, Search, MultiSearch,
`SR, Kick, OpForceMove, ForceMove, Quit.
`To improve the evaluation performance we evaluate this signa-
`ture in the following two steps:
`
`1. The (cid:2)rst byte after the IP+TCP header is ‘
`of the packet is ‘
`’.
`
`’, and the last byte
`
`’, the string terminated by a space is one of
`2. Following the ‘
`the valid TCP commands listed above.
`
`Although we are matching a list of strings which can be an ex-
`pensive operation, we shall only perform the string match on pack-
`ets which pass the (cid:2)rst test.
`4.4 BitTorrent Protocol
`The BitTorrent network consists of clients and a centralized server.
`Clients connect to each other directly to send and receive portions
`of a single (cid:2)le. The central server (called a tracker) only coordi-
`nates the action of the clients, and manages connections. Unlike the
`protocols discussed above, the BitTorrent server is not responsible
`for locating the searching (cid:2)les for the clients, instead the BitTorrent
`network client locates a torrent (cid:2)le through the Web, and initiates
`the downloading by clicking on the hyperlink. Hence there is no
`signaling communication for searching in the BitTorrent network.
`To identify BitTorrent traf(cid:2)c, we focus on the downloading data
`packets between clients only since the communication between the
`client and server is negligible.
`The communication between the clients starts with a handshake
`followed by a never-ending stream of length-pre(cid:2)xed messages.
`We discovered that the BitTorrent header of the handshake mes-
`sages assumes following format:
`
`<a character(1 byte)><a string(19 byte)>
`
`The (cid:2)rst byte is a (cid:2)xed character with value ‘
`’, and the string
`
`value is ‘BitTorrent protocol’. Based on this common header, we
`use following signatures for identifying BitTorrent traf(cid:2)c:
` The (cid:2)rst byte in the TCP payload is the character 19 (0x13).
` The next 19 bytes match the string ‘BitTorrent protocol’.
`The signatures identi(cid:2)ed here are 20 bytes long with (cid:2)xed loca-
`tions, therefore they are very accurate and cost-effective.
`
`515
`
`Cloudflare - Exhibit 1022, page 515
`
`
`
`
`
`
`
`
`5.1 Fixed Offset Match
`Implementing a (cid:2)xed pattern match at a (cid:2)xed offset within a TCP
`payload is rather trivial. The complexity of this operation in the
`worst case is the size of the pattern matched. Despite this simplicity
`it is useful to provide multiple library functions which perform this
`operation using slightly different parameters to allow for the easy
`implementation of diverse signatures. For example, in the context
`of P2P signatures the offset could be speci(cid:2)ed from the beginning
`or end of the TCP payload and the pattern matches could be a byte,
`a word in little endian byte order, a word in big endian byte order,
`or a string. Therefore, we implemented a library which provides
`the following functions:
`
`byte
`
`4.5 Kazaa protocol
`The Kazaa network is a distributed self-organized network. In
`a Kazaa network, clients with powerful connections, and with fast
`computers are automatically selected as Supernodes. Supernodes
`are local search hubs. Normal clients connect to their neighboring
`Supernodes to upload information about (cid:2)les that they share, and
`to perform searches. In turn Supernodes query each other to ful(cid:2)ll
`the search.
`The request message in a Kazaa download contains the following
`HTTP request headers:
`
`GET /.files HTTP/1.1\r\n
`Host: IP address/port\r\n
`UserAgent: KazaaClient\r\n
`X-Kazaa-Username: \r\n
`X-Kazaa-Network: KaZaA\r\n
`X-Kazaa-IP: \r\n
`X-Kazaa-SupernodeIP: \r\n
`
`The Kazaa response contains the following HTTP response head-
`ers:
`
`HTTP/1.1 200 OK\r\n
`Content-Length: \r\n
`Server: KazaaClient\r\n
`X-Kazaa-Username: \r\n
`X-Kazaa-Network: \r\n
`X-Kazaa-IP: \r\n
`X-Kazaa-SupernodeIP: \r\n
`Content-Type: \r\n
`
`For higher Kazaa version (v1.5 or higher), a peer may send an
`encrypted short message before it sends back above response. Note
`that both messages include a (cid:2)eld called X-Kazaa-SupernodeIP.
`This (cid:2)eld speci(cid:2)es the IP address of the supernode to which the
`peer is connected including the TCP/UDP supernode service port.
`This information could be used to identify signaling using (cid:3)ow
`records of all communication.
`Using the special HTTP headers found in the Kazaa data down-
`load we recommend the following two steps to identify Kazaa down-
`loads:
`1. The string following the TCP/IP head is one of following:
`‘GET’, and ‘HTTP’.
`2. There must be a (cid:2)eld with string: X-Kazaa.
`Similar to our Gnutella signatures we include ‘GET’ and ‘HTTP’
`to early discard non-HTTP packets, so that we can avoid searching
`through the whole packet to match ‘X-Kazaa’ if the packet has a
`low probability to contain HTTP request or response headers.
`
`5. SIGNATURE IMPLEMENTATION
`
`As stated earlier we concentrate on P2P application detection in
`TCP traf(cid:2)c. In particular we decomposed our P2P signatures into
`(cid:2)xed pattern matches at (cid:2)xed offsets within a TCP payload and
`variable pattern matches with variable offset within a TCP payload.
`The (cid:2)xed offset operation can be implemented cheaply whereas
`variable pattern matches are substantially more expensive.
`To be able to execute the decomposed signatures on real network
`traf(cid:2)c we implemented them in the context of the Gigascope [7]
`high speed traf(cid:2)c monitor. In this section we will (cid:2)rst discuss the
`issues involved in evaluating (cid:2)xed and variable offset signatures
`and then discuss how we implement them in the context of Gigas-
`cope.
`
`
`
`6. EXPERIMENTAL SETUP
`
`To demonstrate the feasibility of our goal of fast P2P detec-
`tion using application layer signatures we evaluate our signatures
`in the three dimensions introduced in Section 2. We evaluate our
`signature-based classi(cid:2)er in terms of accuracy, robustness and scal-
`ability.
`6.1 Data Sets
`We analyzed two full packet traces from different network van-
`tage points using the Gigascope.
`Internet Access Trace: The (cid:2)rst trace was collected on an access
`network to a major backbone and contains typical Internet
`traf(cid:2)c. The trace covers a 24 hour period on a Tuesday in
`November 2003 and a 18 hour period on a Sunday in Novem-
`GB of com-
`ber 2003. The total traf(cid:2)c volume was
`
`pressed data and corresponded to
`million TCP connec-
`.10 /
`tions.
`VPN Trace: The VPN (Virtual Private Network) trace was col-
`lected on a T3 (45 Mbps) link connecting a VPN contain-
`ing 500 employees to the Internet. The router on this link
`blocks P2P ports and corporate policy prohibits the use of
`P2P applications within the VPN. Therefore, this link has a
`low probability of carrying P2P traf(cid:2)c. This trace contains
`6 days worth of data or 1.8 Terabytes of data in 2.8 billion
`packets. The data was collected in November 2003.
`6.2 Accuracy Evaluation
`There are two types of classi(cid:2)cation inaccuracies, both undesir-
`able
` The classi(cid:2)er erroneously identi(cid:2)es non-application traf(cid:2)c as
`application traf(cid:2)c. One metric to measure this error is the
`False Positive (FP).
` The classi(cid:2)er fails to identify application traf(cid:2)c as such. One
`measure of this error is the False Negative (FN) metric.
`Let
`denote the total application traf(cid:2)c (total bytes, connec-
`tions etc.) identi(cid:2)ed by the signature, and
`, the total actual traf(cid:2)c
`for that application, and
`be the total amount of non-application
`traf(cid:2)c identi(cid:2)ed as application-traf(cid:2)c. Then the FP and FN ratios
`are computed as
`3547698
`
`5.3 Gigascope Based Implementation
`Gigascope is a high speed traf(cid:2)c monitor which can perform a
`(2x2.4
`variety of traf(cid:2)c measurement tasks at speeds up to OC-
`./
`Gbps). To evaluate our signature based P2P classi(cid:2)cation we in-
`cluded the libraries described above into the Gigascope framework
`and wrote a set of Gigascope con(cid:2)guration (cid:2)les based on our P2P
`signatures. In the Gigascope framework these con(cid:2)guration (cid:2)les
`are translated into C code which is subsequently compiled. The
`resulting executable is used to perform the network monitoring in
`real time. Gigascope automatically breaks complex computation
`into multiple tasks exploiting multiple processors if available. In
`addition to the real-time P2P detection task we also used Gigascope
`to collect large datasets for our accuracy evaluation as discussed in
`Section 7.
`When we con(cid:2)gured our Gigascope instance we utilized the fact
`that (cid:2)xed offset matches are substantially cheaper to execute than
`variable offset matches. For example, to identify the DirectConnect
`protocol we need to perform a regular expression match for:
`types|MyNick|Lock|Key|Direction|
`GetListLen|ListLen|MaxedOut|Error|
`Send|Get|FileLength|Canceled|HubName|
`ValidateNick|ValidateDenide|GetPass|
`MyPass|BadPass|Version|Hello|LogedIn|
`MyINFO|GetINFO|GetNickList|NickList|
`OpList|To|ConnectToMe|MultiConnectToMe|
`RevConnectToMe|Search|MultiSearch|SR|
`Kick|OpForceMove|ForceMove|Quit
`However, we also know that the (cid:2)rst byte of the DirectConnect
`TCP payload needs to be 36 and the last byte 124. We therefore
`con(cid:2)gured the Gigascope to only try the regular expression match
`for DirectConnect if the (cid:2)xed offset (cid:2)elds match.
`Note that we used a similar approach for Gnutella a