`
`Thomas Karagiannis
`UC Riverside
`
`Andre Broido
`CAIDA, SDSC
`
`Michalis Faloutsos
`UC Riverside
`
`Kc claffy
`CAIDA, SDSC
`
`ABSTRACT
`Since the emergence of peer-to-peer (P2P) networking in the
`late ’90s, P2P applications have multiplied, evolved and es-
`tablished themselves as the leading ‘growth app’ of Internet
`traffic workload.
`In contrast to first-generation P2P net-
`works which used well-defined port numbers, current P2P
`applications have the ability to disguise their existence through
`the use of arbitrary ports. As a result, reliable estimates of
`P2P traffic require examination of packet payload, a method-
`ological landmine from legal, privacy, technical, logistic, and
`fiscal perspectives. Indeed, access to user payload is often
`rendered impossible by one of these factors, inhibiting trust-
`worthy estimation of P2P traffic growth and dynamics. In
`this paper, we develop a systematic methodology to identify
`P2P flows at the transport layer, i.e., based on connection
`patterns of P2P networks, and without relying on packet
`payload. We believe our approach is the first method for
`characterizing P2P traffic using only knowledge of network
`dynamics rather than any user payload. To evaluate our
`methodology, we also develop a payload technique for P2P
`traffic identification, by reverse engineering and analyzing
`the nine most popular P2P protocols, and demonstrate its
`efficacy with the discovery of P2P protocols in our traces
`that were previously unknown to us. Finally, our results
`indicate that P2P traffic continues to grow unabatedly, con-
`trary to reports in the popular media.
`
`Categories and Subject Descriptors
`C.2.5 [Computer-Communication Networks]: Local and Wide-
`Area Networks
`
`General Terms
`Algorithms, Measurement
`
`Keywords
`Peer-to-peer, Measurements, Traffic classification
`
`Permission to make digital or hard copies of all or part of this work for
`personal or classroom use is granted without fee provided that copies are
`not made or distributed for profit or commercial advantage and that copies
`bear this notice and the full citation on the first page. To copy otherwise, to
`republish, to post on servers or to redistribute to lists, requires prior specific
`permission and/or a fee.
`IMC’04, October 25–27, 2004, Taormina, Sicily, Italy.
`Copyright 2004 ACM 1-58113-821-0/04/0010 ...$5.00.
`
`1.
`
`INTRODUCTION
`Over the last few years, peer-to-peer (P2P) file-sharing
`has relentlessly grown to represent a formidable component
`of Internet traffic. P2P volume is sufficiently dominant on
`some links to incent increased local peering among Inter-
`net Service Providers [25], to observable yet unquantified
`effect on the global Internet topology and routing system
`not to mention competitive market dynamics. Despite this
`dramatic growth, reliable profiling of P2P traffic remains
`elusive. We no longer enjoy the fleeting benefit of first-
`generation P2P traffic, which was relatively easily classi-
`fied due to its use of well-defined port numbers. Current
`P2P networks tend to intentionally disguise their generated
`traffic to circumvent both filtering firewalls as well as legal
`issues most emphatically articulated by the Recording In-
`dustry Association of America (RIAA). Not only do most
`P2P networks now operate on top of nonstandard, custom-
`designed proprietary protocols, but also current P2P clients
`can easily operate on any port number, even HTTP’s port
`80.
`These circumstances portend a frustrating conclusion: ro-
`bust identification of P2P traffic is only possible by examin-
`ing user payload. Yet packet payload capture and analysis
`poses a set of often insurmountable methodological land-
`mines:
`legal, privacy, technical, logistic, and financial ob-
`stacles abound, and overcoming them leaves the task of re-
`verse engineering a growing number of poorly documented
`P2P protocols. Further obfuscating workload characteriza-
`tion attempts is the increasing tendency of P2P protocols
`to support payload encryption. Indeed, the frequency with
`which P2P protocols are introduced and/or upgraded ren-
`ders packet payload analysis not only impractical but also
`glaringly inefficient.
`In this paper we develop a systematic methodology to
`identify P2P flows at the transport layer, i.e., based on flow
`connection patterns of P2P traffic, and without relying on
`packet payload. The significance of our algorithm lies in its
`ability to identify P2P protocols without depending on their
`underlying format, which offers a distinct advantage over
`payload analysis: we can identify previously unknown P2P
`protocols. In fact during our analysis we detected traffic of
`three distinct P2P protocols previously unknown to us. To
`validate our methodology we also developed a payload-based
`technique for P2P traffic identification, by reverse engineer-
`ing and analyzing the nine most popular P2P protocols.
`Specifically, the highlights of our paper include:
`• We develop a systematic methodology for P2P traffic
`profiling by identifying flow patterns and character-
`
`121
`
`Cloudflare - Exhibit 1047, page 121
`
`
`
`Day
`Date
`Set Bb
`D09N 2 2003-05-07 Wed
`D09S 2 2003-05-07 Wed
`D10N 2 2004-01-22 Thu
`D10S 2 2004-01-22 Thu
`D11S 2 2004-02-25 Wed
`D13N 2 2004-04-21 Wed
`D13S 2 2004-04-21 Wed
`
`Table 1: Bulk sizes of OC-48 datasets
`Aver.Util. Ut.%
`Dur
`Dir
`Src.IP
`Dst.IP
`Flows Packets Bytes
`Start
`651 Mbps
`26.2
`2 h Nbd (1)
`904 K 2992 K 56.7 M 930.4 M 603 G
`10:00
`376 Mbps
`15.1
`2 h
`Sbd (0)
`466 K 2527 K 47.3 M 624.2 M 340 G
`10:00
`60 m Nbd (1)
`812 K 2181 K 23.6 M 412.7 M 288 G 638.9 Mbps
`25.7
`14:00
`60 m Sbd (0)
`279 K 4177 K 18.6 M 252.7 M 117 G 260.4 Mbps
`10.5
`14:00
`2 h
`Sbd (0)
`410 K 7465 K 25.3 M 249.6 M 98.5 G 109.4 Mbps
`4.4
`10:00
`20:00 122 m Nbd (1)
`1971 K 6956 K 86.4 M 1263 M 852 G 930.6 Mbps
`37.4
`20:00 122 m Sbd (0)
`306 K 10847 K 27.8 M 266.4 M 106 G 115.5 Mbps
`4.6
`
`istics of P2P behavior, without examination of user
`payload.
`• Our methodology effectively identifies 99% of P2P flows
`and more than 95% of P2P bytes (compared to pay-
`load analysis), while limiting false positives to under
`10%.
`• Our methodology is capable of identifying P2P flows
`missed by payload analysis. Using our methodology
`we identify approximately 10% additional P2P flows
`over payload analysis.
`• Using data collected at an OC48 (2.5Gbps) link of a
`Tier1 Internet Service Provider (ISP), we provide re-
`alistic estimates and trends of P2P traffic in the wide-
`area Internet over the last few years. We find that in
`contrast to claims of a sharp decline, P2P traffic has
`been constantly growing.
`Our methodology can be expanded to support profiling of
`various types of traffic. Since mapping applications by port
`numbers is no longer substantially valid, a generalized ver-
`sion of our algorithm can support traffic characterization
`tasks beyond P2P workload. Indeed, to minimize false pos-
`itives in P2P traffic identification, we assess, and then filter
`by, connection features of numerous protocols and applica-
`tions (such as mail or DNS).
`The rest of this paper is structured as follows: Section 2
`describes our backbone traces, which span from May 2003
`to April 2004. Section 3 discusses previous work in P2P
`traffic estimation and analysis. Sections 4 and 5 describe in
`detail our payload and nonpayload methodologies for P2P
`traffic identification. Section 6 presents an evaluation of our
`algorithm by comparing the volume of P2P identified by
`our methods. In section 7 we challenge media claims that
`the pervasive litigation undertaken by the RIAA is causing
`an overall decline in P2P file-sharing activity. Section 8
`concludes our paper.
`
`2. DATA DESCRIPTION
`Part of the analyzed traces in this paper are included in
`CAIDA’s Backbone Data Kit (BDK) [1], consisting of packet
`traces captured at an OC-48 link of a Tier 1 US ISP connect-
`ing POPs from San Jose, California to Seattle, Washington.
`Table 1 lists general workload dimensions of our datasets:
`counts of distinct source and destination IP addresses and
`the numbers of flows, packets, and bytes observed. We pro-
`cessed traces with CAIDA’s Coral Reef suite [20].
`We analyze traces taken on May 5, 2003 (D09), January
`22, 2004 (D10) February 25, 2004 (D11) and April 21,2004
`(D13). We captured the traces with Dag 4 monitors [14]
`and packet capture software from the University of Waikato
`and Endace [12] that supports observation of one or both
`directions of the link.
`For our older traces (D01-D10), our monitors captured
`44 bytes of each packet, which includes IP and TCP/UDP
`headers and an initial 4 bytes of payload for some packets.
`
`However, approximately 60%-80% of the packets in these
`traces are encapsulated with an extra 4-byte MPLS label
`which leaves no space for payload bytes.
`Fortunately we were able to capture the February and
`April 2004 traces (D11 and D13) with 16 bytes of TCP/UDP
`payload which allows us to evaluate our nonpayload method-
`ology. To protect privacy, our monitoring system anonymized
`the IP addresses in these traces using the Cryptography-
`based Prefix-preserving Anonymization algorithm (Crypto-
`PAn) [33].
`
`3. PREVIOUS WORK
`Most P2P traffic research has thus far emphasized detailed
`characterization of a small subset of P2P protocols and/or
`networks [19] [15], often motivated by the dominance of that
`protocol in a particular provider’s infrastructure or during
`a specific time period. Typical data sources range from aca-
`demic network connections [27], [21] to Tier 2 ISPs [22].
`Other P2P measurement studies have focused on topo-
`logical characteristics of P2P networks based on flow level
`analysis [29], or investigating properties such as bottleneck
`bandwidths [27], the possibility of caching [22], or the avail-
`ability and retrieval of content [3] [13].
`Recently, Sen et al. developed a signature-based payload
`methodology [28] to identify P2P traffic. The authors focus
`on TCP signatures that characterize file downloads in five
`P2P protocols based on the examination of user payload.
`The methodology in [28] is similar to our payload analysis
`and it is further discussed in section 4.
`A number of Sprint studies [8] report on P2P traffic as
`observed in a major Tier 1 provider backbone. However,
`their volume estimates taxonomize applications based on
`fixed port numbers from CoralReef’s database [23], which
`captures a small and decreasing fraction of p2p traffic.
`Our approach differs from previous work in three ways:
`• We analyze traffic sources of exceptionally high diver-
`sity, from major Tier 1 ISPs at the Internet core.
`• We study all popular P2P applications available: Nei-
`ther of our methodologies (payload and nonpayload)
`are limited to a subset of P2P networks. On the con-
`trary we study those P2P applications that currently
`contribute the vast majority of P2P traffic.
`• We combine and cross-validate identification methods
`that use fixed ports, payload, and transport layer dy-
`namics.
`
`4. PAYLOAD ANALYSIS OF P2P TRAFFIC
`AND LIMITATIONS
`Our payload analysis of P2P traffic is based on identify-
`ing characteristic bit strings in packet payload that poten-
`tially represent control traffic of P2P protocols. We mon-
`itor the nine most popular P2P protocols: eDonkey [10]
`
`122
`
`Cloudflare - Exhibit 1047, page 122
`
`
`
`(also includes the Overnet and eMule [11] networks), Fast-
`track which is supported by the Kazaa client, BitTorrent [4],
`OpenNap and WinMx [32], Gnutella, MP2P [24], Soulseek [30],
`Ares [2] and Direct Connect [7].
`Each of these P2P networks operate on top of nonstan-
`dard, usually custom-designed proprietary protocols. Hence,
`payload identification of P2P traffic requires separate anal-
`ysis of the various P2P protocols to identify the specific
`packet format used in each case. This section describes lim-
`itations that inhibit accurate identification of P2P traffic at
`the link level. In addition, we present our methodology to
`identify P2P flows.
`4.1 Limitations
`We had to carefully consider several issues throughout our
`study. While some of these restrictions are data related, oth-
`ers originate from the nature of P2P protocols. Specifically,
`these limitations are the following:
`Captured payload size: CAIDA monitors capture the
`first 16 bytes of user payload1 of each packet (see section 2)
`for our February and April traces. While our payload heuris-
`tics would be capable of effectively identifying all P2P pack-
`ets if the whole payload were available, this 16-byte payload
`restriction limits the number of heuristics that can reliably
`pinpoint P2P flows. Furthermore, our older traces (May
`2003, January 2004) only contain 4 bytes of payload for a
`limited number of packets, since our monitors were used to
`capture 44 bytes for each packet (e.g., TCP options will push
`payload bytes out of the captured segment. Limitations for
`our older traces are described in detail in section 7).
`
`HTTP requests: Several P2P protocols use HTTP re-
`quests and responses to transfer files, and it can be impos-
`sible to distinguish such P2P traffic from typical web traffic
`given only 16 bytes of payload, e.g., “HTTP/1.1 206 Partial
`Content” could represent either HTTP or P2P .
`
`Encryption : An increasing number of P2P protocols rely
`on encryption and SSL to transmit packets and files. Pay-
`load string matching misses all P2P encrypted packets.
`
`Other P2P protocols: The widespread use of file-sharing
`and P2P applications yields a broad variety of P2P proto-
`cols. Thus our analysis of the top nine P2P protocols cannot
`guarantee identification of all P2P flows, especially given the
`diversity of the OC48 backbone link. However, our experi-
`ence with P2P applications and traffic analysis convinces
`us that these nine protocols represent the vast majority of
`current P2P traffic.
`
`Unidirectional traces: Some of our traces reflect only
`one direction of the monitored link. In these cases we cannot
`identify flows that carry the TCP acknowledgment stream
`of a P2P download, since there is no payload. Even if we
`monitored both directions of the link, asymmetric routing
`renders it unlikely to find both streams (data and acknowl-
`edgment) of a TCP flow on the same link.
`We can overcome these limitations with our nonpayload
`methodology described in section 5.
`
`4.2 Methodology
`Our analysis is based on identifying specific bit strings
`in the application-level user data. Since documentation for
`1Privacy issues and agreement with the ISP prohibit the
`examination of more bytes of user payload.
`
`Table 2: Strings at the beginning of the payload of P2P
`protocols. The character “0x” below implies Hex strings.
`
`P2P Protocol
`eDonkey2000
`
`Fasttrack
`
`BitTorrent
`Gnutella
`
`MP2P
`Direct Connect
`
`Ares
`
`String
`0xe319010000
`0xc53f010000
`“Get /.hash”
`0x270000002980
`“0x13Bit”
`“GNUT”, “GIV”
`“GND”
`GO!!, MD5, SIZ0x20
`“$MyN”,”$Dir”
`“$SR”
`“GET hash:”
`“Get sha1:”
`
`Trans. prot. Def. ports
`TCP/UDP
`4661-4665
`
`TCP
`UDP
`TCP
`TCP
`UDP
`TCP
`TCP
`UDP
`TCP
`
`1214
`
`6881-6889
`6346-6347
`
`41170 UDP
`411-412
`
`-
`
`P2P protocols is generally poor, we empirically derived a set
`of distinctive bit strings for each case by monitoring both
`TCP and UDP traffic using tcpdump[31] after installing var-
`ious P2P clients. Table 2 lists a subset of these strings for
`some of the analyzed protocols for TCP and UDP. Table 2
`also presents the well-known ports for these P2P protocols.
`The complete list of bit strings we used is in [18].
`We classify packets into flows, defined by the 5-tuple source
`IP, destination IP, protocol, source port and destination
`port. We use the commonly accepted 64-second flow time-
`out [6], i.e., if no packet arrives in a specific flow for 64 sec-
`onds, the flow expires. To address the limitations described
`in the previous section, we apply three different methods to
`estimate P2P traffic, listed by increasing levels of aggres-
`siveness as to which flows it classifies as P2P :
`
`M1:
`If a source or destination port number of a flow
`matches one of the well-known port numbers (Table 2) the
`flow is flagged as P2P.
`
`M2: We compare the payload (if any) of each packet in a
`flow against our table of strings. In case of a match between
`the 16-byte payload of a packet and one of our bit strings,
`we flag the flow as P2P with the corresponding protocol,
`e.g., Fasttrack, eDonkey, etc. If none of the packets match,
`we classify the flow as non-P2P.
`
`M3: If a flow is flagged as P2P, both source and destina-
`tion IP addresses of this flow are hashed into a table. All
`flows that contain an IP address in this table are flagged
`as “possible P2P” even if there is no payload match. To
`avoid recursive misclassification of non-P2P flows as P2P,
`we perform this type of IP tracking only for host IPs that
`M2 identified as P2P .
`
`In all P2P networks, P2P clients maintain a large number
`of connections open even if there are no active file transfers.
`There is thus increased probability that a host identified as
`P2P from M2 will participate in other P2P flows. These
`flows will be flagged as “possible P2P” in M3. On the other
`hand, a P2P user may be browsing the web or sending email
`while connected to a P2P network. Thus, to minimize false
`positives we exclude from M3 all flows whose source or des-
`tination port implies web, mail, FTP, SSL, DNS (i.e., ports
`80, 8000, 8080, 25, 110, 21, 22, 443, 53) for TCP and online
`gaming and DNS (e.g., 27015-27050, 53) for UDP 2.
`In general, we believe that M3 will provide an estimate
`closer to the real intensity of P2P traffic, especially with lim-
`
`2Since nothing prevents P2P clients from using these ports
`also, excluding specific protocols by looking at port numbers
`may result in underestimating P2P flows.
`
`123
`
`Cloudflare - Exhibit 1047, page 123
`
`
`
`ited 4-byte payload traces, while M2 provides a loose lower
`bound on P2P volume. M3 takes advantage of our ability to
`identify IPs participating in P2P flows as determined by M2,
`facilitating identification of flows for which payload analysis
`fails. M3 is used only in section 7, where we examine the
`evolution of the volume of P2P traffic. In that section, we
`use M3 to overcome the problem of the limited 4-byte payload
`in our older traces. For all other analysis, payload P2P esti-
`mates are strictly based on payload string matching, namely
`M2.
`Recently, Sen et al. developed a similar signature-based
`payload methodology [28]. The authors concentrate on TCP
`signatures that characterize file downloads in five P2P proto-
`cols and identify P2P traffic based on the examination of all
`user payload bytes. [28] describes a subset of the signatures
`included in our methodology, since we also use UDP-based
`as well as protocol signaling signatures for a larger number
`of P2P protocols/networks (e.g., the WinMx/OpenNap net-
`work is not analyzed in [28], although it corresponds to a
`significant portion of P2P traffic [17]). On the other hand,
`[28] presents the advantage of examining all user payload
`bytes. While examining all bytes of the payload should in-
`crease the amount of identified P2P traffic, we expect only
`a minimum difference in the number of identified P2P flows
`between [28] and the methodology described in this section.
`First, characteristic signatures or bit strings of P2P packets
`appear at the beginning of user payload; thus, 16 bytes of
`payload should be sufficient to capture the majority of P2P
`flows. Second, we expect that missed flows due to the pay-
`load limitation will be identified by our M3 method and/or
`by TCP and UDP control traffic originating from the specific
`IPs.
`
`5. NONPAYLOAD IDENTIFICATION OF P2P
`TRAFFIC
`We now describe our nonpayload methodology for P2P
`traffic profiling (PTP). Our method only examines the packet
`header to detect P2P flows, and does not in any way exam-
`ine user payload. To our knowledge, this is a first attempt to
`identity P2P flows on arbitrary ports without any inspection
`of user payload.
`Our heuristics are based on observing connection patterns
`of source and destination IPs. While some of these patterns
`are not unique to P2P hosts, examining the flow history of
`IPs can help eliminate false positives and reveal distinctive
`features.
`We employ two main heuristics that examine the behavior
`of two different types of pairs of flow keys. The first exam-
`ines source-destination IP pairs that use both TCP and UDP
`to transfer data (TCP/UDP heuristic, section 5.1). The sec-
`ond is based on how P2P peers connect to each other by
`studying connection characteristics of {IP, port} pairs (sec-
`tion 5.2). A high level description of our algorithm is as
`follows:
`• Data processing: We build the flow table as we observe
`packets cross the link, based on 5-tuples, similar to the
`payload method. At the same time we collect infor-
`mation on various characteristics of {IP, port} pairs,
`including the sets of distinct IPs and ports that an
`{IP, port} pair is connected to, packet sizes used and
`transferred flow sizes.
`
`Table 3: Excluded ports for TCP/UDP IP pairs heuristic.
`Ports
`Application
`135,137,139,445
`NETBIOS
`53
`DNS
`123
`NTP
`500
`ISAKMP
`554,7070,1755,6970,5000,5001
`streaming
`7000, 7514, 6667
`IRC
`6112, 6868, 6899
`gaming
`3531
`p2pnetworking.exe
`
`• Identification of potential P2P pairs: We flag potential
`flows as P2P based on TCP/UDP usage and the {IP,
`port} connection characteristics.
`• False positives: We eliminate false positives by com-
`paring flagged P2P flows against our set of heuristics
`that identify mail servers, DNS flows, malware, etc.
`5.1 TCP/UDP IP pairs
`Our first heuristic identifies source-destination IP pairs
`that use both TCP and UDP transport protocols. Six out
`of nine analyzed P2P protocols use both TCP and UDP as
`layer-4 transport protocols. These protocols include eDon-
`key, Fasttrack, WinMx, Gnutella, MP2P and Direct Con-
`nect. Generally, control traffic, queries and query-replies
`use UDP, and actual data transfers use TCP. To identify
`P2P hosts we can thus look for pairs of source-destination
`hosts that use both transport protocols (TCP and UDP).
`While concurrent usage of both TCP and UDP is defi-
`nitely typical for the aforementioned P2P protocols, it is also
`used for other application layer protocols such as DNS or
`streaming media. To determine non-P2P applications in our
`traces that use both transport protocols, we examined all
`source-destination host pairs for which both TCP and UDP
`flows exist. We found that besides P2P protocols, only a few
`applications use both TCP and UDP transport protocols:
`DNS, NETBIOS, IRC, gaming and streaming, which collec-
`tively typically use a small set of port numbers such as 135,
`137, 139, 445, 53, 3531, etc. Table 3 lists all such applica-
`tions found, together with their well-known ports. Port 445
`is related to the Microsoft NETBIOS service. Port 3531 is
`used by an application called p2pnetworking.exe which is au-
`tomatically installed by Kazaa. Although p2pnetworking.exe
`is related to P2P traffic, we choose to exclude it from our
`analysis since it is not under user control3 and specific only
`to the Kazaa client. Excluding flows using ports presented in
`Table 3, 98.5% of the remaining IP source-destination pairs
`that use both TCP and UDP in our traces are P2P, based
`on the payload analysis with M2 described in Section 4. In
`summary, if a source-destination IP pair concurrently uses
`both TCP and UDP as transport protocols, we consider flows
`between this pair P2P so long as the source or destination
`ports are not in the set in Table 3.
`
`5.2 {IP, port} pairs
`Our second heuristic is based on monitoring connection
`patterns of {IP, port} pairs.
`Since the lawsuit against Napster, the prevalence of cen-
`tralized P2P networks has decreased dramatically, and dis-
`tributed or hybrid P2P networks have emerged. To connect
`to these distributed networks, each P2P client maintains a
`3The user cannot change the port number or control its
`functionality, and all flows of p2p.networking.exe use port
`3531.
`
`124
`
`Cloudflare - Exhibit 1047, page 124
`
`
`
`Figure 1: Initial connection from a new P2P host A to the P2P network. Host A connects to a superpeer picked from its
`host cache. Peer A informs the superpeer of its IP address and the port willing to accept connections from other peers. The
`superpeer propagates the {IP, port} pair to the rest of the P2P network. Peers willing to connect to host A, use the advertised
`{IP, port} pair. For the {IP, port} pair {A,1}, the number of distinct IPs (C,B) connected to it is equal to the number of
`distinct ports (10,15) used to connect to it. Our {IP, port} pair heuristic is based on such equality between the number of
`distinct ports and the number of distinct IPs affiliated with a pair in order to identify potential P2P pairs.
`
`starting host cache. Depending on the network, the host
`cache may contain the IP addresses of other peers, servers
`or supernodes/superpeers.4 This pool of hosts facilitates
`the initial connection of the new peer to the existing P2P
`network.
`As soon as a connection exists to one of the IPs in the host
`cache (we will henceforth refer to these IPs as superpeers),
`the new host A informs that superpeer of its IP address and
`port number at which it will accept connections from peers.
`Host A also provides other information specific to each P2P
`protocol but not relevant here. While in first-generation
`P2P networks the listening port was well-defined and spe-
`cific to each network, simplifying P2P traffic classification,
`newer versions of all P2P clients allow the user to config-
`ure a random port number (some clients even advise users
`to change the port number to disguise their traffic). The
`superpeer must propagate this information, mainly the {IP,
`port} pair of the new host A, to the rest of the network. This
`{IP, port} pair is essentially the new host’s ID, which other
`peers need to use to connect to it. In summary, when a P2P
`host initiates either a TCP or a UDP connection to peer A,
`the destination port will also be the advertised listening port
`of host A, and the source port will be an ephemeral random
`port chosen by the client.
`Normally, peers maintain at most one TCP connection to
`each other peer, but there may also be a UDP flow to the
`same peer, as described previously. Keeping in mind that
`multiple connections between peers is rare in our data sets,
`we consider what happens when twenty peers all connect
`to peer A. Each peer will select a temporary source port
`and connect to the advertised listening port of peer A. The
`advertised {IP, port} pair of host A would thus be affiliated
`with 20 distinct IPs and 20 distinct ports 5. In other words,
`for the advertised destination {IP, port} pair of host A, the
`number of distinct IPs connected to it will be equal to the
`number of distinct ports used to connect to it. Figure 1
`illustrates the procedure whereby a new host connects to
`the P2P network and advertises its {IP, port} pair.
`4Superpeers/supernodes are P2P hosts that handle ad-
`vanced functionality in the P2P network, such as routing
`and query propagation.
`5The probability that two distinct hosts pick the same ran-
`dom source port at the same time is extremely low.
`
`On the other hand, consider what happens in the case of
`web and HTTP. As in the P2P case, each host connects to
`a pre-specified {IP, port} pair, e.g., the IP address of a web
`server W and port 80. However, a host connecting to the
`web server will initiate usually more than one concurrent
`connection in order to download objects in parallel. In sum-
`mary, web traffic will have a higher ratio than P2P traffic of
`the number of distinct ports versus number of distinct IPs
`connected to the {IP, port} pair {W,80}.
`5.3 Methodology
`Our nonpayload methodology builds on insights from pre-
`vious sections 5.1 and 5.2. Specifically, for a time interval
`t we build the flow table for the link, based on the five-
`tuple key and 64-second flow timeout as with the payload
`methodology described in section 4. We then examine our
`two primary heuristics:
`• We look for source-destination IP pairs that concur-
`rently use both TCP and UDP during t. If such IP
`pairs exist and they do not use any ports from table 3,
`we consider them P2P.
`• We examine all source {srcIP, srcport} and destination
`{dstIP, dstport} pairs during t (use of pairs will hence-
`forth imply both source and destination {IP, port}
`pairs). We seek pairs for which the number of dis-
`tinct connected IPs is equal to the number of distinct
`connected ports. All pairs for which this equality holds
`are considered P2P . In contrast, if the difference be-
`tween connected IPs and ports for a certain pair is
`large (e.g., larger than 10), we regard this pair as non
`P2P.
`
`These two simple heuristics efficiently classify most pairs
`as P2P or nonP2P. In particular the {IP, port} heuristic
`can effectively identify P2P and nonP2P pairs given a suf-
`ficiently large sample of connections for the specific pair.
`For example, with time interval t of 5 minutes there are no
`false positives for pairs with more than 20 connections in
`our February 2004 trace (D11 of Table 1.) That is, for this
`specific trace, if an IP pair has more than 20 IPs connect
`to it, we can classify it with high confidence as P2P or not
`P2P.
`
`125
`
`Cloudflare - Exhibit 1047, page 125
`
`
`
`Whether a flow is considered P2P depends on the classifi-
`cation of its {IP, port} pairs. If one of the pairs in the 5-tuple
`flow key has been classified as P2P, this flow is deemed P2P.
`Similarly, if one of the pairs is classified as non P2P, so is
`the flow. Additionally, if one of the IPs in a flow has been
`found to match the TCP/UDP heuristic, the flow is also
`considered as P2P.
`5.4 False positives
`We now describe heuristics developed to decrease the risk
`of false positives. Considering the diversity of backbone
`links that feature a vast number of IPs and flows, we ex-
`pect the previous methodology to yield false positives, i.e.,
`classifying nonP2P pairs as P2P. False positives are most
`common in pairs with few connections, and also more fre-
`quent for specific applications/protocols whose connection
`behavior matches the P2P profile of our heuristics (e.g., one
`connection per {IP,port} pair), e.g., e-mail (SMTP, POP),
`DNS and gaming.
`To decrease the rate of false positives we review the con-
`nection and flow history of all pairs where the probability
`of a misclassification is high, e.g., the source or destination
`port is equal to 25 and implies SMTP. Past flow history for
`these pairs enables accurate classification by investigating
`properties of specific IPs. In the following subsections, we
`describe heuristics that augment our basic methodology to
`limit the magnitude of false positives.
`
`5.4.1 Mail
`In our data sets, e-mail protocols such as Simple Mail
`Transfer Protocol (SMTP) or Post Office Protocol (POP)
`contribute most false positives. Mail false positives are not
`surprising since connection behavior resembles our {IP, port}
`heuristic. However, analysis of mail flows and connection
`patterns allows for identification of mail servers in our traces,
`forestalling misidentification of traffic to such IP addresses
`as P2P.
`We examine all flows where one of the port numbers is
`equal to 25 (SMTP), 110 (POP) or 113 (authentication ser-
`vice commonly used by mail servers). In fact we treat these
`three port numbers as one (we consider ports 110 and 113
`equal to 25), since for our purpose their behavior is the same.
`We identify mail servers based on their port usage history
`and whether they have different flows during the same time
`interval t that use port 25 for both source and destination
`port. The following observed flow pattern illustrates this
`characteristic behavior of mail servers by examining the us-
`age of port 25 by IP 238.30.35.43 :
`
`dstport
`srcport
`proto
`dst IP
`src IP
`3267
`25
`6
`115.78.57.213
`238.30.35.43
`25
`22092
`6
`238.45.242.104
`238.30.35.43
`50827
`25
`6
`0.32.132.109
`238.30.35.43
`25
`22175
`6
`71.199.74.68
`238.30.35.43
`25
`21961
`6
`4.87.3.29
`238.30.35.43
`25
`22016
`6
`4.87.3.29
`238.30.35.43
`3301
`25
`6
`4.170.125.67
`238.30.35.43
`25
`22066
`6
`5.173.60.126
`238.30.35.43
`25
`22067
`6
`5.173.60.126
`238.30.35.43
`25
`22265
`6
`227.186.155.214
`238.30.35.43
`25
`22266
`6
`227.186.155.214
`238.30.35.43
`3872
`25
`6
`5.170.237.207
`238.30.35.43
`This case shows flows for IP 238.30.35.43 6 with port 25
`as source port for some flows and destination port for other
`flows. This behavior is characteristic of mail servers that
`
`6Note that IP addresses are anonymized.
`
`initiate connections to other mail servers to propagate e-
`mail messages. To identify this pattern, we monitor the set
`of destination port numbers for each IP for which there ex-
`ists a source pair {IP,25}.
`If this set of destination port
`numbers also contains port 25, we consider this IP a mail
`server and classify all its flows as nonP2P. Similarly for the
`set of source ports of an IP for which there exists a desti-
`nation pair {IP,25}. In the above example, for the source
`pair {238.30.35.43,25}, the set of destination ports is [3267,
`25, 50827, 3301, 3872]. Since port 25 appears in this set, we
`infer that IP 238.30.35.43 is a mail server and deem all of its
`flows nonP2P. We keep all IPs identified as mail servers in a
`mailserver list to avoid future application of our heuristics
`to them.
`
`5.4.2 DNS
`The Domain Name Serv