throbber
Toward the Accurate Identi(cid:12)cation of Network
`Applications
`
`Andrew W. Moore1 and Konstantina Papagiannaki2
`
`1 University of Cambridge, andrew.moore@cl.cam.ac.uk?
`2 Intel Research, Cambridge, dina.papagiannaki@intel.com
`
`Abstract. Well-known port numbers can no longer be used to reliably
`identify network applications. There is a variety of new Internet appli-
`cations that either do not use well-known port numbers or use other
`protocols, such as HTTP, as wrappers in order to go through (cid:12)rewalls
`without being blocked. One consequence of this is that a simple inspec-
`tion of the port numbers used by (cid:13)ows may lead to the inaccurate clas-
`si(cid:12)cation of network tra(cid:14)c. In this work, we look at these inaccuracies
`in detail. Using a full payload packet trace collected from an Internet
`site we attempt to identify the types of errors that may result from port-
`based classi(cid:12)cation and quantify them for the speci(cid:12)c trace under study.
`To address this question we devise a classi(cid:12)cation methodology that re-
`lies on the full packet payload. We describe the building blocks of this
`methodology and elaborate on the complications that arise in that con-
`text. A classi(cid:12)cation technique approaching 100% accuracy proves to be
`a labor-intensive process that needs to test (cid:13)ow-characteristics against
`multiple classi(cid:12)cation criteria in order to gain su(cid:14)cient con(cid:12)dence in
`the nature of the causal application. Nevertheless, the bene(cid:12)ts gained
`from a content-based classi(cid:12)cation approach are evident. We are capable
`of accurately classifying what would be otherwise classi(cid:12)ed as unknown
`as well as identifying tra(cid:14)c (cid:13)ows that could otherwise be classi(cid:12)ed in-
`correctly. Our work opens up multiple research issues that we intend to
`address in future work.
`
`1
`
`Introduction
`
`Network tra(cid:14)c monitoring has attracted a lot of interest in the recent past.
`One of the main operations performed within such a context has to do with the
`identi(cid:12)cation of the di(cid:11)erent applications utilising a network’s resources. Such
`information proves invaluable for network administrators and network designers.
`Only knowledge about the tra(cid:14)c mix carried by an IP network can allow e(cid:14)cient
`design and provisioning. Network operators can identify the requirements of
`di(cid:11)erent users from the underlying infrastructure and provision appropriately.
`In addition, they can track the growth of di(cid:11)erent user populations and design
`the network to accommodate the diverse needs. Lastly, accurate identi(cid:12)cation
`
`? Andrew Moore thanks the Intel Corporation for its generous support of his research
`fellowship
`
`Ex. 1009
`Juniper Networks, Inc. / Page 1 of 14
`
`

`

`of network applications can shed light on the emerging applications as well as
`possible mis-use of network resources.
`The state of the art in the identi(cid:12)cation of network applications through
`tra(cid:14)c monitoring relies on the use of well known ports: an analysis of the head-
`ers of packets is used to identify tra(cid:14)c associated with a particular port and
`thus of a particular application [1{3]. It is well known that such a process is
`likely to lead to inaccurate estimates of the amount of tra(cid:14)c carried by di(cid:11)erent
`applications given that speci(cid:12)c protocols, such as HTTP, are frequently used to
`relay other types of tra(cid:14)c, e.g., the NeoTeris VLAN over HTTP product. In ad-
`dition, emerging services typically avoid the use of well known ports, e.g., some
`peer-to-peer applications. This paper describes a method to address the accurate
`identi(cid:12)cation of network applications in the presence of packet payload informa-
`tion3. We illustrate the bene(cid:12)ts of our method by comparing a characterisation
`of the same period of network tra(cid:14)c using ports-alone and our content-based
`method.
`This comparison allows us to highlight how di(cid:11)erences between port and
`content-based classi(cid:12)cation may arise. Having established the bene(cid:12)ts of the
`proposed methodology, we proceed to evaluate the requirements of our scheme
`in terms of complexity and amount of data that needs to be accessed. We demon-
`strate the trade-o(cid:11)s that need to be addressed between the complexity of the
`di(cid:11)erent classi(cid:12)cation mechanisms employed by our technique and the resulting
`classi(cid:12)cation accuracy. The presented methodology is not automated and may
`require human intervention. Consequently, in future work we intend to study its
`requirements in terms of a real-time implementation.
`The remainder of the paper is structured as follows. In Section 2 we present
`the data used throughout this work. In Section 3 we describe our content-based
`classi(cid:12)cation technique. Its application is shown in Section 4. The obtained re-
`sults are contrasted against the outcome of a port-based classi(cid:12)cation scheme.
`In Section 5 we describe our future work.
`
`2 Collected Data
`
`This work presents an application-level approach to characterising network traf-
`(cid:12)c. We illustrate the bene(cid:12)ts of our technique using data collected by the high-
`performance network monitor described in [5].
`The site we examined hosts several Biology-related facilities, collectively re-
`ferred to as a Genome Campus. There are three institutions on-site that employ
`about 1,000 researchers, administrators and technical sta(cid:11). This campus is con-
`nected to the Internet via a full-duplex Gigabit Ethernet link. It was on this
`connection to the Internet that our monitor was placed. Tra(cid:14)c was monitored
`for a full 24 hour, week-day period and for both link directions.
`
`3 Packet payload for the identi(cid:12)cation of network applications is also used in [4].
`Nonetheless, no speci(cid:12)c details are provided by [4] on the implementation of the
`system thus making comparison infeasible. No further literature was found by the
`authors regarding that work.
`
`Ex. 1009
`Juniper Networks, Inc. / Page 2 of 14
`
`

`

`Total
`Total
`Packets MBytes
`Total 573,429,697
`268,543
`As percentage of Total
`94.819
`98.596
`TCP
`3.588
`0.710
`ICMP
`1.516
`0.617
`UDP
`0.077
`0.077
`OTHER
`Table 1. Summary of tra(cid:14)c analysed
`
`Brief statistics on the tra(cid:14)c data collected are given in Table 1. Other proto-
`cols were observed in the trace, namely IPv6-crypt, PIM, GRE, IGMP, NARP
`and private encryption, but the largest of them accounted for fewer than one
`million packets (less than 0.06%) over the 24 hour period and the total of all
`OTHER protocols was fewer than one and a half million packets. All percentage
`values given henceforth are from the total of UDP and TCP packets only.
`
`3 Methodology
`
`3.1 Overview of Content-based classi(cid:12)cation
`
`Our content-based classi(cid:12)cation scheme can be viewed as an iterative procedure
`whose target is to gain su(cid:14)cient con(cid:12)dence that a particular tra(cid:14)c stream is
`caused by a speci(cid:12)c application. To achieve such a goal our classi(cid:12)cation method
`operates on tra(cid:14)c (cid:13)ows and not packets. Grouping packets into (cid:13)ows allows for
`more-e(cid:14)cient processing of the collected information as well the acquisition of
`the necessary context for an appropriate identi(cid:12)cation of the network applica-
`tion responsible for a (cid:13)ow. Obviously, the (cid:12)rst step we need to take is that of
`aggregating packets into (cid:13)ows according to their 5-tuple. In the case of TCP,
`additional semantics can also allow for the identi(cid:12)cation of the start and end
`time of the (cid:13)ow. The fact that we observe tra(cid:14)c in both directions allows clas-
`si(cid:12)cation of all nearly (cid:13)ows on the link. A tra(cid:14)c monitor on a unidirectional
`link can identify only those applications that use the monitored link for their
`datapath.
`One outcome of this operation is the identi(cid:12)cation of unusual or peculiar
`(cid:13)ows | speci(cid:12)cally simplex (cid:13)ows. These (cid:13)ows consist of packets exchanged be-
`tween a particular port/protocol combination in only one direction between two
`hosts. A common cause of a simplex (cid:13)ow is that packets have been sent to an
`invalid or non-responsive destination host. The data of the simplex (cid:13)ows were
`not discarded, they were classi(cid:12)ed | commonly identi(cid:12)ed as carrying worm and
`virus attacks. The identi(cid:12)cation and removal of simplex (cid:13)ows (each (cid:13)ow con-
`sisting of between three and ten packets sent over a 24-hour period) allowed the
`number of unidenti(cid:12)ed (cid:13)ows that needed further processing to be signi(cid:12)cantly
`reduced.
`
`Ex. 1009
`Juniper Networks, Inc. / Page 3 of 14
`
`

`

`The second step of our method iteratively tests (cid:13)ow characteristics against
`di(cid:11)erent criteria until su(cid:14)cient certainty has been gained as to the identity
`of the application. Such a process consists of nine di(cid:11)erent identi(cid:12)cation sub-
`methods. We describe these mechanisms in the next section. Each identi(cid:12)cation
`sub-method is followed by the evaluation of the acquired certainty in the candi-
`date application. Currently this is a (labour-intensive) manual process.
`
`3.2
`
`Identi(cid:12)cation Methods
`
`The nine distinct identi(cid:12)cation methods applied by our scheme are listed in Table
`2. Alongside each method is an example application that we could identify using
`this method. Each one tests a particular property of the (cid:13)ow attempting to
`obtain evidence of the identity of the causal application.
`
`Example
`Identi(cid:12)cation Method
`I Port-based classi(cid:12)cation (only) |
`II Packet Header (including I)
`simplex (cid:13)ows
`III Single packet signature
`Many worm/virus
`IV Single packet protocol
`IDENT
`V Signature on the (cid:12)rst KByte P2P
`VI (cid:12)rst KByte Protocol
`SMTP
`VII Selected (cid:13)ow(s) Protocol
`FTP
`VIII (All) Flow Protocol
`VNC, CVS
`IX Host history
`Port-scanning
`Table 2. Methods of (cid:13)ow identi(cid:12)cation.
`
`Method I classi(cid:12)es (cid:13)ows according to their port numbers. This method rep-
`resents the state of the art and requires access only to the part in the packet
`header that contains the port numbers. Method II relies on access to the en-
`tire packet header for both tra(cid:14)c directions. It is this method that is able to
`identify simplex (cid:13)ows and signi(cid:12)cantly limit the number of (cid:13)ows that need to
`go through the remainder of the classi(cid:12)cation process. Methods III to VIII
`examine whether a (cid:13)ow carries a well-known signature or follows well-known
`protocol semantics. Such operations are accompanied by higher complexity and
`may require access to more than a single packet’s payload. We have listed the
`di(cid:11)erent identi(cid:12)cation mechanisms in terms of their complexity and the amount
`of data they require in Figure 1. According to our experience, speci(cid:12)c (cid:13)ows may
`be classi(cid:12)ed positively from their (cid:12)rst packet alone. Nonetheless, other (cid:13)ows may
`need to be examined in more detail and a positive identi(cid:12)cation may be feasible
`once up to 1 KByte of their data has been observed4. Flows that have not been
`
`4 The value of 1 KByte has been experimentally found to be an upper bound for the
`amount of packet information that needs to be processed for the identi(cid:12)cation of
`several applications making use of signatures. In future work, we intend to address
`
`Ex. 1009
`Juniper Networks, Inc. / Page 4 of 14
`
`

`

`classi(cid:12)ed at this stage will require inspection of the entire (cid:13)ow payload and we
`separate such a process into two distinct steps. In the (cid:12)rst step (Method VII)
`we perform full-(cid:13)ow analysis for a subset of the (cid:13)ows that perform a control-
`function. In our case FTP appeared to carry a signi(cid:12)cant amount of the overall
`tra(cid:14)c and Method VII was applied only to those (cid:13)ows that used the standard
`FTP control port. The control messages were parsed and further context was
`obtained that allowed us to classify more (cid:13)ows in the trace. Lastly, if there are
`still (cid:13)ows to be classi(cid:12)ed, we analyse them using speci(cid:12)c protocol information
`attributing them to their causal application using Method VIII.
`
`Complexity
`
`Amount of Data
`
`VIII
`
`VII
`
`VI
`
`IV
`
`Protocol
`
`V I
`
`II
`
`Signature
`
`II
`
`Port
`
`Flow (all)
`
`Flow (selected)
`
`1st KByte
`
`Packet
`
`Fig. 1. Requirements of identi(cid:12)cation methods.
`
`In our classi(cid:12)cation technique we will apply each identi(cid:12)cation method in
`turn and in such a way that the more-complex or more-data-demanding methods
`(as shown in Figure 1) are used only if no previous signature or protocol method
`has generated a match. The outcome of this process may be that (i) we have
`positively identi(cid:12)ed a (cid:13)ow to belong to a speci(cid:12)c application, (ii) a (cid:13)ow appears
`to agree with more than one application pro(cid:12)le, or (iii) no candidate application
`has been identi(cid:12)ed. In our current methodology all three cases will trigger manual
`intervention in order to validate the accuracy of the classi(cid:12)cation, resolve cases
`where multiple criteria have generated a match or inspect (cid:13)ows that have not
`matched any identi(cid:12)cation criteria. We describe our validation approach in more
`detail in Section 3.4.
`
`the exact question of what is the necessary amount of payload one needs to capture
`in order to identify di(cid:11)erent types of applications.
`
`Ex. 1009
`Juniper Networks, Inc. / Page 5 of 14
`
`

`

`The successful identi(cid:12)cation of speci(cid:12)c (cid:13)ows caused by a particular network
`application reveals important information about the hosts active in our trace.
`Our technique utilises this information to build a knowledge base for particular
`host/port combinations that can be used to validate future classi(cid:12)cation by test-
`ing conformance with already-observed host roles (Method IX). One outcome
`of this operation is the identi(cid:12)cation of hosts performing port scanning where
`a particular destination host is contacted from the same source host on many
`sequential port numbers. These (cid:13)ows evidently do not belong to a particular
`application (unless port scanning is part of the applications looked into). For a
`di(cid:11)erent set of (cid:13)ows, this process validated the streaming audio from a pool of
`machines serving a local broadcaster.
`Method IX can be further enhanced to use information from the host name
`as recorded in the DNS. While we used this as a process-of-last-resort (DNS
`names can be notoriously un-representative), DNS names in our trace did reveal
`the presence of an HTTP proxy, a Mail exchange server and a VPN endpoint
`operating over a TCP/IP connection.
`
`3.3 Classi(cid:12)cation Approach
`
`An illustration of the (cid:13)ow through the di(cid:11)erent identi(cid:12)cation sub-methods, as
`employed by our approach, is shown in Figure 2. In the (cid:12)rst step we attempt to
`reduce the number of (cid:13)ows to be further processed by using context obtained
`through previous iterations. Speci(cid:12)c (cid:13)ows in our data can be seen as \child"
`connections arising from \parent" connections that precede them. One such ex-
`ample is a web browser that initiates multiple connections in order to retrieve
`parts of a single web page. Having parsed the \parent" connection allows us to
`immediately identify the \child" connections and classify them to the causal web
`application.
`
`START
`
`II
`
`Is
`Flow Result
`of Another
`Application ?
`
`YES
`
`II
`
`NO
`
`Tag flows
`
`with known ports
`
`III
`
`IV
`
`1st pkt
`"Well Known"
`Signature ?
`
`NO
`
`1st pkt
`"Well Known"
`Protocol ?
`
`V
`
`NO
`
`1st KB
`"Well Known"
`Signature ?
`
`VI
`
`NO
`
`1st KB
`"Well Known"
`Protocol ?
`
`VII
`
`NO
`
`VIII
`
`NO
`
`Flow Contains
`Known Protocol?
`(selected)
`
`Flow Contains
`Known Protocol?
`(all)
`
`NO
`
`YES
`
`YES
`
`YES
`
`YES
`
`VERIFY
`
`IX
`(using among other mechanisms)
`
`YES
`
`YES
`
`Failed
`Verify
`
`Passed
`Verify
`
`STOP
`
`Manual
`Intervention
`
`Fig. 2. Classi(cid:12)cation procedure.
`
`A second example, that has a predominant e(cid:11)ect in our data, is passive
`FTP. Parsing the \parent" FTP session (Method VIII) allows the identi(cid:12)cation
`
`Ex. 1009
`Juniper Networks, Inc. / Page 6 of 14
`
`

`

`of the subsequent \child" connection that may be established toward a di(cid:11)erent
`host at a non-standard port. Testing whether a (cid:13)ow is the result of an already-
`classi(cid:12)ed (cid:13)ow at the beginning of the classi(cid:12)cation process allows for the fast
`characterisation of a network (cid:13)ow without the need to go through the remainder
`of the process.
`If the (cid:13)ow is not positively identi(cid:12)ed in the (cid:12)rst stage then it goes through
`several additional classi(cid:12)cation criteria. The (cid:12)rst mechanism examines whether
`a (cid:13)ow uses a well-known port number. While port-based classi(cid:12)cation is prone
`to error, the port number is still a useful input into the classi(cid:12)cation process
`because it may convey useful information about the identity of the (cid:13)ow. If no
`well-known port is used, the classi(cid:12)cation proceeds through the next stages.
`However, even in the case when a (cid:13)ow is found to operate on a well-known
`port, it is tagged as well-known but still forwarded through the remainder of the
`classi(cid:12)cation process.
`In the next stage we test whether the (cid:13)ow contains a known signature in its
`(cid:12)rst packet. At this point we will be able to identify (cid:13)ows that may be directed
`to well-known port numbers but carry non-legitimate tra(cid:14)c as in the case of
`virus or attack tra(cid:14)c. Signature-scanning is a process that sees common use
`within Intrusion Detection Systems such as snort [6]. It has the advantage that
`a suitable scanner is often optimised for string-matching while still allowing the
`expression of (cid:13)exible matching criteria. By scanning for signatures, applications
`such as web-servers operating on non-standard ports may be identi(cid:12)ed.
`If no known signature has been found in the (cid:12)rst packet we check whether the
`(cid:12)rst packet of the (cid:13)ow conveys semantics of a well-known protocol. An example
`to that e(cid:11)ect is IDENT which is a single packet IP protocol. If this test fails we
`look for well-known signatures in the (cid:12)rst KByte of the (cid:13)ow, which may require
`assembly of multiple individual packets. At this stage we will be able to identify
`peer-to-peer tra(cid:14)c if it uses well known signatures. Tra(cid:14)c due to SMTP will
`have been detected from the port-based classi(cid:12)cation but only the examination
`of the protocol semantics within the (cid:12)rst KByte of the (cid:13)ow will allow for the
`con(cid:12)dent characterisation of the (cid:13)ow. Network protocol analysis tools, such as
`ethereal [7], employ a number of such protocol decoders and may be used to
`make or validate a protocol identi(cid:12)cation.
`Speci(cid:12)c (cid:13)ows will still remain unclassi(cid:12)ed even at this stage and will require
`inspection of their entire payload. This operation may be manual or automated
`for particular protocols. From our experience, focusing on the protocol semantics
`of FTP led to the identi(cid:12)cation of a very signi(cid:12)cant fraction of the overall traf-
`(cid:12)c limiting the unknown tra(cid:14)c to less than 2%. At this point the classi(cid:12)cation
`procedure can end. However, if 100% accuracy is to be approached we envision
`that the last stage of the classi(cid:12)cation process may involve the manual inspec-
`tion of all unidenti(cid:12)ed (cid:13)ows. This stage is rather important since it is likely to
`reveal new applications. While labour-intensive, the individual examination of
`the remaining, unidenti(cid:12)ed, (cid:13)ows caused the creation of a number of new sig-
`natures and protocol-templates that were then able to be used for identifying
`protocols such as PCAnywhere, the sdserver and CVS. This process also served
`
`Ex. 1009
`Juniper Networks, Inc. / Page 7 of 14
`
`

`

`to identify more task-speci(cid:12)c systems. An example of this was a host o(cid:11)ering
`protocol-speci(cid:12)c database services.
`On occasion (cid:13)ows may remain unclassi(cid:12)ed despite this process; this takes
`the form of small samples (e.g., 1{2 packets) of data that do not provide enough
`information to allow any classi(cid:12)cation process to proceed. These packets used
`unrecognised ports and rarely carried any payload. While such background noise
`was not zero in the context of classi(cid:12)cation for accounting, Quality-of-Service, or
`resource planning, these amounts could be considered insigni(cid:12)cant. The actual
`amount of data in terms of either packets or bytes that remained unclassi(cid:12)ed
`represented less than 0.001% of the total.
`
`3.4 Validation Process
`
`Accurate classi(cid:12)cation is complicated by the unusual use to which some protocols
`are put. As noted earlier, the use of one protocol to carry another, such as
`the use of HTTP to carry peer-to-peer application tra(cid:14)c, will confuse a simple
`signature-based classi(cid:12)cation system. Additionally, the use of FTP to carry an
`HTTP transaction log will similarly confuse signature matching.
`Due to these unusual cases the certainty of any classi(cid:12)cation appears to be
`a di(cid:14)cult task. Throughout the work presented in this paper validation was
`performed manually in order to approach 100% accuracy in our results. Our
`validation approach features several distinct methods.
`Each (cid:13)ow is tested against multiple classi(cid:12)cation criteria. If this procedure
`leads to several criteria being satis(cid:12)ed simultaneously, manual intervention can
`allow for the identi(cid:12)cation of the true causal application. An example is the peer-
`to-peer situation. Identifying a (cid:13)ow as HTTP does not suggest anything more
`than that the (cid:13)ow contains HTTP signatures. After applying all classi(cid:12)cation
`methods we may conclude that the (cid:13)ow is HTTP alone, or additional signature-
`matching (e.g. identifying a peer-to-peer application) may indicate that the (cid:13)ow
`is the result of a peer-to-peer transfer.
`If the (cid:13)ow classi(cid:12)cation results from a well-known protocol, then the val-
`idation approach tests the conformance of the (cid:13)ow to the actual protocol. An
`example of this procedure is the identi(cid:12)cation of FTP PASV (cid:13)ows. A PASV (cid:13)ow
`can be valid only if the FTP control-stream overlaps the duration of the PASV
`(cid:13)ow | such cursory, protocol-based, examination allows an invalid classi(cid:12)cation
`to be identi(cid:12)ed. Alongside this process, (cid:13)ows can be further validated against
`the perceived function of a host, e.g., an identi(cid:12)ed router would be valid to relay
`BGP whereas for a machine identi(cid:12)ed as (probably) a desktop Windows box be-
`hind a NAT, concluding it was transferring BGP is unlikely and this potentially
`invalid classi(cid:12)cation requires manual-intervention.
`
`4 Results
`
`Given the large number of identi(cid:12)ed applications, and for ease of presentation, we
`group applications into types according to their potential requirements from the
`
`Ex. 1009
`Juniper Networks, Inc. / Page 8 of 14
`
`

`

`network infrastructure. Table 3 indicates ten such classes of tra(cid:14)c. Importantly,
`the characteristics of the tra(cid:14)c within each category is not necessarily unique.
`For example, the BULK category which is made up of ftp tra(cid:14)c consists of both
`ftp control channel: data on both directions, and the ftp data channel which
`consists of a simplex (cid:13)ow of data for each object transferred.
`
`Example Application
`Classi(cid:12)cation
`ftp
`BULK
`postgres, sqlnet, oracle, ingres
`DATABASE
`INTERACTIVE ssh, klogin, rlogin, telnet
`MAIL
`imap, pop2/3, smtp
`SERVICES
`X11, dns, ident, ldap, ntp
`WWW
`www
`P2P
`KaZaA, BitTorrent, GnuTella
`MALICIOUS
`Internet work and virus attacks
`GAMES
`Half-Life
`MULTIMEDIA Windows Media Player, Real
`Table 3. Network tra(cid:14)c allocated to each category
`
`In Table 4 we compare the results of simple port-based classi(cid:12)cation with
`content-based classi(cid:12)cation. The technique of port-analysis, against which we
`compare our approach, is common industry practise (e.g., Cisco NetFlow or [1,
`2]). UNKNOWN refers to applications which for port-based analysis are not
`readily identi(cid:12)able. Notice that under the content-based classi(cid:12)cation approach
`we had nearly no UNKNOWN tra(cid:14)c; instead we have 5 new tra(cid:14)c-classes de-
`tected. The tra(cid:14)c we were not able to classify corresponds to a small number
`of (cid:13)ows. A limited number of (cid:13)ows provides a minimal sample of the applica-
`tion behavior and thus cannot allow for the con(cid:12)dent identi(cid:12)cation of the causal
`application.
`
`Table 4 shows that under the simple port-based classi(cid:12)cation scheme based
`upon the IANA port assignments 30% of the carried bytes cannot be attributed
`to a particular application. Further observation reveals that the BULK traf-
`(cid:12)c is underestimated by approximately 20% while we see a di(cid:11)erence of 6%
`in the WWW tra(cid:14)c. However, the port-based approach does not only under-
`estimate tra(cid:14)c but for some classes, e.g., INTERACTIVE applications, it may
`over-estimate it. This means that tra(cid:14)c (cid:13)ows can also be misidenti(cid:12)ed under
`the port-based technique. Lastly, applications such as peer-to-peer and mal-ware
`appear to contribute zero tra(cid:14)c in the port-based case. This is due to the port
`through which such protocols travel not providing a standard identi(cid:12)cation. Such
`port-based estimation errors are believed to be signi(cid:12)cant.
`
`Ex. 1009
`Juniper Networks, Inc. / Page 9 of 14
`
`

`

`Classi(cid:12)cation
`Type
`
`BULK
`DATABASE
`GRID
`INTERACTIVE
`MAIL
`SERVICES
`WWW
`UNKNOWN
`
`Port-Based Content-Based
`Packets Bytes Packets Bytes
`As a percentage of total tra(cid:14)c
`46.97 45.00
`65.06
`64.54
`0.03 0.03
`0.84
`0.76
`0.03 0.07
`0.00
`0.00
`1.19 0.43
`0.75
`0.39
`3.37 3.62
`3.37
`3.62
`0.07 0.02
`0.29
`0.28
`19.98 20.40
`26.49
`27.30
`28.36 30.43 <0.01 <0.01
`
`1.17
`| | 1.10
`MALICIOUS
`0.05
`| | 0.44
`IRC/CHAT
`1.50
`| | 1.27
`P2P
`0.18
`| | 0.17
`GAMES
`0.21
`| | 0.22
`MULTIMEDIA
`Table 4. Contrasting port-based and Content-based classi(cid:12)cation.
`
`4.1 Examining Under and Over-estimation
`
`Of the results in Table 4 we will concentrate on only a few example situations.
`The (cid:12)rst and most dominant di(cid:11)erence is for BULK | tra(cid:14)c created as a
`result of FTP. The reason is that port-based classi(cid:12)cation will not be able to
`correctly identify a large class of (FTP) tra(cid:14)c transported using the PASV
`mechanism. Content-based classi(cid:12)cation is able to identify the causal relationship
`between the FTP control (cid:13)ow and any resulting data-transport. This means that
`tra(cid:14)c that was formerly either of unknown origin or incorrectly classi(cid:12)ed may be
`ascribed to FTP which is a tra(cid:14)c source that will be consistently underestimated
`by port-based classi(cid:12)cation.
`A comparison of values for MAIL, a category consisting of the SMTP, IMAP,
`MAPI and POP protocols, reveals that it is estimated with surprising accuracy
`in both cases. Both the number of packets and bytes transferred is unchanged
`between the two classi(cid:12)cation techniques. We also did not (cid:12)nd any other non-
`MAIL tra(cid:14)c present on MAIL ports. We would assert that the reason MAIL is
`found exclusively on the commonly de(cid:12)ned ports, while no other MAIL trans-
`actions are found on other ports, is that MAIL must be exchanged with other
`sites and other hosts. MAIL relies on common, Internet-wide standards for port
`and protocol assignment. No single site could arbitrarily change the ports on
`which MAIL is exchanged without e(cid:11)ectively cutting itself o(cid:11) from exchanges
`with other Internet sites. Therefore, MAIL is a tra(cid:14)c source that, for quantify-
`ing tra(cid:14)c exchanged with other sites at least, may be accurately estimated by
`port-based classi(cid:12)cation.
`Despite the fact that such an e(cid:11)ect was not pronounced in the analysed
`data set, port-based classi(cid:12)cation can also lead to over-estimation of the amount
`
`Ex. 1009
`Juniper Networks, Inc. / Page 10 of 14
`
`

`

`of tra(cid:14)c carried by a particular application. One reason is that mal-ware or
`attack tra(cid:14)c may use the well-known ports of a particular service, thus in(cid:13)ating
`the amount of tra(cid:14)c attributed to that application. In addition, if a particular
`application uses another application as a relay, then the tra(cid:14)c attributed to the
`latter will be in(cid:13)ated by the amount of tra(cid:14)c of the former. An example of such
`a case is peer-to-peer tra(cid:14)c using HTTP to avoid blocking by (cid:12)rewalls, an e(cid:11)ect
`that was not present in our data. In fact, we notice that under the content-based
`approach we can attribute more tra(cid:14)c to WWW since our data included web
`servers operating on non-standard ports that could not be detected under the
`port-based approach.
`Clearly this work leads to an obvious question of how we know that our
`content-based method is correct. We would emphasise that it was only through
`the labour-intensive examining of all data-(cid:13)ows along with numerous exchanges
`with system administrators and users of the examined site that we were able
`to arrive at a system of su(cid:14)cient accuracy. We do not consider that such a
`laborious process would need to be repeated for the analysis of similar tra(cid:14)c
`pro(cid:12)les. However, the identi(cid:12)cation of new types of applications will require a
`more limited examination of a future, unclassi(cid:12)able anomaly.
`
`4.2 Overheads of content-based analysis
`
`Alongside a presentation of the e(cid:11)ectiveness of the content-based method we
`present the overheads this method incurs. For our study we were able to iterate
`through tra(cid:14)c multiple times, studying data for many months after its collection.
`Clearly, such a labour-intensive approach would not be suitable if it were to be
`used as part of real-time operator feedback.
`We emphasise that while performing this work, we built a considerable body
`of knowledge applicable to future studies. The data collected for one monitor
`can be reapplied for future collections made at that location. Additionally, while
`speci(cid:12)c host information may quickly become out-of-date, the techniques for
`identifying applications through signatures and protocol-(cid:12)tting continue to be
`applicable. In this way historical data becomes an a-priori that can assist in the
`decision-making process of the characterisation for each analysis of the future.
`Table 5 indicates the relationship between the complexity of analysis and the
`quantity of data we could positively identify | items are ordered in the table
`as increasing levels of complexity. The Method column refers to methods listed
`in Table 2 in Section 3.
`Currently our method employs packet-header analysis and host-pro(cid:12)le con-
`struction for all levels of complexity. Signature matching is easier to implement
`and perform than protocol matching due to its application of static string match-
`ing. Analysis that is based upon a single packet (the (cid:12)rst packet) is inherently
`less complex than analysis based upon (up to) the (cid:12)rst KByte. The (cid:12)rst KByte
`may require reassembly from the payload of multiple packets. Finally, any form
`of (cid:13)ow-analysis is complicated although this will clearly reduce the overheads of
`analysis if the number of (cid:13)ows that require parsing is limited.
`
`Ex. 1009
`Juniper Networks, Inc. / Page 11 of 14
`
`

`

`(cid:15)
`
`(cid:15) (cid:15)
`
`(cid:15) (cid:15) (cid:15)
`
`(cid:15) (cid:15) (cid:15)
`
`(cid:15)
`
`UNKNOWN Data % Correctly Identi(cid:12)ed
`Method
`I II III IV V VI VII VIII IX Packets
`Bytes Packets
`Bytes
`28.36
`30.44
`71.03
`69.27
`27.35
`30.33
`72.05
`69.38
`27.35
`30.32
`72.05
`69.39
`27.12
`30.09
`72.29
`69.62
`25.72
`28.43
`74.23
`71.48
`19.11
`21.07
`80.84
`78.84
`1.07
`1.22
`98.94
`98.78
`(cid:15) <0.01
`<0.01 >99.99
`>99.99
`Table 5. Analysis method compared against percentage of UNKNOWN and correctly
`identi(cid:12)ed data.
`
`(cid:15)
`
`(cid:15)
`
`(cid:15)
`
`(cid:15)
`
`(cid:15)
`
`(cid:15)
`
`(cid:15) (cid:15) (cid:15)
`
`(cid:15) (cid:15)
`
`(cid:15) (cid:15) (cid:15)
`
`(cid:15) (cid:15) (cid:15)
`
`(cid:15) (cid:15) (cid:15)
`
`(cid:15) (cid:15) (cid:15)
`
`(cid:15) (cid:15) (cid:15)
`
`(cid:15) (cid:15) (cid:15)
`
`(cid:15)
`
`(cid:15)
`
`(cid:15)
`
`Table 5 clearly illustrates the accuracy achieved by applying successively-
`more-compl

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket