Toward the Accurate Identi(cid:12)cation of Network
`Andrew W. Moore1 and Konstantina Papagiannaki2
`1 University of Cambridge,
`2 Intel Research, Cambridge,
`Abstract. Well-known port numbers can no longer be used to reliably
`identify network applications. There is a variety of new Internet appli-
`cations that either do not use well-known port numbers or use other
`protocols, such as HTTP, as wrappers in order to go through (cid:12)rewalls
`without being blocked. One consequence of this is that a simple inspec-
`tion of the port numbers used by (cid:13)ows may lead to the inaccurate clas-
`si(cid:12)cation of network tra(cid:14)c. In this work, we look at these inaccuracies
`in detail. Using a full payload packet trace collected from an Internet
`site we attempt to identify the types of errors that may result from port-
`based classi(cid:12)cation and quantify them for the speci(cid:12)c trace under study.
`To address this question we devise a classi(cid:12)cation methodology that re-
`lies on the full packet payload. We describe the building blocks of this
`methodology and elaborate on the complications that arise in that con-
`text. A classi(cid:12)cation technique approaching 100% accuracy proves to be
`a labor-intensive process that needs to test (cid:13)ow-characteristics against
`multiple classi(cid:12)cation criteria in order to gain su(cid:14)cient con(cid:12)dence in
`the nature of the causal application. Nevertheless, the bene(cid:12)ts gained
`from a content-based classi(cid:12)cation approach are evident. We are capable
`of accurately classifying what would be otherwise classi(cid:12)ed as unknown
`as well as identifying tra(cid:14)c (cid:13)ows that could otherwise be classi(cid:12)ed in-
`correctly. Our work opens up multiple research issues that we intend to
`address in future work.
`Network tra(cid:14)c monitoring has attracted a lot of interest in the recent past.
`One of the main operations performed within such a context has to do with the
`identi(cid:12)cation of the di(cid:11)erent applications utilising a network’s resources. Such
`information proves invaluable for network administrators and network designers.
`Only knowledge about the tra(cid:14)c mix carried by an IP network can allow e(cid:14)cient
`design and provisioning. Network operators can identify the requirements of
`di(cid:11)erent users from the underlying infrastructure and provision appropriately.
`In addition, they can track the growth of di(cid:11)erent user populations and design
`the network to accommodate the diverse needs. Lastly, accurate identi(cid:12)cation
`? Andrew Moore thanks the Intel Corporation for its generous support of his research
`of network applications can shed light on the emerging applications as well as
`possible mis-use of network resources.
`The state of the art in the identi(cid:12)cation of network applications through
`tra(cid:14)c monitoring relies on the use of well known ports: an analysis of the head-
`ers of packets is used to identify tra(cid:14)c associated with a particular port and
`thus of a particular application [1{3]. It is well known that such a process is
`likely to lead to inaccurate estimates of the amount of tra(cid:14)c carried by di(cid:11)erent
`applications given that speci(cid:12)c protocols, such as HTTP, are frequently used to
`relay other types of tra(cid:14)c, e.g., the NeoTeris VLAN over HTTP product. In ad-
`dition, emerging services typically avoid the use of well known ports, e.g., some
`peer-to-peer applications. This paper describes a method to address the accurate
`identi(cid:12)cation of network applications in the presence of packet payload informa-
`tion3. We illustrate the bene(cid:12)ts of our method by comparing a characterisation
`of the same period of network tra(cid:14)c using ports-alone and our content-based
`This comparison allows us to highlight how di(cid:11)erences between port and
`content-based classi(cid:12)cation may arise. Having established the bene(cid:12)ts of the
`proposed methodology, we proceed to evaluate the requirements of our scheme
`in terms of complexity and amount of data that needs to be accessed. We demon-
`strate the trade-o(cid:11)s that need to be addressed between the complexity of the
`di(cid:11)erent classi(cid:12)cation mechanisms employed by our technique and the resulting
`classi(cid:12)cation accuracy. The presented methodology is not automated and may
`require human intervention. Consequently, in future work we intend to study its
`requirements in terms of a real-time implementation.
`The remainder of the paper is structured as follows. In Section 2 we present
`the data used throughout this work. In Section 3 we describe our content-based
`classi(cid:12)cation technique. Its application is shown in Section 4. The obtained re-
`sults are contrasted against the outcome of a port-based classi(cid:12)cation scheme.
`In Section 5 we describe our future work.
`2 Collected Data
`This work presents an application-level approach to characterising network traf-
`(cid:12)c. We illustrate the bene(cid:12)ts of our technique using data collected by the high-
`performance network monitor described in [5].
`The site we examined hosts several Biology-related facilities, collectively re-
`ferred to as a Genome Campus. There are three institutions on-site that employ
`about 1,000 researchers, administrators and technical sta(cid:11). This campus is con-
`nected to the Internet via a full-duplex Gigabit Ethernet link. It was on this
`connection to the Internet that our monitor was placed. Tra(cid:14)c was monitored
`for a full 24 hour, week-day period and for both link directions.
`3 Packet payload for the identi(cid:12)cation of network applications is also used in [4].
`Nonetheless, no speci(cid:12)c details are provided by [4] on the implementation of the
`system thus making comparison infeasible. No further literature was found by the
`authors regarding that work.
`Packets MBytes
`Total 573,429,697
`As percentage of Total
`Table 1. Summary of tra(cid:14)c analysed
`Brief statistics on the tra(cid:14)c data collected are given in Table 1. Other proto-
`cols were observed in the trace, namely IPv6-crypt, PIM, GRE, IGMP, NARP
`and private encryption, but the largest of them accounted for fewer than one
`million packets (less than 0.06%) over the 24 hour period and the total of all
`OTHER protocols was fewer than one and a half million packets. All percentage
`values given henceforth are from the total of UDP and TCP packets only.
`3 Methodology
`3.1 Overview of Content-based classi(cid:12)cation
`Our content-based classi(cid:12)cation scheme can be viewed as an iterative procedure
`whose target is to gain su(cid:14)cient con(cid:12)dence that a particular tra(cid:14)c stream is
`caused by a speci(cid:12)c application. To achieve such a goal our classi(cid:12)cation method
`operates on tra(cid:14)c (cid:13)ows and not packets. Grouping packets into (cid:13)ows allows for
`more-e(cid:14)cient processing of the collected information as well the acquisition of
`the necessary context for an appropriate identi(cid:12)cation of the network applica-
`tion responsible for a (cid:13)ow. Obviously, the (cid:12)rst step we need to take is that of
`aggregating packets into (cid:13)ows according to their 5-tuple. In the case of TCP,
`additional semantics can also allow for the identi(cid:12)cation of the start and end
`time of the (cid:13)ow. The fact that we observe tra(cid:14)c in both directions allows clas-
`si(cid:12)cation of all nearly (cid:13)ows on the link. A tra(cid:14)c monitor on a unidirectional
`link can identify only those applications that use the monitored link for their
`One outcome of this operation is the identi(cid:12)cation of unusual or peculiar
`(cid:13)ows | speci(cid:12)cally simplex (cid:13)ows. These (cid:13)ows consist of packets exchanged be-
`tween a particular port/protocol combination in only one direction between two
`hosts. A common cause of a simplex (cid:13)ow is that packets have been sent to an
`invalid or non-responsive destination host. The data of the simplex (cid:13)ows were
`not discarded, they were classi(cid:12)ed | commonly identi(cid:12)ed as carrying worm and
`virus attacks. The identi(cid:12)cation and removal of simplex (cid:13)ows (each (cid:13)ow con-
`sisting of between three and ten packets sent over a 24-hour period) allowed the
`number of unidenti(cid:12)ed (cid:13)ows that needed further processing to be signi(cid:12)cantly
`The second step of our method iteratively tests (cid:13)ow characteristics against
`di(cid:11)erent criteria until su(cid:14)cient certainty has been gained as to the identity
`of the application. Such a process consists of nine di(cid:11)erent identi(cid:12)cation sub-
`methods. We describe these mechanisms in the next section. Each identi(cid:12)cation
`sub-method is followed by the evaluation of the acquired certainty in the candi-
`date application. Currently this is a (labour-intensive) manual process.
`Identi(cid:12)cation Methods
`The nine distinct identi(cid:12)cation methods applied by our scheme are listed in Table
`2. Alongside each method is an example application that we could identify using
`this method. Each one tests a particular property of the (cid:13)ow attempting to
`obtain evidence of the identity of the causal application.
`Identi(cid:12)cation Method
`I Port-based classi(cid:12)cation (only) |
`II Packet Header (including I)
`simplex (cid:13)ows
`III Single packet signature
`Many worm/virus
`IV Single packet protocol
`V Signature on the (cid:12)rst KByte P2P
`VI (cid:12)rst KByte Protocol
`VII Selected (cid:13)ow(s) Protocol
`VIII (All) Flow Protocol
`IX Host history
`Table 2. Methods of (cid:13)ow identi(cid:12)cation.
`Method I classi(cid:12)es (cid:13)ows according to their port numbers. This method rep-
`resents the state of the art and requires access only to the part in the packet
`header that contains the port numbers. Method II relies on access to the en-
`tire packet header for both tra(cid:14)c directions. It is this method that is able to
`identify simplex (cid:13)ows and signi(cid:12)cantly limit the number of (cid:13)ows that need to
`go through the remainder of the classi(cid:12)cation process. Methods III to VIII
`examine whether a (cid:13)ow carries a well-known signature or follows well-known
`protocol semantics. Such operations are accompanied by higher complexity and
`may require access to more than a single packet’s payload. We have listed the
`di(cid:11)erent identi(cid:12)cation mechanisms in terms of their complexity and the amount
`of data they require in Figure 1. According to our experience, speci(cid:12)c (cid:13)ows may
`be classi(cid:12)ed positively from their (cid:12)rst packet alone. Nonetheless, other (cid:13)ows may
`need to be examined in more detail and a positive identi(cid:12)cation may be feasible
`once up to 1 KByte of their data has been observed4. Flows that have not been
`4 The value of 1 KByte has been experimentally found to be an upper bound for the
`amount of packet information that needs to be processed for the identi(cid:12)cation of
`several applications making use of signatures. In future work, we intend to address
`classi(cid:12)ed at this stage will require inspection of the entire (cid:13)ow payload and we
`separate such a process into two distinct steps. In the (cid:12)rst step (Method VII)
`we perform full-(cid:13)ow analysis for a subset of the (cid:13)ows that perform a control-
`function. In our case FTP appeared to carry a signi(cid:12)cant amount of the overall
`tra(cid:14)c and Method VII was applied only to those (cid:13)ows that used the standard
`FTP control port. The control messages were parsed and further context was
`obtained that allowed us to classify more (cid:13)ows in the trace. Lastly, if there are
`still (cid:13)ows to be classi(cid:12)ed, we analyse them using speci(cid:12)c protocol information
`attributing them to their causal application using Method VIII.
`Amount of Data
`V I
`Flow (all)
`Flow (selected)
`1st KByte
`Fig. 1. Requirements of identi(cid:12)cation methods.
`In our classi(cid:12)cation technique we will apply each identi(cid:12)cation method in
`turn and in such a way that the more-complex or more-data-demanding methods
`(as shown in Figure 1) are used only if no previous signature or protocol method
`has generated a match. The outcome of this process may be that (i) we have
`positively identi(cid:12)ed a (cid:13)ow to belong to a speci(cid:12)c application, (ii) a (cid:13)ow appears
`to agree with more than one application pro(cid:12)le, or (iii) no candidate application
`has been identi(cid:12)ed. In our current methodology all three cases will trigger manual
`intervention in order to validate the accuracy of the classi(cid:12)cation, resolve cases
`where multiple criteria have generated a match or inspect (cid:13)ows that have not
`matched any identi(cid:12)cation criteria. We describe our validation approach in more
`detail in Section 3.4.
`the exact question of what is the necessary amount of payload one needs to capture
`in order to identify di(cid:11)erent types of applications.
`The successful identi(cid:12)cation of speci(cid:12)c (cid:13)ows caused by a particular network
`application reveals important information about the hosts active in our trace.
`Our technique utilises this information to build a knowledge base for particular
`host/port combinations that can be used to validate future classi(cid:12)cation by test-
`ing conformance with already-observed host roles (Method IX). One outcome
`of this operation is the identi(cid:12)cation of hosts performing port scanning where
`a particular destination host is contacted from the same source host on many
`sequential port numbers. These (cid:13)ows evidently do not belong to a particular
`application (unless port scanning is part of the applications looked into). For a
`di(cid:11)erent set of (cid:13)ows, this process validated the streaming audio from a pool of
`machines serving a local broadcaster.
`Method IX can be further enhanced to use information from the host name
`as recorded in the DNS. While we used this as a process-of-last-resort (DNS
`names can be notoriously un-representative), DNS names in our trace did reveal
`the presence of an HTTP proxy, a Mail exchange server and a VPN endpoint
`operating over a TCP/IP connection.
`3.3 Classi(cid:12)cation Approach
`An illustration of the (cid:13)ow through the di(cid:11)erent identi(cid:12)cation sub-methods, as
`employed by our approach, is shown in Figure 2. In the (cid:12)rst step we attempt to
`reduce the number of (cid:13)ows to be further processed by using context obtained
`through previous iterations. Speci(cid:12)c (cid:13)ows in our data can be seen as \child"
`connections arising from \parent" connections that precede them. One such ex-
`ample is a web browser that initiates multiple connections in order to retrieve
`parts of a single web page. Having parsed the \parent" connection allows us to
`immediately identify the \child" connections and classify them to the causal web
`Flow Result
`of Another
`Application ?
`Tag flows
`with known ports
`1st pkt
`"Well Known"
`Signature ?
`1st pkt
`"Well Known"
`Protocol ?
`1st KB
`"Well Known"
`Signature ?
`1st KB
`"Well Known"
`Protocol ?
`Flow Contains
`Known Protocol?
`Flow Contains
`Known Protocol?
`(using among other mechanisms)
`Fig. 2. Classi(cid:12)cation procedure.
`A second example, that has a predominant e(cid:11)ect in our data, is passive
`FTP. Parsing the \parent" FTP session (Method VIII) allows the identi(cid:12)cation
`of the subsequent \child" connection that may be established toward a di(cid:11)erent
`host at a non-standard port. Testing whether a (cid:13)ow is the result of an already-
`classi(cid:12)ed (cid:13)ow at the beginning of the classi(cid:12)cation process allows for the fast
`characterisation of a network (cid:13)ow without the need to go through the remainder
`of the process.
`If the (cid:13)ow is not positively identi(cid:12)ed in the (cid:12)rst stage then it goes through
`several additional classi(cid:12)cation criteria. The (cid:12)rst mechanism examines whether
`a (cid:13)ow uses a well-known port number. While port-based classi(cid:12)cation is prone
`to error, the port number is still a useful input into the classi(cid:12)cation process
`because it may convey useful information about the identity of the (cid:13)ow. If no
`well-known port is used, the classi(cid:12)cation proceeds through the next stages.
`However, even in the case when a (cid:13)ow is found to operate on a well-known
`port, it is tagged as well-known but still forwarded through the remainder of the
`classi(cid:12)cation process.
`In the next stage we test whether the (cid:13)ow contains a known signature in its
`(cid:12)rst packet. At this point we will be able to identify (cid:13)ows that may be directed
`to well-known port numbers but carry non-legitimate tra(cid:14)c as in the case of
`virus or attack tra(cid:14)c. Signature-scanning is a process that sees common use
`within Intrusion Detection Systems such as snort [6]. It has the advantage that
`a suitable scanner is often optimised for string-matching while still allowing the
`expression of (cid:13)exible matching criteria. By scanning for signatures, applications
`such as web-servers operating on non-standard ports may be identi(cid:12)ed.
`If no known signature has been found in the (cid:12)rst packet we check whether the
`(cid:12)rst packet of the (cid:13)ow conveys semantics of a well-known protocol. An example
`to that e(cid:11)ect is IDENT which is a single packet IP protocol. If this test fails we
`look for well-known signatures in the (cid:12)rst KByte of the (cid:13)ow, which may require
`assembly of multiple individual packets. At this stage we will be able to identify
`peer-to-peer tra(cid:14)c if it uses well known signatures. Tra(cid:14)c due to SMTP will
`have been detected from the port-based classi(cid:12)cation but only the examination
`of the protocol semantics within the (cid:12)rst KByte of the (cid:13)ow will allow for the
`con(cid:12)dent characterisation of the (cid:13)ow. Network protocol analysis tools, such as
`ethereal [7], employ a number of such protocol decoders and may be used to
`make or validate a protocol identi(cid:12)cation.
`Speci(cid:12)c (cid:13)ows will still remain unclassi(cid:12)ed even at this stage and will require
`inspection of their entire payload. This operation may be manual or automated
`for particular protocols. From our experience, focusing on the protocol semantics
`of FTP led to the identi(cid:12)cation of a very signi(cid:12)cant fraction of the overall traf-
`(cid:12)c limiting the unknown tra(cid:14)c to less than 2%. At this point the classi(cid:12)cation
`procedure can end. However, if 100% accuracy is to be approached we envision
`that the last stage of the classi(cid:12)cation process may involve the manual inspec-
`tion of all unidenti(cid:12)ed (cid:13)ows. This stage is rather important since it is likely to
`reveal new applications. While labour-intensive, the individual examination of
`the remaining, unidenti(cid:12)ed, (cid:13)ows caused the creation of a number of new sig-
`natures and protocol-templates that were then able to be used for identifying
`protocols such as PCAnywhere, the sdserver and CVS. This process also served
`to identify more task-speci(cid:12)c systems. An example of this was a host o(cid:11)ering
`protocol-speci(cid:12)c database services.
`On occasion (cid:13)ows may remain unclassi(cid:12)ed despite this process; this takes
`the form of small samples (e.g., 1{2 packets) of data that do not provide enough
`information to allow any classi(cid:12)cation process to proceed. These packets used
`unrecognised ports and rarely carried any payload. While such background noise
`was not zero in the context of classi(cid:12)cation for accounting, Quality-of-Service, or
`resource planning, these amounts could be considered insigni(cid:12)cant. The actual
`amount of data in terms of either packets or bytes that remained unclassi(cid:12)ed
`represented less than 0.001% of the total.
`3.4 Validation Process
`Accurate classi(cid:12)cation is complicated by the unusual use to which some protocols
`are put. As noted earlier, the use of one protocol to carry another, such as
`the use of HTTP to carry peer-to-peer application tra(cid:14)c, will confuse a simple
`signature-based classi(cid:12)cation system. Additionally, the use of FTP to carry an
`HTTP transaction log will similarly confuse signature matching.
`Due to these unusual cases the certainty of any classi(cid:12)cation appears to be
`a di(cid:14)cult task. Throughout the work presented in this paper validation was
`performed manually in order to approach 100% accuracy in our results. Our
`validation approach features several distinct methods.
`Each (cid:13)ow is tested against multiple classi(cid:12)cation criteria. If this procedure
`leads to several criteria being satis(cid:12)ed simultaneously, manual intervention can
`allow for the identi(cid:12)cation of the true causal application. An example is the peer-
`to-peer situation. Identifying a (cid:13)ow as HTTP does not suggest anything more
`than that the (cid:13)ow contains HTTP signatures. After applying all classi(cid:12)cation
`methods we may conclude that the (cid:13)ow is HTTP alone, or additional signature-
`matching (e.g. identifying a peer-to-peer application) may indicate that the (cid:13)ow
`is the result of a peer-to-peer transfer.
`If the (cid:13)ow classi(cid:12)cation results from a well-known protocol, then the val-
`idation approach tests the conformance of the (cid:13)ow to the actual protocol. An
`example of this procedure is the identi(cid:12)cation of FTP PASV (cid:13)ows. A PASV (cid:13)ow
`can be valid only if the FTP control-stream overlaps the duration of the PASV
`(cid:13)ow | such cursory, protocol-based, examination allows an invalid classi(cid:12)cation
`to be identi(cid:12)ed. Alongside this process, (cid:13)ows can be further validated against
`the perceived function of a host, e.g., an identi(cid:12)ed router would be valid to relay
`BGP whereas for a machine identi(cid:12)ed as (probably) a desktop Windows box be-
`hind a NAT, concluding it was transferring BGP is unlikely and this potentially
`invalid classi(cid:12)cation requires manual-intervention.
`4 Results
`Given the large number of identi(cid:12)ed applications, and for ease of presentation, we
`group applications into types according to their potential requirements from the
`network infrastructure. Table 3 indicates ten such classes of tra(cid:14)c. Importantly,
`the characteristics of the tra(cid:14)c within each category is not necessarily unique.
`For example, the BULK category which is made up of ftp tra(cid:14)c consists of both
`ftp control channel: data on both directions, and the ftp data channel which
`consists of a simplex (cid:13)ow of data for each object transferred.
`Example Application
`postgres, sqlnet, oracle, ingres
`INTERACTIVE ssh, klogin, rlogin, telnet
`imap, pop2/3, smtp
`X11, dns, ident, ldap, ntp
`KaZaA, BitTorrent, GnuTella
`Internet work and virus attacks
`MULTIMEDIA Windows Media Player, Real
`Table 3. Network tra(cid:14)c allocated to each category
`In Table 4 we compare the results of simple port-based classi(cid:12)cation with
`content-based classi(cid:12)cation. The technique of port-analysis, against which we
`compare our approach, is common industry practise (e.g., Cisco NetFlow or [1,
`2]). UNKNOWN refers to applications which for port-based analysis are not
`readily identi(cid:12)able. Notice that under the content-based classi(cid:12)cation approach
`we had nearly no UNKNOWN tra(cid:14)c; instead we have 5 new tra(cid:14)c-classes de-
`tected. The tra(cid:14)c we were not able to classify corresponds to a small number
`of (cid:13)ows. A limited number of (cid:13)ows provides a minimal sample of the applica-
`tion behavior and thus cannot allow for the con(cid:12)dent identi(cid:12)cation of the causal
`Table 4 shows that under the simple port-based classi(cid:12)cation scheme based
`upon the IANA port assignments 30% of the carried bytes cannot be attributed
`to a particular application. Further observation reveals that the BULK traf-
`(cid:12)c is underestimated by approximately 20% while we see a di(cid:11)erence of 6%
`in the WWW tra(cid:14)c. However, the port-based approach does not only under-
`estimate tra(cid:14)c but for some classes, e.g., INTERACTIVE applications, it may
`over-estimate it. This means that tra(cid:14)c (cid:13)ows can also be misidenti(cid:12)ed under
`the port-based technique. Lastly, applications such as peer-to-peer and mal-ware
`appear to contribute zero tra(cid:14)c in the port-based case. This is due to the port
`through which such protocols travel not providing a standard identi(cid:12)cation. Such
`port-based estimation errors are believed to be signi(cid:12)cant.
`Ex. 1009
`Port-Based Content-Based
`Packets Bytes Packets Bytes
`As a percentage of total tra(cid:14)c
`46.97 45.00
`0.03 0.03
`0.03 0.07
`1.19 0.43
`3.37 3.62
`0.07 0.02
`19.98 20.40
`28.36 30.43 <0.01 <0.01
`| | 1.10
`| | 0.44
`| | 1.27
`| | 0.17
`| | 0.22
`Table 4. Contrasting port-based and Content-based classi(cid:12)cation.
`4.1 Examining Under and Over-estimation
`Of the results in Table 4 we will concentrate on only a few example situations.
`The (cid:12)rst and most dominant di(cid:11)erence is for BULK | tra(cid:14)c created as a
`result of FTP. The reason is that port-based classi(cid:12)cation will not be able to
`correctly identify a large class of (FTP) tra(cid:14)c transported using the PASV
`mechanism. Content-based classi(cid:12)cation is able to identify the causal relationship
`between the FTP control (cid:13)ow and any resulting data-transport. This means that
`tra(cid:14)c that was formerly either of unknown origin or incorrectly classi(cid:12)ed may be
`ascribed to FTP which is a tra(cid:14)c source that will be consistently underestimated
`by port-based classi(cid:12)cation.
`A comparison of values for MAIL, a category consisting of the SMTP, IMAP,
`MAPI and POP protocols, reveals that it is estimated with surprising accuracy
`in both cases. Both the number of packets and bytes transferred is unchanged
`between the two classi(cid:12)cation techniques. We also did not (cid:12)nd any other non-
`MAIL tra(cid:14)c present on MAIL ports. We would assert that the reason MAIL is
`found exclusively on the commonly de(cid:12)ned ports, while no other MAIL trans-
`actions are found on other ports, is that MAIL must be exchanged with other
`sites and other hosts. MAIL relies on common, Internet-wide standards for port
`and protocol assignment. No single site could arbitrarily change the ports on
`which MAIL is exchanged without e(cid:11)ectively cutting itself o(cid:11) from exchanges
`with other Internet sites. Therefore, MAIL is a tra(cid:14)c source that, for quantify-
`ing tra(cid:14)c exchanged with other sites at least, may be accurately estimated by
`port-based classi(cid:12)cation.
`Despite the fact that such an e(cid:11)ect was not pronounced in the analysed
`data set, port-based classi(cid:12)cation can also lead to over-estimation of the amount
`of tra(cid:14)c carried by a particular application. One reason is that mal-ware or
`attack tra(cid:14)c may use the well-known ports of a particular service, thus in(cid:13)ating
`the amount of tra(cid:14)c attributed to that application. In addition, if a particular
`application uses another application as a relay, then the tra(cid:14)c attributed to the
`latter will be in(cid:13)ated by the amount of tra(cid:14)c of the former. An example of such
`a case is peer-to-peer tra(cid:14)c using HTTP to avoid blocking by (cid:12)rewalls, an e(cid:11)ect
`that was not present in our data. In fact, we notice that under the content-based
`approach we can attribute more tra(cid:14)c to WWW since our data included web
`servers operating on non-standard ports that could not be detected under the
`port-based approach.
`Clearly this work leads to an obvious question of how we know that our
`content-based method is correct. We would emphasise that it was only through
`the labour-intensive examining of all data-(cid:13)ows along with numerous exchanges
`with system administrators and users of the examined site that we were able
`to arrive at a system of su(cid:14)cient accuracy. We do not consider that such a
`laborious process would need to be repeated for the analysis of similar tra(cid:14)c
`pro(cid:12)les. However, the identi(cid:12)cation of new types of applications will require a
`more limited examination of a future, unclassi(cid:12)able anomaly.
`4.2 Overheads of content-based analysis
`Alongside a presentation of the e(cid:11)ectiveness of the content-based method we
`present the overheads this method incurs. For our study we were able to iterate
`through tra(cid:14)c multiple times, studying data for many months after its collection.
`Clearly, such a labour-intensive approach would not be suitable if it were to be
`used as part of real-time operator feedback.
`We emphasise that while performing this work, we built a considerable body
`of knowledge applicable to future studies. The data collected for one monitor
`can be reapplied for future collections made at that location. Additionally, while
`speci(cid:12)c host information may quickly become out-of-date, the techniques for
`identifying applications through signatures and protocol-(cid:12)tting continue to be
`applicable. In this way historical data becomes an a-priori that can assist in the
`decision-making process of the characterisation for each analysis of the future.
`Table 5 indicates the relationship between the complexity of analysis and the
`quantity of data we could positively identify | items are ordered in the table
`as increasing levels of complexity. The Method column refers to methods listed
`in Table 2 in Section 3.
`Currently our method employs packet-header analysis and host-pro(cid:12)le con-
`struction for all levels of complexity. Signature matching is easier to implement
`and perform than protocol matching due to its application of static string match-
`ing. Analysis that is based upon a single packet (the (cid:12)rst packet) is inherently
`less complex than analysis based upon (up to) the (cid:12)rst KByte. The (cid:12)rst KByte
`may require reassembly from the payload of multiple packets. Finally, any form
`of (cid:13)ow-analysis is complicated although this will clearly reduce the overheads of
`analysis if the number of (cid:13)ows that require parsing is limited.
`(cid:15) (cid:15)
`(cid:15) (cid:15) (cid:15)
`(cid:15) (cid:15) (cid:15)
`UNKNOWN Data % Correctly Identi(cid:12)ed
`Bytes Packets
`(cid:15) <0.01
`<0.01 >99.99
`Table 5. Analysis method compared against percentage of UNKNOWN and correctly
`identi(cid:12)ed data.
`(cid:15) (cid:15) (cid:15)
`(cid:15) (cid:15)
`(cid:15) (cid:15) (cid:15)
`(cid:15) (cid:15) (cid:15)
`(cid:15) (cid:15) (cid:15)
`(cid:15) (cid:15) (cid:15)
`(cid:15) (cid:15) (cid:15)
`(cid:15) (cid:15) (cid:15)
`Table 5 clearly illustrates the accuracy achieved by applying successively-

