throbber
Traffic Classification On The Fly
`
`Laurent Bernaille†, Renata Teixeira†, Ismael Akodkenou†
`Augustin Soule‡, Kave Salamatian†
`† LIP6, Universit ´e Pierre et Marie Curie, ‡ Thomson Paris Lab
`Paris, FRANCE
`
`ABSTRACT
`The early detection of applications associated with TCP flows is
`an essential step for network security and traffic engineering. The
`classic way to identify flows, i.e. looking at port numbers, is not
`effective anymore. On the other hand, state-of-the-art techniques
`cannot determine the application before the end of the TCP flow.
`In this editorial, we propose a technique that relies on the obser-
`vation of the first five packets of a TCP connection to identify the
`application. This result opens a range of new possibilities for online
`traffic classification.
`
`Categories and Subject Descriptors
`I.2.6 [Learning]: Unsupervised Learning; C.2.3 [Network Moni-
`toring]: Network Management
`
`General Terms
`Measurement, Algorithms, Management
`
`Keywords
`Traffic classification, Applications, Machine Learning
`
`1.
`
`INTRODUCTION
`Enterprise or campus networks usually impose a set of rules for
`users to access the network in order to protect network resources
`and enforce institutional policies (for instance, no sharing of mu-
`sic files or gaming). This leaves network administrators with the
`daunting task of (1) identifying the application associated with a
`traffic flow on-the fly and (2) controlling user’s traffic when needed.
`Therefore, accurate classification of traffic flows is an essential step
`for administrators to detect intrusion or malicious attacks, forbid-
`den applications, or simply new applications (which may impact
`the future provisioning of network resources).
`Previous works have proposed a number of methods to identify
`the application associated with a traffic flow. The simplest approach
`consists in examining TCP port numbers. Port-based methods are
`simple because many well-known applications have specific port
`numbers (for instance, HTTP traffic uses port 80 and FTP port 21).
`However, the research community now recognizes that port-based
`classification is inadequate [1, 2, 3, 4], mainly because many ap-
`plications use dynamic port-negotiation mechanisms to hide from
`firewalls and network security tools. An alternative approach is
`to inspect the payload of every packet. This technique can be ex-
`tremely accurate when the payload is not encrypted, but it is an
`unrealistic alternative. First, there are privacy concerns with exam-
`ining user data. Second, there is a high storage and computational
`
`cost to study every packet that traverses a link (in particular at very
`high-speed links).
`Detecting the application associated with a flow is of limited in-
`terest once the flow is finished (except for provisioning and maybe
`for pricing). To address these challenges, we propose a novel tech-
`nique that only uses the size of the first few data packets of each
`TCP flow to identify the application associated with a TCP flow.
`Our method runs at the edge of the network, i.e., where the net-
`work connects to the Internet, and hence is able to capture all pack-
`ets associated with a TCP flow in both directions (from sender to
`receiver and vice-versa). The accurate classification of flows based
`on information contained only in the first few packets opens new
`possibilities for online classification (i.e. security, SLAs, etc.). Our
`method is in sharp contrast with the current state of the art.
`
`2. STATE OF THE ART
`After port-based classification has been debunked and given the
`limitations of searching payloads for signatures, there has been a
`new trend to classify traffic based on summarized flow information
`such as duration, number of packets and mean inter-arrival time [2],
`[5], [6], [3]. BLINC [4] introduces a new approach for traffic clas-
`sification. It associates Internet hosts with applications. Instead of
`studying TCP (or UDP) flows individually, it looks at all the flows
`generated by specific hosts. BLINC is able to accurately associate
`hosts with the services they provide or use (application server, web
`client, etc.). However, it cannot classify a single TCP flow.
`All classification techniques that use flow statistics can only iden-
`tify the nature of a flow when the flow is finished. BLINC has to
`gather information from several flows for each host before it can
`decide on the role of a host. These requirements prevent the use
`of these methods online. In contrast, our method relies only on the
`first few packets of a TCP flow. This early classification is essen-
`tial to allow automatic blocking, filtering, or recording of specific
`applications. It also limits the amount of memory required to store
`information associated with each flow.
`
`3. KEY POINTS
`This section discusses our assumptions and the two main insights
`of our method: the analysis of only the first few packets of the flow,
`and the use of unsupervised clustering to detect a set of flows that
`share a common behavior.
`3.1 Assumptions
`Our goal is to build a classifier that runs online and accurately
`identifies the application associated with a TCP flow as early as
`possible. Our work relies on the following assumptions:
`
`ACM SIGCOMM Computer Communication Review
`
`23
`
`Volume 36, Number 2, April 2006
`
`Splunk Inc. Exhibit 1044 Page 1
`
`

`

`SizeofPacket#2
`
`~*%800
`
`-150 -100
`
`RMHON
`
`oO
`-5S0
`Size of Packet #1
`
`100
`
`XMMXKKKKKXK
`
`
`Figure 1: SMTP versus Edonkey flows.
`
`transfer from client to server: and(iii) upload flows are file transfers
`from server to client.
`
`SizeofPacket#2
`
`e Access to both directions of a TCP connection. Edge net-
`works usually connect to the Internet at few locations. The
`vast majority in one or two locations. We assumethatthe net-
`work administrator monitors both directions ofall the edge
`links.
`
`e Offline access to flows of a numberof applications. Our
`method only needs packet size information. However, for
`training purpose, we need to knowthe application associated
`with a sample of flows. There are twoalternative techniques
`for collecting this training trace:
`(i) directly from one of
`the monitored links using high-end packet monitoring cards
`(we can then analyze the payloads offline to identify the ap-
`plication corresponding to a given flow) ; or (ii) generated
`in a controlled fashion by explicitly launching instances of
`each of the desired applications and collecting all packets
`exchanged.
`
`e Online access to header of all packets. We only use the
`information on the packet header to do our onlineclassifica-
`tion, which is much smaller andoffixed size.
`
`3.2
`
`Intuition behind the method
`
`Our goal is to identify the application associated with a TCP flow
`as early as possible. We design a classifier that uses information
`available in the header ofthe first P packets ofthe flow to identify
`the application associated with a flow. Specifically, we only use
`the size of the data packets, not using TCP control packets (SYN,
`ACKs,etc.). The size of the first few packets is a good predictor of
`the application associated with a flow because it captures the appli-
`cation’s negotiation phase, which is usually a pre-defined sequence
`ofmessages and distinct amongapplications.
`We illustrate our approach with SMTP and Edonkey flows cap-
`tured at the edge of a large university network to shed light on
`the behaviors of applications. Figure 1 presents the size of the
`first packet versus the size of the second (negative values repre-
`sents packets sent from the TCP server to the client and positive
`from client to server) for every TCP flow ofthese two applications.
`Edonkey clients initialize request for files, hence the positive sign
`for the size of the first packet, whereas the SMTP server initiates
`negotiation. This simple example illustrates the expressiveness of
`this two-dimensionalrepresentation ofTCP flowsto distinguish the
`application associated with each flow. In fact, the traces we study
`consist of ten major applications and we found that 5 packets were
`enough to distinguish their behaviors.
`
`3.3. Unsupervised clustering
`To extract groups offlowsthat share a common communication
`behavior, we borrow techniques from machine learning. We use
`unsupervised clustering as it relies on unlabeled data samples (in
`our case, the size of the first few packets of a TCP flow) to find
`natural groups (or clusters) in a dataset, whereas supervised clus-
`tering uses a pre-labeled set of samples to construct a model for
`each cluster.
`
`Although thetraffic classification mechanism presented in [3]
`uses Naive Bayes Classifiers, an example of supervised clustering,
`unsupervised learning is more appropriatefor traffic classification
`because it doesnotrely on pre-defined classes. A single application
`can have multiple behaviors which should be modeled separately.
`For example, Figure 2 presents the size ofthe first packet versus the
`size ofthe second ofa set ofFTP flows. This figure shows that FTP
`has three very distinct behaviors: (i) command flows correspond
`to a control flow, in which a data connection is negotiated and in
`which clients ask for data; (ii) download flows corresponds to files
`
`Figure 2: Example of a multi-modal application: FTP.
`
`FTP is an example of a multi-modal application, and we found
`many other applications with several behaviors in our data. Unsu-
`pervised learning finds groups of flows that share the same behav-
`ior, which are easier to model.
`
`4. METHODOLOGY
`
`Basedon the observations from the previous section, we propose
`a traffic classification mechanism that works in two phases: an of-
`fline learning phase and an online traffic classification phase.First,
`the learning phase usesa set of training data to cluster TCP flows
`that share a commonbehavior. Then, thetraffic classification phase
`uses these classes to determine the application associated with each
`TCP flow. To verify that the behaviors of applications do not vary
`with time, we use one data set for the learning phase and another
`for the evaluation of the derived classification method. These two
`traces were collected several month apart on the samelink.
`
`ACM SIGCOMM Computer Communication Review
`
`24
`
`Volume 36, Number2, April 2006
`
`Splunk Inc.
`
`Exhibit1044
`
`Page2
`
`Splunk Inc. Exhibit 1044 Page 2
`
`

`

` Application label
`
`Figure 3: Design of online classifier.
`
`4.1 Learning phase
`The learning phase is performed offline and consists in detect-
`ing common behaviors in a set of flows. It takes as input a short
`packet trace with TCP flows from a mix ofapplications. Our study
`uses a one-hour trace collected at the edge of a university network.
`This trace also includes packet payloads. We apply a packet trace
`analyzer to this data set to extract TCP flows. First, to group flows
`into clusters, we need to evaluate the similarity between them. We
`associate each flow with a spatial representation based onthe sizes
`ofits first P packets. The representation we use is quite simple:
`flows are represented by points in a P-dimensional space where
`each packet is associated to a dimension. The coordinate on the
`dimensionpis the size of packet p in the flow. In order to differ-
`entiate packets sent by the TCP-server and the TCP-client, packets
`sent by the server have a negative coordinate. The similarity be-
`tween flowsis then evaluated with a simple metric: the Euclidean
`distance between their associated spatial representation.
`Tn order to extract common behaviors in such a space we rely on
`the well-known K-Means algorithm [7]. Once we have found nat-
`ural clusters in the training set, we model and analyze eachcluster.
`The modeling step consists in defining a heuristic to associate a
`new flow to a cluster. We use a simple heuristic: we compute the
`euclidean distance between the new flow andthe center ofeach pre-
`defined cluster, and then choose the cluster for which this distance
`is minimum. Theanalysis step consists in examining the composi-
`tion of each cluster. We processthe training flows with a payload
`analysis tool [8] that is a able to accurately determine the applica-
`tion associated with each flow. This step allows us to decide how
`to label a flow onceit has be assigned to a cluster.
`Westudied different numbers of packets and found that the best
`separation ofapplications amongclusters was observed with 5 pack-
`ets. We also tried different numbers of clusters for the K-Means
`
`5. EARLY EVALUATION RESULTS
`
`This section gives a proof of concept of our flow classification
`method. Further work is necessary to conclude in the applicability
`of the method. We build a prototype classifier using Matlab. First,
`we use a training data set for the learning phase. Based on the
`cluster descriptions and cluster compositions found in the learning
`phase, we emulate the online classification using a trace collected
`six month afterwards on the same link. For validation purposes, we
`label this trace using the payload analysis tool used in the learning
`phase [8].
`Table 1 presents the accuracy of our classifier when it consid-
`ers the first five packets of each flow. We measure accuracy by
`comparing the application labels given by our classifier to the ones
`obtained through payload analysis. We were ableto correctly iden-
`tify more than 80% offlows ofalmost all of the applications. The
`only exception is POP3. Our classifier labels 86.8% of POP3 flows
`as NNTP and 12.6% as SMTP, because POP3 flows always belong
`to clusters where POP3is not the dominant application. This error
`can be easily fixed if we consider extra information.
`In this ex-
`ample, ifwe had used the destination port to label POP3 flows, we
`would have achieved over 90% accuracy. We are currently working
`on more sophisticated labeling techniques.
`
`Application Accuracy
`edonkey
`84.2%
`fp
`87%
`http
`99%
`kazaa
`95.24%
`nntp
`99.6%
`pop3
`0%
`smtp
`84.4%
`ssh
`96.92%
`https
`81.8%
`pop3s
`89.8%
`
`Table 1: Accuracy offlow classification.
`
`6. LIMITATIONS AND CHALLENGES
`The initial results observed with our method on a small trace are
`encouraging. The method is promising as it allows early classifica-
`
`algorithm and found that, for 5 packets, 50 clusters was the best
`trade-offbetween behavior separation and complexity.
`The learning phase outputs two sets: one with the description of
`each cluster (or the center of the cluster) and the other with their
`composition (or the set of applications represented in the cluster).
`Weuse both these sets to classify flows online.
`
`4.2. Online classification
`
`Figure 3 presents the structure of the online classifier. This clas-
`sifier can run either at a managementhost that has online access
`to packet headers or in a network processorat the router. We as-
`sume that the network administrators deploy a monitoring card that
`is able to process the header of all packets traversing the link [9].
`This process is very light since it only involves retrieving informa-
`tion from the IP and TCP headers (the size of the packet and the
`flow id).
`Our classifier takes as input the series ofpacket headers for both
`directions of an edge link. A packet analyzer extracts the 5-tuple
`(protocol, source IP, destination IP, source port, destination port)
`and the packet size. The analyzer filters out control traffic (the
`three packets of the TCP handshake and ACKs without payload)
`andstores the size of every packet in both directions ofthe connec-
`tion. When it has the size for the first P packets ofthe connection, it
`sends this information to the flow conversion module, which maps
`the new flow to a spatial representation. Then, the cluster assign-
`ment module searches all the cluster descriptions to find the bestfit
`for the new flow and the application identification module selects
`which application is the most likely for the flow given the set of
`applications that compose the cluster. We chose a simple heuristic
`to associate an unknown flow with an application: it is labeled with
`the application that is the most commonin the cluster.
`
`ACM SIGCOMM Computer Communication Review
`
`25
`
`Volume 36, Number2, April 2006
`
`Splunk Inc.
`
`Exhibit1044
`
`Page3
`
`Splunk Inc. Exhibit 1044 Page 3
`
`

`

`7. REFERENCES
`[1] T. Karagiannis, A. Broido, N. Brownlee, K. Claffy, and
`M. Faloutsos, “Is P2P dying or just hiding?,” in IEEE
`Globecom, 2004.
`[2] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield,
`“Class-of-service mapping for QoS: A statistical
`signature-based approach to IP traffic classification,” in
`Internet Measurement Conference, 2004.
`[3] A. Moore and D. Zuev, “Internet traffic classification using
`bayesian analysis,” in ACM SIGMETRICS, 2005.
`[4] T. Karagiannis, D. Papagiannaki, and M. Faloutsos,
`“BLINC: Multilevel traffic classification in the dark,” in
`ACM SIGCOMM, 2005.
`[5] A. McGregor, M. Hall, P. Lorier, and J. Brunskill, “Flow
`clustering using machine learning techniques,” in Passive
`and Active Measurement Workshop, 2004.
`[6] D. Zuev and A. Moore, “Traffic classification using a
`statistical approach,” in Passive and Active Measurement
`Workshop, 2005.
`[7] J. McQueen, “Some methods for classification and analysis
`of multivariate tions,” in Symposium on Mathematical
`Statistics and Probability, 1967.
`[8] Qosmos, “www.qosmos.com.”
`[9] Endace, “www.endace.com.”
`[10] N. Hohn and D. Veitch, “Inverting sampled traffic,” in
`Internet Measurement Conference, 2003.
`
`tion of applications and is quite simple. However, the method has
`some limitations that we discuss below. Most of these limitations
`are easy to overcome, while others are more fundamental and affect
`most classification methods to date.
`Multi-homed networks. Large networks often have multiple
`connections to the Internet.
`In this case, we can extend our ap-
`proach to monitor all access links and aggregate information on a
`machine where the classification will take place.
`Packet order. In IP networks, packets may be out of order or
`may appear more than once. Out-of-order packets will change the
`spatial representation of the flow, which will impact the quality of
`our classification. Fortunately, we only need the first five packets
`to arrive in order, which is very likely. On the network we studied,
`less than 4% of TCP flows had an out-of-order packet within the
`first five data packets of the flow.
`Sampling. On high speed links, monitors cannot collect all
`packets. The accuracy of our method will degrade fairly quickly
`under packet sampling. If instead the network adopts flow sam-
`pling [10], then our method will work unaltered.
`Applications with similar behaviors. If two distinct applica-
`tions start by exchanging five packets of approximately the same
`size, then we classify both with the same label. To solve this prob-
`lem, we are working on more sophisticated labeling heuristics that
`take into account other pieces of information available in the first
`few packets (such as port numbers and inter-arrival times).
`Applications with unknown behaviors. The learning phase
`gives a model for the traffic present in the training data. Using this
`model on other networks could be inefficient. Although a specific
`behavior for an application precisely described on a trace should
`not change when observed on another network or at a different
`time, a new network can (and most likely will) have different ap-
`plication behaviors. We can identify new behaviors using limited
`cluster radiuses. Using this limit, we return an accurate application
`label for previously-defined behaviors and classify all flows that are
`not represented in the training data as “unknown”.
`Short flows. It is well known that the traffic is made of a large
`majority of short flows and a small number of very large flows. We
`need to test our method on these very short flows. In this scenario, if
`it is successful, it will not work better than most other techniques.
`However, previous classification techniques have not specifically
`studied these short flows and they may exhibit some limitations
`that have not been observed yet.
`False matches. Most classifiers to date are based on heuristics
`and may misclassify a flow. This possibility of error, even if small,
`prevents the use of traffic classification techniques to automatically
`filter or block flows. Network administrators can use our method
`to identify suspicious flows early and store all their packets. They
`can later audit the stored data.
`Evasion. The main challenge to traffic classification techniques
`is evasion. For instance, an “attacker” could easily evade our method
`by padding packet payloads in order to modify sizes. However, all
`classification methods can be evaded: payload analysis tools can-
`not classify encrypted packets, port-based methods are deceived by
`a simple change of port, and approaches relying on summarized
`flow information are sensitive to simple alterations of packet sizes
`and inter-arrival times.
`To summarize, we now need to analyze our method on a much
`broader panel of traces. This is a difficult task as (i) we need the full
`trace with full payload for validation purposes (such traces are rare
`in the community); and (ii) we need more sophisticated techniques
`to label a flow after it has been assigned to a cluster. We are also
`working on circumventing the limitations and the weaknesses of
`our method.
`
`ACM SIGCOMM Computer Communication Review
`
`26
`
`Volume 36, Number 2, April 2006
`
`Splunk Inc. Exhibit 1044 Page 4
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket