IPR2022-00228, No. 1044 Exhibit - Exhibit 1044 (P.T.A.B. Nov. 24, 2021)

Trafﬁc Classiﬁcation On The Fly
`
`Laurent Bernaille†, Renata Teixeira†, Ismael Akodkenou†
`Augustin Soule‡, Kave Salamatian†
`† LIP6, Universit ´e Pierre et Marie Curie, ‡ Thomson Paris Lab
`Paris, FRANCE
`
`ABSTRACT
`The early detection of applications associated with TCP ﬂows is
`an essential step for network security and trafﬁc engineering. The
`classic way to identify ﬂows, i.e. looking at port numbers, is not
`effective anymore. On the other hand, state-of-the-art techniques
`cannot determine the application before the end of the TCP ﬂow.
`In this editorial, we propose a technique that relies on the obser-
`vation of the ﬁrst ﬁve packets of a TCP connection to identify the
`application. This result opens a range of new possibilities for online
`trafﬁc classiﬁcation.
`
`Categories and Subject Descriptors
`I.2.6 [Learning]: Unsupervised Learning; C.2.3 [Network Moni-
`toring]: Network Management
`
`General Terms
`Measurement, Algorithms, Management
`
`Keywords
`Trafﬁc classiﬁcation, Applications, Machine Learning
`
`1.
`
`INTRODUCTION
`Enterprise or campus networks usually impose a set of rules for
`users to access the network in order to protect network resources
`and enforce institutional policies (for instance, no sharing of mu-
`sic ﬁles or gaming). This leaves network administrators with the
`daunting task of (1) identifying the application associated with a
`trafﬁc ﬂow on-the ﬂy and (2) controlling user’s trafﬁc when needed.
`Therefore, accurate classiﬁcation of trafﬁc ﬂows is an essential step
`for administrators to detect intrusion or malicious attacks, forbid-
`den applications, or simply new applications (which may impact
`the future provisioning of network resources).
`Previous works have proposed a number of methods to identify
`the application associated with a trafﬁc ﬂow. The simplest approach
`consists in examining TCP port numbers. Port-based methods are
`simple because many well-known applications have speciﬁc port
`numbers (for instance, HTTP trafﬁc uses port 80 and FTP port 21).
`However, the research community now recognizes that port-based
`classiﬁcation is inadequate [1, 2, 3, 4], mainly because many ap-
`plications use dynamic port-negotiation mechanisms to hide from
`ﬁrewalls and network security tools. An alternative approach is
`to inspect the payload of every packet. This technique can be ex-
`tremely accurate when the payload is not encrypted, but it is an
`unrealistic alternative. First, there are privacy concerns with exam-
`ining user data. Second, there is a high storage and computational
`
`cost to study every packet that traverses a link (in particular at very
`high-speed links).
`Detecting the application associated with a ﬂow is of limited in-
`terest once the ﬂow is ﬁnished (except for provisioning and maybe
`for pricing). To address these challenges, we propose a novel tech-
`nique that only uses the size of the ﬁrst few data packets of each
`TCP ﬂow to identify the application associated with a TCP ﬂow.
`Our method runs at the edge of the network, i.e., where the net-
`work connects to the Internet, and hence is able to capture all pack-
`ets associated with a TCP ﬂow in both directions (from sender to
`receiver and vice-versa). The accurate classiﬁcation of ﬂows based
`on information contained only in the ﬁrst few packets opens new
`possibilities for online classiﬁcation (i.e. security, SLAs, etc.). Our
`method is in sharp contrast with the current state of the art.
`
`2. STATE OF THE ART
`After port-based classiﬁcation has been debunked and given the
`limitations of searching payloads for signatures, there has been a
`new trend to classify trafﬁc based on summarized ﬂow information
`such as duration, number of packets and mean inter-arrival time [2],
`[5], [6], [3]. BLINC [4] introduces a new approach for trafﬁc clas-
`siﬁcation. It associates Internet hosts with applications. Instead of
`studying TCP (or UDP) ﬂows individually, it looks at all the ﬂows
`generated by speciﬁc hosts. BLINC is able to accurately associate
`hosts with the services they provide or use (application server, web
`client, etc.). However, it cannot classify a single TCP ﬂow.
`All classiﬁcation techniques that use ﬂow statistics can only iden-
`tify the nature of a ﬂow when the ﬂow is ﬁnished. BLINC has to
`gather information from several ﬂows for each host before it can
`decide on the role of a host. These requirements prevent the use
`of these methods online. In contrast, our method relies only on the
`ﬁrst few packets of a TCP ﬂow. This early classiﬁcation is essen-
`tial to allow automatic blocking, ﬁltering, or recording of speciﬁc
`applications. It also limits the amount of memory required to store
`information associated with each ﬂow.
`
`3. KEY POINTS
`This section discusses our assumptions and the two main insights
`of our method: the analysis of only the ﬁrst few packets of the ﬂow,
`and the use of unsupervised clustering to detect a set of ﬂows that
`share a common behavior.
`3.1 Assumptions
`Our goal is to build a classiﬁer that runs online and accurately
`identiﬁes the application associated with a TCP ﬂow as early as
`possible. Our work relies on the following assumptions:
`
`ACM SIGCOMM Computer Communication Review
`
`23
`
`Volume 36, Number 2, April 2006
`
`Splunk Inc. Exhibit 1044 Page 1
`
`

`SizeofPacket#2
`
`~*%800
`
`-150 -100
`
`RMHON
`
`oO
`-5S0
`Size of Packet #1
`
`100
`
`XMMXKKKKKXK
`
`
`Figure 1: SMTP versus Edonkey flows.
`
`transfer from client to server: and(iii) upload flows are file transfers
`from server to client.
`
`SizeofPacket#2
`
`e Access to both directions of a TCP connection. Edge net-
`works usually connect to the Internet at few locations. The
`vast majority in one or two locations. We assumethatthe net-
`work administrator monitors both directions ofall the edge
`links.
`
`e Offline access to flows of a numberof applications. Our
`method only needs packet size information. However, for
`training purpose, we need to knowthe application associated
`with a sample of flows. There are twoalternative techniques
`for collecting this training trace:
`(i) directly from one of
`the monitored links using high-end packet monitoring cards
`(we can then analyze the payloads offline to identify the ap-
`plication corresponding to a given flow) ; or (ii) generated
`in a controlled fashion by explicitly launching instances of
`each of the desired applications and collecting all packets
`exchanged.
`
`e Online access to header of all packets. We only use the
`information on the packet header to do our onlineclassifica-
`tion, which is much smaller andoffixed size.
`
`3.2
`
`Intuition behind the method
`
`Our goal is to identify the application associated with a TCP flow
`as early as possible. We design a classifier that uses information
`available in the header ofthe first P packets ofthe flow to identify
`the application associated with a flow. Specifically, we only use
`the size of the data packets, not using TCP control packets (SYN,
`ACKs,etc.). The size of the first few packets is a good predictor of
`the application associated with a flow because it captures the appli-
`cation’s negotiation phase, which is usually a pre-defined sequence
`ofmessages and distinct amongapplications.
`We illustrate our approach with SMTP and Edonkey flows cap-
`tured at the edge of a large university network to shed light on
`the behaviors of applications. Figure 1 presents the size of the
`first packet versus the size of the second (negative values repre-
`sents packets sent from the TCP server to the client and positive
`from client to server) for every TCP flow ofthese two applications.
`Edonkey clients initialize request for files, hence the positive sign
`for the size of the first packet, whereas the SMTP server initiates
`negotiation. This simple example illustrates the expressiveness of
`this two-dimensionalrepresentation ofTCP flowsto distinguish the
`application associated with each flow. In fact, the traces we study
`consist of ten major applications and we found that 5 packets were
`enough to distinguish their behaviors.
`
`3.3. Unsupervised clustering
`To extract groups offlowsthat share a common communication
`behavior, we borrow techniques from machine learning. We use
`unsupervised clustering as it relies on unlabeled data samples (in
`our case, the size of the first few packets of a TCP flow) to find
`natural groups (or clusters) in a dataset, whereas supervised clus-
`tering uses a pre-labeled set of samples to construct a model for
`each cluster.
`
`Although thetraffic classification mechanism presented in [3]
`uses Naive Bayes Classifiers, an example of supervised clustering,
`unsupervised learning is more appropriatefor traffic classification
`because it doesnotrely on pre-defined classes. A single application
`can have multiple behaviors which should be modeled separately.
`For example, Figure 2 presents the size ofthe first packet versus the
`size ofthe second ofa set ofFTP flows. This figure shows that FTP
`has three very distinct behaviors: (i) command flows correspond
`to a control flow, in which a data connection is negotiated and in
`which clients ask for data; (ii) download flows corresponds to files
`
`Figure 2: Example of a multi-modal application: FTP.
`
`FTP is an example of a multi-modal application, and we found
`many other applications with several behaviors in our data. Unsu-
`pervised learning finds groups of flows that share the same behav-
`ior, which are easier to model.
`
`4. METHODOLOGY
`
`Basedon the observations from the previous section, we propose
`a traffic classification mechanism that works in two phases: an of-
`fline learning phase and an online traffic classification phase.First,
`the learning phase usesa set of training data to cluster TCP flows
`that share a commonbehavior. Then, thetraffic classification phase
`uses these classes to determine the application associated with each
`TCP flow. To verify that the behaviors of applications do not vary
`with time, we use one data set for the learning phase and another
`for the evaluation of the derived classification method. These two
`traces were collected several month apart on the samelink.
`
`ACM SIGCOMM Computer Communication Review
`
`24
`
`Volume 36, Number2, April 2006
`
`Splunk Inc.
`
`Exhibit1044
`
`Page2
`
`Splunk Inc. Exhibit 1044 Page 2
`
`

` Application label
`
`Figure 3: Design of online classifier.
`
`4.1 Learning phase
`The learning phase is performed offline and consists in detect-
`ing common behaviors in a set of flows. It takes as input a short
`packet trace with TCP flows from a mix ofapplications. Our study
`uses a one-hour trace collected at the edge of a university network.
`This trace also includes packet payloads. We apply a packet trace
`analyzer to this data set to extract TCP flows. First, to group flows
`into clusters, we need to evaluate the similarity between them. We
`associate each flow with a spatial representation based onthe sizes
`ofits first P packets. The representation we use is quite simple:
`flows are represented by points in a P-dimensional space where
`each packet is associated to a dimension. The coordinate on the
`dimensionpis the size of packet p in the flow. In order to differ-
`entiate packets sent by the TCP-server and the TCP-client, packets
`sent by the server have a negative coordinate. The similarity be-
`tween flowsis then evaluated with a simple metric: the Euclidean
`distance between their associated spatial representation.
`Tn order to extract common behaviors in such a space we rely on
`the well-known K-Means algorithm [7]. Once we have found nat-
`ural clusters in the training set, we model and analyze eachcluster.
`The modeling step consists in defining a heuristic to associate a
`new flow to a cluster. We use a simple heuristic: we compute the
`euclidean distance between the new flow andthe center ofeach pre-
`defined cluster, and then choose the cluster for which this distance
`is minimum. Theanalysis step consists in examining the composi-
`tion of each cluster. We processthe training flows with a payload
`analysis tool [8] that is a able to accurately determine the applica-
`tion associated with each flow. This step allows us to decide how
`to label a flow onceit has be assigned to a cluster.
`Westudied different numbers of packets and found that the best
`separation ofapplications amongclusters was observed with 5 pack-
`ets. We also tried different numbers of clusters for the K-Means
`
`5. EARLY EVALUATION RESULTS
`
`This section gives a proof of concept of our flow classification
`method. Further work is necessary to conclude in the applicability
`of the method. We build a prototype classifier using Matlab. First,
`we use a training data set for the learning phase. Based on the
`cluster descriptions and cluster compositions found in the learning
`phase, we emulate the online classification using a trace collected
`six month afterwards on the same link. For validation purposes, we
`label this trace using the payload analysis tool used in the learning
`phase [8].
`Table 1 presents the accuracy of our classifier when it consid-
`ers the first five packets of each flow. We measure accuracy by
`comparing the application labels given by our classifier to the ones
`obtained through payload analysis. We were ableto correctly iden-
`tify more than 80% offlows ofalmost all of the applications. The
`only exception is POP3. Our classifier labels 86.8% of POP3 flows
`as NNTP and 12.6% as SMTP, because POP3 flows always belong
`to clusters where POP3is not the dominant application. This error
`can be easily fixed if we consider extra information.
`In this ex-
`ample, ifwe had used the destination port to label POP3 flows, we
`would have achieved over 90% accuracy. We are currently working
`on more sophisticated labeling techniques.
`
`Application Accuracy
`edonkey
`84.2%
`fp
`87%
`http
`99%
`kazaa
`95.24%
`nntp
`99.6%
`pop3
`0%
`smtp
`84.4%
`ssh
`96.92%
`https
`81.8%
`pop3s
`89.8%
`
`Table 1: Accuracy offlow classification.
`
`6. LIMITATIONS AND CHALLENGES
`The initial results observed with our method on a small trace are
`encouraging. The method is promising as it allows early classifica-
`
`algorithm and found that, for 5 packets, 50 clusters was the best
`trade-offbetween behavior separation and complexity.
`The learning phase outputs two sets: one with the description of
`each cluster (or the center of the cluster) and the other with their
`composition (or the set of applications represented in the cluster).
`Weuse both these sets to classify flows online.
`
`4.2. Online classification
`
`Figure 3 presents the structure of the online classifier. This clas-
`sifier can run either at a managementhost that has online access
`to packet headers or in a network processorat the router. We as-
`sume that the network administrators deploy a monitoring card that
`is able to process the header of all packets traversing the link [9].
`This process is very light since it only involves retrieving informa-
`tion from the IP and TCP headers (the size of the packet and the
`flow id).
`Our classifier takes as input the series ofpacket headers for both
`directions of an edge link. A packet analyzer extracts the 5-tuple
`(protocol, source IP, destination IP, source port, destination port)
`and the packet size. The analyzer filters out control traffic (the
`three packets of the TCP handshake and ACKs without payload)
`andstores the size of every packet in both directions ofthe connec-
`tion. When it has the size for the first P packets ofthe connection, it
`sends this information to the flow conversion module, which maps
`the new flow to a spatial representation. Then, the cluster assign-
`ment module searches all the cluster descriptions to find the bestfit
`for the new flow and the application identification module selects
`which application is the most likely for the flow given the set of
`applications that compose the cluster. We chose a simple heuristic
`to associate an unknown flow with an application: it is labeled with
`the application that is the most commonin the cluster.
`
`ACM SIGCOMM Computer Communication Review
`
`25
`
`Volume 36, Number2, April 2006
`
`Splunk Inc.
`
`Exhibit1044
`
`Page3
`
`Splunk Inc. Exhibit 1044 Page 3
`
`

`7. REFERENCES
`[1] T. Karagiannis, A. Broido, N. Brownlee, K. Claffy, and
`M. Faloutsos, “Is P2P dying or just hiding?,” in IEEE
`Globecom, 2004.
`[2] M. Roughan, S. Sen, O. Spatscheck, and N. Dufﬁeld,
`“Class-of-service mapping for QoS: A statistical
`signature-based approach to IP trafﬁc classiﬁcation,” in
`Internet Measurement Conference, 2004.
`[3] A. Moore and D. Zuev, “Internet trafﬁc classiﬁcation using
`bayesian analysis,” in ACM SIGMETRICS, 2005.
`[4] T. Karagiannis, D. Papagiannaki, and M. Faloutsos,
`“BLINC: Multilevel trafﬁc classiﬁcation in the dark,” in
`ACM SIGCOMM, 2005.
`[5] A. McGregor, M. Hall, P. Lorier, and J. Brunskill, “Flow
`clustering using machine learning techniques,” in Passive
`and Active Measurement Workshop, 2004.
`[6] D. Zuev and A. Moore, “Trafﬁc classiﬁcation using a
`statistical approach,” in Passive and Active Measurement
`Workshop, 2005.
`[7] J. McQueen, “Some methods for classiﬁcation and analysis
`of multivariate tions,” in Symposium on Mathematical
`Statistics and Probability, 1967.
`[8] Qosmos, “www.qosmos.com.”
`[9] Endace, “www.endace.com.”
`[10] N. Hohn and D. Veitch, “Inverting sampled trafﬁc,” in
`Internet Measurement Conference, 2003.
`
`tion of applications and is quite simple. However, the method has
`some limitations that we discuss below. Most of these limitations
`are easy to overcome, while others are more fundamental and affect
`most classiﬁcation methods to date.
`Multi-homed networks. Large networks often have multiple
`connections to the Internet.
`In this case, we can extend our ap-
`proach to monitor all access links and aggregate information on a
`machine where the classiﬁcation will take place.
`Packet order. In IP networks, packets may be out of order or
`may appear more than once. Out-of-order packets will change the
`spatial representation of the ﬂow, which will impact the quality of
`our classiﬁcation. Fortunately, we only need the ﬁrst ﬁve packets
`to arrive in order, which is very likely. On the network we studied,
`less than 4% of TCP ﬂows had an out-of-order packet within the
`ﬁrst ﬁve data packets of the ﬂow.
`Sampling. On high speed links, monitors cannot collect all
`packets. The accuracy of our method will degrade fairly quickly
`under packet sampling. If instead the network adopts ﬂow sam-
`pling [10], then our method will work unaltered.
`Applications with similar behaviors. If two distinct applica-
`tions start by exchanging ﬁve packets of approximately the same
`size, then we classify both with the same label. To solve this prob-
`lem, we are working on more sophisticated labeling heuristics that
`take into account other pieces of information available in the ﬁrst
`few packets (such as port numbers and inter-arrival times).
`Applications with unknown behaviors. The learning phase
`gives a model for the trafﬁc present in the training data. Using this
`model on other networks could be inefﬁcient. Although a speciﬁc
`behavior for an application precisely described on a trace should
`not change when observed on another network or at a different
`time, a new network can (and most likely will) have different ap-
`plication behaviors. We can identify new behaviors using limited
`cluster radiuses. Using this limit, we return an accurate application
`label for previously-deﬁned behaviors and classify all ﬂows that are
`not represented in the training data as “unknown”.
`Short ﬂows. It is well known that the trafﬁc is made of a large
`majority of short ﬂows and a small number of very large ﬂows. We
`need to test our method on these very short ﬂows. In this scenario, if
`it is successful, it will not work better than most other techniques.
`However, previous classiﬁcation techniques have not speciﬁcally
`studied these short ﬂows and they may exhibit some limitations
`that have not been observed yet.
`False matches. Most classiﬁers to date are based on heuristics
`and may misclassify a ﬂow. This possibility of error, even if small,
`prevents the use of trafﬁc classiﬁcation techniques to automatically
`ﬁlter or block ﬂows. Network administrators can use our method
`to identify suspicious ﬂows early and store all their packets. They
`can later audit the stored data.
`Evasion. The main challenge to trafﬁc classiﬁcation techniques
`is evasion. For instance, an “attacker” could easily evade our method
`by padding packet payloads in order to modify sizes. However, all
`classiﬁcation methods can be evaded: payload analysis tools can-
`not classify encrypted packets, port-based methods are deceived by
`a simple change of port, and approaches relying on summarized
`ﬂow information are sensitive to simple alterations of packet sizes
`and inter-arrival times.
`To summarize, we now need to analyze our method on a much
`broader panel of traces. This is a difﬁcult task as (i) we need the full
`trace with full payload for validation purposes (such traces are rare
`in the community); and (ii) we need more sophisticated techniques
`to label a ﬂow after it has been assigned to a cluster. We are also
`working on circumventing the limitations and the weaknesses of
`our method.
`
`ACM SIGCOMM Computer Communication Review
`
`26
`
`Volume 36, Number 2, April 2006
`
`Splunk Inc. Exhibit 1044 Page 4
`
`

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases