throbber

`
`Anagram: A Content Anomaly Detector Resistant to
`Mimicry Attack1
`Ke Wang Janak J. Parekh Salvatore J. Stolfo
`
`Computer Science Department, Columbia University
`500 West 120th Street, New York, NY, 10027
`{kewang, janak, sal}@cs.columbia.edu
`
`Abstract. In this paper, we present Anagram, a content anomaly detector that
`models a mixture of high-order n-grams (n > 1) designed to detect anomalous and
`“suspicious” network packet payloads. By using higher-order n-grams, Anagram
`can detect significant anomalous byte sequences and generate robust signatures of
`validated malicious packet content. The Anagram content models are imple-
`mented using highly efficient Bloom filters, reducing space requirements and ena-
`bling privacy-preserving cross-site correlation. The sensor models the distinct
`content flow of a network or host using a semi-supervised training regimen. Pre-
`viously known exploits, extracted from the signatures of an IDS, are likewise
`modeled in a Bloom filter and are used during training as well as detection time.
`We demonstrate that Anagram can identify anomalous traffic with high accuracy
`and low false positive rates. Anagram’s high-order n-gram analysis technique is
`also resilient against simple mimicry attacks that blend exploits with “normal” ap-
`pearing byte padding, such as the blended polymorphic attack recently demon-
`strated in [1]. We discuss randomized n-gram models, which further raises the bar
`and makes it more difficult for attackers to build precise packet structures to evade
`Anagram even if they know the distribution of the local site content flow. Finally,
`Anagram’s speed and high detection rate makes it valuable not only as a stand-
`alone sensor, but also as a network anomaly flow classifier in an instrumented
`fault-tolerant host-based environment; this enables significant cost amortization
`and the possibility of a “symbiotic” feedback loop that can improve accuracy and
`reduce false positive rates over time.
`1 Introduction
`The current generation of Network Intrusion Detection Systems (NIDS) are typically ill-
`suited for stealthy worms and targeted attacks. Misuse and anomaly detectors that ana-
`lyze packet headers and traffic flow statistics may be too slow to react to reliably detect
`worms that are designed to evade detection by shaping their behavior to look like le-
`gitimate traffic patterns [2]. Furthermore, signature scanners are vulnerable to zero-day
`exploits [3] and polymorphic worms/stealthy attacks with obfuscated exploit code [4].
`Consequently, there has been an increasing focus on payload analysis to detect the early
`onset of a worm or targeted attack. Ideally, one would hope to detect the very first pack-
`ets of an attack, rather than accumulating sufficient statistics about connection flows to
`detect a zero-day attack.
`A number of researchers (e.g., [5-8]) have focused on payload-based anomaly detec-
`tion. Approaches that have been studied include specification-based anomaly detection
`[7] as well as techniques that aim to detect “code-like” byte sequences in network pay-
`loads [6, 9]. In our work, we have focused on automated statistical learning approaches
`to efficiently train content models on a site’s “normal” traffic flow without requiring
`
`1 This work has been partially supported by a grant with the Army Research Office, No. DA
`W911NF-04-1-0442.
`
`
`
`Exhibit Page 1
`
`Columbia Ex. 2017
`Symantec v. Columbia
`IPR2015-00375
`
`

`

`
`significant semantic analysis. Ideally, we seek to design a sensor that automatically
`learns the characteristics of “normal” attack-free data for any application, service, net-
`work or host. Consequently, a model learned for “normal” attack-free data may be used
`to identify “abnormal” or suspicious traffic that would be subjected to further analysis
`to validate whether the data embodies a new attack.
` In our previous work we proposed PAYL (short for “PAYLoad anomaly detection”)
`that modeled the “normal” attack-free traffic of a network site as 1-gram, byte-value
`frequency distributions [10], and demonstrated an ability to effectively detect worm
`behavior via ingress/egress and cross-site correlation [11]. The sensor was designed to
`be language-independent, requiring no syntactic analysis of the byte stream. Further-
`more, PAYL was designed to be efficient and scalable for high-speed networks and
`applicable to any network service. Various experiments demonstrated that PAYL
`achieved a high detection rate and with low false positives for “typical” worms and
`exploits available at the time.
`However, most researchers2 correctly suspected that PAYL’s simplicity would be
`easily blinded by mimicry attacks. Kolesnikov, Dagon and Lee [1] demonstrated a new
`blended, polymorphic worm designed to evade detection by PAYL and other frequency
`distribution-based anomaly detectors. This demonstration represents a new class of
`“smart worms” that launch their attack by first sniffing traffic and shaping the datagram
`to the statistics specific to a given site to appear normal. The same principles may be
`applied to the propagation strategy as well as in, for example, parasitic worms. Since
`PAYL only models 1-gram distributions, it can be easily evaded with proper padding to
`avoid detection of anomalous byte sequences. As a countermeasure, we conjecture that
`higher-order n-gram modeling may likely detect these anomalous byte sequences. Un-
`fortunately, computing a full frequency distribution for higher order n-grams is compu-
`tationally and memory-wise infeasible, and would require a prohibitively long training
`period even for modest gram sizes.
`In this paper we present a new sensor, Anagram, which introduces the use of Bloom
`filters and a binary-based detection model. Anagram does not compute frequency distri-
`butions of normal content flows; instead, it trains its model by storing all of the distinct
`n-grams observed during training in a Bloom filter without counting the occurrences of
`these n-grams. Anagram also stores n-grams extracted from known malicious packets in
`a second bad content Bloom filter, acquired by extracting n-grams from openly avail-
`able worm detection rules, such as the latest Snort rulesets [12]. At detection time,
`packets are scored by the sensor on the basis of the number of unobserved n-grams the
`packet contains. The score is weighted by the number of malicious n-grams it contains
`as well. In this paper, we demonstrate that this semi-supervised strategy attains re-
`markably high detection and low false positive rates, in some cases 100% detection with
`less than 0.006% false positive rate (per packet).
`The use of Bloom filters makes Anagram memory and computationally efficient and
`allows for the modeling of a mixture of different sizes of n-grams extracted from packet
`payloads, i.e. an Anagram model need not contain samples of a fixed size gram. This
`strategy is demonstrated to exceed PAYL in both detection and false positives rates.
`Furthermore, Anagram’s modeling technique is easier to train, and allows for the esti-
`mation of when the sensor has been trained enough for deployment. The Bloom filter
`model representation also provides the added benefit of preserving the privacy of shared
`content models and alerts for cross-site correlation.
`
`
`2 Including ourselves; a proposal to study counter-evasion techniques led to the work reported
`herein.
`
`
`
`Exhibit Page 2
`
`Columbia Ex. 2017
`Symantec v. Columbia
`IPR2015-00375
`
`

`

`
`Of particular interest here is that Anagram is shown to be robust against existing
`mimicry attack approaches. We demonstrate Anagram’s ability to counter the simple
`mimicry attacks levied against PAYL. Furthermore, Anagram is designed to defeat
`training and mimicry attacks by using randomized n-gram modeling. The approach
`presented raises the bar against the enemy, making it far more difficult to design an n-
`gram attack against Anagram. By randomizing the portion of packets that Anagram
`extracts and models, mimicry attackers cannot easily guess how and where to pad mali-
`cious content to avoid detection. We also describe the use of a feedback loop between
`the Anagram sensor and host-based detectors, thereby updating Anagram models over
`time and improving its detection performance. Thus, the combination of model ran-
`domization and a feedback loop makes it more difficult to evade detection by training
`and mimicry attacks. The contributions of this paper include:
`• A new statistical, language-independent, efficient content-based anomaly de-
`tector based upon semi-supervised training of higher-order n-gram analysis
`that is shown to be resistant against existing mimicry attacks. The sensor does
`not rely upon a specification or semantic analysis of the target applications;
`• Robustness against future mimicry attacks by the use of a novel, low-overhead
`randomized testing strategy, making it difficult for the attacker to guess where
`or how to pad content;
`• Development of a run-time measure of the “stability” of a network’s content
`flow, providing a reasonable estimate of when the sensor has been well enough
`trained and is ready for deployment.
`• A robust means of representing content-based alerts for cross-site alert sharing
`and robust signature generation using a Bloom Filter representation of anoma-
`lous byte sequences3;
`• A new defensive strategy showing how a symbiotic relationship between host-
`based sensors and a content-based sensor can adapt over time to improve accu-
`racy of modeling a site’s content flow.
`The rest of the paper is organized as follows. Section 2 details the Anagram sensor
`and its relevant algorithms. Performance, detection rate, and false positive characteris-
`tics are presented testing Anagram against real network traffic traces infected with a
`collection of worms and viruses. Section 3 describes Anagram’s robustness against the
`cleverly crafted blended polymorphic worm reported in [1], previews the possibility of
`new customized mimicry attacks being crafted against Anagram, and describes ran-
`domization techniques for defeating such attacks. In section 4 we present the concept of
`coupling a “shadow server” with Anagram and discuss how the combination can effec-
`tively support the techniques presented in section 3, as well as support robust signature
`generation and patch generation. Section 5 discusses related work, while section 6 con-
`cludes the paper with a call for collaboration among researchers at distributed sites.
`2 Anagram – Modeling a Mixture of N-grams
`Anagram’s approach to network payload anomaly detection uses a mixture of higher
`order n-grams (n>1) to model and test network traffic content. N-gram analysis is a
`well-known technique has been used in a variety of tasks, such as system call monitor-
`ing [15-17]. In Anagram, the n-grams are generated by sliding windows of arbitrary
`lengths over a stream of bytes, which can be per network packet, per request session, or
`other type of data unit.
`
`3 The representation also permits patch generation systems to share anomalous data for local
`patch generation across an “application community” [13, 14].
`
`
`
`Exhibit Page 3
`
`Columbia Ex. 2017
`Symantec v. Columbia
`IPR2015-00375
`
`

`

`
`In our previous work on network payload anomaly detection, PAYL [10, 11], we
`modeled the length-conditioned 1-gram frequency distribution of packet payloads, and
`tested new packets by computing the Mahalanobis distance of the test packet against the
`model. This approach is effective at capturing attacks that display abnormal byte distri-
`butions, but it is likely to miss well-crafted attacks that focus on simple CPU instruc-
`tions and that are crafted to resemble normal byte distributions. For instance, although a
`standard CodeRed II’s buffer overflow exploit uses a large number of “N” or “X” char-
`acters and so appears as a peak in the frequency distribution, [18] shows that the buffer
`can instead be padded with nearly any random byte sequence without affecting the at-
`tack vector. Another example is the following simple phpBB forum attack:
`GET /modules/Forums/admin/admin_styles.php?phpbb_root_path=http
`://81.174.26.111/cmd.gif?&cmd=cd%20/tmp;wget%20216.15.209.4/cri
`man;chmod%20744%20criman;./criman;echo%20YYY;echo|..HTTP/1.1.Ho
`st:.128.59.16.26.User-Agent:.Mozilla/4.0.(compatible;.MSIE.6.0;
`.Windows.NT.5.1;)..
`In such situations, the abnormal byte distribution model is insufficient by itself to
`identify these attack vectors as abnormal data. However, invariants remain in the packet
`payloads: the exploit code, the sequence of commands, or the special URL that should
`not appear in the normal content flow to the target application. By modeling higher
`order n-grams, Anagram captures the order dependence of byte sequences in the net-
`work payload, enabling it to capture more subtle attacks. The core hypothesis is that any
`new, zero-day exploit will contain a portion of data that has never before been delivered
`to the application. These subsequences of new, distinct byte values will manifest as
`anomalous” n-grams that Anagram is designed to efficiently and rapidly detect.4
`In the following sections we will give a detailed description of Anagram, which out-
`performs PAYL in the following respects:
`• Accuracy in detecting anomalous payloads, even carefully crafted ‘mimicry at-
`tacks’ with a demonstrably lower false positive rate;
`• Computational efficiency in detection by the use of fast (and incremental, linear-
`time) hashing in its Bloom filter implementation;
`• Model space efficiency since PAYL's multiple-centroid modeling is no longer
`necessary, and Bloom filters are compact;
`• Fast correlation of multiple alerts while preserving privacy as collaborating sites
`exchange Bloom filter representations of common anomalous payloads;
`• The generation of robust signatures via cross-site correlation for early warning
`and detection of new zero day attacks.
`In the following sections, we will describe the mechanisms in detail and present ex-
`perimental results of testing Anagram against network traces sniffed from our local
`LAN.
`
`2.1 High Order N-gram Payload Model
`While higher order n-grams contain more information about payloads, the feature
`space grows exponentially as n increases. Comparing an n-gram frequency distribution
`against a model is infeasible since the training data is simply too sparse; the length of a
`packet is too small compared to the total feature space size of a higher-order n-gram.
`
`
`4 Note that some attacks may not include byte sequences that are “code-like”, and hence testing
`content for such code-like data subsequences is not guaranteed to cover all attack cases. The
`language independence of anomalous n-grams may be broadly applicable to these and other at-
`tacks.
`
`
`
`Exhibit Page 4
`
`Columbia Ex. 2017
`Symantec v. Columbia
`IPR2015-00375
`
`

`

`
`One TCP packet may contain only a thousand or so n-grams, while the feature space
`size is 256n. Clearly, with increasing n, generating sufficient frequency statistics to
`estimate the true empirical distribution accurately is simply not possible in a reasonable
`amount of time.
`In Anagram, we therefore do not model the frequency distribution of each n-gram.
`Rather, we observe each distinct n-gram seen in the training data and record each in a
`space efficient Bloom filter. Once the training phase terminates, each packet is scored
`by measuring the number of n-grams it did not observe in the training phase. Hence, a
`packet is scored by the following formula, where
`newN is the number of new n-grams
`not seen before and T is the total number of n-grams in the packet:
`N
`
`Score
`]1,0[∈
`new
`=
`T
`At first glance, the frequency-based n-gram distribution may contain more informa-
`tion about packet content; one might suspect it would model data more accurately and
`perform better at detecting anomalous data, but since the training data is sparse, this
`alternative “binary-based model” performs significantly better than the frequency-based
`approach given the same amount of training data.
`We analyzed the network traffic for the Columbia Computer Science website and, as
`expected, a small portion of the n-grams appear frequently while there is a long “tail” of
`n-grams that appear very infrequently. This can be seen in Table 1, which displays the
`percentage of the n-grams by their frequency counts for 90 hours of CS web traffic.
`Since a significant number of n-grams have a small frequency count, and the number of
`n-grams in a packet is very small relative to the whole feature space, the frequency-
`distribution model incurs relatively high false positives. Thus, the binary-based model
`(simply recording the distinct n-grams seen during training) provides a reasonable esti-
`mate of how “normal” a packet may be. This is a rather surprising observation; as we
`will demonstrate, it works very well in practice. The conjecture is that true attacks will
`be delivered in packets that contain many more n-grams not observed in training than
`“normal” packets used to train the model. After all, a true zero-day attack must deliver
`data to a server application that has never been processed by that application before.
`Hence, the data exercising the vulnerability is very likely to be an n-gram of some size
`never before observed. By modeling a mixture of n-grams, we increase the likelihood of
`observing these anomalous grams.
`To validate this conjecture, we compare the ROC curves of the frequency-based ap-
`proach and the binary-based approach for the same datasets (representing equivalent
`training times) as displayed in figure 1. We collected the web traffic of two CS depart-
`mental web servers, www and www1; the former serves the department webpage, while
`the latter serves personal web pages. Traffic was collected for two different time periods:
`a period of sniffed traffic from the year 2004 and another dataset sniffed in 2006. The
`2004 datasets5 (www-04 and www1-04) contain 160 hours of traffic; the 2006 datasets
`(www-06 and www1-06) contain about 560 hours. We tested for the detection of several
`real worms and viruses: CodeRed, CodeRed II, WebDAV, Mirela, a php forum attack, and a
`worm that exploits the IIS Windows media service, the nsiislog.dll buffer overflow vulner-
`ability (MS03-022). These worm samples were collected from real traffic as they appeared in
`the wild, from both our own dataset and from a third-party.
`
`
`5 The 2004 dataset was used to train and test PAYL as reported in [11], and is used here for
`comparative evaluation.
`
`
`
`
`
`Exhibit Page 5
`
`Columbia Ex. 2017
`Symantec v. Columbia
`IPR2015-00375
`
`

`

`
`For the first experiment, we used 90 hours of www1-06 for training and 72 hours for
`testing. (Similar experiments on the other datasets display similar results, and we skip
`them here for brevity.) To make it easier for the reader to see the plots, the curves are
`plotted for the cases where the false positive rate is less than 0.03%. Both the detection
`rate and false positive rate are calculated based on packets with payloads; non-payload
`(e.g., control) packets were ignored. Notice that the detection rate in figure 1 is based on
`per packet, instead of per attack. Some attacks have multiple packets; while fragmenta-
`tion can result in a few packets appearing normal, we can still guarantee reliable attack
`detection over the entire set of packets. For example, for the IIS5 WebDAV attack, 5-
`grams detect 24 out of 25 packets as being anomalous. The only missed packet is the
`first packet, which contains the buffer overflow string “SEARCH /AAA……AA”; this
`is not the key exploit part of the attack. For further comparison, we list minimum false
`positive rates when detecting all attack attempts (where an attack is detected if at least
`80% of the packets are classified as anomalous) for both binary-based and frequency-
`based models in table 2.
`
`
`Table 1. The percentage of the observed unique n-grams for different frequencies of occur-
`rence for 90 hours of training traffic
`frequency count
`3-grams
`5-grams
`>=5
`58.05%
`39.13%
`2 to 4
`22.17%
`28.22%
`1
`19.78%
`32.65%
`
`7-grams
`32.53%
`28.48%
`38.99%
`
`
`
`Freq-based, 5-gram
`Binary-based, 5-gram
`Freq-based, 3-gram
`Binary-based, 3-gram
`
`100
`
`95
`
`90
`
`85
`
`80
`
`75
`
`Detection Rate (%) of all packets
`
`70
`0
`
`
`Fig. 1. ROC curves comparing the frequency-based and binary-based n-gram approach
`
`0.02
`0.01
`False Positive Rate (%)
`
`0.03
`
`Table 2. The false positive rate (%) of the two approaches using different n-grams when
`achieving 100% detection rate, www1-06 train/test dataset
`8-gram
`3-gram
`4-gram
`5-gram
`6-gram
`7-gram
`
`Freq-based 30.40205
`0.18559
`0.08858
`0.13427
`0.30792 0.41477
`Binary-
`0.02109
`0.01406
`0.00634
`0.00703
`0.00914 0.00914
`based
`
`
`The binary-based approach yields significantly better results than the frequency-
`based approach. When a 100% detection rate is achieved for the packet traces analyzed,
`the false positive rate of the binary-based approach is at least one order of magnitude
`less than the frequency-based approach. The relatively high false positive rate of the
`
`
`
`Exhibit Page 6
`
`Columbia Ex. 2017
`Symantec v. Columbia
`IPR2015-00375
`
`

`

`
`frequency-based approach suggests much more training is needed to capture accurate
`statistical information to be competitive. In addition, the extremely high false positive
`rate of the 3-gram frequency-based approach is due to the fact that the 3-grams of the
`php attack all appear frequently enough to make it hard to distinguish them from normal
`content packets. On the other hand, the binary-based approach used in Anagram results
`in far better performance. The 0.01% false positives average to about 1 alert per hour
`for www1 and about 0.6 alerts per hour for www. The result also shows that 5-grams
`and 6-grams give the best result, and we’ve found this to be true for others as well.
`As previously stated, Anagram may easily model a mixture of different n-grams sim-
`ply by storing these in the same Bloom filter. However, for larger n-grams additional
`training may be required; as we shall describe shortly, our modeling approach allows us
`to estimate when the sensor has been sufficiently trained.
`
`2.2 Model Size and Bloom Filters
`As previously stated, one significant issue when modeling with higher order n-grams is
`memory overhead. By leveraging the binary-based approach, we can use more memory-
`efficient set-based data structures to represent the set of observed n-grams. In particular,
`the Bloom filter (BF) [19] is a convenient tool to represent the binary model. Instead of
`using n bytes to represent the n-gram, or even 4 bytes for a 32-bit hash of the n-gram,
`the Bloom filter can represent a set entry with just a few bits, reducing memory re-
`quirements by an order of magnitude or more.
`A Bloom filter is essentially a bit array of m bits, where any individual bit i is set if
`the hash of an input value, mod m, is i. As with a hash table, a Bloom filter acts as a
`convenient one-way data structure that can contain many items, but generally is orders-
`of-magnitude smaller. Operations on a Bloom filter are O(1), keeping computational
`overhead low. A Bloom filter contains no false negatives, but may contain false posi-
`tives if collisions occur; the false positive rate can be optimized by changing the size of
`the bit array and by using multiple hash functions (and requiring all of them to be set for
`an item to be verified as present in the Bloom filter; in the rare case where one hash
`function collides between two elements, it’s highly unlikely a second or a third would
`also simultaneously collide). By using universal hash functions [20], we can minimize
`the probability of multiple collisions for n-grams in one packet (assuming each n-gram
`is statistically independent); the Bloom filter is therefore safe to use and does not nega-
`tively affect detection accuracy.
`
`
`
`
`Exhibit Page 7
`
`Columbia Ex. 2017
`Symantec v. Columbia
`IPR2015-00375
`
`

`

`
`
`Fig. 2. Inserting n-grams (here, n = 5) into a Bloom filter.
`
`
`
`2.2.1 Memory overhead
`While Bloom filters are comparatively small even when inserting a large number of
`entries, choosing the optimal size of a Bloom filter is nontrivial, since Anagram is not
`aware of a site’s distribution (and the number of unique n-grams) before building its
`model. Additionally, a Bloom filter’s size cannot be dynamically resized, as the hash
`values cannot be recomputed without the original underlying training data.6
`A large Bloom filter will waste memory, but small Bloom filters saturate more
`quickly, yielding higher false positive rates. It is worth pointing out that “large” is rela-
`tive; a 24-bit Bloom filter is capable of holding 224/nh elements, where nh represents the
`number of hash functions used, in only 2MB of memory, e.g., each n-gram inserted uses
`about 2.7 bits when 3 hash functions are used. Additionally, we can use traditional com-
`pression methods (e.g., LZW) for storing a sparse Bloom filter, which significantly
`reduces storage and transport costs. As discussed later in this paper, our experiments
`anecdotally suggest this Bloom filter is large enough for at least 5-grams, assuming a
`mostly textual distribution.
`The presence of binary data does significantly increase Bloom filter requirements; if
`allocating extra initial memory is undesirable, a “layered” approach can be employed,
`where new Bloom filters are created on demand as previous Bloom filters saturate, with
`a small constant-time overhead. It should be evident that Bloom filters can be trivially
`merged via bitwise ORing and compared via bitwise ANDing.
`
`2.2.2 Computation overhead
`The overhead of inserting or verifying a single item into a Bloom filter is O(1) per
`item inserted in the Bloom filter. However, the constant-time overhead to process an n-
`gram can become significant when inserting or checking n-grams over a large popula-
`tion of packets. This effect is magnified when an Anagram model uses different-size n-
`grams, as the same data is repeatedly hashed into different positions on the Bloom filter
`based on the various n-gram windows being processed. Couple this with the fact that
`
`
`6 While the training data can be kept, we do not consider this practical for actual deployment,
`especially if techniques like incremental and run-time training are used.
`
`
`
`Exhibit Page 8
`
`Columbia Ex. 2017
`Symantec v. Columbia
`IPR2015-00375
`
`

`

`
`Bloom filters need a good source of hashes, and Anagram’s largest computation over-
`head in both training and testing rapidly becomes the hash operation.
`To reduce this computation overhead, we make use of a cumulative universal hash
`function.
`A
`cumulative
`hash
`function
`fulfills
`the
`requirement
`(
`
`)= d h a( ),h b( )( ), where h represents our hash function, a and b are data (n-
`
`that h c a,b( )
`
`grams or fragments of them), c is a (bitwise) concatenation function, and d is a compo-
`sition function. Given such a hash function, we can avoid re-computing hashes when
`sliding the n-gram window and when using different window sizes, e.g., if we’ve
`hashed a 5-gram and need to generate the hash of a 7-gram, we can just hash the incre-
`mental 2 grams and combine that with the 5-gram hash value. A class of universal hash
`functions known as H3 [21] uses XOR as the composition function, which is very fast
`and lends itself well to our application. Thus, Anagram’s modeling engine may perform
`fast enough to scale to very high bandwidth environments. Dharmapurikar, et. al. de-
`scribe a similar technique [22] for static signature scanning in hardware.
`
`2.3 Learning Issues
`The huge feature space of higher order n-grams makes estimating the frequency distri-
`bution infeasible. In the case of representing distinct n-grams, we pose the following
`questions: how long do we need to train before deploying the sensor in detection mode?
`Further, how well can Anagram handle noisy training data since the model is binary-
`based?
`
`2.3.1 When is the model well trained?
`Since there are many distinct n-grams, many of which may take days or weeks to see,
`when can we say the model is well trained and ready to use? We first check the likeli-
`hood of seeing a new n-gram with additional training. Figure 2 shows the percentage of
`the new distinct n-grams out of every 10,000 content packets when we train for up to
`500 hours of traffic data. The key observation is that during the initial training period,
`one should expect to see many distinct n-grams. Over time, however, fewer distinct
`“never before seen” n-grams should be observed. Hence, for a given value of n, a par-
`ticular site should exhibit some rate of observing “new” distinct n-grams within its
`“normal” content flow. By estimating this rate, we can estimate how well the Anagram
`model has been trained. When the likelihood of seeing new n-grams in normal traffic is
`stable and low, the model is stable; we can then use the rate to help detect the attacks, as
`they should contain a higher percentage of unseen n-grams. Figure 3 plots the false
`positive rates of different models, varying in n-gram size and length of training time,
`when tested on the 72 hours of traffic immediately following the training data.
`
`
`
`Exhibit Page 9
`
`Columbia Ex. 2017
`Symantec v. Columbia
`IPR2015-00375
`
`

`

`3-grams
`5-grams
`7-grams
`
`0.03
`
`0.025
`
`0.02
`
`0.015
`
`0.01
`
`0.005
`
`0
`0
`
`1
`
`5
`4
`3
`2
`Training Dataset Length (in days)
`
`
`Fig. 4. False positive rate (with 100% detec-
`tion rate) as training time increases.
`
`6
`
`7
`
`False Positive Rate (%) when 100% Detection Rate
`
`3-grams
`5-grams
`7-grams
`
`50
`
`100
`
`300
`250
`200
`150
`Per 10000 content packets
`
`350
`
`400
`
`450
`
`0.01
`
`0.009
`
`0.008
`
`0.007
`
`0.006
`
`0.005
`
`0.004
`
`0.003
`
`0.002
`
`0.001
`
`0
`0
`
`
`
`Likelihood of seeing new n-grams
`
`Fig. 3. The likelihood of seeing new n-grams
`as training time increases.
`
`From the above plots, we can see that as the training time increases, the false positive
`rate generally goes down as the model is more complete. After some point, e.g. 4 days’
`in figure 3, there is no further significant gain, and the FP rate is sufficiently low.
`Higher order n-grams need a longer training time to build a good model, so 7-grams
`display a worse result than 5-grams given the same amount of training. While the 3-
`gram model is likely more complete with the same amount of training, it scores signifi-
`cantly worse: 3-gram false positives do not stem from inadequate training, but rather
`because 3-grams are not long enough to distinguish malicious byte sequences from
`normal ones.7
`In theory, Anagram should always improve with further training – if we can guaran-
`tee a clean training dataset, which is crucial for the binary-based approach. However,
`obtaining clean training data is not an easy task in practice. During our experiments,
`increased training eventually crosses a threshold where the false positive rate starts
`increasing, even if the training traffic has been filtered for all known attacks. The bi-
`nary-based approach has significant advantages in speed and memory, but it’s not toler-
`ant of noisy training, and manual cleanup is infeasible for large amounts of training data.
`We therefore introduce semi-supervised training in the next section to help Anagram be
`more robust against noisy data.
`
`2.3.2 Semi-supervised Learning
`The binary-based approach is simple and memory efficient, but too sensitive to noisy
`training data. One form of supervised training is quite simple. We utilize the signature
`content of Snort rules obtained from [12] and (a collection of about 500) virus samples
`to precompute a known “bad content model”. We build a bad content Bloom filter con-
`taining the n-grams that appear in the two collections, using all possible n that we may
`eventually train on (e.g., n=2~9 in our experiments). This model can be incrementally
`updated when new signatures have been released.
`It’s important to note, however, that signatures and viruses often contain some nor-
`mal n-grams, e.g., the ‘GET’ keyword for HTTP exploits. To remove these normal n-
`grams, we can maintain a small, known-clean dataset used to exclude normal traffic
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket