throbber
US007487544B2
`
`(12) United States Patent
`Schultz et al.
`
`(10) Patent N0.2
`(45) Date of Patent:
`
`US 7,487,544 B2
`Feb. 3, 2009
`
`(54) SYSTEM AND METHODS FOR DETECTION
`OF NEW MALICIOUS EXECUTABLES
`
`5,832,208 A * 11/1998 Chen et a1. .................. .. 726/24
`6,016,546 A *
`1/2000 Kephart et a1. .............. .. 726/24
`
`t
`(75) I
`nven ors:
`
`NY (Us)
`M tth G s h It 1th
`a eW . c u z,
`aca,
`;
`Eleazar Eskin’ Santa Monica’ CA (Us);
`Erez Zadok, Middle Island, NY (US);
`Manasi Bhattacharyya, Flushing, NY
`(US); Stolfo J. Salvatore, RidgeWood,
`NJ (Us)
`
`_
`(73) Ass1gnee: The Trustees of Columbia University
`in the city of New York, NY, NY (US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(1)) by 1122 days.
`(21) APPI' No‘: 10/208,432
`(22) Filed;
`JUL 30, 2002
`
`6,161,130 A * 12/2000 Horvitz et a1. ............ .. 709/206
`*
`6,275,850 B1
`8/2001 Beyda et a1. .............. .. 709/206
`
`.
`(Cont1nued)
`
`OTHER PUBLICATIONS
`
`Jeffrey O. Kephart and William C. Arnold, “Automatic Extraction of
`Computer Virus Signatures,” 4th I/lrus Bulletin International Con
`ference, pp. 178-184, 1994.
`
`_
`(Con?rmed)
`Prirnary ExamineriGilberto Barron, Jr.
`ASSlSZLlI’lZ ExammeriAbdulhaklm Nobahar
`(74) Attorney, Agent, or FirmiBaker Botts LLP
`
`(65)
`
`Prior Publication Data
`
`(57)
`
`ABSTRACT
`
`US 2003/0065926 A1
`
`Apr. 3, 2003
`
`Related US Application D at a
`_
`_
`_
`_
`(60) PrOVlslOnal appllcfmon NO~ 904308522, ?led on Jul
`30> 2001: PrOVlslOna1 appl1cat1on NO~ 60/308523:
`?led on Jul‘ 30’ 2001'
`(51) I t Cl
`n '
`'
`G06F 21/00
`
`(2006.01)
`
`(52) US. Cl. ....................................... .. 726/24; 713/188
`58 F 1d fCl _?
`_
`s
`h
`726/13
`(
`)
`1e
`0
`21752535??? 7612131}? 5206 207’
`T ’
`’
`’
`70’9/225’
`1
`?l f
`h h,
`_
`1,
`S
`ee app lcanon e or Comp ete Seam lstory'
`References Cited
`
`(56)
`
`US. PATENT DOCUMENTS
`
`9/1995 Kephart et a1.
`5,452,442 A
`1/1996 Chess et a1. ................. .. 714/38
`5,485,575 A *
`5,675,711 A 10/1997 Kephart et a1.
`5,765,170 A *
`6/1998 Morikawa ................. .. 707/200
`
`A system and methods for detecting malicious executable
`attachments at an ema11 process1ng appl1cat1on of a computer
`system using data mining techniques. The email processing
`application may be located at the server or at the client or host.
`The executable attachments are ?ltered from said email, and
`byte sequence features are extracted from the executable
`attachment. The executable attachments are classi?ed by
`.
`compar1ng the byte sequence feature of the executable attach
`.
`.
`.
`ment to a c1ass1?cat1on rule set der1ved from byte sequence
`features of a data set of knoWn executables having a prede
`termined class in a set of classes, e.g., malicious or benign.
`The system is also able to classify executable attachments as
`borderline When the difference betWeen the probability that
`the executable is malicious and the probability that the
`executable is benign are Within a predetermined threshold.
`The system can notify the user When the number of borderline
`attachments exceeds the threshold in order to re?ne the clas
`si?cation rule set.
`
`43 Claims, 7 Drawing Sheets
`
`M
`
`Email trims
`
`Learning Algolimm
`Executor
`
`Columbia Ex. 2015-1
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`US 7,487,544 B2
`Page 2
`
`US. PATENT DOCUMENTS
`
`7/2003 Chang et al. .............. .. 709/206
`6,598,076 B1 *
`8/2004 Gallivan
`6,778,995 B1
`11/2004 Kawai et al.
`6,820,081 B1
`6,826,609 B1* 11/2004 Smith et al. ............... .. 709/225
`6,888,548 B1
`5/2005 Gallivan
`6,978,274 B1
`12/2005 Gallivan et al.
`7,035,876 B2
`4/2006 Kawai et al.
`7,080,076 B1
`7/2006 Williamson et al.
`2002/0059383 A1* 5/2002 Katsuda ................. .. 709/206
`2002/0065892 A1* 5/2002 Malik ....................... .. 709/206
`
`OTHER PUBLICATIONS
`
`R Kohavi, “A study of cross-validation and boot-strap for accuracy
`estimation and model selection,” IJCAI, 1995.
`Ronald L. Rivest. “The MD5 Message Digest Algorithm.” published
`as Internet RFC 1321, Apr. 1992. http://www.freesoft.org/CIE/RFC/
`1321/.
`Stephen R. van den Berg and Philip Guenther, “Procmail” online
`publication, 2001. http://www.procmail.org.
`Steve R. White, Morton Swimmer, Edward J. Pring, William C.
`Arnold, David M. Chess, and John F. Morar, “Anatomy of a Com
`mercial-Grade Immune System,” IBM Research White Paper, 1999.
`EleaZar Eskin et al. “System and Methods for Intrusion Detection
`with Dynamic Window Sizes,” ?led Jul. 30, 2000, US. Appl. No.
`10/208,402.
`US. Appl. No. 10/352,343, ?led Jan. 27, 2003 claiming priority to
`P34981 (070050.1936) U.S. Appl. No. 60/351,857, ?led Jan. 25,
`2001, entitled “Behavior Based Anomaly Detection For Host-Based
`Systems For Detection Of Intrusion In Computer Systems,” of Frank
`Apap, Andrew Honig, Shlomo Hershkop, EleaZar Eskin and
`Salvatore J. Stolfo.
`
`U.S. Appl. No. 10/352,342, ?led Jan. 27, 2003 claiming priority to
`US. Appl. No. 60/351,913, ?led Jan. 25, 2002, entitled “Data Ware
`house Architecture For Adaptive Model Generation Capability In
`Systems For Detecting Intrusion In Computer Systems,” of Salvatore
`J. Stolfo, EleaZar Eskin, Matthew Miller, JuXin Zhang and Zhi-Da
`Zhong.
`U.S. Appl. No. 10/327,811, ?led Dec. 19, 2002 claiming priority to
`U.S.Appl.No. 60/342,872, ?ledDec. 20,2001, entitled“SystemAnd
`Methods for Detecting A Denial-Of-Service Attack On A Computer
`System” of Salvatore J. Stolfo, Shlomo Hershkop, Rahul Bhan,
`Suhail Mohiuddin and EleaZar Eskin.
`U.S. Appl. No. 10/320,259, ?led Dec. 16, 2002 claiming priority to
`US. Appl. No. 60/328,682, ?led Oct. 11, 2001 and US. Appl. No.
`60/352,894, ?led Jan. 29, 2002, entitled “Methods of Unsupervised
`Anomaly Detection Using A Geometric Framework” of EleaZar
`Eskin, Salvatore J. Stolfo and Leonid Portnoy.
`U.S. Appl. No. 10/269,718, ?led Oct. 11, 2002 claiming priority to
`US. Appl. No. 60/328,682, ?led Oct. 11, 2001 and US. Appl. No.
`60/340,198, ?led Dec. 14, 2001, entitled “Methods For Cost-Sensi
`tive Modeling For Intrusion Detection” of Dec. 14, 2001, entitled
`“Methods For Cost-Sensitive Modeling For Intrusion Detection” of
`Salvatore J. Stolfo, Wenke Lee, Wei Fan and Matthew Miller.
`U.S. Appl. No. 10/269,694, ?led Oct. 11, 2002 claiming priority to
`US. Appl. No. 60/328,682, ?led Oct. 11, 2001 and US. Appl. No.
`60/339,952, ?led Dec. 13, 2001, entitled “System And Methods For
`Anomaly Detection AndAdaptive Learning” of Wei Fan, Salvatore J.
`Stolfo.
`U.S. Appl. No. 10/222,632, ?led Aug. 16, 2002 claiming priority to
`US. Appl. No. 60/312,703, ?led Aug. 16,2001 and US. Appl. No.
`60/340, 197, ?led Dec. 14, 2001, entitled “System And Methods For
`Detecting Malicious Email Transmission” of Salvatore J. Stolfo,
`EleaZar Eskin, Manasi Bhattacharyya and Matthew G. Schultz.
`
`* cited by examiner
`
`Columbia Ex. 2015-2
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`US. Patent
`
`Feb. 3, 2009
`
`Sheet 1 0f 7
`
`US 7,487,544 B2
`
`/12
`
`Assemble Dataset
`
`Data Set
`
`\14
`
`1
`
`Label Data by Class
`
`1%
`l
`and Training Data 1 18
`l
`
`Divide Data into Test Data
`
`1
`
`24
`
`Classification
`Rule Set
`
`Extract Features
`from Data
`
`i
`
`Train classifier based
`on Training data
`
`zzs/l
`
`Test classifier based
`on Test data
`
`FIG. 1
`
`Columbia Ex. 2015-3
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`US. Patent
`
`Feb. 3, 2009
`
`Sheet 2 of7
`
`US 7,487,544 B2
`
`6468 7'75f S26E73 OaOd 0024 0000 0000 0000
`4548 3C0f 026C 0009 0000 0000 0302 0004
`0400 2800 3924 0001 0000 0004 0004 0006
`0000 0040 0060 0216 0238 0244 02f5 0000
`0001 0004 0000 0802 0032 1304 0000 030a
`
`FIG. 2
`
`—~|advapi32 A avicap32 A - - - A winmm A —-\ws0ck32
`
`FIG. 3
`
`advapBZAdjustToken Pr z'vileges0
`
`A advapi32.GetFileSecurityAO A - '
`
`A ws0ck32.recv() A wsock32.send0
`
`FIG. 4
`
`advapi32 = 2 A avicap32:10A--
`
`A
`
`winmm : 8 A ws0ck32 = 2
`
`FIG. 5
`
`Columbia Ex. 2015-4
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`US. Patent
`
`Feb. 3, 2009
`
`Sheet 3 of7
`
`US 7,487,544 B2
`
`malicious
`
`malicious
`
`.'= —|user32.EndDialog() A
`kernel32.Enum CalendarInfoAO
`:= -—1user32.Load1conA() /\
`?kernel32. GetTempPathAO /\ —1advapi32.
`
`malicious
`malicious
`
`.'= shell32.ExtractAssociatedIconA0
`:: msvbvm.
`
`Benign
`
`.'— otherwise
`
`FIG. 6
`
`P(" windows "\benign)
`
`P(" windows "\malicious)
`
`P(" * . COM"\benign)
`
`P("*.COM"\malicious)
`
`=
`
`=
`
`=
`
`=
`
`45/4 7
`
`2/4 7
`
`1/12
`
`11/12
`
`FIG. 7
`
`Columbia Ex. 2015-5
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`US. Patent
`
`Feb. 3, 2009
`
`Sheet 4 of7
`
`US 7,487,544 B2
`
`100
`
`Receive Emails
`
`1
`
`Filter Attachments
`from Email
`
`1
`
`Extract Features
`from Attachment
`
`1
`
`Evaluate Features to
`Predict Classification
`
`1
`
`Identify Features
`as Borderline
`
`1
`
`Log Attachments with
`Classification
`
`FIG. 8
`
`102
`
`106
`
`r/ r/ r/ r/ H /~/
`
`108
`
`110
`
`112
`
`Columbia Ex. 2015-6
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`US. Patent
`
`Feb. 3, 2009
`
`Sheet s on
`
`US 7,487,544 B2
`
`229.
`
`220
`
`Email traffic <——_
`
`\ 300
`
`_
`
`|
`
`l‘
`
`Email interface I 232
`ll
`Email ?lter I 222
`ll
`Feature Extractor \ 224
`ll
`
`Rule Evaluator
`
`230
`
`................................................................... ..
`
`A
`
`Y
`
`Filter interface
`
`l T
`
`Feature Extractor
`
`ll
`
`242
`
`246
`
`Learning Algorithm
`Executor \ 248
`
`‘
`
`240
`
`FIG. 9
`
`Columbia Ex. 2015-7
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`US. Patent
`
`Feb. 3, 2009
`
`Sheet 6 of7
`
`US 7,487,544 B2
`
`,
`
`'
`
`Receive Emails
`‘l
`
`1
`1
`
`1
`
`Filter Attachments
`from Email
`l
`Extract Features
`from Attachment
`l
`Evaluate Features to 1
`Predict Classification
`l
`Identify Features
`as Borderline
`
`1
`
`i
`increment Count of
`Borderline Attachments
`
`i
`Log Attachments with
`Classification
`
`1
`
`1
`
`Provide Noti?cation
`that Threshold has
`been Exceeded
`
`Columbia Ex. 2015-8
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`US. Patent
`
`Feb. 3, 2009
`
`Sheet 7 of7
`
`US 7,487,544 B2
`
`4-00
`
`1
`
`I
`
`f *_I
`
`‘
`
`__T
`
`90 *
`
`4.02
`
`80 ~
`
`405
`
`_
`
`1
`
`_
`-
`
`_
`
`_
`
`_
`
`-
`
`-
`
`K 70 -
`2
`f!“
`60 -
`C
`2 50 _
`‘6
`1%,
`o
`
`40 '
`
`30
`
`2O
`
`10
`
`0
`
`0
`
`\
`
`40c.
`
`Fewer False Positives
`
`Q
`4°
`More Secure-—l-———+
`
`Data Mining Method ——
`Signature Based Method ----------
`
`I
`
`1
`
`l
`
`2
`
`1—_ 4L
`
`1
`
`l
`
`6
`5
`4
`3
`False Positive Rate
`k“"40 4
`
`7
`
`1
`
`8
`
`9
`
`FIG. H
`
`Columbia Ex. 2015-9
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`US 7,487,544 B2
`
`1
`SYSTEM AND METHODS FOR DETECTION
`OF NEW MALICIOUS EXECUTABLES
`
`CLAIM FOR PRIORITY TO RELATED
`APPLICATIONS
`
`This application claims the bene?t of US. Provisional
`Patent Application Ser. Nos. 60/308,622, ?led on Jul. 30,
`2001, entitled “Data Mining Methods for Detection of NeW
`Malicious Executables” and 60/308,623, ?led on Jul. 30,
`2001, entitled “Malicious Email Filter,” Which are hereby
`incorporated by reference in their entirety herein.
`
`STATEMENT OF GOVERNMENT RIGHT
`
`The present invention Was made in part With support from
`the United States Defense Advanced Research Projects
`Agency (DARPA) grant nos. PAS-526617 and SRTSC
`CU019-7950-1. Accordingly, the United States Government
`may have certain rights to this invention.
`
`COPYRIGHT NOTICE
`
`A portion of the disclosure of this patent document con
`tains material Which is subject to copyright protection. The
`copyright oWner has no objection to the facsimile reproduc
`tion by any one of the patent disclosure, as it appears in the
`Patent and Trademark O?ice patent ?les or records, but oth
`erWise reserves all copyright rights Whatsoever.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`This invention relates to systems and methods for detecting
`malicious executable programs, and more particularly to the
`use of data mining techniques to detect such malicious
`executables in email attachments.
`2. Background
`A malicious executable is a program that performs a mali
`cious function, such as compromising a system’s security,
`damaging a system or obtaining sensitive information With
`out the user’s permission. One serious security risk is the
`propagation of these malicious executables through e-mail
`attachments. Malicious executables are used as attacks for
`many types of intrusions. For example, there have been some
`high pro?le incidents With malicious email attachments such
`as the ILOVEYOU virus and its clones. These malicious
`attachments are capable of causing signi?cant damage in a
`short time.
`Current virus scanner technology has tWo parts: a signa
`ture-based detector and a heuristic classi?er that detects neW
`viruses. The classic signature-based detection algorithm
`relies on signatures (unique telltale strings) of knoWn mali
`cious executables to generate detection models. Signature
`based methods create a unique tag for each malicious pro
`gram so that future examples of it can be correctly classi?ed
`With a small error rate. These methods do not generaliZe Well
`to detect neW malicious binaries because they are created to
`give a false positive rate as close to Zero as possible. When
`ever a detection method generaliZes to neW instances, the
`tradeoff is for a higher false positive rate.
`Unfortunately, traditional signature-based methods may
`not detect a neW malicious executable. In an attempt to solve
`this problem, the anti-virus industry generates heuristic clas
`si?ers by hand. This process can be even more costly than
`generating signatures, so ?nding an automatic method to
`generate classi?ers has been the subject of research in the
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`anti-virus community. To solve this problem, different IBM
`researchers appliedArti?cial Neural NetWorks (ANNs) to the
`problem of detecting boot sector malicious binaries. (The
`method of detection is disclosed in G. Tesauro et al., “Neural
`NetWorks for Computer Virus Recognition, IEE Expert,
`11(4): 5-6, August 1996, Which is incorporated by reference
`in its entirety herein.) An ANN is a classi?er that models
`neural netWorks explored in human cognition. Because of the
`limitations of the implementation of their classi?er, they Were
`unable to analyZe anything other than small boot sector
`viruses Which comprise about 5% of all malicious binaries.
`Using anANN classi?er With all bytes from the boot sector
`malicious executables as input, IBM researchers Were able to
`identify 80-85% of unknoWn boot sector malicious
`executables successfully With a loW false positive rate (<1%).
`They Were unable to ?nd a Way to apply ANNs to the other
`95% of computer malicious binaries.
`In similar Work, Arnold and Tesauro applied the same
`techniques to Win32 binaries, but because of limitations of
`the ANN classi?er they Were unable to have the comparable
`accuracy over neW Win32 binaries. (This technique is
`described in Arnold et al., :Automatically Generated Win 32
`Heuristic Virus Detection,” Proceedings of the 2000 Interna
`tional Virus Bulletin Conference, 2000, Which is incorporated
`by reference in its entirety herein.)
`The methods described above have the shortcoming that
`they are not applicable to the entire set of malicious
`executables, but rather only boot-sector viruses, or only
`Win32 binaries.
`The technique is similar to data mining techniques that
`have already been applied to Intrusion Detection Systems by
`Lee et al. Their methods Were applied to system calls and
`netWork data to learn hoW to detect neW intrusions. They
`reported good detection rates as a result of applying data
`mining to the problem of IDS. A similar frameWork is applied
`to the problem of detecting neW malicious executables. (The
`techniques are described in W. Lee et al., “Learning Patterns
`From UNIX Processes Execution Traces for Intrusion Detec
`tion, AAAI Workshop in Al Approaches to Fraud Detection
`and RiskManagement, 1997, pages 50-56, and W. Lee et al.,
`“A Data Mining FrameWork for Building Intrusion Detection
`Models,” IEEE Symposium on Security and Privacy, 1999,
`both of Which are incorporated by reference in their entirety
`herein.)
`Procmail is a mail processing utility Which runs under
`UNIX, and Which ?lters email; and sorts incoming email
`according to sender, subject line, length of message, key
`Words in the message, etc. Procmail’s pre-existent ?lter pro
`vides the capability of detecting active-content HTML tags to
`protect users Who read their mail from a Web broWser or
`HTML-enabled mail client. Also, if the attachment is labeled
`as malicious, the system “mangles” the attachment name to
`prevent the mail client from automatically executing the
`attachment. It also has built in security ?lters such as long
`?lenames in attachments, and long MIME headers, Which
`may crash or alloW exploits of some clients.
`HoWever, this ?lter lacks the ability to automatically
`update its list of knoWn malicious executables, Which may
`leave the system vulnerable to attacks by neW and unknoWn
`viruses. Furthermore, its evaluation of an attachment is based
`solely on the name of the executable and not the contents of
`the attachment itself.
`Accordingly, there exists a need in the art for a technique
`Which is not limited to particular types of ?les, such as boot
`sector viruses, or only Win32 binaries, and Which provides
`the ability to detect neW, previously unseen ?les.
`
`Columbia Ex. 2015-10
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`US 7,487,544 B2
`
`3
`SUMMARY
`
`An object of the present invention is to provide a technique
`for predicting a classi?cation of an executable ?le as mali
`cious or benign Which is not dependent upon the format of the
`executable.
`Another object of the present invention is to provide a data
`mining technique Which examines the entire ?le, rather than
`a portion of the ?le, such as a header, to classify the executable
`as malicious or benign.
`A further object of the present invention is to provide an
`email ?lter Which can detect executables that are borderline,
`i.e., executables having features indicative of both malicious
`and benign executables, Which may be detrimental to the
`model if misclassi?ed.
`These and other objects of the invention, Which Will
`become apparent With reference to the disclosure herein, are
`accomplished by a system and methods for classifying an
`executable attachment in an email received by an email pro
`cessing application or program, Which includes ?ltering the
`executable attachment from said email. The email processing
`application may be executed at an email server or a client or
`host email application. A byte sequence feature is subse
`quently extracted from the executable attachment. The
`executable attachment is classi?ed by comparing said byte
`sequence feature of the executable attachment With a classi
`?cation rule set derived from byte sequence features of a set
`of executables having a predetermined class in a set of
`classes.
`According to a preferred embodiment, extracting the byte
`sequence feature from said executable attachment comprises
`extracting static properties of the executable attachment,
`Which are properties that do not require the executable to be
`run in order to discern. Extracting the byte sequence feature
`from the executable attachment may comprise converting the
`executable attachment from binary format to hexadecimal
`format. According to another embodiment, extracting the
`byte sequence features from the executable attachment may
`comprise creating a byte string representative of resources
`referenced by said executable attachment.
`Advantageously, classifying the executable attachment
`may comprise predicting the classi?cation of the executable
`attachment as one class in a set of classes consisting of mali
`cious and benign. The set of classes may also include a bor
`derline class. Classifying the executable attachment may
`comprise determining a probability or likelihood that the
`executable attachment is a member of each class in said set of
`classes based on said byte sequence feature. In one embodi
`ment, this probability is determined by use of a Naive Bayes
`algorithm. In another embodiment, the probability may be
`determined by use of a Multi-Naive Bayes algorithm. The
`determination of the probability may be divided into a plu
`rality of processing steps. These processing steps may then be
`performed in parallel. The executable attachment is classi?ed
`as malicious if the probability that the executable attachment
`is malicious is greater than said probability that the execut
`able attachment is benign. The executable attachment is clas
`si?ed as benign if the probability that the executable attach
`ment is benign is greater than said probability that said
`executable attachment is malicious. The executable attach
`ment is classi?ed as borderline if a difference betWeen the
`probability the executable attachment is benign and the prob
`ability the executable attachment is malicious is Within a
`predetermined threshold.
`A further step in accordance With the method may include
`logging the class of the executable attachment. The step of
`logging the class of the executable attachment may further
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`include incrementing a count of the executable attachments
`classi?ed as borderline. If the count of executable attach
`ments classi?ed as borderline exceeds a predetermined
`threshold, the system may provide a noti?cation that the
`threshold has been exceeded.
`In accordance With the invention, the objects as described
`above have been met, and the need in the art for a technique
`Which can analyZe previously unseen malicious executables,
`Without regard to the type of ?le, has been satis?ed.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`Further objects, features and advantages of the invention
`Will become apparent from the folloWing detailed description
`taken in conjunction With the accompanying ?gures shoWing
`illustrative embodiments of the invention, in Which:
`FIG. 1 is a How chart illustrating an overvieW of a method
`of detection model generation in accordance With the present
`invention.
`FIGS. 2-4 illustrate a several approaches to binary pro?l
`ing.
`FIG. 5 illustrates sample classi?cation rules determined
`from the features represented in FIG. 3.
`FIG. 6 illustrates sample classi?cation rules found by a
`RIPPER algorithm.
`FIG. 7 illustrates sample classi?cation rules found by a
`Naive Bayes algorithm.
`FIG. 8 is a How chart illustrating a method of detecting
`malicious executables in accordance With the present inven
`tion.
`FIG. 9 is a simpli?ed diagram illustrating the architecture
`of the malicious email detector and model generator in accor
`dance With the present invention.
`FIG. 10 is a How chart, similar to FIG. 8, illustrating
`another method of detecting malicious executables in accor
`dance With the present invention.
`FIG. 11 is a plot illustrating the interactive effect of false
`positive rate and detection rate on the performance of the
`detection model or classi?er in accordance With the present
`invention.
`Throughout the ?gures, the same reference numerals and
`characters, unless otherWise stated, are used to denote like
`features, elements, components or portions of the illustrated
`embodiments. Moreover, While the subject invention Will
`noW be described in detail With reference to the ?gures, it is
`done so in connection With the illustrative embodiments. It is
`intended that changes and modi?cations can be made to the
`described embodiments Without departing from the true
`scope and spirit of the subject invention as de?ned by the
`appended claims.
`
`DETAILED DESCRIPTION OF EXEMPLARY
`EMBODIMENTS
`
`This invention Will be further understood in vieW of the
`folloWing detailed description.
`An exemplary system and methods for detecting malicious
`email attachments Was implemented in UNIX With respect to
`Sendmail (a message transfer agent (MTA) Which ensures
`messages get from source message servers to destination
`message servers for recipients to access their email, as pro
`duced by Sendmail, Inc. or Emeryville, Calif.), using Proc
`mail (a publicly available program that processes e-mail mes
`sages received by the server, as further described in Stephen
`R. van den Berg and Philip Guenther, “Procmail,” online
`publication as vieWed on http://WWW.procmail.org, 2001).
`This system and methods uses data mining methods in order
`
`Columbia Ex. 2015-11
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`US 7,487,544 B2
`
`5
`to create the detection model. The data mining methods are
`used to create classi?ers to detect the malicious executables.
`A classi?er is a classi?cation rule set, or detection model,
`generated by the data mining algorithm that Was trained over,
`i.e., derived from. a given set of training data.
`In accordance With the exemplary embodiment, a data
`mining-based ?lter integrates With Procmail’s pre-existent
`security ?lter to detect malicious executable attachments. The
`?lter uses a scoring system based on a data mining classi?er
`to determine Whether or not an attachment may be malicious.
`If an attachment’s score is above a certain threshold it is
`considered malicious. The data mining classi?er provides the
`ability to detect both the set of knoWn malicious executables
`and a set of previously unseen, but similar malicious
`executables.
`A ?owchart illustrating the process 10 of creating of the
`classi?cation rule set is illustrated in FIG. 1. An early stage in
`the process is to assemble the dataset (step 12) Which Will be
`used for training, and for optionally testing the detection
`model. In the exemplary embodiment, this step included gath
`ering a large set of executables 14 from public sources. In
`addition, each example program in the data set is a WindoWs
`or MS-DOS format executable, although the framework is
`applicable to other formats. In the exemplary embodiment,
`the programs Were gathered either from FTP sites, or personal
`25
`computers in the Data Mining Lab at Columbia University. A
`total of 4,031 programs Were used.
`In a subsequent stage, each data item, or executable, is
`labeled by class (step 16). The learning problem in the exem
`plary embodiment is de?ned With tWo classes, e. g., malicious
`and benign. As discussed above, a malicious executable is
`de?ned to be a program that performs a malicious function,
`such as compromising a system’s security, damaging a sys
`tem, or obtaining sensitive information Without the user’s
`permission. A benign program does not perform such mali
`cious functions. Thus, the data set Was divided into tWo
`groups: (1) malicious and (2) benign executables. In order to
`train the classi?cation rule set, the classes of the executables
`must be knoWn in advance. Of the 4,031 programs used in the
`data set, 3,301 Were malicious executables and 1,000 Were
`benign executables. The malicious executables consisted of
`viruses, Trojans, and cracker/network tools. There Were no
`duplicate programs in the data set To standardiZe the data-set,
`an updated MacAfee’s virus scanner, produced by McAfee
`.com Corporation of Sunnyvale, Calif., Was used to label the
`programs as either malicious orbenign executables. All labels
`Were assumed to be correct for purposes of the analysis.
`Another step, Which may be performed concurrently With
`or subsequent to the above step, is to divide the dataset into
`tWo subsets Which include a training set and a test set (step
`18). The data mining algorithms use the training set to gen
`erate the classi?cation rule sets. After training, a test set may
`be used to test the accuracy of the classi?ers on a set of unseen
`examples. It is understood that testing the detection model is
`an optional step to determine the accuracy of the detection
`model, and, as such, may be omitted from the process.
`The next step of the method is to extract features from each
`executable (Step 20). Features in a data mining frameWork are
`de?ned as properties extracted from each example program in
`the data set, e.g., byte sequences, that a classi?er uses to
`generate detection models. (Signatures, as distinguished
`from features, typically refer to a speci?c feature value, While
`a feature is a property or attribute of data (such as “byte
`sequence feature”) Which may take on a set of values. Signa
`ture based methods are methods that inspect and test data to
`determine Whether a speci?c feature value is present in that
`data, and then classify or alarm accordingly.) In the present
`
`55
`
`45
`
`50
`
`60
`
`65
`
`20
`
`30
`
`35
`
`40
`
`6
`invention, the presence of speci?c feature values is used by
`the learning algorithms to calculate a probability or likeli
`hood of classi?cation of the data. The features Which are
`extracted in the exemplary embodiment are static properties,
`Which are properties that do not require executing the binary
`in order to be detected or extracted.
`In the exemplary embodiment, hexdump Was used in the
`feature extraction step. Hexdump, as is knoWn in the art (Peter
`Miller, “Hexdump,” on line publication 2000, http://gd.tu
`Wien.ac.at/softeng/Aegis/hexdump.html Which is incorpo
`rated by reference in its entirety herein), is an open source tool
`that transforms binary ?les into hexadecimal ?les. The byte
`sequence feature is informative because it represents the
`machine code in an executable. After the “hexdumps” are
`created, features are produced in the form illustrated in FIG.
`2 in Which each line represents a short sequence of machine
`code instructions. In the analysis, a guiding assumption is
`made that similar instructions Were present in malicious
`executables that differentiated them from benign programs,
`and the class of benign programs had similar byte code that
`differentiated them from the malicious executables. Each
`byte sequence in the program is used as a feature.
`Many additional methods of feature extraction are also
`useful to carry out step 20, above, and are described herein.
`For example, octale encoding may be used rather than hexa
`decimal encoding. According to another approach to feature
`extraction is to extract resource information from the binary
`that provides insight to its behavior, Which is also referred to
`herein as “binary pro?ling.” According to this approach, a
`subset of the data may be examined Which is in Portable
`Executable (PE) format (Which is described in “Portable
`Executable Format,” online publication as vieWed on, http://
`support.microsoft.com/ support/kb/ articles/Q121/ 4/ 60.asp,
`1999, Which is incorporated by reference in its entirety
`herein.) For instance, an executable in a standard WindoWs
`user interface may normally call the User Interfaces Dynami
`cally Linked Library (U SER32.DLL). This approach
`assumes that if an executable being evaluated does not call
`USER32.DLL, then the program does not have the standard
`WindoWs user interface. To extract resource information from
`WindoWs executables, GNU’s Bin-Utils may be used (as
`described in “GNU Binutils CygWin, online publication as
`vieWed on http://sourceWare.cygnus.com/cygWin, 1999,
`Which is incorporated by reference in its entirety herein).
`GNU’s Bin-Utils suite of tools can analyZe PE binaries Within
`WindoWs. In PE, or Common Object File Format (COFF),
`program headers are composed of a COFF header, an
`Optional header, at MS-DOS stub, and a ?le signature. All of
`the information about the binary is obtained from the program
`header Without executing the unknoWn program but by exam
`ining the static properties of the binary, using libBFD, Which
`is a library Within Bin-Utils, to extract information in object
`format. Object format for a PE binary gives the ?le siZe, the
`names of DLLs, and the names of function calls Within those
`DLLs and Relocation Tables. From the object format, it is
`possible to extract a set of features to compose a feature
`vector, or string, for each binary representative of resources
`referenced by the binary.
`Three types of features may be analyZed to determine hoW
`resources affect a binary’s behavior:
`1. The list of DLLs used by the binary
`2. The list of DLL function calls made by the binary
`3. The number of different function calls Within each DLL
`A ?rst approach to binary pro?ling used the DLLs loaded
`by the binary as features. Data can be modeled by extracting
`a feature or a set of features, and each set of features may be
`represented as a vector of feature values. The feature vector
`
`Columbia Ex. 2015-12
`Symantec v. Columbia
`IPR2015-00375
`
`

`
`US 7,487,544 B2
`
`7
`comprised of a number of boolean values, e. g., 30, represent
`ing Whether or not a binary used a DLL. Typically, not every
`DLL Was used in all of the binaries, but a majority of the
`binaries called the same resource. For example, almost every
`binary called GDI32.DLL, Which is the WindoWs NT Graph
`ics Device Interface a

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket