`
`Symantec 1022
`IPR of U.S. Pat. No. 7,757,298
`
`
`
`Page 2
`
`5,440,723
`
`OTHER PUBLICATIONS
`S. W. Shieh et al. “A Pattern—Oriented Intrusion-De-
`
`tection Model and its Applications”, Proceedings of the
`1991 IEEE Computer Society Symposium on Reserach
`and Privacy, pp. 327-342.
`H. S. Javitz et al. “The SR1 IDES Statistical Anomaly
`Detector", Proceedings of the 1991 IEEE Computer
`Symposium on Research in Security and Privacy, pp.
`316-326.
`
`W. Arnold et al. “System for Detecting Undesired Al-
`teration of Software”, IBM TDB, vol. 32, No. 11, Apr.
`1990, pp. 48—50.
`S. M. Katz, “Estimation of Probabilities from Sparse
`Data for the Language Model Component of a Speech
`Recognizer”, IEEE Trans. ASSP—35, No. 3, Mar. 1987,
`pp. 400-401.
`F. Cohen, A Short Course on Computer Viruses, ASP
`Press, Pittsburg, 1990, pp. 9-15.
`
`OOOOO2
`
`000002
`
`
`
`U.S. Patent
`
`Aug. 8, 1995 4
`
`Sheet 1 of7
`
`5,440,723
`
`l2
`
`BUS
`
`NETWORK
`
`M 3!
`
`l4
`
`l6
`
`I8
`
`24
`
`28
`
`CPU W TERMINAL
`36
`
`CONTROL
`
`220
`
`22
`
`DISPLAY
`DEVICE
`
`MANUAL
`INPUT
`
`HARD DISK
`SYSTEM
`CONTROL
`as
`
`HARD DISK
`
`FLOPPY DISK
`SYSTEM
`CONTROL
`30
`
`FLOPPY DISK
`DRIVE
`
`I0
`
`300
`
`
`
`EXACT- MATCH I FUZZY-MATCH
`
`CALCULATION CALCULATION
`
`
`
`REPORT RESULTS
`
`
`
`
`
`FIG. 5
`
`000003
`
`000003
`
`
`
`U.S. Patent
`
`Aug. 8, 1995
`
`Sheet 2 of 7
`
`_EBOCESSORS
`
`TO OTHER
`
`FIG.
`
`IB
`
` fie INFECTED
`
`FIG. IC
`
`000004
`
`000004
`
`
`
`U.S. Patent
`
`Aug. 8, 1995
`
`Sheet 3 of 7
`
`5,440,723
`
`MONITOR SYSTEM FOR
`ANOMALOUS BEHAVIOR
`
`A
`
`,
`I
`
`
`
`ALERT USER.
`REMOVE vmus
`
`ANOMALOUS BEHAVIOR
`SCAN FOR KNOWN
`DETECTED
`vmuses
`B
`
`vuaus
`
`ummown vmus
`
`DEPLOY oscov PROGRAMS
`TO CAPTURE UNKNOWN
`vmus
`
`C
`
`IDENTIFY INVARIANT
`CODE PORTIONS OF
`VIRUS SAMPLEISI
`
`EXTRACT SIGNATURE
`
`FROM CODE, ADD TO
`SCANNER.'S DATABASE
`
`IN FORM OTHER NETWORK
`PROCESSORS OF vnaus
`INFECTION
`
`GENERATE DISTRESS
`
`SIGNAL, IF REQUIRED
`
`E
`
`F
`
`6)
`
`FIG. 2
`
`000005
`
`000005
`
`
`
`>u3<DZ_.._.200
`
`n_o._m>23¢
`
`
`
`4.._.oz...=
`
` U.S.Patent
`
`__
`
`>._._<o_ooEu._msxomzo23¢
`
`
`>..._<:z_»zoo93>23¢
`
`_mm._m<Souxm_362410n:
`
`mmozaion:
`
`aaxomzo23¢
`
`>._<s_oz<u:
`
`_
`
`zsozxn:
`
`m_..m.>
`
`omoz<.._oZ0z<omm_>23¢
`
`mu._mE.:omxm
`
`mmozaio02u:
`
`
`
`I0__.._&.mm..<Em
`
`
`
`22.82"E34Em
`
`
`_|OH._<zo_mozmm
`
`Im:m_>zaozx
`
`
`
`
`
`
`
`ozaomxomzmzqmooma>ooma>o._nma
`
`Aug.8,1995
`
`
`
`zo_Eon_..<m_>m:jom_._._<Qoowoamigos.nozoz<omm_>22.
`
`
`
`
`
`
`
`
`
`muzioqzamigos.n:
`
`aroomoV.
`
`wza:n5z<m._oo
`
`mm._m<Somxm
`
`55...“.
`
`Sheet4of7
`
`
`Dw._.Own._2_n:wm|_n=2<mM000mDm_>mommz
`I.
`
`
`
`mzmfiwwwfimdsqmw:m_>:93no
`
`2
`
`
`
`mzo_Eon_mooo>.._:zuo_
`
`
`
`EonmmEqmmzmo
`
`cgfifiomaxmzaz
`
`
`
`$EE:<zo_m._.o<Exm
`
`m.9“.
`
`mn5xo<mmo>mm>
`
`5,440,723
`
`
`
`O._..mEm3<zo_m84
`
`um<m<_.<a
`
`02n:
`
`amigos.
`
`m>oowo
`
`
`
`uo<mmms_mD._.<._.w
`
`mum:O...
`
`000006
`
`000006
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Aug. 8, 1995
`
`Sheet 5 of 7
`
`5,440,723
`
`
`
`FILTER OUT PORTIONS OF VIRUS
`THAT ARE OBSERVED TO VARY
`FROM INSTANCE TO ANOTHER
`
`
`
`
`
`
`
`FILTER
`T PORTIONS OF VIRUS.
`THAT A
`INTRINSICALLY THE MOST
`LIKELY TO VARY FROM INSTANCE
`TO ANOTHER
`
`
`
`Ist
`
`INSTANCE
`
`
`
`BYTE n
`
`II
`
`BYTE I
`I
`2nd INSTANCE
`I
`flfifiafiflfililfiifiml
`I
`I
`3rd INSTANCE
`I
`,
`HIEEEBIEIEEEEEIEI
`I
`VARIANT
`I"|NVARlANT"
`IVARIANTI
`I
`PORTION
`[PORTION
`IPORTIONI
`
`II
`
`FIG. 6A
`
`BYTE I
`\BE DE 8C 06 0| OO 8C C6 8E DE 3| CO
`
`BI DB 3| C9 3I D2 CB F9 F8 FF SF F5\
`BYTE 24
`
`FIG. 6B
`
`OOOOO7
`
`000007
`
`
`
`U.S. Patent
`
`Aug. 8, 1995
`
`Sheet 6 of7
`
`.
`
`5,440,723
`
`SEGREGATE CORPUS
`INTo PROBE, TRAINING
`AND TEST SETS
`
`SELECT BYTE-STRINGS
`FROM PROBE SET
`
`USE TRAINING SET TO
`ESTIMATE PROBABILITIES
`
`COUNT FREQUENCY IN
`TEST SET '
`
`PRODUCE LIST OF
`ESTIMATE PROBABILITIES
`VS FREQUENCY
`
`A
`
`B
`
`C
`
`D
`
`E
`
`DETERMINE FALSE-
`POSITIVE AND FALSE-
`
`F REJECTION PROBABILTY
`
`SELECT THRESHOLD
`
`G
`
`FIG. 7
`
`000008
`
`000008
`
`
`
`U
`
`1
`
`e
`
`327.,O
`
`4.,mmoom.om.o_5.mo2_>_:.wmmmm0_n_
`
`4um:2ze_mo3<>
`
`o:.m_m:m:.m>z..%__mmm_mmmE:Eozmz
`
`o7
`
`mmE.__m<moE
`
`.m245-2M.2<mo»mw_m
`
`.5...uon%m
`
`
`
`e2%%3...mmE:ms
`
`255.2mm._2oz<w>.:.._m<momn_
`
`
`
`>t.__m<moEmo.Es_:mm.mo05wmmz_ms_oo
`
`ms_<mmo.EcmWEoammnomsmmoo0mmnzzmamo_.ow._mmM.9mo.._<>
`
`mmE.._m<moEmo»<s__.6m
`
`
`
`mm:.Ezo_mZ._.__m<moE
`
`
`
` mommmoomacmmtwEo_oz<ommB<ze.ws_<mo:zSoas.m.._o5:m.Eo_oz<o2.tmoDamom_mE8.Sxmozfimz
`
`
`
`mEEo-co.»
`
`
`8,
`
`§m.Es_<mmyNo_._o._.<s_m=2n_n..c._.cmu_m<._.A»zms_o<mnEn.s_<me-zElmwzzqow
`
`
`
`5<xmn_mm
`
`000009
`
`
`
`
`1
`
`i44Q723
`
`AUTOMATIC IMMUNE SYSTEM FOR
`COMPUTERS AND COMPUTER NETWORKS
`
`CROSS-REFERENCE TO A RELATED PATENT
`APPLICATION
`
`This patent application is related to commonly as-
`signed U.S. patent application Ser. No. 08/004,871, filed
`Jan. 19, 1993, entitled “Methods and Apparatus for
`Evaluating and Extracting Signatures of Computer
`Viruses and Other Undesirable Software Entities”, by
`Jeffrey O. Kephart
`FIELD OF THE INVENTION
`
`This invention relates generally to digital data pro-
`cessors and, in particular, to methods and apparatus for
`providing computational integrity for digital data pro-
`cessors and networks thereof.
`
`BACKGROUND OF THE INVENTION
`
`A computer virus has been defined by Frederick B.
`Cohen as a program that can infect other programs by
`modifying them to include a, possibly evolved, version
`of itself (A Short Course on Computer Viruses, ASP
`Press, Pittsburg, 1990, page 11).
`As employed herein, a computer virus is considered
`to include an executable assemblage of computer in-
`structions or code that is capable of attaching itself to a
`computer program. The subsequent execution of the
`viral code may have detrimental effects upon the opera-
`tion of the computer that hosts the virus. Some viruses
`have an ability to modify their constituent code, thereby
`complicating the task of identifying and removing the
`virus.
`
`A worm is a program or collection of programs that
`can cause a possibly evolved version of itself to be exe-
`cuted on the same or possibly on a different computer.
`A Trojan Horse is a block of undesired code that is
`intentionally hidden within a block of desirable code.
`Both computer viruses, worms, and Trojan Horses
`are considered to be members of a class of undesirable
`software entities, the presence of which within a data
`processor, or network of data processors,
`is to be
`avoided so as to maintain computational integrity.
`A widely-used method for the detection of computer
`viruses is known as a virus scanner. A virus scanner
`employs short strings of bytes to identify particular
`viruses in executable files, boot records, or memory.
`The byte strings (referred to as signatures) for a particu-
`lar virus must be chosen with care such that they always
`discover the virus, if it is present, but seldom give a
`“false alarm”, known as a false positive. That is, the
`signature must be chosen so that the byte string is one
`that is unlikely to be found in programs that are nor-
`mally executed on the computer. Typically, a human
`expert makes this choice by converting the binary ma-
`chine code of the virus to an assembler version, analyz-
`ing the assembler code, selecting sections of code that
`appear to be unusual or virus-like, andidentifying the
`corresponding bytes in the binary machine code so as to
`produce the signature. Wildcard bytes can be included
`within the signature to provide a match with any code
`byte appearing in a virus.
`Currently, a number of commercial computer virus
`scanners are successful in alerting users to the presence
`of viruses that are known apriori. However, conven-
`tional virus scanners are typically unable to detect the
`presence of computer viruses which they have not been
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`programmed to detect explicitly. The problem of deal-
`ing with new viruses has typically been addressed by
`distributing updates of scanning programs and/or auxil-
`iary files containing the necessary information for iden-
`tifying the latest viruses. However, the increasing rate
`at which new viruses are being written is widening the
`gap between the number of viruses that exist and the
`number of viruses that can be detected by an apprecia-
`ble fraction of computer users. As a result, it is becom-
`ing increasingly likely that a new virus will become
`wide-spread before any remedy is generally available.
`At least one mechanism to inform other computers
`connected to a network of a presence of a viral infection
`has been previously advanced. For example, a “net
`hormones” concept has been proposed by David Sto-
`dolsky, wherein every computer on a network keeps
`archives of all potential virus-carrying contacts with
`every other computer. When a computer finds that it is
`infected with a virus, it sends a message to every com-
`puter that it has ever contacted or been contacted by,
`which in turn send messages to all of their contactees or
`contactors, etc. However, it is believed that this ap-
`proach will result in most or all computers quickly
`becoming overloaded by infection messages from most
`of the other computers on the network.
`As such, a need has arisen to develop methods for
`automatically recognizing and eradicating previously
`unknown or unanalyzed viruses on individual comput-
`ers and computer networks. An efficient method is also
`required for informing other computers on the network
`as to the existence of a computer virus within the net-
`work.
`
`It is an object of this invention to provide methods
`and apparatus to automatically detect and extract a
`signature from an undesirable software entity, such as a
`computer virus or worm.
`It is further object of this invention to provide meth-
`ods and apparatus for immunizing a computer system,
`and also a network of computer systems, against a sub-
`sequent infection by a previously unknown and undesir-
`able software entity.
`
`SUMMARY OF THE INVENTION
`
`The foregoing and other problems are overcome and
`the objects of the invention are realized by a method
`that provides computational integrity for a digital data
`processor and a network of digital data processors. The
`method includes the following component steps, or
`some functional subset of these steps:
`(A) periodic monitoring of a data processing system
`for anomalous behavior that may indicate the presence
`of an undesirable informational state, such as one arising
`from the presence of an undesirable software entity,
`such as a computer virus or worm;
`(B) automatic scanning for occurrences of known
`types of undesirable software entities;
`(C) deploying decoy programs to capture samples of
`unknown types of computer viruses;
`(D) automatically identifying portions of a captured
`virus sample which are likely to remain invariant from
`one instance of the virus to another;
`(E) extracting an identifying signature from the in-
`variant portion and adding the signature to a signature
`database;
`(F) informing neighboring data processing systems
`on a network of an occurrence of the undesirable soft-
`ware entity; and
`
`000010
`
`000010
`
`
`
`i44Q723
`
`4
`
`3
`(G) generating a distress signal, if appropriate, so as
`to call upon an expert to resolve difficult cases.
`A feature of this invention is the automatic execution
`of the foregoing steps in response to a detection of an
`undesired software entity, such as a virus, worm, or
`Trojan Horse, within a data processing system. Further-
`more, the automatic extraction of the identifying signa-
`ture, and the addition of the signature to a signature data
`base, provides protection, or “immunity”, to subsequent
`infections of the system, and also a network of systems,
`by the same or an altered form of the undesirable soft-
`ware entity.
`BRIEF DESCRIPTION OF THE DRAWING
`
`The above set forth and other features of the inven-
`tion are made more apparent in the ensuing Detailed
`Description of the Invention when read in conjunction
`with the attached Drawing, wherein:
`FIG. 1a is a block diagram of a computer system that
`is suitable for use in practicing the invention;
`FIG. 1b illustrates an exemplary computer network
`that includes the computer system of FIG. 1a;
`FIG. 1c illustrates the exemplary computer network
`of FIG. 1b and the use of a kill signal that is transmitted
`by infected computers to neighboring computers;
`FIG. 2 is a flow chart depicting the constituent steps
`of a method of the invention;
`FIG. 3 is a more detailed flow chart depicting the
`operation of the method of the invention;
`FIG. 4 is a flow chart depicting a method for use in
`segregating variant virus portions from invariant virus
`portions;
`FIG. 5 is a flow chart that shows the operation of a
`statistically based computer virus signature extraction
`method;
`FIG. 6a illustrates exemplary portions of three in-
`stances of a computer virus, the three portions each
`including an invariant portion and variant portions;
`FIG. 6b illustrates an exemplary 24-byte “invariant”
`portion of a computer virus;
`FIG. 7 is a flow chart that depicts a method of select-
`ing a probability threshold used for accepting or reject-
`ing a candidate signature; and
`FIG. 8 is a block diagram of a system operable for
`executing the method of the invention.
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`FIG. 1a is a block diagram of a data processing sys-
`tem 10 that is suitable for practicing the teaching of the
`invention. A bus 12 is comprised of a plurality of signal
`lines for conveying addresses, data, and controls be-
`tween a Central Processing Unit 14 and a number of
`other system bus units. A RAM 16 is coupled to the
`system bus 12 and provides program instruction storage
`and working memory for the CPU 12. A terminal con-
`trol subsystem 18 is coupled to the system bus 12 and
`provides outputs to a display device 20, typically a CRT
`monitor, and receives inputs from a manual input device
`22,
`typically a keyboard. Manual input may also be
`provided from a pointing device, such as a mouse. A
`hard disk control subsystem 24 bidirectionally couples a
`rotating fixed disk, or hard disk 26, to the system bus 12.
`The control 24 and hard disk 26 provide mass storage
`for CPU instructions and data. A floppy disk control
`subsystem 28 bidirectionally couples one or more
`floppy disk drives 30 to the system bus 12. The floppy
`disk drive 30 works in conjunction with a removable
`
`floppy diskette 30a. A network adapter (NA) 31 is pro-
`vided for coupling the system 10 to a network.
`The components illustrated in FIG. 1a may be em-
`bodied within a personal computer, a portable com-
`puter, a workstation, a minicomputer, a main frame
`computer, or a supercomputer. As such, the details of
`the physical embodiment of the data processing system
`10, such as the structure of the bus 12 or the number of
`CPUs 14 that are coupled to the bus, is not crucial to the
`operation of the invention, and is not described in fur-
`ther detail below.
`As employed herein, the informational state of the
`data processing system 10, at a given moment in time, is
`considered to be defined by the contents (bytes) of all
`memory or storage devices to which the system has
`access. These include, but are not limited to, RAM,
`registers, hard disks, floppy disks that are inserted into
`floppy disk drives, and magnetic tape that is inserted
`into a tape drive.
`The informational state history of the system 10 is
`considered to be the history of its informational state at
`all moments in time.
`For the case where the data processing system 10 (P1
`of FIG. 1b) is connected to other data processing sys-
`tems (P2—P7) via a network, such as the exemplary
`network depicted in FIG. 1b, an additional data proces-
`sor (PD) may be added as a dedicated decoy program
`server, as described below.
`The method of this invention includes the following
`component steps, as illustrated in FIG. 2, or some func-
`tional subset of these steps.
`Step A: Periodic monitoring of the data processing
`system 10 for anomalous, potentially virus-like behav-
`ior.
`
`Step B: Automatic scanning for occurrences of
`known viruses, and removal of any which are found to
`be present.
`Step C: Deployment of decoy programs to capture
`virus samples.
`Step D: Segregation of a captured virus sample into
`portions that are likely to vary from one instance of the
`computer virus to another instance, and into portions
`that are likely to be invariant from one instance to an-
`other.
`
`Step E: Extraction of a viral signature from the in-
`variant portion(s) and the addition of the signature to a
`signature database.
`Step F: Informing neighboring data processing sys-
`tems on a network of an occurrence of a viral infection.
`Step G: Generation of a distress signal, if appropriate,
`so as to call upon human experts to resolve difficult
`cases.
`A feature of this invention is the automatic execution
`of the foregoing steps in response to a detection of an
`undesired software entity, such as a virus or a worm,
`within the data processing system 10.
`A discussion is now made of each of the above refer-
`enced Steps A—G of FIG. 2.
`
`Step A: Anomaly Detection
`
`The first step in the process of this invention detects
`anomalous system behavior of a type that may indicate
`the presence of an undesirable informational state re-
`sulting from the presence of a virus or some other unde-
`sirable software entity, such as a worm or a Trojan
`Horse. The detection of anomalous behavior within a
`computer or computer network can be accomplished
`through the use of known techniques, preferably a tech-
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`000011
`
`000011
`
`
`
`i440fl23
`
`6
`
`Step C: Decoy Deployment
`
`5
`nique that detects anomalies that may indicate a virus.
`One suitable anomaly detection technique,
`that uses
`patterns of activity undertaken by computational pro-
`cesses, is described in an article entitled “A Pattem-
`Orientated Intrusion-Detection Model and its Applica-
`tions”, S. W. Shieh and V. D. Gligor, Proceedings of
`the 1991 IEEE Computer Society Symposium on Re-
`search in Security and Privacy”. Another suitable tech-
`nique detects unexpected changes in executable or re-
`source files. As an example, F. Cohen, A Short Course on
`Computer Viruses, ASP Press, Pittsburg, 1990, discusses
`the use of integrity shells employing modification detec-
`tors. In this case, modification of files is detected by
`cryptographic or non—cryptographic checksums, and
`the user is notified if a modification is detected.
`The method at Step A may employ any one or a
`combination of_ these or similar techniques. The end
`result is the detection by the system 10 of an anomalous
`system behavior that may be indicative of the presence
`of a computer virus, or some other undesirable software
`entity.
`The anomaly detection process or processes may be
`run continually as a background task, or periodically as
`a foreground task, depending on the desire of the user
`and/or the nature of the underlying operating system.
`
`Step B: Scan for Known Viruses
`
`If preliminary evidence of virus-like activity is de-
`tected, additional computational resources are devoted
`to obtaining more conclusive evidence of viral infec-
`tion. If a virus is present, there is a high probability that
`the virus is a well-known virus that can be readily de-
`tected by a virus scanner. Thus, Step A scans an infor-
`mational state history of the system, that is, all relevant
`media (e.g. files, boot records, memory), for known
`viruses.
`In one implementation, a pre-existing virus
`scanner is activated to search for a large set of patterns
`or signatures, each of which pertains to a different virus
`or virus family. It is highly desirable (but not essential)
`that the virus scanner be capable of detecting slight
`variations on known viruses. Many conventional scan-
`ners tend to possess this capability, to a limited extent,
`simply by virtue of the fact that they search based only
`on short contiguous strings of bytes found in the virus.
`One especially suitable virus scanner, known as VIR-
`SCAN, was available from the International Business
`Machines Corporation until recently, when it was in-
`corporated into the IBM AntiVirus program. This virus
`scanner is particularly useful for detecting variations in
`known viruses because it permits a certain number of
`mismatches between a string of bytes in a file being
`examined and the virus signature string. This capability
`reduces the chance of having to proceed to Step C
`(decoy deployment), and is even more useful
`in the
`event that the virus is completely unrelated to any
`known virus, as is described below with respect to Step
`E. If the anomaly is found to be due to a known virus or
`some slight alteration of it, the method proceeds to Step
`B1 where the user is alerted, and the virus removed
`(killed) by traditional methods, such as restoration from
`backup (either automatically or manually by the user)
`or disinfection (removal of the virus from all of the
`software it has infected.) In general, disinfection is only
`acceptable if the virus is found to be an exact copy of a
`known virus. This implies that the system verify the
`identification made by the virus scanner. .
`
`If no known virus is detected, the anomaly may be
`due to some unknown virus. In this case, the method
`employs one or more decoy programs in an attempt to
`attract and obtain one or more samples of the unknown
`virus. The decoy programs could be commercially-
`available computer programs which are installed on
`(and are capable of being executed on) the computer
`system. Preferably, however, they would be special-
`purpose programs which exist solely for the purpose of
`capturing viruses.
`For the case of a DOS-based system, a variety of
`decoy programs having *.COM and *.EXE extensions
`(also referred to as executables) are created, executed,
`read, written to, copied, etc. The use of decoy programs
`for this purpose is described by W. C. Arnold and D. M.
`Chess in “System for Detecting Undesired Alteration of
`Software”, IBM Technical Disclosure Bulletin, Vol. 32,
`No. 11, pages 48-50, April 1990.
`In operation, original copies of the decoy programs
`are stored securely, by example in encrypted form or on
`a separate server, along with checksums of the originals.
`After a specified time interval, or after other specified
`conditions are met, checksums of some or all of the
`decoy programs are taken and compared directly to
`their originals. Alternatively, the decoy programs are
`compared directly to their originals. If none of the de-
`coys are found to have changed, the user is alerted to
`the fact that an anomaly of some sort was detected, but
`that the system was unable to find conclusive evidence
`of a virus. In this case, it is desirable to maintain the
`system in a moderate state of alert for some time period,
`exercising the decoy programs at a reduced priority or
`less frequent intervals.
`It is desirable to have a large variety of decoy pro-
`grams that differ from one another in length, content,
`and name. Furthermore, the decoy programs should be
`placed in and run from various directories, particularly
`those directories in which there are modified executa-
`bles.
`Some DOS viruses are known to become resident in
`memory and are active until the computer is rebooted.
`In such a case, the virus is able to infect any executable
`at any time. Often, a resident virus will infect executa-
`bles as they are brought into memory to be executed.
`Thus, if several decoy programs are run, there is a rea-
`sonable chance that one or more decoy programs will
`be infected by such a virus. In addition to running the
`decoys, the system should perform a variety of other
`operations with the decoy programs, including copying
`them, opening them, creating them, etc. All of these
`operations increase the probability of infection by a
`virus that is resident in memory.
`However, some DOS viruses do not go resident, and
`can infect other programs only when the host program
`in which the virus is imbedded is executed. Such viruses
`typically use static criteria for selecting a program to
`infect, such as the program name or the directory in
`which the program is located. In this case, it is more
`difficult for the decoy programs to become infected. To
`increase the chance of infection, decoy programs are
`preferably placed in locations where they are more
`likely to be selected, such as the directory of the possi-
`bly-infected executable, or directories containing sys-
`tem programs (e.g. the root directory) and commonly-
`used utility programs. In order to infect the decoy pro-
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`000012
`
`000012
`
`
`
`5A4Q723
`
`7
`grams, the potentially-infected executable should be run
`repeatedly.
`If one or more decoy programs is subsequently found
`to have changed from the original, protected version, it
`can be assumed that the changes are due to a virus. A
`comparison of each modified decoy program with its
`corresponding uninfected version enables a copy of the
`virus to be isolated from each decoy.
`
`8
`grams, which can include representations of numerical
`constants, character strings, screen images, work areas
`for computations, addresses, etc., tend to be more likely
`to vary from one instance of the program to another
`than “code” portions, which represent machine instruc-
`tions. It may be that most or all of the “data” bytes in a
`given virus are the same in all instances of that virus, but
`without a careful analysis this is difficult to establish
`with certainty. Thus, a conservative approach assumes
`that “data” bytes may change, and therefore it is desir-
`able to disqualify them from being chosen as portions of
`candidate signatures. Perfect classification of each byte
`as being “code” or “data” is not necessary. An errone-
`ous classification of “code” bytes as “data” bytes is
`more tolerable than an erroneous classification of
`“data” bytes as “code” bytes.
`By example, FIG. 6b illustrates a 24-byte “invariant”
`portion of a virus sample that was generated at Step C.
`The first 4 bytes represent machine instructions, bytes 5
`and 6 represent data, bytes 7-19 represent code, and
`bytes 20-24 represent data. Thus, if Block B of FIG. 4-
`were employed, only bytes 1-4 and 7-19 would be
`passed along to Step E for signature extraction.
`Suitable methods of code-data segregation include,
`but are not limited to, the following techniques.
`1. Conventional static disassembly techniques, such
`as flow analysis.
`2. Simulation of marked viral code in a virtual ma-
`chine, wherein the virtual machine includes a hardware
`and software model of the computer or class of comput-
`ers for which the virus signature is being obtained.
`3. Running the marked viral code through an inter-
`preter or debugger to determine which bytes are actu-
`ally executed. This method requires a “safe” environ-
`ment in which to run the virus, such as the dedicated
`server PD referred to above in the discussion of FIG. lb.
`4. Statistical classification techniques.
`The second method,
`i.e., simulation of the marked
`viral code in a virtual machine, is the safest and most
`reliable, but is often the most computationally intensive.
`The second and third methods each have the advan-
`tage that they are capable of determining whether a
`virus is self-modifying. Viruses that modify themselves
`often do so for the purpose of obscuring their machine
`code portions by making the code appear different from
`one instance of the virus to the next. Such viruses are
`often referred to as self-garblers. Self-garblers typically
`consist of a short decrypting “head” portion followed
`by code which is transformed in an easily-invertible
`manner (e.g. XOR’ed) using a randomly-chosen de-
`cryption key when the virus is replicated. Thus, the
`only portion of the code which remains invariant, from
`one instance of the virus to another, is the head. It is
`helpful to know that a virus is self-modifying (and thus
`potentially self-garbling), so that signatures will only be
`drawn from the decrypting head. This information can
`also be helpful in identifying cases in which what ap-
`peared to be several different viruses in Step C are
`actually only different instances of a single, self-garbling
`virus.
`At the conclusion of Step D, there exists one or more
`sequences of bytes which appear in every instance of
`the one or more viruses that were obtained or generated
`at Step C and which, according to any heuristics em-
`ployed in Block B, are deemed likely to appear in most
`or all possible instances of the virus. The set of candi-
`date viral signatures are all possible contiguous blocks
`of S bytes found in these “probably-invariant” or “sub-
`
`l0
`
`15
`
`20
`
`Step D: Identification of Invariant Code Portions
`In order to maximize the likelihood that a chosen
`signature will appear in all or nearly all progeny of the
`virus, it is preferred to filter out all portions of the virus
`which are observed to or are likely to vary from one
`instance of the virus to another. To achieve this, the
`method employs one, or preferably both, of the tech-
`niques represented as Blocks A and B in FIG. 4. The
`order in which the two techniques are used can be re-
`versed, and the following description is correspond-
`ingly modified to accommodate the opposite ordering
`of the two techniques.
`If several decoy programs are found to be infected, a
`most desirable case, it is important to compare the vari-
`ous versions of the isolated virus programs with one
`another (Block A). If all versions are identical, or if 25
`there is only one version, the entire virus is considered
`to be “invariant”, and can be passed along to block B
`Otherwise, if the various isolated computer virus pro-
`grams differ, but have substantial areas in common, it
`can be assumed that the virus alters itself when it repli-
`cates. As a result, only the common areas are marked
`for further examination by the heuristics employed in
`Block B of FIG. 4. This procedure is illustrated in
`FIG. 6a, in which the variant and “invariant” por-
`tions of a virus are identified by comparing three in-
`stances of it. In FIG. 6a, each byte is shown expressed
`in a hexadecimal representation, and an invariant por-
`tion is identified. The length of the invariant sequence
`of bytes is exemplary. In practice, invariant sequences
`of bytes are often considerably longer than five bytes,
`and more than one such portion typically appears
`within the virus. If there are no substantially similar
`areas,
`this may indicate the presence of a relatively
`sophisticated self-altering virus, or the presence of more
`than one virus. In this case, each of what appear to be
`different viruses are provided separately to the extrac-
`tion method, which determines one or more signatures
`for each. The system is also placed in a moderate state
`of alert, and the user is informed.
`It is noted that a situation may arise in which the
`computer is attacked by a sophisticated, unknown virus,
`every instance of which looks completely different. In
`this case, at the completion of Block A in FIG. 4 the
`virus detector may be unable able to determine whether
`there are several viruses, or one virus that appears dif-
`ferent every time it is replicated. The use of a distress
`signal, described below in relation to Step G, may be
`appropriate for this situation.
`At this point, there are one or more sections of a virus
`program (or a set of different viruses) that are marked as
`being substantially “invariant”. However, even if a
`given section of the virus program is common to all
`copies which have been obtained in Step C, this does
`not guarantee that the section will appear in its entirely
`in every instance of the virus. The purpose of Block B
`is to filter out portions of the “invariant” sections which
`by their nature are more likely to be variable. As a
`general rule, non-executable “data” portions -of pro-
`
`65
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`000013
`
`000013
`
`
`
`9
`stantially invariant” byte sequences, where S is a signa-
`ture length specified by the user or determined by the
`system. Typically, S has a value in the range of approxi-
`mately 12 bytes to 36 bytes. The “probably-invariant”
`byte sequence(s) may be provided an an input file (IF
`32, FIG. 1a) for subsequent processing by the signature
`extraction technique described below.
`There may be several viruses (or virus signatures) of
`interest. As such, a record is maintained of which candi-
`date signatures belong to each virus.
`
`Step E: Signature Extraction
`
`In general, the signature extraction method searches
`through the “probably-invariant” sections of virus code
`for the n best possible signatures (where n is typically
`small, and perhaps 1). A “good” viral signature is one
`which is statistically unlikely to generate false positives
`when run on uninfected, legitimate programs, but will
`always identify the virus if it is present. Depending on
`the nature of the virus scanner which will employ the
`resulting signature, other criteria may be combined
`with this criterion to evaluate the desirability of a given
`candidate signature. Another desirable property is that
`the signature be flexible enough to allow for the detec-
`tion of slight alterations in the viral code. This is espe-
`cially useful because the decoys may not have captured
`the full range of variation of a particular virus. This is
`the reason that the virus scanner referred to in Step B
`should be capable of detecting slightly-altered