`
`Symantec 1001
`IPR of U.S. Pat. No. 7,757,298
`
`
`
`US 7,757,298 B2
`Page 2
`
`11/2003 Davis etal.
`6,643,696 B2
`7/2005 Shuster
`6,922,781 131
`............. N 382/100
`7,120,274 132* 10/2006 Kackeretal.
`............... .. 713/201
`2002/0087885 A1*
`7/2002 Peledetal.
`2005/0108248 A1*
`5/2005 Natunen .................... .. 707/10
`
`W0
`
`FOREIGN PATENT DOCUMENTS
`WO9842098
`,,
`9/1998
`
`* cited by examiner
`
`U.S. PATENT DOCUMENTS
`
`6/2000 Bersson ..................... .. 726/32
`1/2001 Dietletal.
`. 707/102
`3/2001 Nakayamaet a1
`.713/193
`5/2001 Rhodes etal.
`............ .. 382/306
`9/2001 Barney ........................ .. 707/6
`1/2003 Danieli
`.713/156
`3/2003 Blair et al. ........... .. 713/186
`6/2003 Hypponen et al.
`
`
`
`6,081,897 A *
`6,182,081 B1*
`6,209,097 131*
`6,236,768 131*
`6,289,341 131*
`6,510,513 131*
`6,530,022 131*
`6,577,920 B1
`
`OOOOO2
`
`000002
`
`
`
`U.S. Patent
`
`Jul. 13, 2010
`
`Sheet 1 of6
`
`US 7,757,298 B2
`
`WEB
`
`102
`
`1.32
`
`1,4
`
`112
`
`USER COMPUTER
`F 7
`
`* -------------- -'7» ,,,,,,,,,,,
`120
`
`APPLICATION SECONDARY
`------J
` 7.34
`
`———————-J
`
`710
`
`7.30
`
`000003
`
`
`
`FILE I
`
`IDENTIFICATION
`
`SERVER
`
`000003
`
`
`
`U.S. Patent
`
`Jul. 13, 2010
`
`Sheet 2 of6
`
`US 7,757,298 B2
`
`
`
`
`
`REPORT PRESENCE
`OF SUSPECT FILES
`
`
`
`TOTAL
`SIZE GREATER
`THAN THRESHOLD
`?
`
`DIRECTORY SCAN
`
`TRAVERSE DIRECTORY ENTRIES
`
`204
`
`NAME CONTAIN
`SUSPECT TAGS
`?
`
`272
`
`214
`
`FILE
`REFERENCED IN
`ANY HTML
`FILE
`9
`
`
`
` END OF
`? F/G. 2/1
`
`DIRECTORY
`
`000004
`
`000004
`
`
`
`U.S. Patent
`
`Jul. 13, 2010
`
`Sheet 3 of6
`
`US 7,757,298 B2
`
`220
`
`RETRIEVE FILE FROM
`DIRECTORY
`
`222
`
`226
`
`FILE
`CONTENTS
`
`MATCH INDICATED
`
`FILE TYPE
`?
`
`FE)EFP%T}SPPEFEETS%'EEE
`
`TRUNCATE
`THE FILE
`
`OOOOO5
`
`000005
`
`
`
`U.S. Patent
`
`Jul. 13, 2010
`
`Sheet 4 of6
`
`US 7,757,298 B2
`
`
`
`ADD FILE TO
`
`DELETION LIST
`
`
`
`256
`
`NO
`
`F/G. 2C
`
`OOOOO6
`
`240
`
`242
`
`244
`
`246'
`
`250
`
`252
`
`254
`
`RETRIEVE FILE FROM SUSPECT FILE LIST
`
`READ INITIAL PORTION OF FILE
`
`GENERATE FIRST CHECKSUM
`
`COMPARE FIRST CHECKSUM TO TABLE
`
`248
`
`READ LARGER PORTION OF FILE
`
`GENERATE SECOND CHECKSUM
`
`COMPARE SECOND CHECKSUM TO TABLE
`
`000006
`
`
`
`U.S. Patent
`
`Jul. 13, 2010
`
`Sheet 5 of6
`
`US 7,757,298 B2
`
`CHECKSUM GENERATION
`
`
`
`READ BYTE OF FILE
`
`MULTIPLY BYTE BY
`RUNNING CHECKSUM
`
`REVERSE THE RESULT
`
`TRUNCATE TO FIXED SIZE
`
`REACHED
`
`
`PREDETERMINED
`
`.302
`
`304
`
`306
`
`.308
`
`.310
`
`NUMBER OF
`
`BYTES
`9
`
`
`
`F/G. 3
`
`YES
`
`312
`
`000007
`
`000007
`
`
`
`U.S. Patent
`
`Jul. 13, 2010
`
`Sheet 6 of6
`
`US 7,757,298 B2
`
`
`
`CHECKSUM LIBRARY
`
`IDENTIFY SOURCE FILES
`
`GENERATE CHECKSUMS
`
`
`
`
`
`
`402
`
`
`
`404
`
`406
`
`STORE CHECKSUM, FILE NAME.
`AND FILE LENGTH IN LIBRARY
`
`408
`
`410
`
`000008
`
`000008
`
`
`
`US 7,757,298 B2
`
`1
`METHOD AND APPARATUS FOR
`IDENTIFYING AND CHARACTERIZING
`ERRANT ELECTRONIC FILES
`
`RELATED APPLICATIONS
`
`This application is a continuation of application Ser. No.
`09/561,751 filedApr. 29, 2000, now U.S. Pat. No. 6,922,781,
`which claims priority pursuant to 35 U.S.C. §119(e) to U.S.
`Provisional Application Nos. 60/ 132,093, filedApr. 30, 1999;
`60/142,332, filed Jul. 3, 1999; and 60/157,195, filed Sep. 30,
`1999. All of the foregoing non-provisional and provisional
`applications are specifically incorporated by reference
`herein, in their entirety.
`
`COPYRIGHT NOTICE
`
`This patent document contains material subject to copy-
`right protection. The copyright owner, Ideaflood, Inc., has no
`objection to the reproduction of this patent document or any
`related materials, as they appear in the files of the Patent and
`Trademark Office of the United States or any other country,
`but otherwise reserves all rights whatsoever.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`The present invention relates to electronic files stored on
`computers, and more particularly, to methods and apparatus
`for identifying and characterizing errant electronic files
`stored on computer storage devices.
`2. Description of Related Art
`The use of public and shared computing environments has
`proliferated due to the popularity of the Internet. Many Inter-
`net service providers (ISP) offer Web hosting services at low
`or no cost in which registered users can place their own Web
`sites on the ISP’s servers. These individual Web sites allow
`
`users to store and access electronic files that are uploaded to
`the servers. As a result ofthis proliferation, the administration
`of the large number of stored electronic files has become an
`important aspect of such Web hosting services. In view of the
`relative ease of public access to these electronic file storage
`resources, there is also widespread abuse ofWeb server space
`in which users upload files that are offensive, illegal, unau-
`thorized, or otherwise undesirable and thus wasteful of stor-
`age resources. These file types are predominantly of four
`types: music, video, software and graphics. Many such files
`may contain pornography in violation of the terms of use of
`the Web hosting service. Moreover, the copying of these files
`to the Web server may be in violation of U.S. copyright laws.
`Consequently, the identification and removal of such files
`represents a significant administrative burden to the Web
`hosting services. In addition, the presence of certain files
`(such as depictions of child pornography or copyrighted
`music files) on user computers on corporate networks poses
`great legal risks to the corporation.
`Such files can be selected for review and characterized as
`
`acceptable or unacceptable to the system administrator using
`an automated or manual process. Unfortunately, many unde-
`sirable files are not easily recognizable and cannot be
`detected and characterized. A manual review ofthe content of
`
`the files stored on the storage resource is usually not economi-
`cally feasible, and is also not entirely effective at identifying
`undesirable files. Illicit users of Web hosting services have
`devised numerous techniques for disguising improper files
`wherein even easily recognizable file types are disguised as
`less recognizable file types. One such technique for disguis-
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`
`ing files is to split them into parts so that (i) they cannot be
`detected by simple searches for large files, and (ii) they can be
`downloaded or uploaded in smaller chunks so that ifa transfer
`is interrupted, the entire download or upload is not lost. The
`split files may also be renamed so as to hide their true file type.
`For example, a search for oversized music files (*.mp3)
`would not turn up a huge file named “song.txt” because it
`appears to the system as a text file.
`Another technique for hiding files is to append them to files
`that legitimately belong on a web server. By way of example,
`a Web site may be created called “Jane’s Dog’s Home Page.”
`Jane gets ten small pictures of her dog, converts them to a
`computer readable format (for example, jpeg) and saves them
`on her computer. She then splits stolen, copyrighted software
`into ten parts. She appends each part to the end of one of the
`jpeg files. She then uploads these to a web server. Upon a
`manual review of the web page, the administrator of the site
`would not notice that the otherwise innocuous dog pictures
`actually contain stolen software, because each of the files
`would in fact display a photo of a dog. Thus, even if the files
`were reported for manual review by software doing a simple
`search for oversized files, the files would be left on the server
`because they appear to be legitimate. While these files can
`sometimes be identified by name or size alone, these methods
`lead to unacceptable numbers of false positives and false
`negatives as file sizes and names are changed.
`Free and low cost web hosting services typically rely on
`advertising revenue to fund their operation. An additional
`abuse of these web hosting services is that they can be cir-
`cumvented such that the advertisements are not displayed.
`Typically, the advertising content is displayed on text or
`hypertext pages. If a user stores graphics or other non-text
`files on a free web hosting server, yet creates a web page
`elsewhere on a different service that references these graphics
`or non-text files, the free web hosting service pays the storage
`and bandwidth costs for these files without deriving the rev-
`enue from advertisement displays.
`A need exists, therefore, to provide a method and apparatus
`for identifying and characterizing errant electronic files
`stored on computer storage devices, that makes use of a
`variety of file attributes to reliably characterize files accord-
`ing to pre-set criteria, that is not easily circumvented, and that
`reduces the amount of manual review necessary to verify
`proper operation.
`
`SUMMARY OF THE INVENTION
`
`In accordance with the teachings of the present invention,
`a method and apparatus are provided for identifying and
`characterizing files electronically stored on a computer stor-
`age device. More particularly, an embodiment of the inven-
`tion further comprises a computer system that includes a
`server having a memory connected thereto. The server is
`adapted to be connected to a network to permit remote storage
`and retrieval of data files from the memory. A file identifica-
`tion application is operative with the server to identify errant
`files stored in the memory. The file identification application
`provides the functions of: (1) selecting a file stored in said
`memory; (2) generating a unique checksum corresponding to
`the stored file; (3) comparing saidunique checksum to each of
`a plurality of previously generated checksums, wherein the
`plurality of previously generated checksums correspond to
`known errant files; and (4) marking the file for deletion from
`the memory if the unique checksum matches one of the plu-
`rality of previously generated checksums.
`A more complete understanding of the method and appa-
`ratus will be afforded to those skilled in the art, as well as a
`
`OOOOO9
`
`000009
`
`
`
`US 7,757,298 B2
`
`3
`realization of additional advantages and objects thereof, by a
`consideration of the following detailed description of the
`preferred embodiment. Reference will be made to the
`appended sheets of drawings that will first be described
`briefly.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a block diagram illustrating a wide area network
`in which a web host delivers information in the form of web
`pages to users;
`FIG. 2A is a flow chart illustrating a method of scanning a
`file directory to identify suspect files stored in a database in
`accordance with an embodiment of the invention;
`FIG. 2B is a flow chart illustrating a method of reviewing
`file contents to identify suspect files;
`FIG. 2C is a flow chart illustrating a method of checksum-
`ming the suspect files;
`FIG. 3 is a flow chart illustrating a method of generating
`checksum values; and
`FIG. 4 is a flow chart illustrating a method of generating a
`checksum library.
`
`DETAILED DESCRIPTION OF THE PREFERRED
`EMBODIMENT
`
`The present invention satisfies the need for a method and
`apparatus for identifying and characterizing errant electronic
`files stored on computer storage devices, that makes use of a
`variety of file attributes to reliably characterize files accord-
`ing to pre-set criteria, that is not easily circumvented, and that
`reduces the amount of manual review necessary to verify
`proper operation. In the detailed description that follows, like
`element numerals are used to describe like elements illus-
`
`trated in one or more of the figures.
`Referring first to FIG. 1, a block diagram is illustrated of a
`wide area network in which information is delivered to users
`
`in the form of web pages. It is anticipated that the present
`system operates with a plurality of computers that are coupled
`together on a communications network, such as the Internet
`or a wide area network. FIG. 1 depicts a network that includes
`a user computer 120 that communicates with a Web host 110
`though communication links that include the Internet 102.
`The user computer 120 may be any type of computing device
`that allows a user to interactively browse websites, such as a
`personal computer (PC) that includes a Web browser appli-
`cation 122 executing thereon (e.g., Microsoft
`Internet
`ExplorerTM or Netscape CommunicatorTM). The Web host
`110 includes a server 112 that can selectively deliver graphi-
`cal data files in the form of HyperText Markup Language
`(HTML) documents to the user computer 120 using the
`HyperText Transport Protocol (HTTP). Currently, HTML 2.0
`is the standard used for generating Web documents, though it
`should be appreciated that other coding conventions could
`also be used within the scope of the present invention. The
`server 112 accesses HTML documents stored within a data-
`
`base 116 that can be requested, retrieved and viewed at the
`user computer via operation of the Web browser 122. The
`database 116 may also contain many other types of files,
`including text, graphics, music, and software files. It should
`be appreciated that many different user computers may be
`communicating with the server 112 at the same time.
`As generally known in the art, a user identifies a Web page
`that is desired to be viewed at the user computer 120 by
`communicating an HTTP request from the browser applica-
`tion 122. The HTTP request includes the Uniform Resource
`Locator (URL) of the desired Web page, which may corre-
`
`4
`
`spond to an HTML document stored on the database 116 of
`the Web host 110. The HTTP request is routed to the server
`112 via the Internet 102. The server 112 then retrieves the
`
`5
`
`HTML document identified by the URL, and communicates
`the HTML document across the Internet 102 to the browser
`
`application 122. The HTML document may be communi-
`cated in the form of plural message packets as defined by
`standard protocols, such as the Transport Control Protocol/
`Internet Protocol (TCP/IP). A user may also download any
`other type of file from the database 116 in the same manner.
`FIG. 1 further illustrates a secondary Web host 130 having
`a server 132 and database 134 similar to that of the primary
`Web host 110. The user computer 120 can communicate with
`the secondary Web host 130 in the same manner as described
`above. Moreover, the primary Web host 110 can communi-
`cate with the secondary Web host 130 in the same manner.
`The pertinence ofthis communication path will become more
`clear from the following description of the present method.
`The Web ho st 1 1 0 further comprises a file identification appli-
`cation 114 that analyzes the data files stored on the database
`116 in order to identify errant files in accordance with the
`present invention. The file identification application 114 may
`comprise a program executing on the same computer as the
`server 112, or may be executing on a separate computer. The
`file identification application tests various attributes of the
`files stored on the database to determine whether they satisfy
`a particular profile that corresponds to an errant file. Source
`code for a preferred embodiment of a file identification appli-
`cation is attached hereto as an exhibit.
`
`A widely accepted characteristic of the Internet is that files
`are copied relentlessly and without permission. This is par-
`ticularly true of illicit files, such as adult content, pomo-
`graphic material or illegally copied software, music or graph-
`ics. Thus, a photograph showing up on a single Web site may
`be propagated to hundreds of other Web sites within days.
`Although the file name is often changed, and transmission
`errors often result in premature truncation ofthe file (and thus
`a new file length), the initial portion of the file remains iden-
`tical as it is propagated throughout the Internet. Another
`characteristic of the Internet is that illicit files, such as music,
`video and software, all have one common attribute—they are
`very large once reassembled. It is therefore necessary to (i)
`identify oversized files that have been uploaded in parts, and
`(ii) identify “hidden” files that are appended to otherwise
`legitimate files. As will be further described below, an aspect
`of the present invention takes advantage of these characteris-
`tics of the Internet.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`Referring now to FIGS. 2A-2C, a method for identifying
`and characterizing files is illustrated in accordance with an
`embodiment ofthe invention. The method would be executed
`
`50
`
`by the file identification application 1 14 described above with
`respect to FIG. 1. FIG. 2A illustrates an exemplary method of
`scanning a file directory to identify suspect files stored in a
`database. Suspect files are ones that are suspected of being
`improper, and are marked for further testing. The database
`116 includes a directory that identifies the files stored therein
`based on various attributes, including file name and file size.
`It will be appreciated from the following discussion that the
`method of FIGS. 2A-2C relates specifically to the identifica-
`tion of pornographic materials in view of the particular selec-
`tion criteria that is utilized; however, it will be understood to
`persons of ordinary skill in the art that the selection criteria
`can be modified to identify other types of illicit files. Starting
`at step 202, the application traverses the directory in order to
`analyze the numerous directory entries. The application may
`construct a relational database ofthe directory entries in order
`to sort on the various fields of the directory. This step may be
`
`55
`
`60
`
`65
`
`000010
`
`000010
`
`
`
`US 7,757,298 B2
`
`5
`performed repeatedly as a continuing process through this
`identifying process, and would have to be repeated periodi-
`cally to identify new files that are added to the database 116.
`At step 204, the application determines whether there are
`any sequentially numbered files within the directory. Sequen-
`tial files can be identified by analyzing and comparing the file
`names to each other. One attribute of pornographic materials
`is that they are often uploaded to a server as part of a series of
`photographs. Thus, the file names may include an embedded
`numerical
`designation
`such
`as
`“xxx00l .jpg”
`or
`“xxx002 .j pg”. The user may define at what level offolders the
`software will look for sequentially numbered, lettered, or
`otherwise identified files. For example, if a file server is
`divided into folders lettered from “AA” to “ZZ”, and each
`folder contains Web sites with names in which the first two
`
`letters correspond to the name ofthe file folder, the user could
`decide to treat all folders on the server as a single Web site, or
`to treat only Web sites within the same folder as a single Web
`site, or to treat each Web site individually. In the preferred
`embodiment, each Web site is considered on its own without
`reference to other Web sites, although the invention need not
`be limited in this manner.
`
`If any such sequential files are identified, they are reported
`as suspect files at step 206. Then, the application returns to
`step 202 and continues traversing through the directory
`entries. If no sequential files are identified at step 204, the
`application next determines at step 208 whether there are any
`files having identical file sizes. Another attribute of stolen
`intellectual property materials such as music files is that they
`are often broken up into several pieces in order to thwart their
`detection by simple searches for large files, and also to enable
`them to be downloaded or uploaded in smaller chunks to
`facilitate transfer. The presence of two or more files having
`identical file size within the, directory is an indicator that they
`may be pieces of a single, larger, illicit file. If there are plural
`files with identical file sizes, the application determines at
`step 210 whether the total size of the identical files summed
`together would exceed a predetermined threshold. As noted
`above, illicit files tend to be unusually large, so the predeter-
`mined threshold would be selected to correspond with the
`largest size of a typical non-illicit file. If the total size does
`exceed the predetermined threshold, then the identical files
`are reported as suspect files at step 206.
`More particularly, the application may manipulate the file
`names to determine whether they are in fact likely to be parts
`of a single,
`larger file. An alternative way to determine
`whether files should be aggregated is to delete all numbers
`from the file names. Any files that are identically named after
`the elimination ofall numbers would be marked as potentially
`responsive and their names and aggregate size would be
`reported. Of course, this can be limited to numbers in con-
`junction with specified letters (such as r00, r41, etc., as the “r”
`denotation often indicates file compression and division via
`the RAR method). Similarly, this can be limited to specified
`file types (whether identified by the file type sufiix to the file
`name, or by examination of the actual contents of the file) or
`files other than specified types (for example,
`legitimate
`graphics files such as *.jpg are often sequentially numbered
`and may be a good candidate for exclusion). Next, using the
`original list of file names, any files are identified that differ
`only by a user-defined number of characters. Such files would
`be marked as potentially responsive and their names and
`aggregate size would be reported. Both ofthe foregoing meth-
`ods can be set to either ignore the file suffix or file type
`information or to utilize it. Next, using the original list of file
`names and sizes, files that are of the same size (or within a
`user-defined number of bytes of being of the same size) are
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`identified. Any such files are marked as potentially responsive
`and their names and aggregate size would be reported.
`If no identical files are identified at step 208, or if the total
`size does not exceed the predetermined threshold at step 210,
`the application proceeds to step 212 where it is determined
`whether the file names contain any suspect tags. An example
`of a suspect tag is “xxx” which is often used in association
`with pornographic materials. Another example of a suspect
`tag is “crc”, which refers to a cyclical redundancy check
`(CRC), i.e., a known error checking technique used to ensure
`the accuracy oftransmitting digital data. When a large file has
`been broken up into plural smaller files, it is common to
`include a CRC file in order verify the accurate reconstruction
`ofthe large file. The presence of a file having a “crc” tag is an
`indicator that an illicit or illegal file has been uploaded to the
`server. A table of predetermined suspect tags may be gener-
`ated and periodically updated to reflect current usage within
`Internet newsgroups, Web sites and other facilities for traf-
`ficking in pornographic or illicit materials. If any file names
`containing suspect tags are identified, then the associated files
`are reported as suspect files at step 206.
`If no suspect tags are identified at step 212, the application
`proceeds to step 214 where it is determined whether the file is
`referenced in any HTML file contained within the directory.
`Ideally, the files stored on the database would each be linked
`to HTML files contained within the directory. Where a file is
`not linked to a local HTML file, this is an indicator that a user
`is storing graphics or other non-text files that are linked to a
`Web page hosted elsewhere on a different service. As
`described above, this situation is undesirable since the free
`web hosting service pays the storage and bandwidth costs for
`these files without deriving the revenue from advertisement
`displays. Accordingly, any file names that are not referenced
`in an HTML file contained within the directory are reported as
`suspect files at step 206. Alternatively, every file bearing a file
`type capable of causing a web browser to generate hypertext
`links (i.e. *.htm, *.html, *.shtml, etc.) may also be reviewed.
`The hypertext links may be then compared against a list of
`illegal links (for example, links to adult-content Web sites).
`Any file that contains a hypertext link to such a site is reported
`as suspect. If all files on the directory are properly referenced
`in HTML files or contain no illegal links, the application
`determines whether the end of the directory has been reached
`at step 216. If the end of the directory is not yet reached, the
`application returns to step 202 to continue traversing the
`directory and identifying suspect files. Otherwise, this por-
`tion of the application ends at step 218.
`Once a review of the directory entries is complete, the next
`step is to review the content ofthe files listed on the directory
`to see ifadditional files should be added to the suspect file list.
`This review may address every file listed on the directory not
`already listed on the suspect file list, or may be further nar-
`rowed using particular selection criteria specific to the type of
`illicit file, i.e., pornography, copyright infringement, etc. FIG.
`2B illustrates an exemplary method of reviewing file con-
`tents. At step 220, the application retrieves a file from the
`directory. At step 222, the retrieved file is examined to iden-
`tify whether the file contains a copyright notice or the symbol
`©. The presence of a copyright notice in the file is an indicator
`that the file has been uploaded to the server unlawfully, and
`likely contains graphics, text, software or other material that
`is protected by copyright. Any files containing the copyright
`notice would be reported as a suspect file and added to the
`suspect file list at step 224. This copyright notice check pro-
`cedure can also be used to ensure compliance with appropri-
`
`00001 1
`
`000011
`
`
`
`US 7,757,298 B2
`
`7
`the file can be simply
`ate copyright laws. Alternatively,
`marked for deletion. The application then returns to step 220
`and retrieves the next file.
`
`If the file does not contain a copyright notice, the applica-
`tion passes to step 226, in which the retrieved file is examined
`to determine whether the file structure is as expected for a file
`of the indicated type. For example, the file type “jpg” should
`contain a header structure with the values “255 216 255 224”.
`
`Alternatively, files can be checked to ensure that they actually
`contain the type of data described by the file type marker (i.e.,
`a file named *jpg should contain a jpg image). If the file does
`not match the indicated file type, the file can be reported as a
`suspect file and added to the suspect file list at step 224, or
`simply marked for deletion. Another alternative approach
`would be to replace files containing data of a type different
`than that indicated by their file type marker by a file stating
`that the original file was corrupted. Yet another approach
`would be to retype the file (i.e. *jpg can be retyped to *.zip if
`it contained a zipped file and not a jpg). Further, certain file
`types can be aggregated. For example, *.gif and *.jpg files
`may be aggregated as a single file type, and a file bearing a
`* .jpg type is considered valid if it contains either a gif or a jpg
`image. This greatly reduces the problem of mistakenly delet-
`ing a file that a consumer has innocently misnarned. The
`application then returns to step 220 and retrieves the next file.
`If the file contents do match the indicated file type, the
`application determines at step 228 whether the file contains
`data extending past the end of data marker. If this marker
`appears before the true end of file, then it is likely that the
`additional data following the end of data marker constitutes a
`portion of an illicit file. At step 230, the file is truncated at the
`end of file marker. The application then returns to step 220
`and retrieves the next file. Ifthe file does not contain data past
`the end of data marker, the application proceeds to step 232 in
`which it is determined whether the end of the directory has
`been reached. If there are still additional files in the directory
`to review, the application returns to step 220 and retrieves the
`next file. Ifthere are no additional files, the file content review
`process ends at step 234.
`After the files within the directory have been reviewed and
`a list of suspect files generated, the next step is to checksum
`the suspect files and compare the results against a library of
`checksum values corresponding to known illicit files. The
`generation of this list of known illicit files will be described
`below with respect to FIG. 4. FIG. 2C illustrates an exemplary
`method of checksumming the suspect files. A checksum is a
`unique number based upon a range or ranges ofbytes in a file.
`Unlike checksums as they are traditionally used in the com-
`puting field, the checksum described herein is not related to
`the total number of bytes used to generate the number, thus
`reducing a traditional problem with checksums, namely that
`similar file lengths are more likely to generate the same
`checksum than are dissimilar file lengths. In a preferred
`embodiment of the invention, two separate checksums are
`generated for a file corresponding to two different length
`portions ofthe file. While it is possible that the first checksum
`based on a shorter length portion ofthe file may falsely match
`the checksum of another file, it is highly unlikely that the
`second checksum would result in a false match. In addition,
`the use of an initial checksum based upon a small amount of
`data, reduces the burden on the network and file server. This
`reduction is a result of the ability to disqualify a file that does
`not match the first checksum without the need to read the
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`larger amount of data necessary to generate the second check-
`sum.
`
`65
`
`More particularly, at step 240, the application retrieves a
`file from the database identified on the suspect file list. Then,
`
`8
`at step 242, the application reads a first portion of the suspect
`file. In an embodiment of the invention, the first portion
`comprises the first one-thousand (1,024) bytes of the file. A
`first checksum based on this first portion is generated at step
`244. The first checksum is then compared to a library of
`known checksum values at step 246, and at step 248 it is
`determined whether there is a match between the first check-
`
`sum and the library. This step provides an initial screen of a
`file. If there is no match, then the file likely does not corre-
`spond to a known illicit file. The file may nevertheless con-
`stitute improper or unlawful material, and it may therefore be
`advisable to manually review the file to evaluate its contents.
`If the file does contain improper or unlawful material, its
`checksum may be added to the library of known checksums
`and the file marked for deletion from the database. Con-
`
`versely, if the manual review does not reveal the file to be
`improper or unlawful, or based simply on the negative result
`ofthe first checksum comparison, the file is removed from the
`suspect file list, and the application returns to step 240 to
`retrieve the next file from the suspect file list.
`If there is a match based on the initial screen of the file, the
`application proceeds to step 250 in which a second portion of
`the file is read. In an embodiment ofthe invention, the second
`portion comprises the first ten-thousand (10,240) bytes ofthe
`file. A second checksum based on this second portion is
`generated at step 252. The second checksum is then compared
`to a library ofknown checksum values at step 254, and at step
`256 it is determined whether there is a match between the
`
`second checksum and the library. This step provides a more
`conclusive determination as to whether the file corresponds to
`a known improper or unlawful file. If there is a match, the file
`is marked for deletion (or other treatment) at step 258, and the
`application returns to step 240 to retrieve the next suspect file.
`Ifthere is not a match, the file is removed from the suspect file
`list, and the application again returns to step 240 to retrieve
`the next suspect file.
`The files that are marked for deletion may be listed along
`with the pertinent information in a database (either via numer-
`ous individual files, an actual database such as SQL Server, or
`otherwise). This database may be manually reviewed and files
`that should not be deleted removed from the database. A
`
`simple file deletion program may then be run that deletes any
`file in the database.
`
`As noted above, the first one-thousand bytes and the first
`ten-thousand bytes are used for the two checksums, respec-
`tively. For most applications, the use of the entire file or a
`larger portion ofthe file is not necessary and indeed may slow
`the process; however, there is no reason why the entire file or
`any other subset ofthe file could not be used. In an alternative
`embodiment, the first and last portions of the file are used for
`checksumming, although premature file truncation then
`becomes a way to defeat the screen. It is also possible to use
`ot