throbber
000001
`
`Symantec 1001
`IPR of U.S. Pat. No. 7,757,298
`
`

`
`US 7,757,298 B2
`Page 2
`
`11/2003 Davis etal.
`6,643,696 B2
`7/2005 Shuster
`6,922,781 131
`............. N 382/100
`7,120,274 132* 10/2006 Kackeretal.
`............... .. 713/201
`2002/0087885 A1*
`7/2002 Peledetal.
`2005/0108248 A1*
`5/2005 Natunen .................... .. 707/10
`
`W0
`
`FOREIGN PATENT DOCUMENTS
`WO9842098
`,,
`9/1998
`
`* cited by examiner
`
`U.S. PATENT DOCUMENTS
`
`6/2000 Bersson ..................... .. 726/32
`1/2001 Dietletal.
`. 707/102
`3/2001 Nakayamaet a1
`.713/193
`5/2001 Rhodes etal.
`............ .. 382/306
`9/2001 Barney ........................ .. 707/6
`1/2003 Danieli
`.713/156
`3/2003 Blair et al. ........... .. 713/186
`6/2003 Hypponen et al.
`
`
`
`6,081,897 A *
`6,182,081 B1*
`6,209,097 131*
`6,236,768 131*
`6,289,341 131*
`6,510,513 131*
`6,530,022 131*
`6,577,920 B1
`
`OOOOO2
`
`000002
`
`

`
`U.S. Patent
`
`Jul. 13, 2010
`
`Sheet 1 of6
`
`US 7,757,298 B2
`
`WEB
`
`102
`
`1.32
`
`1,4
`
`112
`
`USER COMPUTER
`F 7
`
`* -------------- -'7» ,,,,,,,,,,,
`120
`
`APPLICATION SECONDARY
`------J
` 7.34
`
`———————-J
`
`710
`
`7.30
`
`000003
`
`
`
`FILE I
`
`IDENTIFICATION
`
`SERVER
`
`000003
`
`

`
`U.S. Patent
`
`Jul. 13, 2010
`
`Sheet 2 of6
`
`US 7,757,298 B2
`
`
`
`
`
`REPORT PRESENCE
`OF SUSPECT FILES
`
`
`
`TOTAL
`SIZE GREATER
`THAN THRESHOLD
`?
`
`DIRECTORY SCAN
`
`TRAVERSE DIRECTORY ENTRIES
`
`204
`
`NAME CONTAIN
`SUSPECT TAGS
`?
`
`272
`
`214
`
`FILE
`REFERENCED IN
`ANY HTML
`FILE
`9
`
`
`
` END OF
`? F/G. 2/1
`
`DIRECTORY
`
`000004
`
`000004
`
`

`
`U.S. Patent
`
`Jul. 13, 2010
`
`Sheet 3 of6
`
`US 7,757,298 B2
`
`220
`
`RETRIEVE FILE FROM
`DIRECTORY
`
`222
`
`226
`
`FILE
`CONTENTS
`
`MATCH INDICATED
`
`FILE TYPE
`?
`
`FE)EFP%T}SPPEFEETS%'EEE
`
`TRUNCATE
`THE FILE
`
`OOOOO5
`
`000005
`
`

`
`U.S. Patent
`
`Jul. 13, 2010
`
`Sheet 4 of6
`
`US 7,757,298 B2
`
`
`
`ADD FILE TO
`
`DELETION LIST
`
`
`
`256
`
`NO
`
`F/G. 2C
`
`OOOOO6
`
`240
`
`242
`
`244
`
`246'
`
`250
`
`252
`
`254
`
`RETRIEVE FILE FROM SUSPECT FILE LIST
`
`READ INITIAL PORTION OF FILE
`
`GENERATE FIRST CHECKSUM
`
`COMPARE FIRST CHECKSUM TO TABLE
`
`248
`
`READ LARGER PORTION OF FILE
`
`GENERATE SECOND CHECKSUM
`
`COMPARE SECOND CHECKSUM TO TABLE
`
`000006
`
`

`
`U.S. Patent
`
`Jul. 13, 2010
`
`Sheet 5 of6
`
`US 7,757,298 B2
`
`CHECKSUM GENERATION
`
`
`
`READ BYTE OF FILE
`
`MULTIPLY BYTE BY
`RUNNING CHECKSUM
`
`REVERSE THE RESULT
`
`TRUNCATE TO FIXED SIZE
`
`REACHED
`
`
`PREDETERMINED
`
`.302
`
`304
`
`306
`
`.308
`
`.310
`
`NUMBER OF
`
`BYTES
`9
`
`
`
`F/G. 3
`
`YES
`
`312
`
`000007
`
`000007
`
`

`
`U.S. Patent
`
`Jul. 13, 2010
`
`Sheet 6 of6
`
`US 7,757,298 B2
`
`
`
`CHECKSUM LIBRARY
`
`IDENTIFY SOURCE FILES
`
`GENERATE CHECKSUMS
`
`
`
`
`
`
`402
`
`
`
`404
`
`406
`
`STORE CHECKSUM, FILE NAME.
`AND FILE LENGTH IN LIBRARY
`
`408
`
`410
`
`000008
`
`000008
`
`

`
`US 7,757,298 B2
`
`1
`METHOD AND APPARATUS FOR
`IDENTIFYING AND CHARACTERIZING
`ERRANT ELECTRONIC FILES
`
`RELATED APPLICATIONS
`
`This application is a continuation of application Ser. No.
`09/561,751 filedApr. 29, 2000, now U.S. Pat. No. 6,922,781,
`which claims priority pursuant to 35 U.S.C. §119(e) to U.S.
`Provisional Application Nos. 60/ 132,093, filedApr. 30, 1999;
`60/142,332, filed Jul. 3, 1999; and 60/157,195, filed Sep. 30,
`1999. All of the foregoing non-provisional and provisional
`applications are specifically incorporated by reference
`herein, in their entirety.
`
`COPYRIGHT NOTICE
`
`This patent document contains material subject to copy-
`right protection. The copyright owner, Ideaflood, Inc., has no
`objection to the reproduction of this patent document or any
`related materials, as they appear in the files of the Patent and
`Trademark Office of the United States or any other country,
`but otherwise reserves all rights whatsoever.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`The present invention relates to electronic files stored on
`computers, and more particularly, to methods and apparatus
`for identifying and characterizing errant electronic files
`stored on computer storage devices.
`2. Description of Related Art
`The use of public and shared computing environments has
`proliferated due to the popularity of the Internet. Many Inter-
`net service providers (ISP) offer Web hosting services at low
`or no cost in which registered users can place their own Web
`sites on the ISP’s servers. These individual Web sites allow
`
`users to store and access electronic files that are uploaded to
`the servers. As a result ofthis proliferation, the administration
`of the large number of stored electronic files has become an
`important aspect of such Web hosting services. In view of the
`relative ease of public access to these electronic file storage
`resources, there is also widespread abuse ofWeb server space
`in which users upload files that are offensive, illegal, unau-
`thorized, or otherwise undesirable and thus wasteful of stor-
`age resources. These file types are predominantly of four
`types: music, video, software and graphics. Many such files
`may contain pornography in violation of the terms of use of
`the Web hosting service. Moreover, the copying of these files
`to the Web server may be in violation of U.S. copyright laws.
`Consequently, the identification and removal of such files
`represents a significant administrative burden to the Web
`hosting services. In addition, the presence of certain files
`(such as depictions of child pornography or copyrighted
`music files) on user computers on corporate networks poses
`great legal risks to the corporation.
`Such files can be selected for review and characterized as
`
`acceptable or unacceptable to the system administrator using
`an automated or manual process. Unfortunately, many unde-
`sirable files are not easily recognizable and cannot be
`detected and characterized. A manual review ofthe content of
`
`the files stored on the storage resource is usually not economi-
`cally feasible, and is also not entirely effective at identifying
`undesirable files. Illicit users of Web hosting services have
`devised numerous techniques for disguising improper files
`wherein even easily recognizable file types are disguised as
`less recognizable file types. One such technique for disguis-
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`
`ing files is to split them into parts so that (i) they cannot be
`detected by simple searches for large files, and (ii) they can be
`downloaded or uploaded in smaller chunks so that ifa transfer
`is interrupted, the entire download or upload is not lost. The
`split files may also be renamed so as to hide their true file type.
`For example, a search for oversized music files (*.mp3)
`would not turn up a huge file named “song.txt” because it
`appears to the system as a text file.
`Another technique for hiding files is to append them to files
`that legitimately belong on a web server. By way of example,
`a Web site may be created called “Jane’s Dog’s Home Page.”
`Jane gets ten small pictures of her dog, converts them to a
`computer readable format (for example, jpeg) and saves them
`on her computer. She then splits stolen, copyrighted software
`into ten parts. She appends each part to the end of one of the
`jpeg files. She then uploads these to a web server. Upon a
`manual review of the web page, the administrator of the site
`would not notice that the otherwise innocuous dog pictures
`actually contain stolen software, because each of the files
`would in fact display a photo of a dog. Thus, even if the files
`were reported for manual review by software doing a simple
`search for oversized files, the files would be left on the server
`because they appear to be legitimate. While these files can
`sometimes be identified by name or size alone, these methods
`lead to unacceptable numbers of false positives and false
`negatives as file sizes and names are changed.
`Free and low cost web hosting services typically rely on
`advertising revenue to fund their operation. An additional
`abuse of these web hosting services is that they can be cir-
`cumvented such that the advertisements are not displayed.
`Typically, the advertising content is displayed on text or
`hypertext pages. If a user stores graphics or other non-text
`files on a free web hosting server, yet creates a web page
`elsewhere on a different service that references these graphics
`or non-text files, the free web hosting service pays the storage
`and bandwidth costs for these files without deriving the rev-
`enue from advertisement displays.
`A need exists, therefore, to provide a method and apparatus
`for identifying and characterizing errant electronic files
`stored on computer storage devices, that makes use of a
`variety of file attributes to reliably characterize files accord-
`ing to pre-set criteria, that is not easily circumvented, and that
`reduces the amount of manual review necessary to verify
`proper operation.
`
`SUMMARY OF THE INVENTION
`
`In accordance with the teachings of the present invention,
`a method and apparatus are provided for identifying and
`characterizing files electronically stored on a computer stor-
`age device. More particularly, an embodiment of the inven-
`tion further comprises a computer system that includes a
`server having a memory connected thereto. The server is
`adapted to be connected to a network to permit remote storage
`and retrieval of data files from the memory. A file identifica-
`tion application is operative with the server to identify errant
`files stored in the memory. The file identification application
`provides the functions of: (1) selecting a file stored in said
`memory; (2) generating a unique checksum corresponding to
`the stored file; (3) comparing saidunique checksum to each of
`a plurality of previously generated checksums, wherein the
`plurality of previously generated checksums correspond to
`known errant files; and (4) marking the file for deletion from
`the memory if the unique checksum matches one of the plu-
`rality of previously generated checksums.
`A more complete understanding of the method and appa-
`ratus will be afforded to those skilled in the art, as well as a
`
`OOOOO9
`
`000009
`
`

`
`US 7,757,298 B2
`
`3
`realization of additional advantages and objects thereof, by a
`consideration of the following detailed description of the
`preferred embodiment. Reference will be made to the
`appended sheets of drawings that will first be described
`briefly.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a block diagram illustrating a wide area network
`in which a web host delivers information in the form of web
`pages to users;
`FIG. 2A is a flow chart illustrating a method of scanning a
`file directory to identify suspect files stored in a database in
`accordance with an embodiment of the invention;
`FIG. 2B is a flow chart illustrating a method of reviewing
`file contents to identify suspect files;
`FIG. 2C is a flow chart illustrating a method of checksum-
`ming the suspect files;
`FIG. 3 is a flow chart illustrating a method of generating
`checksum values; and
`FIG. 4 is a flow chart illustrating a method of generating a
`checksum library.
`
`DETAILED DESCRIPTION OF THE PREFERRED
`EMBODIMENT
`
`The present invention satisfies the need for a method and
`apparatus for identifying and characterizing errant electronic
`files stored on computer storage devices, that makes use of a
`variety of file attributes to reliably characterize files accord-
`ing to pre-set criteria, that is not easily circumvented, and that
`reduces the amount of manual review necessary to verify
`proper operation. In the detailed description that follows, like
`element numerals are used to describe like elements illus-
`
`trated in one or more of the figures.
`Referring first to FIG. 1, a block diagram is illustrated of a
`wide area network in which information is delivered to users
`
`in the form of web pages. It is anticipated that the present
`system operates with a plurality of computers that are coupled
`together on a communications network, such as the Internet
`or a wide area network. FIG. 1 depicts a network that includes
`a user computer 120 that communicates with a Web host 110
`though communication links that include the Internet 102.
`The user computer 120 may be any type of computing device
`that allows a user to interactively browse websites, such as a
`personal computer (PC) that includes a Web browser appli-
`cation 122 executing thereon (e.g., Microsoft
`Internet
`ExplorerTM or Netscape CommunicatorTM). The Web host
`110 includes a server 112 that can selectively deliver graphi-
`cal data files in the form of HyperText Markup Language
`(HTML) documents to the user computer 120 using the
`HyperText Transport Protocol (HTTP). Currently, HTML 2.0
`is the standard used for generating Web documents, though it
`should be appreciated that other coding conventions could
`also be used within the scope of the present invention. The
`server 112 accesses HTML documents stored within a data-
`
`base 116 that can be requested, retrieved and viewed at the
`user computer via operation of the Web browser 122. The
`database 116 may also contain many other types of files,
`including text, graphics, music, and software files. It should
`be appreciated that many different user computers may be
`communicating with the server 112 at the same time.
`As generally known in the art, a user identifies a Web page
`that is desired to be viewed at the user computer 120 by
`communicating an HTTP request from the browser applica-
`tion 122. The HTTP request includes the Uniform Resource
`Locator (URL) of the desired Web page, which may corre-
`
`4
`
`spond to an HTML document stored on the database 116 of
`the Web host 110. The HTTP request is routed to the server
`112 via the Internet 102. The server 112 then retrieves the
`
`5
`
`HTML document identified by the URL, and communicates
`the HTML document across the Internet 102 to the browser
`
`application 122. The HTML document may be communi-
`cated in the form of plural message packets as defined by
`standard protocols, such as the Transport Control Protocol/
`Internet Protocol (TCP/IP). A user may also download any
`other type of file from the database 116 in the same manner.
`FIG. 1 further illustrates a secondary Web host 130 having
`a server 132 and database 134 similar to that of the primary
`Web host 110. The user computer 120 can communicate with
`the secondary Web host 130 in the same manner as described
`above. Moreover, the primary Web host 110 can communi-
`cate with the secondary Web host 130 in the same manner.
`The pertinence ofthis communication path will become more
`clear from the following description of the present method.
`The Web ho st 1 1 0 further comprises a file identification appli-
`cation 114 that analyzes the data files stored on the database
`116 in order to identify errant files in accordance with the
`present invention. The file identification application 114 may
`comprise a program executing on the same computer as the
`server 112, or may be executing on a separate computer. The
`file identification application tests various attributes of the
`files stored on the database to determine whether they satisfy
`a particular profile that corresponds to an errant file. Source
`code for a preferred embodiment of a file identification appli-
`cation is attached hereto as an exhibit.
`
`A widely accepted characteristic of the Internet is that files
`are copied relentlessly and without permission. This is par-
`ticularly true of illicit files, such as adult content, pomo-
`graphic material or illegally copied software, music or graph-
`ics. Thus, a photograph showing up on a single Web site may
`be propagated to hundreds of other Web sites within days.
`Although the file name is often changed, and transmission
`errors often result in premature truncation ofthe file (and thus
`a new file length), the initial portion of the file remains iden-
`tical as it is propagated throughout the Internet. Another
`characteristic of the Internet is that illicit files, such as music,
`video and software, all have one common attribute—they are
`very large once reassembled. It is therefore necessary to (i)
`identify oversized files that have been uploaded in parts, and
`(ii) identify “hidden” files that are appended to otherwise
`legitimate files. As will be further described below, an aspect
`of the present invention takes advantage of these characteris-
`tics of the Internet.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`Referring now to FIGS. 2A-2C, a method for identifying
`and characterizing files is illustrated in accordance with an
`embodiment ofthe invention. The method would be executed
`
`50
`
`by the file identification application 1 14 described above with
`respect to FIG. 1. FIG. 2A illustrates an exemplary method of
`scanning a file directory to identify suspect files stored in a
`database. Suspect files are ones that are suspected of being
`improper, and are marked for further testing. The database
`116 includes a directory that identifies the files stored therein
`based on various attributes, including file name and file size.
`It will be appreciated from the following discussion that the
`method of FIGS. 2A-2C relates specifically to the identifica-
`tion of pornographic materials in view of the particular selec-
`tion criteria that is utilized; however, it will be understood to
`persons of ordinary skill in the art that the selection criteria
`can be modified to identify other types of illicit files. Starting
`at step 202, the application traverses the directory in order to
`analyze the numerous directory entries. The application may
`construct a relational database ofthe directory entries in order
`to sort on the various fields of the directory. This step may be
`
`55
`
`60
`
`65
`
`000010
`
`000010
`
`

`
`US 7,757,298 B2
`
`5
`performed repeatedly as a continuing process through this
`identifying process, and would have to be repeated periodi-
`cally to identify new files that are added to the database 116.
`At step 204, the application determines whether there are
`any sequentially numbered files within the directory. Sequen-
`tial files can be identified by analyzing and comparing the file
`names to each other. One attribute of pornographic materials
`is that they are often uploaded to a server as part of a series of
`photographs. Thus, the file names may include an embedded
`numerical
`designation
`such
`as
`“xxx00l .jpg”
`or
`“xxx002 .j pg”. The user may define at what level offolders the
`software will look for sequentially numbered, lettered, or
`otherwise identified files. For example, if a file server is
`divided into folders lettered from “AA” to “ZZ”, and each
`folder contains Web sites with names in which the first two
`
`letters correspond to the name ofthe file folder, the user could
`decide to treat all folders on the server as a single Web site, or
`to treat only Web sites within the same folder as a single Web
`site, or to treat each Web site individually. In the preferred
`embodiment, each Web site is considered on its own without
`reference to other Web sites, although the invention need not
`be limited in this manner.
`
`If any such sequential files are identified, they are reported
`as suspect files at step 206. Then, the application returns to
`step 202 and continues traversing through the directory
`entries. If no sequential files are identified at step 204, the
`application next determines at step 208 whether there are any
`files having identical file sizes. Another attribute of stolen
`intellectual property materials such as music files is that they
`are often broken up into several pieces in order to thwart their
`detection by simple searches for large files, and also to enable
`them to be downloaded or uploaded in smaller chunks to
`facilitate transfer. The presence of two or more files having
`identical file size within the, directory is an indicator that they
`may be pieces of a single, larger, illicit file. If there are plural
`files with identical file sizes, the application determines at
`step 210 whether the total size of the identical files summed
`together would exceed a predetermined threshold. As noted
`above, illicit files tend to be unusually large, so the predeter-
`mined threshold would be selected to correspond with the
`largest size of a typical non-illicit file. If the total size does
`exceed the predetermined threshold, then the identical files
`are reported as suspect files at step 206.
`More particularly, the application may manipulate the file
`names to determine whether they are in fact likely to be parts
`of a single,
`larger file. An alternative way to determine
`whether files should be aggregated is to delete all numbers
`from the file names. Any files that are identically named after
`the elimination ofall numbers would be marked as potentially
`responsive and their names and aggregate size would be
`reported. Of course, this can be limited to numbers in con-
`junction with specified letters (such as r00, r41, etc., as the “r”
`denotation often indicates file compression and division via
`the RAR method). Similarly, this can be limited to specified
`file types (whether identified by the file type sufiix to the file
`name, or by examination of the actual contents of the file) or
`files other than specified types (for example,
`legitimate
`graphics files such as *.jpg are often sequentially numbered
`and may be a good candidate for exclusion). Next, using the
`original list of file names, any files are identified that differ
`only by a user-defined number of characters. Such files would
`be marked as potentially responsive and their names and
`aggregate size would be reported. Both ofthe foregoing meth-
`ods can be set to either ignore the file suffix or file type
`information or to utilize it. Next, using the original list of file
`names and sizes, files that are of the same size (or within a
`user-defined number of bytes of being of the same size) are
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`identified. Any such files are marked as potentially responsive
`and their names and aggregate size would be reported.
`If no identical files are identified at step 208, or if the total
`size does not exceed the predetermined threshold at step 210,
`the application proceeds to step 212 where it is determined
`whether the file names contain any suspect tags. An example
`of a suspect tag is “xxx” which is often used in association
`with pornographic materials. Another example of a suspect
`tag is “crc”, which refers to a cyclical redundancy check
`(CRC), i.e., a known error checking technique used to ensure
`the accuracy oftransmitting digital data. When a large file has
`been broken up into plural smaller files, it is common to
`include a CRC file in order verify the accurate reconstruction
`ofthe large file. The presence of a file having a “crc” tag is an
`indicator that an illicit or illegal file has been uploaded to the
`server. A table of predetermined suspect tags may be gener-
`ated and periodically updated to reflect current usage within
`Internet newsgroups, Web sites and other facilities for traf-
`ficking in pornographic or illicit materials. If any file names
`containing suspect tags are identified, then the associated files
`are reported as suspect files at step 206.
`If no suspect tags are identified at step 212, the application
`proceeds to step 214 where it is determined whether the file is
`referenced in any HTML file contained within the directory.
`Ideally, the files stored on the database would each be linked
`to HTML files contained within the directory. Where a file is
`not linked to a local HTML file, this is an indicator that a user
`is storing graphics or other non-text files that are linked to a
`Web page hosted elsewhere on a different service. As
`described above, this situation is undesirable since the free
`web hosting service pays the storage and bandwidth costs for
`these files without deriving the revenue from advertisement
`displays. Accordingly, any file names that are not referenced
`in an HTML file contained within the directory are reported as
`suspect files at step 206. Alternatively, every file bearing a file
`type capable of causing a web browser to generate hypertext
`links (i.e. *.htm, *.html, *.shtml, etc.) may also be reviewed.
`The hypertext links may be then compared against a list of
`illegal links (for example, links to adult-content Web sites).
`Any file that contains a hypertext link to such a site is reported
`as suspect. If all files on the directory are properly referenced
`in HTML files or contain no illegal links, the application
`determines whether the end of the directory has been reached
`at step 216. If the end of the directory is not yet reached, the
`application returns to step 202 to continue traversing the
`directory and identifying suspect files. Otherwise, this por-
`tion of the application ends at step 218.
`Once a review of the directory entries is complete, the next
`step is to review the content ofthe files listed on the directory
`to see ifadditional files should be added to the suspect file list.
`This review may address every file listed on the directory not
`already listed on the suspect file list, or may be further nar-
`rowed using particular selection criteria specific to the type of
`illicit file, i.e., pornography, copyright infringement, etc. FIG.
`2B illustrates an exemplary method of reviewing file con-
`tents. At step 220, the application retrieves a file from the
`directory. At step 222, the retrieved file is examined to iden-
`tify whether the file contains a copyright notice or the symbol
`©. The presence of a copyright notice in the file is an indicator
`that the file has been uploaded to the server unlawfully, and
`likely contains graphics, text, software or other material that
`is protected by copyright. Any files containing the copyright
`notice would be reported as a suspect file and added to the
`suspect file list at step 224. This copyright notice check pro-
`cedure can also be used to ensure compliance with appropri-
`
`00001 1
`
`000011
`
`

`
`US 7,757,298 B2
`
`7
`the file can be simply
`ate copyright laws. Alternatively,
`marked for deletion. The application then returns to step 220
`and retrieves the next file.
`
`If the file does not contain a copyright notice, the applica-
`tion passes to step 226, in which the retrieved file is examined
`to determine whether the file structure is as expected for a file
`of the indicated type. For example, the file type “jpg” should
`contain a header structure with the values “255 216 255 224”.
`
`Alternatively, files can be checked to ensure that they actually
`contain the type of data described by the file type marker (i.e.,
`a file named *jpg should contain a jpg image). If the file does
`not match the indicated file type, the file can be reported as a
`suspect file and added to the suspect file list at step 224, or
`simply marked for deletion. Another alternative approach
`would be to replace files containing data of a type different
`than that indicated by their file type marker by a file stating
`that the original file was corrupted. Yet another approach
`would be to retype the file (i.e. *jpg can be retyped to *.zip if
`it contained a zipped file and not a jpg). Further, certain file
`types can be aggregated. For example, *.gif and *.jpg files
`may be aggregated as a single file type, and a file bearing a
`* .jpg type is considered valid if it contains either a gif or a jpg
`image. This greatly reduces the problem of mistakenly delet-
`ing a file that a consumer has innocently misnarned. The
`application then returns to step 220 and retrieves the next file.
`If the file contents do match the indicated file type, the
`application determines at step 228 whether the file contains
`data extending past the end of data marker. If this marker
`appears before the true end of file, then it is likely that the
`additional data following the end of data marker constitutes a
`portion of an illicit file. At step 230, the file is truncated at the
`end of file marker. The application then returns to step 220
`and retrieves the next file. Ifthe file does not contain data past
`the end of data marker, the application proceeds to step 232 in
`which it is determined whether the end of the directory has
`been reached. If there are still additional files in the directory
`to review, the application returns to step 220 and retrieves the
`next file. Ifthere are no additional files, the file content review
`process ends at step 234.
`After the files within the directory have been reviewed and
`a list of suspect files generated, the next step is to checksum
`the suspect files and compare the results against a library of
`checksum values corresponding to known illicit files. The
`generation of this list of known illicit files will be described
`below with respect to FIG. 4. FIG. 2C illustrates an exemplary
`method of checksumming the suspect files. A checksum is a
`unique number based upon a range or ranges ofbytes in a file.
`Unlike checksums as they are traditionally used in the com-
`puting field, the checksum described herein is not related to
`the total number of bytes used to generate the number, thus
`reducing a traditional problem with checksums, namely that
`similar file lengths are more likely to generate the same
`checksum than are dissimilar file lengths. In a preferred
`embodiment of the invention, two separate checksums are
`generated for a file corresponding to two different length
`portions ofthe file. While it is possible that the first checksum
`based on a shorter length portion ofthe file may falsely match
`the checksum of another file, it is highly unlikely that the
`second checksum would result in a false match. In addition,
`the use of an initial checksum based upon a small amount of
`data, reduces the burden on the network and file server. This
`reduction is a result of the ability to disqualify a file that does
`not match the first checksum without the need to read the
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`larger amount of data necessary to generate the second check-
`sum.
`
`65
`
`More particularly, at step 240, the application retrieves a
`file from the database identified on the suspect file list. Then,
`
`8
`at step 242, the application reads a first portion of the suspect
`file. In an embodiment of the invention, the first portion
`comprises the first one-thousand (1,024) bytes of the file. A
`first checksum based on this first portion is generated at step
`244. The first checksum is then compared to a library of
`known checksum values at step 246, and at step 248 it is
`determined whether there is a match between the first check-
`
`sum and the library. This step provides an initial screen of a
`file. If there is no match, then the file likely does not corre-
`spond to a known illicit file. The file may nevertheless con-
`stitute improper or unlawful material, and it may therefore be
`advisable to manually review the file to evaluate its contents.
`If the file does contain improper or unlawful material, its
`checksum may be added to the library of known checksums
`and the file marked for deletion from the database. Con-
`
`versely, if the manual review does not reveal the file to be
`improper or unlawful, or based simply on the negative result
`ofthe first checksum comparison, the file is removed from the
`suspect file list, and the application returns to step 240 to
`retrieve the next file from the suspect file list.
`If there is a match based on the initial screen of the file, the
`application proceeds to step 250 in which a second portion of
`the file is read. In an embodiment ofthe invention, the second
`portion comprises the first ten-thousand (10,240) bytes ofthe
`file. A second checksum based on this second portion is
`generated at step 252. The second checksum is then compared
`to a library ofknown checksum values at step 254, and at step
`256 it is determined whether there is a match between the
`
`second checksum and the library. This step provides a more
`conclusive determination as to whether the file corresponds to
`a known improper or unlawful file. If there is a match, the file
`is marked for deletion (or other treatment) at step 258, and the
`application returns to step 240 to retrieve the next suspect file.
`Ifthere is not a match, the file is removed from the suspect file
`list, and the application again returns to step 240 to retrieve
`the next suspect file.
`The files that are marked for deletion may be listed along
`with the pertinent information in a database (either via numer-
`ous individual files, an actual database such as SQL Server, or
`otherwise). This database may be manually reviewed and files
`that should not be deleted removed from the database. A
`
`simple file deletion program may then be run that deletes any
`file in the database.
`
`As noted above, the first one-thousand bytes and the first
`ten-thousand bytes are used for the two checksums, respec-
`tively. For most applications, the use of the entire file or a
`larger portion ofthe file is not necessary and indeed may slow
`the process; however, there is no reason why the entire file or
`any other subset ofthe file could not be used. In an alternative
`embodiment, the first and last portions of the file are used for
`checksumming, although premature file truncation then
`becomes a way to defeat the screen. It is also possible to use
`ot

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket