`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Extensions of the UNIX File Command and
`
`Magic File for File Type Identification
`
`William Underwood
`
`
`
`
`Technical Report ITTL/CSITD 09-02
`
`
`September 2009
`
`Computer Science and Information Technology Division
`
`Information Technology and Telecommunications Laboratory
`
`Georgia Tech Research Institute
`
`Georgia Institute of Technology
`
`
`
`
`
`
`
`
`The Army Research Laboratory (ARL) and the National Archives and Records Administration
`(NARA) sponsor this research under Army Research Office Cooperative Agreement W911NF-
`06-2-0050. The findings in this paper should not be construed as an official ARL or NARA
`position unless so indicated by other authorized documentation.
`
`Teradata, Exh. 1027, p. 1 of 25
`
`
`
`ABSTRACT
`
`
`File format identification is a core requirement for digital archives. The UNIX file command is
`among the most promising technologies for file type identification. This report describes
`extensions to the file command and magic file that enhance their utility for file format
`identification in archival systems.
`
` A
`
` File Format Library (database) has been created to manage information about file formats.
`This information includes file format name, MIME type, PRONOM Universal Identifier and file
`signature tests. There is a one-to-one correspondence between file formats and file signature
`tests. Precedence relations between file signature tests are explicitly expressed in the database.
`Published specifications for file formats are also collected in the library and are used to
`determine file signatures for the formats. When specifications have not been published for a file
`format, samples for files in those formats have been collected and analyzed to determine possible
`file signatures. File signature tests have been created for more than 800 file formats. Sample files
`for more than 500 of the file formats in the library have been created or collected for testing of
`the file signatures. These examples are included in the library
`
`The Library includes links to file format software resources that are needed in archival
`processing of digital records. These include: file viewers/players, archive extractors, file format
`converters, password recovery software and repairers for damaged files.
`
`The File Format Library supports the creation of a magic file from the file signature tests in the
`Library. The GTRI File Type Identifier is a graphical user interface to the file command and the
`magic file created from the File Format Library. The file command and magic tests have been
`applied to examples of 500+ file formats from the File Format Library. These tests have led to
`refinement of the file signature tests and discovery of the precedence relationships among file
`signature tests.
`
`The National Archives (TNA) of the UK provides a public registry of file format information
`(PRONOM). This information includes file signature patterns expressed as regular expressions.
`TNA also provides a tool (DROID) that uses these file signature patterns for file format
`identification. This approach to file type identification is also promising and seems to be
`primarily limited by the small number of file signature patterns in the PRONOM registry. GTRI
`is collaborating with TNA to enhance the content of the registry and the performance of the
`DROID file format identifier.
`
`
`
`
`
`Teradata, Exh. 1027, p. 2 of 25
`
`
`
`
`
`TABLE OF CONTENTS
`
`
`1. INTRODUCTION ............................................................................................................................................. 1
`2. FILE FORMAT SIGNATURES AND EXTERNAL FILE IDENTIFIERS ...................................................... 1
`2.1 BASIC CONCEPTS ............................................................................................................................................... 1
`2.1 EXTERNAL FORMAT IDENTIFIERS ....................................................................................................................... 2
`3. THE FILE COMMAND AND MAGIC FILE .................................................................................................... 5
`3.1 THE FILE COMMAND ........................................................................................................................................... 5
`3.2 MAGIC FILE ......................................................................................................................................................... 5
`3.3 LIMITATIONS OF FILE AND MAGIC....................................................................................................................... 7
`4. A FILE FORMAT LIBRARY ........................................................................................................................... 8
`5. GTRI FILE TYPE IDENTIFIER ..................................................................................................................... 13
`6. RELATED RESEARCH ................................................................................................................................ 15
`7. CONCLUSION ............................................................................................................................................... 19
`REFERENCES .................................................................................................................................................. 21
`
`
`
`Teradata, Exh. 1027, p. 3 of 25
`
`
`
`1. Introduction
`
`
`Automated file format identification is a necessary feature for the ingestion of digital objects into
`an archive. Such identification is needed to insure that the files received from a creator have the
`expected file formats so that the archive is able to preserve the files and make them available to
`the public. Knowledge of the file formats is necessary to insure that viewers/players are available
`for the files, for conversion of legacy file formats into standard, current or persistent object file
`formats, for extraction of files from archive files, and for repair of damaged files.
`
`The file command and magic file available in the Linux and BSD flavors of UNIX is probably
`the most widely used tool for file format identification. The tests for identifying file formats in
`the magic file have been and remain the largest repository of information on file signatures in the
`world. However, the file command and magic file lack some features for file format
`identification that are required by digital archives.
`
`The primary objective of the research described in this report is to identify the most promising
`technology for reliable file format identification and to advance this technology to meet the
`needs of the National Archives and Records Administration. The specific purpose of this report
`is to describe extensions made to the UNIX file command and magic file to improve the
`management of file format information and to increase the reliability of file type identification.
`These extensions include a File Format Library for managing file format information including
`file signature criteria that can be used to identify file formats. The library is also a repository for
`file format specifications and software for viewing/playing files, extracting files from archive
`files, recovering passwords, and repairing damaged files. It also contains sample files for the file
`formats in the library.
`
`Section 2 of this report discusses the concepts of file types and file type identifiers. Section 3
`briefly summarizes features of the file command and magic file and discusses some of their
`limitations. Section 4 describes a File Format Library that supports the management of file
`format information and that supports the creation of magic files used by the file command.
`Section 5 describes a graphical user interface for file type identification that is based on the file
`command and a magic file created from the File Format Library. Section 6 describes related
`research and development. Section 7 summarizes the results.
`
`
`2. File Format Signatures and External File Identifiers
`
`
`2.1 Basic Concepts
`
`In the context of data storage and transmission, a file is a sequence of bits in which a data
`representation or computer instructions internal to a computer program has been encoded
`according to a file format so that it can be stored on a storage medium or transmitted over a
`network communication link. When the resulting file is read by a computer operating system, it
`is either decoded and executed by the computer, or passed to a computer program and decoded
`according to the file format to create a copy of the original internal data representation.
`
`
`
`
`1
`
`Teradata, Exh. 1027, p. 4 of 25
`
`
`
`An executable file is a serialization of encoded computer instructions. A script file is a file that
`contains instructions for an interpreter or a virtual machine.
`
` A
`
` data file is an external data representation in a sequence of bits of an internal data
`representation that can be stored on a storage medium or transmitted across a communication
`network. When the resulting file is reread according to the file format, it can be used to create a
`copy of the original internal data representation.
`
`In object-oriented programming, serialization is the process of encoding an object into an
`architecture independent serial format for storage or transmission across a communication
`network. When the resulting series of bits is decoded according to the serialization format, it can
`be used to create a semantically identical copy of the original object. Such methods of
`serialization result in persistent objects that because of their architecture independence are not
`subject to obsolescence of computer platforms (hardware and operating systems). Examples of
`such formats include the Hierarchical Data Format (HDF), Comma Separated Values (CSV), and
`JavaScript Object Notation (JSON).
`
` A
`
` file type (or file format class) is class of files with the same file format. A file format signature
`is invariant data in a file format that can be used to identify the file type (or format) of a file. In
`the UNIX operating system (including flavors such as BSD, Linux and Solaris), file signatures
`are referred to as magic numbers. In contrast to file format signatures, a file signature is a
`checksum or hash code of a file that can be used to check the integrity of the file
`
`2.1 External Format Identifiers
`
` A
`
` unique identifier is needed for file formats that is external to the file but can be linked to the
`file so that file signatures do not need to be checked every time a file is accessed. File name
`extensions and metadata stored in the operating system are two approaches that are used. MS-
`DOS and Windows file names use a file name extension to distinguish different file types.
`However, file extensions alone are often not enough to discriminate file types. For instance, file
`extensions such as DOC are ambiguous, since there are several applications that create files with
`that extension but have different file formats. Furthermore, there are WordPerfect document
`files that do not have the .DOC extension recommended by the WordPerfect manual. Instead, the
`document creator avails himself of the filename plus the filename extension to create a longer
`mnemonic filename. These extended names sometimes result in an extension used for another
`file type. For instance, SPEECH.COM is a user-created WordPerfect document file discovered
`in the Bush Presidential e-record collection that contains a speech to the Commonwealth Club.
`However, the .COM extension is also used to represent a MSDOS compressed executable file.
`
`An alternative way of identifying file formats was developed by Apple Computer for the
`Macintosh OS Hierarchical File System. Each program installed has an associated creator code.
`This is a 3-4 letter code that tells the MacOS Finder which program created a file in the file
`system. Each time an application writes a file to the file system, a creator code as well as a file
`type code are stored as part of the directory entry for the file. The file type code is also a 3-4
`letter code that tells the MacOS Finder the format of the file. The combination of creator and file
`
`
`
`2
`
`Teradata, Exh. 1027, p. 5 of 25
`
`
`
`type codes is referred to as an OSType. Both codes are needed because the same file type could
`be created by different applications.
`
` A
`
` Uniform Type Identifier (UTI) is a string added to Mac OS X 10.4 operating system for
`uniquely identifying "typed" classes of items. For example, com.apple.pict is the UTI for the
`Apple Quickdraw PICT format and com.adobe.pdf is the UTI for Adobe PDF. UTIs are used to
`identify the types of files and folders, clipboard data, bundles, aliases, symbolic links and
`streaming data. It was developed by Apple to eliminate problems associated with inferring a
`file's content from its file name extension, MIME type, or Mac OSType code [Apple 2007].
`UTIs support multiple inheritance, allowing multimedia files to be identified as not as single type
`(as in MIME), but as all the types a file contains.
`
`Multipurpose Internet Mail Extensions (MIME) media types were created to configure mail
`clients to view files that contain multiple formats. MIME was extended to configure browsers for
`a similar purpose. Each MIME media type consists of a type and a subtype separated by a slash,
`which uniquely identifies the application that created a file. When one person sends an email
`message with an attachment, the MIME media type for the application that created the
`attachment is included in the email message. The person receiving g the email can read the
`attachment if he/she has the mime type associated with the application that created the file.
`MIME media types are registered with the Internet Assigned Numbers Authority (IANA)
`[RFC2048].
`
`The MIME standard initially defined seven media types [RFC2046] with Model being added in
`1997 [RFC2077].
`
`
`
`
` Text—Textual information that requires a graphical display device to display the text,
`e.g., plain text and html.
`Image—Graphical data that requires a graphical display or printing device to view the
`information, e.g., g3fax, gif or jpeg image formats.
` Audio—Audio data that requires an audio output device such as speakers e.g., au and
`wav files.
` Video—Video data that requires a display device to display time-varying images possibly
`with color and coordinated sound, e.g., mpeg or mov files.
` Application—Data that does not fit into the other categories that is either (1) intended to
`be processed by an application program, e.g., PostScript, word processing or spreadsheet
`data, or (2) binary data that is not intended to be interpreted and displayed but just
`transferred. The subtype of application will often be the name or include part of the name
`of the application for which the data are intended. The octet-stream subtype is used to
`indicate that a body contains arbitrary binary data.
` Multipart— Data consisting of multiple entities of independent data types. Four subtypes
`were initially defined: the mixed subtype specifying a generic mixed set of parts,
`alternative for representing the same data in multiple formats, parallel for parts intended
`to be viewed simultaneously, and digest for multipart entities in which each part has a
`default type of message/rfc822.
` Message—Used for various types of messages.
`
`
`
`3
`
`Teradata, Exh. 1027, p. 6 of 25
`
`
`
` Model— A behavioral or physical representation within a given domain, e.g., mesh, iges
`and vrml.
`
`
`There are three registration trees listed on the IANA application for a MIME media type—
`Vendor, IETF and Personal. RFC 2048 states the following guidelines regarding Vendor and
`IETF trees.
`
`
`The vendor tree is used for media types associated with commercially available products.
`A registration may be placed in the vendor tree by anyone who has need to interchange
`files associated with the particular product. However, the registration formally belongs to
`the vendor or organization producing the software or file format.
`
`Registrations in the vendor tree will be distinguished by the leading facet vnd. That may
`be followed, at the discretion of the registration, by either a media type name from a well-
`known producer (e.g., vnd.microsoft) or by an IANA-approved designation of the
`producer's name which is then followed by a media type or product designation (e.g.,
`vnd.microsoft.excel).
`
`The IETF tree is intended for types of general interest to the Internet Community.
`Registration in the IETF tree requires approval by the IESG and publication of the media
`type registration as some form of RFC. Media types in the IETF tree are normally
`denoted by names that are not explicitly faceted, i.e., do not contain period characters.
`
`
`The format of MIME Types is media type/subtype. It is possible to experimentally extend the
`subtype names that are not registered with IANA by prefixing them with x- [RFC1521].
`
`After the media type and subtype names, can occur a set of parameters, specified in an
`attribute=value notation. The ordering of parameters is not significant.
`
` A
`
` charset parameter is be used to indicate the character set of the file for text subtypes. The
`octet-stream subtype of type application is used to indicate that a body contains arbitrary binary
`data. One of the optional parameters for this subtype is type which is the general type or
`category of binary data. This is intended as information for the human recipient rather than for
`any automatic processing. A codecs parameter is used for audio and video media types to
`indicate the coder-decoder for encoding analog signals to digital and decoding digital to analog
`signals [RFC4281, RFC5334].
`
`The PRONOM Persistent Universal Identifier (PUID) is an extensible scheme of persistent,
`unique and unambiguous identifiers for file formats in the PRONOM registry [Brown 2006b].
`PRONOM, operated by The National Archives of the UK, was the first and remains, to date, the
`only operational public file format registry in the world. The PUID for file formats is of the form
`fmt/identifier where identifier is a sequence of digits or lowercase letters.
`
` A
`
` PUID of the type x-fmt can be assigned to formats that have not yet been assigned an fmt
`identifier. PUID types prefixed by x- are used to provide temporary, private or experimental
`identifiers for that type. These may be used, for example, in the File Format Library as PUIDs
`
`
`
`4
`
`Teradata, Exh. 1027, p. 7 of 25
`
`
`
`for file formats that have not yet been formally assigned a fmt identifier by PRONOM. However,
`the PRONOM registry has 500 or so PUIDs of the type x-fmt. Those file formats are primarily
`file formats for which there are not internal file signature tests.
`
`MIME Type and Uniform Type Identifiers indicate type-subtype relations and thus are more
`descriptive than PUIDs. In cases that a human must assign or interpret a file format identifier, a
`descriptive identifier such as application/postscript; version=1.0 is preferable to the PUID x-
`fmt/91 for the same file type.
`
`
`3. The File Command and Magic File
`
`
`3.1 The File Command
`
`File is a UNIX program for determining the file type of a file [OpenBSD 2009]. The original
`version of the UNIX file command originated in Unix Research Version 4 in 1973. All checks
`for file type were internal to the file program. The current version of the file command derives
`from a version created by Ian Darwin circa 1986-87 and first introduced in Unix System V. The
`most significant change in that version was to specify the tests for file types in a file external to
`the file program. There have been many contributors to the evolution of the file command. Since
`1990, the primary developer and maintainer of the file command has been Christos Zoulas
`[OpenBSD 2009].
`
`In outline, the file command’s procedure for determining file types is:
`
`
`1. Check for the empty file or file system files (socket, symbolic link, named pipes (FIFO)).
`2. Use tests indicated in an external Magic file to check for file types in particular formats
`that have invariant data at some location in the file.
`3. Check if the file is a text file and if so, indicate the character set (ASCII, ISO-8859-x,
`ISO 8-bit ASCII, UTF-8, UTF-16, EBCDIC)
`a. Check for the language of a text file (e.g., troff(1), C-program)
`b. Check for tar(1) files
`c. Check for EMX application type
`d. Check for Compound Document Files (CDF)
`e. Check for elf file
`4. If none of the checks above succeed, the file type is indicated as data.
`
`
`The file command is probably the most widely used tool for identifying file types. Versions of
`the file command have been ported to the Windows operating system.
`
`
`
`3.2 Magic File
`
`In the UNIX operating system, the system and various application programs distinguish among
`executable files by checking for a so-called magic number at the beginning of a file. Many other
`files types, including file formats used in other operating systems, now have a magic number
`somewhere in the file format.
`
`
`
`5
`
`Teradata, Exh. 1027, p. 8 of 25
`
`
`
`The file(1) command identifies the file type of a file using among other tests, a test of whether a
`file matches certain patterns specified in a magic file. These patterns, referred to as magic
`numbers, are file signatures. The magic file for version 4.21 of the file command contained tests
`for approximately 2000 file types.
`
`The magic file is an ASCII text file. The criteria for identifying file types are represented by a
`series of one or more tests. A test is specified in a line with 4 fields [OpenBSD 2009b]
`
`
`1. The offset from the beginning of the file at which to begin the test. The initial location of
`a file is byte 0.
`
`
`
`
`
`
`
`2. The type of data to match, e.g., byte, short, long, string, regex, search.
`
`3. The value (or pattern) to be compared with the value from the file.
`
`
`a. If the type of data to match is regex, the pattern is in extended POSIX regular
`expression syntax and is tested against line n+1, where n is the given offset.
`b. If the type of data to match is search, the value is a string to search for beginning
`at the given offset.
`
`4. The message to be printed if the comparison succeeds. If the this field contains a printf(3)
`format specification, the value from the file is printed according to this specification. The
`fourth field may be empty, if additional tests are needed to identify the file type.
`
`
`Right angle brackets (>) are used to indicate additional tests that are necessary to identify a file
`type. A line beginning with a numerical constant is considered to be a level 0 test. The number
`of right angle brackets preceding a numerical constant indicates the level of the test. If a test at
`level n succeeds, all tests at level n+1 are performed, and the messages printed if the tests
`succeed, until a test with level n (or less) occurs.
`
`Blank lines are allowed in the magic file, but ignored. Lines beginning with the number sign (#)
`are comment lines.
`
`Example: Identification with a single test
`
`
`20
`
`lelong 0xFDC4A7DC
`
`Zoo Archive
`
`
`Example: Identification with a series of tests
`
`
`leshort 0xEA60
`0
`byte <0x0A
`>7
`byte <100
`>>8
`>>>8 byte
`!1
`>>>>8 byte
`!2
`>>>>>10
`byte
`
`
`2
`
`
`
`ARJ Archive
`
`6
`
`Teradata, Exh. 1027, p. 9 of 25
`
`
`
`Example: Regular expression used for identifying XyWrite Document
`
`
`0
`>0
`
`0xAE
`byte
`regex (\xAE[A-Z0-9,.]+\xAF)+
`
`
`Example: Identification using the search operator
`
`
`XyWrite - Note Bene Document
`
`0
`>4
`
`PK\003\004
`string
`search/256 META-INF/MANIFEST.MF
`
`JAVA Archive
`
`
`Sometimes the location of invariant data that is needed to identify a file type is not at a constant
`offset, but is pointed to by a value at a constant location. To locate such information, the magic
`file provides indirect offsets. “If the first character following the last > is a ( then the string
`following the ( is considered an indirect offset. That means that the number after the parenthesis
`is used as an offset in the file. The value at that offset is read, and is used again as an offset in the
`file.” [OpenBSD 2009b]
`
`Example: Identification using indirect offsets.
`
`
`0
`string MZ
`>0x1E string PKLITE
`string PKSFX\ for\ Windows
`>>0x24E
`>(4.s*512)
`long x
`>>&(2.s-517) byte
`x
`>>>&0 string PK\3\4 Zip Self-extracting Archive
`
`Zip Self-extracting Archive
`
`
`
`3.3 Limitations of File and Magic
`
`Each new version of the file(1) man page contains a section titled BUGS. In version 5.03, there
`are no bugs listed, but first comment is:
`
`
`There must be a better way to automate the construction of the magic file from all the
`glop in Magdir. What is it?
`
`
`This comment refers to the fact that sections of the magic file are stored in 112 folders in the
`Magic directory that have folder names indicating a kind of file format, for example, animation,
`apple, archive audio, etc. The folders are ordered alphabetically. Each folder contains a single
`text file with tests for a number of file types. This supposedly facilitates locating tests that need
`to be modified or where new tests should be placed. Placing the file signature tests in a database
`table is a better approach to managing this information. The magic file could then be generated
`from the ordered tests in the table.
`
`The placing of tests in this file structure is also complicated by the fact that some file signature
`tests must precede other tests and these precedence relationships are only indicated by comments
`
`
`
`7
`
`Teradata, Exh. 1027, p. 10 of 25
`
`
`
`in the sections of the magic file. Creating a precedence relationship for the tests in a database
`table is one approach to solving this problem.
`
`
` A
`
` sequence of magic tests is used to identify versions of a file format or even different file
`formats. Editing of file format signatures would be easier if there were a one-to-one
`correspondence of file signature tests to file formats.
`
`For many file types, the current magic file extracts and outputs more metadata than is necessary
`to identify the file type. It is an important feature to be able to extract this metadata, but the two
`functions would be better supported by distinct magic files for file type identification and for file
`type identification with metadata extraction.
`
`The tests for the language of a text file are keyword based, unreliable and embedded in C-
`program code that is not as easily modified as the magic tests. The tests and output descriptions
`for tar(1) files, EMX application types, CDF files and elf files are also embedded in C-programs
`and not easy to modify.
`
`Some file formats only have invariant data that could be used as a file signature at the end of a
`file. There needs to be a capability to test for invariant data at an offset from the end-of-file.
`
`Many software manufacturers consider the specifications for the file formats of their software
`applications to be proprietary and do not publish them. For example, IBM did not publish the file
`format specifications for IBM's DisplayWrite 4 documents. Consequently, many of the criteria
`for identifying file types that are in the magic file are created by individuals who have analyzed
`examples of the files of a particular file type. Many of these tests are inadequate to correctly
`identify the file type. This can be addressed by obtaining more file format specifications, analysis
`of additional examples, and extensive testing with samples of file types.
`
`There are file formats that occur in digital archives for which there are not file signature tests in
`the magic file. This is largely because the file command was developed for use on UNIX
`platforms, but many of the file formats accessioned into government archives were created on
`DOS and Windows platforms.
`
`
`
`4. A File Format Library
`
`
`In this section, extensions of the file command and magic file are described that incorporate
`features that overcome the limitations discussed in the previous section. Primary among these is
`the development of a database management application for managing information about file
`formats including file signature criteria that can be used to identify file formats. This database
`application is referred to as the GTRI File Format Library. The library is also a repository for file
`format specifications, software for viewing/playing files, extracting files from archive files,
`recovering passwords, and repairing damaged files. It also contains sample files.
`
`
`
`
`8
`
`Teradata, Exh. 1027, p. 11 of 25
`
`
`
`The File Format library is created using Java and the MySQL RDBMS [Sun 2009]. Figure 1
`shows the user interface to the File Format Library.
`
`
`
`
`Figure 1. General Tab of the File Format Library.
`
`
`
`
`The File Format ID is an internal sequential integer that uniquely identifies a file format in the
`library. The Format Name is the name assigned to the format by the creator or the name by
`which it is commonly known. There are difficulties in naming a file format if the creator did not
`assign a name to the format or the format is commonly known by more than one name. There is
`no standard for naming file formats. If there are versions of the format, the format name should
`also include a version number.
`
`The value of MIME Type is the registered MIME Type for this file format (or the application
`creating the file format), if one is registered with IANA. If the file format does not have a
`registered MIME type, a MIME type is created for the format in the form media-type/x-
`application_name where application_name is the name of the product creating the file format.
`For example, if one were creating a MIME type for an application called Smart Calendar, it
`would be defined as application/x-smartcalendar.
`
`
`
`9
`
`Teradata, Exh. 1027, p. 12 of 25
`
`
`
`
`MIME types have parameters such as charset, type and codecs. Additional parameters have been
`added to the MIME types in the File Format Library to indicate the version of the file format (1,
`1.1, iv) and encoding of files (encrypted compressed, and rle). For Windows bitmaps
`(image/bmp), the parameter colors (16, 16bit, 24bit, 256, 32bit) has been added.
`
`For executable programs, for example a Windows 32-bit executable, the MIME type includes a
`type parameter, for example,
`
`
`application/octet-stream; type=win32-exe
`
`
`The value of extensions is a list of filename extensions commonly used for files in this file
`format. This value of this attribute can be null.
`
`In the GTRI File Format Library, if there is a PRONOM fmt or x-fmt identifier for the file
`format, it has been entered in the PUID field. If there is no PRONOM PUID, the PUID attribute
`value is null. To have unique identifiers for all file formats, formats without PUIDs could be
`assigned an identifier such as x-fmt/gtrinumeric_identifer.
`
`The value for Precedes is a list of File Format Ids. The interpretation is that the file signature
`(magic number) tests for the current File Format must precede the tests for the file formats whose
`ids are the values of Precedes. Precedence relationships are necessary because some tests for file
`formats must be performed before others, or the file format will be incorrectly identified. For
`example, the OpenDocument Text format must be recognized before the Zip file format, because
`the former is a special case of the latter.
`
`In the future, the attributes of a file format might be extended to include:
`
`Platform (OS, hardware)
`Digital Object Class (e.g., 3D model, image, video, audio)
`Description (History, relationship to other formats, etc.)
`Release date/supported until date
`MacOS Creator/Type codes
`MacOS X Uniform Type Identifier (UTI)
`
`
`
`The magic file released with the file command often uses a sequence of magic statements to
`characterize multiple file formats and in addition extracts additional metadata for a file type. The
`file signature tests that have been created for the File Format Library contain criteria for
`individual file formats and do not extract additional metadata. In the future, an attribute could be
`added to file format whose values were tests for technical metadata as well as identification of
`file type.
`
`File signature tests have been created for more than 800 file formats. Figure 2 shows the magic
`tests characterizing the OpenDocument Text file format.
`
`
`
`
`
`10
`
`Teradata, Exh. 1027, p. 13 of 25
`
`
`
`
`
`
`
`
`
`Figure 2. Sig