`
`ACTICE 8 EXPERIENCE
`ME12, No.12
`DECEMBER 1982
`
`EDITORS
`
`DAVID BARRON
`
`CHARLES LANG
`
`DAVID HANSON
`
`§«2Ang Q
`
`18 1982
`
`Q
`"BusH\‘°
`
`JOHN WILEY & SONS
`Chichester - New York - Brisbane - Toronto - Singapore
`A Wile-v—lnterscience Publication
`SPEXBL12(12) 1085-1173
`4
`ISSN 0O38—O64
`
`EMCVMW 1017
`EMCVMW 1017
`
`
`
`SHORT COMMUNICATIONS
`
`1165
`
`‘FINGERPRINTING’~A
`TECHNIQUE FOR FILE
`IDENTIFICATION AND
`MAINTENANCE
`
`D. R. MCGREGOR AND
`
`A. MARIANI
`
`Department of Computer Science, University of
`Strathclyde, Glasgow, Scotland
`
`This short note describes a technique, which we
`have called ‘fingerprinting’,
`to produce a quasi-
`unique identifier for a file, derived from that file’s
`contents. This allows the identification of identical
`files with different names and the notification of
`any changes to a known file.
`KEY \V'ORDS
`Sum check
`File identification
`Information content Documentation
`Programme support environment
`Source code control
`
`One problem associated with current operating
`systems is the easy proliferation of multiple copies
`of programs and data. While this is not only
`wasteful of space and other resources, it can cause
`problems when associated documentation has been
`connected with a program by the program’s file
`name.
`
`The ‘fingerprinting’ technique described in this
`short note outlines a solution whereby information
`about a file is connected to that file by means of the
`file’s contents, as opposed to its name.
`The idea is to provide an identifying feature for
`every file, which is intrinsically distinctive, and
`analogous (hopefully) to a human’s fingerprint. In
`our current implementation, a fingerprint is a 32-
`bit number,
`formed simply by performing a
`double check-sum on the program file, and provid-
`ing a unique identifier for the file.
`A fingerprint is calculated as follows*given a
`file with N characters, the double checksum (s2) is
`formed by the algorithm-
`
`s1:=0;s2:=O;
`for 1':=1toNdo
`begin
`s1 := s1+c[1];
`s2 := $2 + 31
`end
`
`‘ideal’ value (1 in 232), but the simplicity of the
`algorithm makes for a simple and very fast fin-
`gerprinting program.
`as
`Hence, we are using the file’s content,
`opposed to the file’s name as in conventional
`systems, to identify the file. In a simple fingerprint
`(FP) system,
`for example, we can maintain a
`database of FPS against associated documentation.
`Therefore, when the user comes across an un-
`documented file he can try to match its FP against
`one in the database in an attempt to obtain any
`information regarding that file. To build up the FP
`database, users are free to add new files to the
`system, as long as they are willing to enter appro-
`priate information regarding that file.
`Systems can naturally range upwards in sophist-
`ication from the simple one described above. In the
`case of program files, to ensure correct identificat-
`ion, the FP database could also hold test data and
`matching output for the program (for example, a
`Pascal compiler could have a Pascal program as
`data and the corresponding compiled listing as
`output) which can be checked with the output of
`the unknown program (by fingerprints again). In
`the event of any difference, a more sophisticated
`comparison of the programs could be called into
`play.
`The fingerprinting technique, as outlined this
`far,
`is best used statically~i.e. with programs
`which are rarely expected to change. The FP
`database can be extended to handle different ver-
`sions of a program, while programs (or program
`versions) which are no longer supported can have
`the updated version’s FP overwrite the old FP.
`Static fingerprints can be used to detect any file
`corruptions. One use FP was put to was as an
`assurance of correct software transportation. Every
`file transported was fingerprinted at our installa-
`tion, and a list of FPS plus a copy of the finger-
`printer
`sent with the software. The finger-
`printer was then employed at the destination to
`confirm the
`safe
`transportation of
`the
`files
`involved.
`The main use of the FP technique has been in
`conjunction with the AutoProg system,’ which
`essentially (for the purposes of this paper) main-
`tains a library of user source modules accessible by
`all users but maintained by their owners within
`ther own file store. In this case, the files can be
`expected to change, and the fact that a file has been
`altered can be
`exposed by occasionally re-
`fingerprinting all the modules and comparing these
`dynamic FPS against the FPS as recorded in the
`AutoProg database.
`If modifications are detected, users of the af-
`fected modules can be notified and re-compilation
`
`where s1 and s2 are 16-bit integers (overflow is
`ignored) and c[f] is the ith character of the file.
`These 2 16-bit numbers are then stored as a 32-
`bit fingerprint. From an information theoretical
`standpoint this algorithm does not generate the
`best possible checksum. The chance of a clash in
`fingerprints is therefore somewhat higher than the
`Received 26 fuly 1982 Revised 21 September 1982
`
`
`
`1166
`
`BOOK REVIEWS
`
`of affected programs can be automatically gen-
`erated. Similarly, FPs can be extended to detect
`changes in the associated documentation files.
`In this short note we have presented a very
`simple technique for connecting information about
`a file with that file’s content. This method shares
`the drawback of associating documentation with a
`file name, as both a file’s name and contents can be
`transient. However, the FP technique can be use-
`fully applied in the diametrically opposing situa-
`
`tions where files are relatively static or where
`changes to files are required to be trapped, such as
`the AutoProg system or as a detection mechanism
`for file corruption.
`
`REFERENCES
`
`l. D. R. N-IcGregor and J. A. Mariani, ‘AutoProg—a
`software development and maintenance system’,
`IUCC Bulletin, Summer 1981.
`
`
`
`Book Reviews
`
`INTRODUCTORY ALGOL 68 PROGRAMMING, D. F.
`Brailsford and A. N. VValker, Ellis Horwood,
`Chichester. No. of pages: 281. Price:£l4-.00 (hard-
`back), £595 (paperback).
`
`Algol 68 is not an easy language to teach. This
`text on Algol 68 is
`intended both for under-
`graduates attending their
`first course in pro-
`gramming and for computer programmers and
`system analysts in industry, research and com-
`merce. I am afraid that the book’s approach is such
`that most beginners will soon be put off pro-
`gramming. Even for programmers conversant with
`Fortran, Basic or Cobol, a gentler introduction to
`some of the concepts would be helpful.
`For a first course in programming I believe it is
`necessary to have an approach led by a need to
`write programs in order to solve problems rather
`than a breadth-first approach through the elements
`of a programming language. One requires a gentle
`lead-in to the way that objects are represented in
`programs rather than know all about objects, then
`find out how programs are written and finally find
`out how objects are manipulated within programs.
`In another context, there is the phrase Algorithms
`+ Data Structures = Programs where the operator
`+ evaluates its operands concurrently.
`In this
`book the right hand operand of + is evaluated
`first, then the left hand operand and finally the +
`is done in order to yield some programs.
`Thus, chapter 1
`is about objects. It introduces
`the modes, real, int, bool and char, row modes
`and structured modes. The chapter also mentions
`heap generators and dope vectors and has a very
`heavy dose of refs including:
`ref ref int chain = heap ref int
`:=heap int :=123
`
`In my opinion this chapter contains too many new
`ideas.
`
`For example, consider the very first coverage of
`arrays. There is one paragraph explaining in gen-
`eral terms that it may be useful to regard a group of
`objects as forming a composite object and that
`Algol 68 has row modes and structured modes in
`order to do this. The next paragraph contains an
`example of an object of mode []real and intro-
`duces the notation [,]real and [, ,]real and the
`final paragraph explains how a [, ,]real is repre-
`sented inside a computer. I feel that most of this
`material should have been left until a later chapter.
`Chapter 2 is called ‘Program Structure’. It starts
`with sections on ‘primaries’, ‘expressions’, ‘unitary
`clauses’ and ‘serial clauses’. This is an important
`chapter becausevit contains some of the essentials
`such as assignments,
`expressions, conditional
`clauses and loops. However,
`there is too much
`theory: there is not enough demonstration of how
`these constructs are used and put together. For
`example, at
`the start of this chapter,
`the first
`Algol 68 program of the book is presented. It
`contains two heap declarations, two calls of read
`and a call ofprint. The second Algol 68 program of
`the book appears at the end of the section on ‘serial
`clauses’: it contains three heap declarations, a call
`of read, an obscure while loop and a call of print.
`In between the two programs, the fragments of
`Algol 68 that are given are not presented as fulfil-
`ling some need.
`Chapter 2 continues with a discussion of scopes
`and ranges. Until now heap generators have been
`used. However, after a discussion of how runtime
`storage is organized in terms of a stack and a stack
`pointer,
`the book points out
`that heap storage
`cannot be handled in this fashion and so it intro-
`duces
`loc. Abbreviated declarations are then
`covered.
`
`Chapter 3 takes some elementary problems and
`produces programs to solve them. The chapter
`starts by producing many programs to solve a