`
`G. D. Knott
`Division of Computer Research and Technology, National Institutes of Health,
`Department of Health, Education, and Welfare, Bethesda. Maryland 20014, USA .
`
`to present a brief history of
`The object of this paper is to survey various hashing functions,
`hashing schemes and their development, and to give an exhaustive bibliography on hashing and
`hash table storage and retrieval methods
`(Rweivcd January 1972)
`‘
`
`1. Introduction
`The storage and retrieval problem arises frequently in program-
`ming. Often we have given a collection of items presented in
`arbitrary order, and we wish to store these items and upon
`demand retrieve those items whose key-values match given
`key-values. Moreover we may wish to delete previously-stored
`items and include further new items fromtime to time.
`One particular approach to the storage and retrieval problem
`is the hash table method of storage and retrieval. The spirit of
`such schemes is to use the key value of an item to compute an
`address for the storage or retrieval ofthat item, Key-values for
`which the same address is computed are called synonyms. The
`possibility of collisions due to theoccurrcnce of synonymous
`key-values is a major difficulty which can be overcome in
`various ways.
`Because the address calculation is generally a randomising
`scrambling of the given keywalue,
`the term hashing" has
`become the name of this computation. The storage area used to
`store items is known as a hash table and the various algorithms
`for accessing hash tables and resolving collisions are known as
`hash table storage and retrieval algorithms (or just hash
`storage algorithms).
`There are two major formulations of hash table storage and
`retrieval algorithms, differing in the manner in which collisions
`are resolved. One of these approaches is to establish a hash
`table for the storage of items and to resolve collisions by some-
`how finding an unoccupied space for those items whose
`natural home location (according to the hashing function being
`used) is already full, in such a way that the item can be later
`retrieved. Algorithms which use such schemes are called
`open addressing algorithms. The other approach finesses the
`problem of collisions by using indirect addressing (Le. simple
`list'processing techniques) to allow all items which collide to
`maintain a claim to their home location. These methods of
`handling collisions are commonly called chaining methods.
`Often we think of an address as the address of a space which
`can hold many items rather than just one. Such a storage space
`is called a bucket. Even when the cells in the hash table can
`hold only a single item, we shall allow all synonymous items to
`maintain a logicalclaim to their common initially-addressed
`bucket, and we shall say that such. synonymous items reside
`in the logical bucket associated with the initiallyvaddressed cell.
`In discussing any storage and retrieval method, many com-
`plicating factors exist such as the possibility of variable length
`items,~retrieval on secondary keys, and the possibility of'dup—
`licate key-values in distinct items. We shall ignore these prob-
`lems in most of the discussion below. It is often easy to see how
`to accommodate special circumstances within the framework
`of a particular basic algorithm. Finally we shall generally
`confuse an item with its key-value for notational clarity. Thus
`we speak of an item x with keyvvaluc x.
`This paper surveys various hashing functions. We then present
`a brief history of hashing schemes and their development.
`Finally, an exhaustive bibliography on hashing and hash table
`storage and retrieval methods is given.
`
`.
`
`2. Hashing functions
`In this section we shall discuss various hashing methods. A
`hashing function is defined on a domain of values which
`includes the possible key~valucs of the items to be processed.
`The range of a hashing function is some. given segment of
`integers, say 0, 1, .
`. ., 71 ~— 1.
`For all hash table storage and retrieval schemes characterized
`by the collision resolution methods mentioned before, a good
`hashing function is one which minimises the expected number
`of collisions.
`
`2.1 Distribution-dependent hashing functions
`Consider a collection, G, of k items with integer key—values
`which are included in a universe of possible integers, U.
`(Any set of key-values can be assumed to be integers under an
`appropriate interpretation). The key—values in G are distributed
`according to some discrete empirical distributionfunction, Fx:
`where Fx is the distribution of the random variable X defined
`on G as X(x) = x; we wish to find a hashing function H with a
`domain including U and range equal to {0, 1, .
`. ., n —— 1},
`such that the random variable H(X) defined on G has a discrete
`uniform distribution function on {0, .
`. ., n _ 1}.
`In elementary terms we wish to find a hashing function 11
`which maps our it given items to the addresses 0, 1, .
`. ., n — 1
`such that any given address is not burdened with more than its
`share of items; that is, so that the items are uniformly distri-
`buted over the address values. The reason for desiring a uniform
`distribution of the hash-values is to minimiSe the number of
`collisions which will occur.
`
`Recall
`
`that Fx(t) = P(X a: t), where P(C) denotes
`
`the
`l
`probability of C. We wish to find H such thatP(H(X) = i) = a
`
`for 0 $ i s n. Consider the random variable FX(X). Note that
`
`P (FAX) < 13) = ;~Cfor 0 < r a; k. (This is strictly true only
`
`when there are no items in G- with duplicate key-values. We
`shall assume that this is the case in this discussion). Thus
`FX(X) has a discrete uniform distribution on
`1
`km]l
`k)--~a
`k
`9
`
`’
`
`and so nFX(X) has a discrete uniform distribution on
`
`2.2:
`k k"'
`
`.,n}. Therefore MFA/Yr] »-
`
`1
`
`is approximately
`
`. ., it ~— 1} (especially when n < k).
`uniform on {0, l, .
`From the above we see that a good choice for 11 is
`H(x) = ian(x)1 ~ I.
`When Fx is known and is easily computable, it can be used as
`above to obtain a hashing function which is tailored to the
`collection of items to be stored. When Fx is itself a uniform
`distribution on {1, 2, .
`. ., h}, our hashing function becomes
`nx
`
`H(x) = [7;] —- l where h is the largest keywvalue occurring in
`
`filo'siuurnoiprcho‘gnfruong;:duuwonpepeommoq
`
`
`
`Z102‘3zJeqmerNuoiscnf;,(q,
`
`*Hashing means to cut or chop; hash is a random jumble achieved by hashing; thus the term is appropriate.
`
`Volume 18 Number 3
`
`265
`
`EMCVMW 101 1
`
`
`
`an appropriate function (possiblyJust a constant) to use as our
`hash function. This approach is
`inspired by Marsaglias’
`method of generating pseudorandom numbers which are
`distributedin some given manner; see Knuth (1969). Methods
`of this form which dynamically change the size of the hash
`table as required are discussed in Knott (1971).
`
`1112.01111100
`
`. 22. Cluster-separating hashing functions
`A general consideration often mentioned is that a hashing
`function for a collection of items, G, should destroy clustersin
`G, meaning that if the key—values of two items are close to each
`other in some sense, then their bashed-values are not equal
`But no matter what hashing function we choose, there are
`notions of what constitutes a cluster for which our hashing
`function fails to achieve reasonable results. Some desirable
`behaviour can be determined for a prospective hashing function
`only if one knows something of the distribution of items in G
`For example, it is wise to ponder the nature of identifier,
`strings actually occurring in programs before choosing a
`hashing function for a hash table storage and retrieval algor-
`ithm which is intended to store such strings.
`Also, when handling character string keys one may pay
`particular attention. to the non-uniform nature of the binary
`codes used for representing characters.
`In any event, we see that there15 no such thing as a general
`purpose hashing function.
`What we mean, therefore, when we speak of an acceptableg
`general hashing function is that we have a hashing function?L
`which achieves a reasonably uniform distribution of hash—"g
`values, given certain general assumptions about the distribution:
`of the items to be processed
`9
`One common assumption is that the key~values of our collec-<3
`tion, G, of items is a random sample (without replacement):
`from the set of all possible keywalucs, U From this point of9
`View, a hashing function with range {0,1,
`..,n ~— 1} whicho
`partitions the universe of possible key-values, U, as nearly asc
`possible into n equi-numerous sets of synonymous key—values§
`corresponding to the addresses 0,1,.
`,1: —- 1is preferable to?
`one which partitions U into differently sized sets of synonymous?.
`key—values We shall call such a hashing function a balanceaIS
`hashing function. Ullman (1972) would call a hashing scheme:
`which uses a balanced hashing function i-uniform.
`~
`A more local assumption is that key—values tend to occur in;
`arithmetic progression, ie if x is an occurring key-value, x + so
`for some fixed 5 is relatively likely to occur also. Dependingg
`upon the choice of s, this assumption may be subsumed by then
`situation discussed below. A simple Markov chain model forao
`clustering which'is similar to arithmetic progression clustering§
`is discussed by Toyoda, Tezuka, and Kashara (1966) Let usN
`suppose that a reasonable notion of the distance between two
`key-values1s the so-called Hamming metric (1950). Two key-
`values, x and y, considered to be t-tuples whose components
`are symbols of some alphabet, are separated by a distance d
`under the Hamming metric when x and y have different sym—
`bols in exactly d different component positions. Another
`common clustering assumption is that if x is a key-value actually
`occurring in some item, then key~values which are close to x
`with respect to the Hamming metric are relatively likely to
`actually occur in some item also.
`If we consider our kcy~values to be English words of t or
`fewer letters for example, then the key-values are distributed in
`a certain manner in the set of all t~component vectors of letters
`(including blank). This distribution is such that our intuitive
`notion of clusters is indeed approximated byusing the Hamming
`metric as a measure of closeness. If w is an English word, then
`w changed in one or two positions is also likely to be an English
`word, relative to the probability that w changed in one or two
`positions is an English word given only'that w is a random
`string of letters.
`
`The Computer Journal
`
`G One may take 11 as the largest key-value in U if the largest
`key-value occurring among the actual items is not known I‘h1s
`particular address calculation functiorr has often been used
`An extension to the case where Fx can be approximated by a
`piece~wise linear function is discussed by Isaac and Singleton
`(1956) and later by Kronmal and Tarter (1965). Polynominal
`approximations have been tried by Simon and Guiho (1972).
`Tarter and Kronmal have also considered how to make use of
`the partial knowledge of FX whichIS gained by knowing just
`some of the items being stored (Tarter and Kronmal, 1968)
`Their schemeis very costly computationally, but it has interest
`as an example, independent of its utility.
`. , x," taken without
`Consider a sample of key-values x1, x2, .
`replacement from the file of items, G, being stored Let all the
`items in G have key-values in the range [11,12]. Then the
`1‘arter-Kronmal hashing function, H, which maps key——values
`in [a,b] to {0, 1,.. .,n — l}, is given as
`”(3‘) = [HF-Mm - 1
`
`where:
`
`p
`
`x -— x
`
`F...<x1—- 1 + 2(,_ )
`
`+ b ~ a
`2”
`1:515:
`
`
`
`. x -— a
`, x —— a
`,
`l
`i 0,5111 mb—a "516031”b_a
`.
`
`and where,
`
`13
`
`i
`
`;,sz
`c, = (b -2 (1)71 2 cos [in(xj “‘ a)/(b -- a)] 1
`
`léjém
`
`
`’
`2
`5, = (b __ a)n 2 sin [in(x, —— a)/(b —- a)] a
`
`lsjsm
`
`and t = the least integer ,2] such that
`
`"mm
`
`1(b-artcsi+sf..)<2(” 11111-1)
`
`.
`
`Fm, is an optimised t—term Fourier series approximation of the
`empirical distribution function Fx for the items in G Let
`F*(x). : P(X< xlXe {xv ,
`.
`,x,,,.}) Thus F*1s the empirical
`distribution of the sample {xb ,. .,,x,,} anmd as such is an
`
`approximation to Fx. Then Fm,(x) = a}: fl 2 +
`
`Fact), where
`
`F3, is the t-term Fourier series approximation of
`
`Fr<x1~1;::.
`
`The subtraction of the 9L...“term which is subsequently
`17:11
`
`added back in is a device to increase the rate of convergence of
`the Fourier series for 173:.
`Tarter and Kronmal have also given a way to update the values
`9:, ch 11,, and twhen more items are known,1.e when m increases
`Thus Fm can be made to be a better and better approximation
`to Fx. The use of H as an adaptive function necessitates the use
`of memory to store the previous states of H in so far as re«
`quired for retrieval. In this sense, H is related to the methods
`discussed in Knott (1971).
`Another possible scheme to make use of partial knowledge of
`Fx 15 to use the key-value x to determine ‘where’ in the current
`collectmn of already—stored items we are, and then to select
`
`266
`
`
`
`When the notion of cluster can be usefully defined in terms of
`the Hamming metric, there are methods of constructing‘cluster-
`destroying’ hashing functions which may be studied theoreti—
`cally using the results of algebraic coding theory. This was
`first suggested by Muroga (1961).
`In particular let our key-values be considered to be vectors of
`t symbols, padded if necessary, where the alphabet of usable
`symbols is in 1-1 correspondence with elements of some finite
`field, GF(q), of q elements. The common situations are to
`consider the key-values as vectors of bits in correspondence
`with GF(2), or to consider them to be vectors of binary~coded
`characters (say b-bit characters) which are then in corres-
`pondence with GF(2”).
`The 1-1 correspondence "of symbols to elements of a finite field
`induces an addition and multiplication operation on the
`appropriately extended alphabet of symbols in the obvious
`' way. Thus we shall speak of the sum or product of two symbols
`and understand that the symbol in correspondence with the
`sum or product of the corresponding field elements is intended.
`Henceforth, we shall speak of the symbol field in which the
`symbols are embedded rather than the symbol alphabet.
`Now that the set of symbols can be considered to be in a finite
`field, GF(q), for some q, we have that the set of all possible
`key-values may be considered as a t-dimensional vector space
`over GF(q).
`Now let us further insist that our ‘space of addresses‘ to be
`computed be the set of all s~tuples over the symbol field. Such
`an s-tuple is of course interpretable as an integer address in
`base q notation, or directly as an integer using the string of bits
`obtained by considering the concatenated binary represen-
`tationsof the various elements of GF(q).
`Thus, for example, we can consider mapping 64-bit key—values
`into 16«bit addresses by mapping 8—tuples over GF(28) into
`2-tuples over GF(28). Such a mapping is of course a hashing
`function.
`
`But mappings of this form occur in coding theory. There, we
`take a message of .9 symbols taken from 017(9) and encode that
`message as a longer message of t symbols over 61“(q), where
`the extra symbols are used to provide redundancy or error
`checking information such as parity checks. The t-symbol
`coded message is then transmitted to some receiving station,
`where the received message, now possibly in error, is decoded
`to yield what is considered to be the original s—tuple. The
`decoding process tries to use the redundancy of the received
`message to detect and possibly correct any transmission errors
`which may have occurred. Coding theory is concerned with
`the design of coding and decoding schemes and the study of
`their error detection and error correction properties.
`Here we are considering mappings which arise in decoding
`schemes but we consider them only as compression functions
`with certain cluster~destroying properties.
`A hashing function, H, based on methods used in decoding
`(actually in computing the so-called syndrome of a received
`message) can be defined on our t—component keys considered
`as column vectors by H(x) = Tx where T is an s x 1 matrix
`with entries in the symbol field.
`The following theorem applies:
`
`Theorem:
`
`(Sacks (1958)) Let T be an s x t matrix over a field F. Let x and
`y be t-dimensional column vectors over F. Let D be the
`Hamming metric. Then for
`1 g d g s, 0 < D(x,_y) s d
`implies Tx 515 Ty,
`ifi' every set of d columns of T is linearly
`independent over F.
`
`Proof:
`Suppose x # y, and 0 < D(x, y) < d, and yet that Tx a: Ty.
`Then T(x - y) == 0. Now at —- y has at most of non-zero
`
`Volume 18 Number 3
`
`components, and thus T(x —— y) is a linear combination over F
`of at most d columns of T, namely those which are selected
`by the non‘zero components of x ~— y. But if every d columns
`01" T are linearly independent, then in particular T(x ~ y) = 0
`implies x ~ y a O and hence x = y. Therefore for x are y we
`have 0 <: D(x, y) 4, d implies Tx 5!: Ty, given that every d
`columns of T are linearly independent.
`Now, conversely We can choose it and y such that D(x, y) = d
`and such that T(x —— y) is any desired linear combination of any
`'specified (1 columns of T. But
`then Tx 7’: Ty implies
`T(x —~ y) aé O, for any such choices of x and y, and hence no d
`columns of T are linearly dependent. QED.
`Let us call a hashing function, H, d~sepamting with respect to
`the metric D if 0 < D(x, y) < d implies H(x) ¢ H(y). We
`then have that H(x) = Tx, where T is an s x t matrix as
`defined above, is a d—separating hashing function with respect
`to the Hamming metric if every d columns of T are linearly
`independent.
`.
`.
`'
`. There are constraints on the values d, t, and s. In particular we
`have the so-called Varsharmovailbert bound (Peterson
`1961); an s x 1 matrix T can be found with elements in a
`symbol field of q elements such that H(x) = T3:
`is d—sepan
`ating whenever t, s, and d are such that:
`
`q‘~1>2(‘j‘)-(q~1r.
`
`lfiiQd—l
`
`Thus for q = 2a (8-bit symbols) and t = 8 and s = 2, we can
`certainly achieve d = 2, whence no two 8—tuples (64 bit key-
`values) will map to the same 2-tuple (16—bit bucket address),
`Whenever they differ in less than 3 symbol positions.
`Note also that for d = l and s 2 1, t can be arbitrarily large.
`Thus two keys which differ in just one position will always be
`mapped to different addresses. In particular, for q prime, we
`can take H(x) to be the sum modulo q of the 1 symbols in the
`key value x, where
`each symbol
`is
`an integer
`in
`{0, l, .
`. ., q — 1}. x can thus be considered as a sequence of t
`consecutive [an ql-bit fields to be added together modulo (1..
`To prove the above bound, we consider constructing a matrix,
`T, of s—component columns with d fixed. Then by the prevrcus
`theorem we must choose the columns of T such that every
`subset of d or fewer columns are independent. Suppose we have
`chosen t -« 1 columns such that no d of these columns are
`linearly dependent. Now in the worst case, every non—zero
`linear combination of no more than d — 1 columns already
`chosen is a distinct vector which, of course, may not be chosen
`to be column t. Counting the number of different non—zero
`linear combinations of at most d -— 1 existing columns, we
`obtain the sum given above on the right—hand side. Now we will
`certainly be able to choose another column if there are any
`vectors among the q‘s — l non~zero vectors which are not a
`linear combination of at most (I — 1 existing columns; and this
`will surely be the case when the inequality above holds. There-
`fore when the inequality above holds, there is surely an s x t
`matrix T which has every subset of d columns linearly indepen-
`dent and this proves the above statement.
`There is a better way to define d—separating hashing functions
`which now t-tuples to s-tuples than actually computing
`H(x) 3 Tx for some appropriate matrix .’I‘. The basic idea is to
`use an equivalent formulation which involves polynomial
`division as opposed to matrix multiplication.
`Let g(z) be a polynomial of degree 5 with coefficients in the
`symbol field. Also, consider t-tuples x = (x0, x1, .
`. ., x,~ 1) to
`be in correspondence with polynomials 35(2) of degree 1‘ —- l
`or less, where x(2) = x0 + x12 + .. . + x,.,,z‘”‘. We shall
`write x(z)* to denote the associated I-tuple x.
`Now we can consider the hashing function
`
`”(96) = (W) mod 3(2))* .
`
`267
`
`'1
`
`q{nor}pepnoiumog
`
`1Q}.\'0'[ll'fll10.\,;2f,l5:“.OcH:3n
`
`{L
`
`:3;
`CTL<
`if:
`’4
`
`oz‘SZ.IQQLUQAONuosenN
`
`
`
`where the result is the s—tuple corresponding to the polynomial
`x(z) mod g(z). Of course all arithmetic among coefficients is
`done in the symbol field. Knuth (1969) gives a standard poly—
`nomial division algorithm which may be used to compute
`.H(x).
`The mapping x(z)* -> (x(z) mod g(z))* is a linear transfor~
`mation on the space of t~tup1es over the symbol field into the
`space of s-tuples, over the symbol field. This follows since for
`a and b in the symbol field, we have:
`-
`
`(one) + by1211mod 112111 = a1x<z1mod g(z»* +1111.»
`mod g(z))* ,
`
`such that
`then there exists an s x 1 matrix T”
`But
`T,x == (x(z) mod g(z))* for every r-tuple x. Thus for some
`matrices, Ta, Tax can be computed as (x(z) mod g(z))* and for
`every polynomial g(z) of degree 11 there is an associated linear
`transformation, T, which can be used to specify a hashing
`function. It is particularly convenient
`to compute Tgx by
`computing (x(z) mod g(z))*
`Now we may ask what properties of g(z) result in Ta having
`every d columns independent for d as large as possible.
`Let K(Ta) be the kernel of Ta, that is the set of t-tuples, x,
`such that Tgx = 0. Now if when x, y e K(T,,) and x sé y then
`D(x,y) > d, we then have that 11(x) = (x(z) mod g(z))* = Tax
`is d-separating. To see this suppose that x, y e K(T9) and x are y
`implies D(x, y) > d. Now let u and v be t—tuples such that
`u #5 v and H(u) = H(v). Then D(u, v) = D(u —~ (14(2) mod
`g(z))‘fi v — (v(z) mod g(z))*) > d since (u ~— (11(2) mod g(z))*) e
`K(T) and (v —- (11(2) mod g(z))*) e K(T); and hence H is
`d—separating as required.
`Thus if we choose g(z) such that K(Ty) contains no vectors
`x, y, x ¢ y, where D(x, y) S. d, then H (x) = (x(z) mod g(z))*
`is d—separating.
`in coding theory terms, K(Tg) is called the
`cyclic code generated by g(z) with minimum distance d + 1.
`It is not hard to find appropriate polynomials g(z) with which
`we may construct d—separating hashing functions, but some
`knowledge of the theory of finite fields1s required and we shall
`forgo these constructions here, (the appendix contains further
`details)
`.
`One may also find polynomials which lead to hashing functions
`which separate key-values which differ in bursts, that is, in
`component positions which are within a certain distance of
`each other. Of course there are various restrictions on the
`values of t, s, and (1, also related to circumstances which involve
`finite field theory.
`Schay and Raver (1963) may be consulted to learn how to
`choose particular polynomials, g(z). Klinkhamer has also
`provided an introduction to this subject.
`
`2.3. Distribution-independent hashing functions
`We now consider some classes of hashing functions which may
`be applied without specific knowledge of the initial distribution
`of items. From such a position of ignorance, one may turn to
`‘scrambling’ methods inspired by methods of computing
`pseudo—random numbers. Some of these methods in fact turn ‘
`out to be cluster-separating schemes also. We may consider a
`keyvvalue as a bit-string. Then some common elementary
`hashing functions are listed below.
`
`1 (extraction) H(x) = a k-bit value composed by extracting k
`bits from x and concatenating them'in some order. The range
`is {0,.., 2"~ l}.
`
`2. (multiplication). H(x) = a k—bit value extracted from the
`value x2 or ex for some constant c.
`-
`
`3. (exclusive-wing). H(x) = a k—bit value extracted from the
`result of exclusive-oring various segments of x together
`
`4. (addition). H(x) = a k-bit value extracted from the result
`of adding various segments of x together.
`
`268
`
`5. (division). H(x) = xmod. n. The range is {0, .
`
`. ., n —— l}.
`
`6. (radix conversion). H(x) = a k—bit value extracted from the
`result of treating x as a binary——coded sequence of base 12
`digits and converting this coded value into a similarly-
`coded base q value. The range is nominally {0, .
`. .,2" —— 1}.
`
`From such operations as those just given and other simple
`logical and arithmetic operations one can construct reasonable
`hashing functions. For example, for character-string key-values,
`a possible hashing functionis to take the residue modulo n of
`the concatenation of the first character, the last character and
`the binary value representing the length of the given character
`string.
`In general, one can convert long key-values to a size which is
`more reasonable for computation by the extraction or exclusive-
`oring of subfields. When a suitably-sized number is obtained by
`such compression it may be further transformed by multi-
`plication, radix-conversion or division operations.
`The commutativity of certain operations may be undesirable.
`For example, if we choose to exclusive-or two component
`fields, a and b, of the key»value; and if there is a high probability
`that if there is an item with a key—value such that a = u, b = v,
`then there will be an item with a key-value such that a = u,
`b 2 u, then the exclusive-wing of the component fields a and b,
`will lead to a relatively high incidence of collisions. This may
`epcornmoq
`be the case when we are storing matrix elements, using their
`indices as key—values. A solution is to use a non-symmetric
`compression operation For example,
`the first component“a.
`shifted. circularly some number of places may be exclusive--orWed
`£131qu
`with the unshifted second component.
`The multiplication methodIS sensitive to the distribution ofi?
`key-values, and/or the choice of- a constant multiplier. The 3
`occurrence of many 0 digits in the key-value or constantf1:.
`multiplier results in a simple linear combination of shiftedQ
`forms of the key—value and/or constant multiplier The digitsQ
`of the key-value may then be carried over to the hash-valuein ~-
`a non—random manner, thus biasing the resulting hash-values. §
`This18 not serious if we choose a good constant 0. Floyd (1970) Z.
`has discovered an interesting multiplicative scheme which isf
`discussed below. The use of a middle segment of x2 has a long;
`history, inspired no doubt by Von Neumann (see Knuth, 1969),";
`but other methods may be safer
`The exclusive-or operation has the advantageous feature that
`if the arguments have 0 or 1 bits occurring equi-probably and
`independently in each position, then the resulting bit string
`also has this property.
`Addition schemes which extract other than the low~order
`k—bits of the sum are generally inferior to the use of shifting and
`exclusive-oring methods in that the number of 0’s and 1’s
`resulting do not occur cqui~probably and independently even
`though the input had this property; however, sums of k-bit
`fields taken modulo 2" are as good as exclusive-oring. In any
`event,
`if the sub~fields involved are not independent,
`the
`results of addition (or exclusive-oring) will be biased.
`Lin (1963) suggests that radix conversion methods with the
`initial and final radix values being relatively prime is a good
`general purpose hashing function.
`An attractive particular radix conversion hashing function
`might be to treat a key—value, x, as a binary value and convert
`it to a binary~coded base 11 value, thereupon choosing the
`low-order digit ([an 111 bits) as H(x), but this is just the same as
`computing x mod r1. More complex radix conversion is prob-
`ably not worthwhile.
`Simple division hashing functions seem to have acceptable
`properties. With a suitable choice of modulus, clusters under
`the Hamming metric are destroyed, and such hashing functions
`are usually balanced. Toyoda (1966) has considered the be-
`haviour of a simple division hashing function for a set of items
`whose keys are clustered according to a simple Markov chain
`
`zroz‘gzJaquiaAoNuotsanfs
`
`The Computer Journal
`
`
`
`generating scheme. His model formalises the notion of arith-
`metic progression clusters, but his results are not directly useful
`without further computation. Knuth (1973) has noted that u
`should not have 2 or 3 as Factors. x mod n is even whenever x
`is even for example. The choice of a prime modulus is often
`recommended, for then runs of keyvvalues in arithmetic pro‘
`gression hash to different address values as much as possible
`unless the step-size of the progression is a multiple of the
`modulus. Buchholz (1963), Heising (1963), and Maurer (1968)
`have discussed the division method. Buchholz points out that
`primes of the form kr‘ + l, where r is the radix of the'key«
`values (generally a power of 2) may not be the best choice, for
`then the hash-values are complex shifted sums of subfields of
`the key-values, and if these subfields are not independent then
`the comment concerning addition hashing functions applies.
`The modulus need not necessarily be chosen to he prime, how-
`ever, and this flexibility is one of the assets of the division
`method.
`Another is that the quotient as well as the remainder is gener-
`ally available, and this may be put to good use.
`When simple extraction alone, or some other 1-] function on a'
`subfield of the keysvalue, can be used, one may obtain the
`advantage that
`that subfield of the key-value of an item need
`not be stored since it can be computed from the result of the
`hash function. Such fields can be dropped only when the col-
`lision resolution scheme being used permits it by maintaining
`the distinction between buckets. This space—saving can be very
`important in some circumstances, even at the cost of retrieval
`time.
`
`2.4. A multiplicative hashing function
`Floyd’s construction of a particular multiplicative hashing
`function (1970) is given in this section.
`Let x be as usual a non—negative integral key-value. Let c
`be such that 0 g c < 1. Now we may compute cx mod 1, i.e.
`the fraction part of ex, to obtain a value in the unit interval
`[0, 1). Now given such a value 0 as y < l we may scale y to
`obtain an address in {0,1,...,n m 1} by computing Lnyj.
`Thus we will consider a hashing function H(x) of the form
`H(x) = Ln(cx mod 1)) .
`The question to be resolved here is how to choose a.
`Let us assume that keyvvalues have a tendency to occur in
`arithmetic progression, so that we may have key-values x,
`x + s, x + 25, .
`. .. In this case it is desirable that H(x),
`H(x + s), H(x + 2s), .
`.
`. be a sequence of distinct addresses
`(in so far as this is possible). But this goal is achieved if the set
`of values {ex mod 1, c(x + 5) mod 1, . . .} are ‘well-spread’ in
`the unit interval, fer then the addresses achieved by sealing .
`will be well-spread among {0, 1, .
`.
`., n —— I}.
`Now let us further suppose that c = gwhere k and h are
`
`integers. It is convenient to choose h such that s a 1 mod h
`where the integer sis the arithmetic progression stepsize. Then
`for integral a and i we have:
`
`C(x + at") mod 1 = (ex + ll; (kar‘ mod [0) mod 1
`
`z.- (cx + ac) mod 1
`
`.
`
`We have used the following facts:
`
`fmodl = éUg mod g) a
`
`and
`
`(f+g)modl :(f+ (gmodl))modl;
`as well as the condition 5‘ = 1 mod h. Thus for h and .r such
`that s a 1 mod I; we have, regardless of the values of i1, z‘z,
`.
`. .;
`
`Volume 18 Number 3
`
`.
`.
`the sequence of key-values x, x + s“, x + 2s”, .
`that
`corresponds
`to
`the
`values
`cx mod 1,
`(ex + c) mod 1,
`(ex + 20) mod 1, . . ..
`It is no loss of generality to assume x is such. that ex mod l = 0
`since we here consider the unit interval to be circular (0 = 1)
`in any event. Hence it suffices that for s =—_: 1 mod I: we further
`choose h and k, and thus determine 6, such that the points 0,
`0 mod 1, 2c mod 1, 30 mod 1, .
`.
`. are well~spread in the unit
`interval.
`We shall consider a set of points to be well—spread in the unit
`interval when the ratio of the minimum of the lengths of the
`sub-intervals defined by the given points placed in the unit
`interval to the maximum sub—interval length is large. We denote
`this ratio by Dc.
`. has been studied asymp-
`.
`The distribution of 0, c mod 1, .
`totically by Zaremba (1966). We are interested in the distri-
`bution in its early stages as well. Halton (1965) has results of
`use here.
`.
`Now consider the equations:
`l=m1c+cz
`C=m262+03
`
`cl = m3c3 + c4
`
`(0)
`
`g
`Crvi "2 mici + 0H1 -
`:5
`‘
`§
`'
`is a nonmegative integer and c1 = c and g
`Where each mi
`.O<C,<lfori3/>l.
`31
`v
`.
`.
`.
`.
`.
`3
`To avord complications let us suppose that c 18 irrational; then 3«3‘
`the sequence c, c2, cs, .
`.
`.
`is not eventually periodic and the
`following analysis holds. Of course the actual value of c which 3l(.4.
`
`we Will use Will be a rational approximation, i, . to some desrred promo'ru
`
`)
`.w
`irrational value, such that s -=— 1 mod 11.
`By considering the process of successively placing the points i
`0 mod 1, 20 mod 1, .
`.
`. in the unit interval, and keeping in mind 5;
`the