throbber
Hashing functions
`
`G. D. Knott
`Division of Computer Research and Technology, National Institutes of Health,
`Department of Health, Education, and Welfare, Bethesda. Maryland 20014, USA .
`
`to present a brief history of
`The object of this paper is to survey various hashing functions,
`hashing schemes and their development, and to give an exhaustive bibliography on hashing and
`hash table storage and retrieval methods
`(Rweivcd January 1972)
`‘
`
`1. Introduction
`The storage and retrieval problem arises frequently in program-
`ming. Often we have given a collection of items presented in
`arbitrary order, and we wish to store these items and upon
`demand retrieve those items whose key-values match given
`key-values. Moreover we may wish to delete previously-stored
`items and include further new items fromtime to time.
`One particular approach to the storage and retrieval problem
`is the hash table method of storage and retrieval. The spirit of
`such schemes is to use the key value of an item to compute an
`address for the storage or retrieval ofthat item, Key-values for
`which the same address is computed are called synonyms. The
`possibility of collisions due to theoccurrcnce of synonymous
`key-values is a major difficulty which can be overcome in
`various ways.
`Because the address calculation is generally a randomising
`scrambling of the given keywalue,
`the term hashing" has
`become the name of this computation. The storage area used to
`store items is known as a hash table and the various algorithms
`for accessing hash tables and resolving collisions are known as
`hash table storage and retrieval algorithms (or just hash
`storage algorithms).
`There are two major formulations of hash table storage and
`retrieval algorithms, differing in the manner in which collisions
`are resolved. One of these approaches is to establish a hash
`table for the storage of items and to resolve collisions by some-
`how finding an unoccupied space for those items whose
`natural home location (according to the hashing function being
`used) is already full, in such a way that the item can be later
`retrieved. Algorithms which use such schemes are called
`open addressing algorithms. The other approach finesses the
`problem of collisions by using indirect addressing (Le. simple
`list'processing techniques) to allow all items which collide to
`maintain a claim to their home location. These methods of
`handling collisions are commonly called chaining methods.
`Often we think of an address as the address of a space which
`can hold many items rather than just one. Such a storage space
`is called a bucket. Even when the cells in the hash table can
`hold only a single item, we shall allow all synonymous items to
`maintain a logicalclaim to their common initially-addressed
`bucket, and we shall say that such. synonymous items reside
`in the logical bucket associated with the initiallyvaddressed cell.
`In discussing any storage and retrieval method, many com-
`plicating factors exist such as the possibility of variable length
`items,~retrieval on secondary keys, and the possibility of'dup—
`licate key-values in distinct items. We shall ignore these prob-
`lems in most of the discussion below. It is often easy to see how
`to accommodate special circumstances within the framework
`of a particular basic algorithm. Finally we shall generally
`confuse an item with its key-value for notational clarity. Thus
`we speak of an item x with keyvvaluc x.
`This paper surveys various hashing functions. We then present
`a brief history of hashing schemes and their development.
`Finally, an exhaustive bibliography on hashing and hash table
`storage and retrieval methods is given.
`
`.
`
`2. Hashing functions
`In this section we shall discuss various hashing methods. A
`hashing function is defined on a domain of values which
`includes the possible key~valucs of the items to be processed.
`The range of a hashing function is some. given segment of
`integers, say 0, 1, .
`. ., 71 ~— 1.
`For all hash table storage and retrieval schemes characterized
`by the collision resolution methods mentioned before, a good
`hashing function is one which minimises the expected number
`of collisions.
`
`2.1 Distribution-dependent hashing functions
`Consider a collection, G, of k items with integer key—values
`which are included in a universe of possible integers, U.
`(Any set of key-values can be assumed to be integers under an
`appropriate interpretation). The key—values in G are distributed
`according to some discrete empirical distributionfunction, Fx:
`where Fx is the distribution of the random variable X defined
`on G as X(x) = x; we wish to find a hashing function H with a
`domain including U and range equal to {0, 1, .
`. ., n —— 1},
`such that the random variable H(X) defined on G has a discrete
`uniform distribution function on {0, .
`. ., n _ 1}.
`In elementary terms we wish to find a hashing function 11
`which maps our it given items to the addresses 0, 1, .
`. ., n — 1
`such that any given address is not burdened with more than its
`share of items; that is, so that the items are uniformly distri-
`buted over the address values. The reason for desiring a uniform
`distribution of the hash-values is to minimiSe the number of
`collisions which will occur.
`
`Recall
`
`that Fx(t) = P(X a: t), where P(C) denotes
`
`the
`l
`probability of C. We wish to find H such thatP(H(X) = i) = a
`
`for 0 $ i s n. Consider the random variable FX(X). Note that
`
`P (FAX) < 13) = ;~Cfor 0 < r a; k. (This is strictly true only
`
`when there are no items in G- with duplicate key-values. We
`shall assume that this is the case in this discussion). Thus
`FX(X) has a discrete uniform distribution on
`1
`km]l
`k)--~a
`k
`9
`
`’
`
`and so nFX(X) has a discrete uniform distribution on
`
`2.2:
`k k"'
`
`.,n}. Therefore MFA/Yr] »-
`
`1
`
`is approximately
`
`. ., it ~— 1} (especially when n < k).
`uniform on {0, l, .
`From the above we see that a good choice for 11 is
`H(x) = ian(x)1 ~ I.
`When Fx is known and is easily computable, it can be used as
`above to obtain a hashing function which is tailored to the
`collection of items to be stored. When Fx is itself a uniform
`distribution on {1, 2, .
`. ., h}, our hashing function becomes
`nx
`
`H(x) = [7;] —- l where h is the largest keywvalue occurring in
`
`filo'siuurnoiprcho‘gnfruong;:duuwonpepeommoq
`
`
`
`Z102‘3zJeqmerNuoiscnf;,(q,
`
`*Hashing means to cut or chop; hash is a random jumble achieved by hashing; thus the term is appropriate.
`
`Volume 18 Number 3
`
`265
`
`EMCVMW 101 1
`
`

`

`an appropriate function (possiblyJust a constant) to use as our
`hash function. This approach is
`inspired by Marsaglias’
`method of generating pseudorandom numbers which are
`distributedin some given manner; see Knuth (1969). Methods
`of this form which dynamically change the size of the hash
`table as required are discussed in Knott (1971).
`
`1112.01111100
`
`. 22. Cluster-separating hashing functions
`A general consideration often mentioned is that a hashing
`function for a collection of items, G, should destroy clustersin
`G, meaning that if the key—values of two items are close to each
`other in some sense, then their bashed-values are not equal
`But no matter what hashing function we choose, there are
`notions of what constitutes a cluster for which our hashing
`function fails to achieve reasonable results. Some desirable
`behaviour can be determined for a prospective hashing function
`only if one knows something of the distribution of items in G
`For example, it is wise to ponder the nature of identifier,
`strings actually occurring in programs before choosing a
`hashing function for a hash table storage and retrieval algor-
`ithm which is intended to store such strings.
`Also, when handling character string keys one may pay
`particular attention. to the non-uniform nature of the binary
`codes used for representing characters.
`In any event, we see that there15 no such thing as a general
`purpose hashing function.
`What we mean, therefore, when we speak of an acceptableg
`general hashing function is that we have a hashing function?L
`which achieves a reasonably uniform distribution of hash—"g
`values, given certain general assumptions about the distribution:
`of the items to be processed
`9
`One common assumption is that the key~values of our collec-<3
`tion, G, of items is a random sample (without replacement):
`from the set of all possible keywalucs, U From this point of9
`View, a hashing function with range {0,1,
`..,n ~— 1} whicho
`partitions the universe of possible key-values, U, as nearly asc
`possible into n equi-numerous sets of synonymous key—values§
`corresponding to the addresses 0,1,.
`,1: —- 1is preferable to?
`one which partitions U into differently sized sets of synonymous?.
`key—values We shall call such a hashing function a balanceaIS
`hashing function. Ullman (1972) would call a hashing scheme:
`which uses a balanced hashing function i-uniform.
`~
`A more local assumption is that key—values tend to occur in;
`arithmetic progression, ie if x is an occurring key-value, x + so
`for some fixed 5 is relatively likely to occur also. Dependingg
`upon the choice of s, this assumption may be subsumed by then
`situation discussed below. A simple Markov chain model forao
`clustering which'is similar to arithmetic progression clustering§
`is discussed by Toyoda, Tezuka, and Kashara (1966) Let usN
`suppose that a reasonable notion of the distance between two
`key-values1s the so-called Hamming metric (1950). Two key-
`values, x and y, considered to be t-tuples whose components
`are symbols of some alphabet, are separated by a distance d
`under the Hamming metric when x and y have different sym—
`bols in exactly d different component positions. Another
`common clustering assumption is that if x is a key-value actually
`occurring in some item, then key~values which are close to x
`with respect to the Hamming metric are relatively likely to
`actually occur in some item also.
`If we consider our kcy~values to be English words of t or
`fewer letters for example, then the key-values are distributed in
`a certain manner in the set of all t~component vectors of letters
`(including blank). This distribution is such that our intuitive
`notion of clusters is indeed approximated byusing the Hamming
`metric as a measure of closeness. If w is an English word, then
`w changed in one or two positions is also likely to be an English
`word, relative to the probability that w changed in one or two
`positions is an English word given only'that w is a random
`string of letters.
`
`The Computer Journal
`
`G One may take 11 as the largest key-value in U if the largest
`key-value occurring among the actual items is not known I‘h1s
`particular address calculation functiorr has often been used
`An extension to the case where Fx can be approximated by a
`piece~wise linear function is discussed by Isaac and Singleton
`(1956) and later by Kronmal and Tarter (1965). Polynominal
`approximations have been tried by Simon and Guiho (1972).
`Tarter and Kronmal have also considered how to make use of
`the partial knowledge of FX whichIS gained by knowing just
`some of the items being stored (Tarter and Kronmal, 1968)
`Their schemeis very costly computationally, but it has interest
`as an example, independent of its utility.
`. , x," taken without
`Consider a sample of key-values x1, x2, .
`replacement from the file of items, G, being stored Let all the
`items in G have key-values in the range [11,12]. Then the
`1‘arter-Kronmal hashing function, H, which maps key——values
`in [a,b] to {0, 1,.. .,n — l}, is given as
`”(3‘) = [HF-Mm - 1
`
`where:
`
`p
`
`x -— x
`
`F...<x1—- 1 + 2(,_ )
`
`+ b ~ a
`2”
`1:515:
`
`
`
`. x -— a
`, x —— a
`,
`l
`i 0,5111 mb—a "516031”b_a
`.
`
`and where,
`
`13
`
`i
`
`;,sz
`c, = (b -2 (1)71 2 cos [in(xj “‘ a)/(b -- a)] 1
`
`léjém
`
`
`’
`2
`5, = (b __ a)n 2 sin [in(x, —— a)/(b —- a)] a
`
`lsjsm
`
`and t = the least integer ,2] such that
`
`"mm
`
`1(b-artcsi+sf..)<2(” 11111-1)
`
`.
`
`Fm, is an optimised t—term Fourier series approximation of the
`empirical distribution function Fx for the items in G Let
`F*(x). : P(X< xlXe {xv ,
`.
`,x,,,.}) Thus F*1s the empirical
`distribution of the sample {xb ,. .,,x,,} anmd as such is an
`
`approximation to Fx. Then Fm,(x) = a}: fl 2 +
`
`Fact), where
`
`F3, is the t-term Fourier series approximation of
`
`Fr<x1~1;::.
`
`The subtraction of the 9L...“term which is subsequently
`17:11
`
`added back in is a device to increase the rate of convergence of
`the Fourier series for 173:.
`Tarter and Kronmal have also given a way to update the values
`9:, ch 11,, and twhen more items are known,1.e when m increases
`Thus Fm can be made to be a better and better approximation
`to Fx. The use of H as an adaptive function necessitates the use
`of memory to store the previous states of H in so far as re«
`quired for retrieval. In this sense, H is related to the methods
`discussed in Knott (1971).
`Another possible scheme to make use of partial knowledge of
`Fx 15 to use the key-value x to determine ‘where’ in the current
`collectmn of already—stored items we are, and then to select
`
`266
`
`

`

`When the notion of cluster can be usefully defined in terms of
`the Hamming metric, there are methods of constructing‘cluster-
`destroying’ hashing functions which may be studied theoreti—
`cally using the results of algebraic coding theory. This was
`first suggested by Muroga (1961).
`In particular let our key-values be considered to be vectors of
`t symbols, padded if necessary, where the alphabet of usable
`symbols is in 1-1 correspondence with elements of some finite
`field, GF(q), of q elements. The common situations are to
`consider the key-values as vectors of bits in correspondence
`with GF(2), or to consider them to be vectors of binary~coded
`characters (say b-bit characters) which are then in corres-
`pondence with GF(2”).
`The 1-1 correspondence "of symbols to elements of a finite field
`induces an addition and multiplication operation on the
`appropriately extended alphabet of symbols in the obvious
`' way. Thus we shall speak of the sum or product of two symbols
`and understand that the symbol in correspondence with the
`sum or product of the corresponding field elements is intended.
`Henceforth, we shall speak of the symbol field in which the
`symbols are embedded rather than the symbol alphabet.
`Now that the set of symbols can be considered to be in a finite
`field, GF(q), for some q, we have that the set of all possible
`key-values may be considered as a t-dimensional vector space
`over GF(q).
`Now let us further insist that our ‘space of addresses‘ to be
`computed be the set of all s~tuples over the symbol field. Such
`an s-tuple is of course interpretable as an integer address in
`base q notation, or directly as an integer using the string of bits
`obtained by considering the concatenated binary represen-
`tationsof the various elements of GF(q).
`Thus, for example, we can consider mapping 64-bit key—values
`into 16«bit addresses by mapping 8—tuples over GF(28) into
`2-tuples over GF(28). Such a mapping is of course a hashing
`function.
`
`But mappings of this form occur in coding theory. There, we
`take a message of .9 symbols taken from 017(9) and encode that
`message as a longer message of t symbols over 61“(q), where
`the extra symbols are used to provide redundancy or error
`checking information such as parity checks. The t-symbol
`coded message is then transmitted to some receiving station,
`where the received message, now possibly in error, is decoded
`to yield what is considered to be the original s—tuple. The
`decoding process tries to use the redundancy of the received
`message to detect and possibly correct any transmission errors
`which may have occurred. Coding theory is concerned with
`the design of coding and decoding schemes and the study of
`their error detection and error correction properties.
`Here we are considering mappings which arise in decoding
`schemes but we consider them only as compression functions
`with certain cluster~destroying properties.
`A hashing function, H, based on methods used in decoding
`(actually in computing the so-called syndrome of a received
`message) can be defined on our t—component keys considered
`as column vectors by H(x) = Tx where T is an s x 1 matrix
`with entries in the symbol field.
`The following theorem applies:
`
`Theorem:
`
`(Sacks (1958)) Let T be an s x t matrix over a field F. Let x and
`y be t-dimensional column vectors over F. Let D be the
`Hamming metric. Then for
`1 g d g s, 0 < D(x,_y) s d
`implies Tx 515 Ty,
`ifi' every set of d columns of T is linearly
`independent over F.
`
`Proof:
`Suppose x # y, and 0 < D(x, y) < d, and yet that Tx a: Ty.
`Then T(x - y) == 0. Now at —- y has at most of non-zero
`
`Volume 18 Number 3
`
`components, and thus T(x —— y) is a linear combination over F
`of at most d columns of T, namely those which are selected
`by the non‘zero components of x ~— y. But if every d columns
`01" T are linearly independent, then in particular T(x ~ y) = 0
`implies x ~ y a O and hence x = y. Therefore for x are y we
`have 0 <: D(x, y) 4, d implies Tx 5!: Ty, given that every d
`columns of T are linearly independent.
`Now, conversely We can choose it and y such that D(x, y) = d
`and such that T(x —— y) is any desired linear combination of any
`'specified (1 columns of T. But
`then Tx 7’: Ty implies
`T(x —~ y) aé O, for any such choices of x and y, and hence no d
`columns of T are linearly dependent. QED.
`Let us call a hashing function, H, d~sepamting with respect to
`the metric D if 0 < D(x, y) < d implies H(x) ¢ H(y). We
`then have that H(x) = Tx, where T is an s x t matrix as
`defined above, is a d—separating hashing function with respect
`to the Hamming metric if every d columns of T are linearly
`independent.
`.
`.
`'
`. There are constraints on the values d, t, and s. In particular we
`have the so-called Varsharmovailbert bound (Peterson
`1961); an s x 1 matrix T can be found with elements in a
`symbol field of q elements such that H(x) = T3:
`is d—sepan
`ating whenever t, s, and d are such that:
`
`q‘~1>2(‘j‘)-(q~1r.
`
`lfiiQd—l
`
`Thus for q = 2a (8-bit symbols) and t = 8 and s = 2, we can
`certainly achieve d = 2, whence no two 8—tuples (64 bit key-
`values) will map to the same 2-tuple (16—bit bucket address),
`Whenever they differ in less than 3 symbol positions.
`Note also that for d = l and s 2 1, t can be arbitrarily large.
`Thus two keys which differ in just one position will always be
`mapped to different addresses. In particular, for q prime, we
`can take H(x) to be the sum modulo q of the 1 symbols in the
`key value x, where
`each symbol
`is
`an integer
`in
`{0, l, .
`. ., q — 1}. x can thus be considered as a sequence of t
`consecutive [an ql-bit fields to be added together modulo (1..
`To prove the above bound, we consider constructing a matrix,
`T, of s—component columns with d fixed. Then by the prevrcus
`theorem we must choose the columns of T such that every
`subset of d or fewer columns are independent. Suppose we have
`chosen t -« 1 columns such that no d of these columns are
`linearly dependent. Now in the worst case, every non—zero
`linear combination of no more than d — 1 columns already
`chosen is a distinct vector which, of course, may not be chosen
`to be column t. Counting the number of different non—zero
`linear combinations of at most d -— 1 existing columns, we
`obtain the sum given above on the right—hand side. Now we will
`certainly be able to choose another column if there are any
`vectors among the q‘s — l non~zero vectors which are not a
`linear combination of at most (I — 1 existing columns; and this
`will surely be the case when the inequality above holds. There-
`fore when the inequality above holds, there is surely an s x t
`matrix T which has every subset of d columns linearly indepen-
`dent and this proves the above statement.
`There is a better way to define d—separating hashing functions
`which now t-tuples to s-tuples than actually computing
`H(x) 3 Tx for some appropriate matrix .’I‘. The basic idea is to
`use an equivalent formulation which involves polynomial
`division as opposed to matrix multiplication.
`Let g(z) be a polynomial of degree 5 with coefficients in the
`symbol field. Also, consider t-tuples x = (x0, x1, .
`. ., x,~ 1) to
`be in correspondence with polynomials 35(2) of degree 1‘ —- l
`or less, where x(2) = x0 + x12 + .. . + x,.,,z‘”‘. We shall
`write x(z)* to denote the associated I-tuple x.
`Now we can consider the hashing function
`
`”(96) = (W) mod 3(2))* .
`
`267
`
`'1
`
`q{nor}pepnoiumog
`
`1Q}.\'0'[ll'fll10.\,;2f,l5:“.OcH:3n
`
`{L
`
`:3;
`CTL<
`if:
`’4
`
`oz‘SZ.IQQLUQAONuosenN
`
`

`

`where the result is the s—tuple corresponding to the polynomial
`x(z) mod g(z). Of course all arithmetic among coefficients is
`done in the symbol field. Knuth (1969) gives a standard poly—
`nomial division algorithm which may be used to compute
`.H(x).
`The mapping x(z)* -> (x(z) mod g(z))* is a linear transfor~
`mation on the space of t~tup1es over the symbol field into the
`space of s-tuples, over the symbol field. This follows since for
`a and b in the symbol field, we have:
`-
`
`(one) + by1211mod 112111 = a1x<z1mod g(z»* +1111.»
`mod g(z))* ,
`
`such that
`then there exists an s x 1 matrix T”
`But
`T,x == (x(z) mod g(z))* for every r-tuple x. Thus for some
`matrices, Ta, Tax can be computed as (x(z) mod g(z))* and for
`every polynomial g(z) of degree 11 there is an associated linear
`transformation, T, which can be used to specify a hashing
`function. It is particularly convenient
`to compute Tgx by
`computing (x(z) mod g(z))*
`Now we may ask what properties of g(z) result in Ta having
`every d columns independent for d as large as possible.
`Let K(Ta) be the kernel of Ta, that is the set of t-tuples, x,
`such that Tgx = 0. Now if when x, y e K(T,,) and x sé y then
`D(x,y) > d, we then have that 11(x) = (x(z) mod g(z))* = Tax
`is d-separating. To see this suppose that x, y e K(T9) and x are y
`implies D(x, y) > d. Now let u and v be t—tuples such that
`u #5 v and H(u) = H(v). Then D(u, v) = D(u —~ (14(2) mod
`g(z))‘fi v — (v(z) mod g(z))*) > d since (u ~— (11(2) mod g(z))*) e
`K(T) and (v —- (11(2) mod g(z))*) e K(T); and hence H is
`d—separating as required.
`Thus if we choose g(z) such that K(Ty) contains no vectors
`x, y, x ¢ y, where D(x, y) S. d, then H (x) = (x(z) mod g(z))*
`is d—separating.
`in coding theory terms, K(Tg) is called the
`cyclic code generated by g(z) with minimum distance d + 1.
`It is not hard to find appropriate polynomials g(z) with which
`we may construct d—separating hashing functions, but some
`knowledge of the theory of finite fields1s required and we shall
`forgo these constructions here, (the appendix contains further
`details)
`.
`One may also find polynomials which lead to hashing functions
`which separate key-values which differ in bursts, that is, in
`component positions which are within a certain distance of
`each other. Of course there are various restrictions on the
`values of t, s, and (1, also related to circumstances which involve
`finite field theory.
`Schay and Raver (1963) may be consulted to learn how to
`choose particular polynomials, g(z). Klinkhamer has also
`provided an introduction to this subject.
`
`2.3. Distribution-independent hashing functions
`We now consider some classes of hashing functions which may
`be applied without specific knowledge of the initial distribution
`of items. From such a position of ignorance, one may turn to
`‘scrambling’ methods inspired by methods of computing
`pseudo—random numbers. Some of these methods in fact turn ‘
`out to be cluster-separating schemes also. We may consider a
`keyvvalue as a bit-string. Then some common elementary
`hashing functions are listed below.
`
`1 (extraction) H(x) = a k-bit value composed by extracting k
`bits from x and concatenating them'in some order. The range
`is {0,.., 2"~ l}.
`
`2. (multiplication). H(x) = a k—bit value extracted from the
`value x2 or ex for some constant c.
`-
`
`3. (exclusive-wing). H(x) = a k—bit value extracted from the
`result of exclusive-oring various segments of x together
`
`4. (addition). H(x) = a k-bit value extracted from the result
`of adding various segments of x together.
`
`268
`
`5. (division). H(x) = xmod. n. The range is {0, .
`
`. ., n —— l}.
`
`6. (radix conversion). H(x) = a k—bit value extracted from the
`result of treating x as a binary——coded sequence of base 12
`digits and converting this coded value into a similarly-
`coded base q value. The range is nominally {0, .
`. .,2" —— 1}.
`
`From such operations as those just given and other simple
`logical and arithmetic operations one can construct reasonable
`hashing functions. For example, for character-string key-values,
`a possible hashing functionis to take the residue modulo n of
`the concatenation of the first character, the last character and
`the binary value representing the length of the given character
`string.
`In general, one can convert long key-values to a size which is
`more reasonable for computation by the extraction or exclusive-
`oring of subfields. When a suitably-sized number is obtained by
`such compression it may be further transformed by multi-
`plication, radix-conversion or division operations.
`The commutativity of certain operations may be undesirable.
`For example, if we choose to exclusive-or two component
`fields, a and b, of the key»value; and if there is a high probability
`that if there is an item with a key—value such that a = u, b = v,
`then there will be an item with a key-value such that a = u,
`b 2 u, then the exclusive-wing of the component fields a and b,
`will lead to a relatively high incidence of collisions. This may
`epcornmoq
`be the case when we are storing matrix elements, using their
`indices as key—values. A solution is to use a non-symmetric
`compression operation For example,
`the first component“a.
`shifted. circularly some number of places may be exclusive--orWed
`£131qu
`with the unshifted second component.
`The multiplication methodIS sensitive to the distribution ofi?
`key-values, and/or the choice of- a constant multiplier. The 3
`occurrence of many 0 digits in the key-value or constantf1:.
`multiplier results in a simple linear combination of shiftedQ
`forms of the key—value and/or constant multiplier The digitsQ
`of the key-value may then be carried over to the hash-valuein ~-
`a non—random manner, thus biasing the resulting hash-values. §
`This18 not serious if we choose a good constant 0. Floyd (1970) Z.
`has discovered an interesting multiplicative scheme which isf
`discussed below. The use of a middle segment of x2 has a long;
`history, inspired no doubt by Von Neumann (see Knuth, 1969),";
`but other methods may be safer
`The exclusive-or operation has the advantageous feature that
`if the arguments have 0 or 1 bits occurring equi-probably and
`independently in each position, then the resulting bit string
`also has this property.
`Addition schemes which extract other than the low~order
`k—bits of the sum are generally inferior to the use of shifting and
`exclusive-oring methods in that the number of 0’s and 1’s
`resulting do not occur cqui~probably and independently even
`though the input had this property; however, sums of k-bit
`fields taken modulo 2" are as good as exclusive-oring. In any
`event,
`if the sub~fields involved are not independent,
`the
`results of addition (or exclusive-oring) will be biased.
`Lin (1963) suggests that radix conversion methods with the
`initial and final radix values being relatively prime is a good
`general purpose hashing function.
`An attractive particular radix conversion hashing function
`might be to treat a key—value, x, as a binary value and convert
`it to a binary~coded base 11 value, thereupon choosing the
`low-order digit ([an 111 bits) as H(x), but this is just the same as
`computing x mod r1. More complex radix conversion is prob-
`ably not worthwhile.
`Simple division hashing functions seem to have acceptable
`properties. With a suitable choice of modulus, clusters under
`the Hamming metric are destroyed, and such hashing functions
`are usually balanced. Toyoda (1966) has considered the be-
`haviour of a simple division hashing function for a set of items
`whose keys are clustered according to a simple Markov chain
`
`zroz‘gzJaquiaAoNuotsanfs
`
`The Computer Journal
`
`

`

`generating scheme. His model formalises the notion of arith-
`metic progression clusters, but his results are not directly useful
`without further computation. Knuth (1973) has noted that u
`should not have 2 or 3 as Factors. x mod n is even whenever x
`is even for example. The choice of a prime modulus is often
`recommended, for then runs of keyvvalues in arithmetic pro‘
`gression hash to different address values as much as possible
`unless the step-size of the progression is a multiple of the
`modulus. Buchholz (1963), Heising (1963), and Maurer (1968)
`have discussed the division method. Buchholz points out that
`primes of the form kr‘ + l, where r is the radix of the'key«
`values (generally a power of 2) may not be the best choice, for
`then the hash-values are complex shifted sums of subfields of
`the key-values, and if these subfields are not independent then
`the comment concerning addition hashing functions applies.
`The modulus need not necessarily be chosen to he prime, how-
`ever, and this flexibility is one of the assets of the division
`method.
`Another is that the quotient as well as the remainder is gener-
`ally available, and this may be put to good use.
`When simple extraction alone, or some other 1-] function on a'
`subfield of the keysvalue, can be used, one may obtain the
`advantage that
`that subfield of the key-value of an item need
`not be stored since it can be computed from the result of the
`hash function. Such fields can be dropped only when the col-
`lision resolution scheme being used permits it by maintaining
`the distinction between buckets. This space—saving can be very
`important in some circumstances, even at the cost of retrieval
`time.
`
`2.4. A multiplicative hashing function
`Floyd’s construction of a particular multiplicative hashing
`function (1970) is given in this section.
`Let x be as usual a non—negative integral key-value. Let c
`be such that 0 g c < 1. Now we may compute cx mod 1, i.e.
`the fraction part of ex, to obtain a value in the unit interval
`[0, 1). Now given such a value 0 as y < l we may scale y to
`obtain an address in {0,1,...,n m 1} by computing Lnyj.
`Thus we will consider a hashing function H(x) of the form
`H(x) = Ln(cx mod 1)) .
`The question to be resolved here is how to choose a.
`Let us assume that keyvvalues have a tendency to occur in
`arithmetic progression, so that we may have key-values x,
`x + s, x + 25, .
`. .. In this case it is desirable that H(x),
`H(x + s), H(x + 2s), .
`.
`. be a sequence of distinct addresses
`(in so far as this is possible). But this goal is achieved if the set
`of values {ex mod 1, c(x + 5) mod 1, . . .} are ‘well-spread’ in
`the unit interval, fer then the addresses achieved by sealing .
`will be well-spread among {0, 1, .
`.
`., n —— I}.
`Now let us further suppose that c = gwhere k and h are
`
`integers. It is convenient to choose h such that s a 1 mod h
`where the integer sis the arithmetic progression stepsize. Then
`for integral a and i we have:
`
`C(x + at") mod 1 = (ex + ll; (kar‘ mod [0) mod 1
`
`z.- (cx + ac) mod 1
`
`.
`
`We have used the following facts:
`
`fmodl = éUg mod g) a
`
`and
`
`(f+g)modl :(f+ (gmodl))modl;
`as well as the condition 5‘ = 1 mod h. Thus for h and .r such
`that s a 1 mod I; we have, regardless of the values of i1, z‘z,
`.
`. .;
`
`Volume 18 Number 3
`
`.
`.
`the sequence of key-values x, x + s“, x + 2s”, .
`that
`corresponds
`to
`the
`values
`cx mod 1,
`(ex + c) mod 1,
`(ex + 20) mod 1, . . ..
`It is no loss of generality to assume x is such. that ex mod l = 0
`since we here consider the unit interval to be circular (0 = 1)
`in any event. Hence it suffices that for s =—_: 1 mod I: we further
`choose h and k, and thus determine 6, such that the points 0,
`0 mod 1, 2c mod 1, 30 mod 1, .
`.
`. are well~spread in the unit
`interval.
`We shall consider a set of points to be well—spread in the unit
`interval when the ratio of the minimum of the lengths of the
`sub-intervals defined by the given points placed in the unit
`interval to the maximum sub—interval length is large. We denote
`this ratio by Dc.
`. has been studied asymp-
`.
`The distribution of 0, c mod 1, .
`totically by Zaremba (1966). We are interested in the distri-
`bution in its early stages as well. Halton (1965) has results of
`use here.
`.
`Now consider the equations:
`l=m1c+cz
`C=m262+03
`
`cl = m3c3 + c4
`
`(0)
`
`g
`Crvi "2 mici + 0H1 -
`:5
`‘

`'
`is a nonmegative integer and c1 = c and g
`Where each mi
`.O<C,<lfori3/>l.
`31
`v
`.
`.
`.
`.
`.
`3
`To avord complications let us suppose that c 18 irrational; then 3«3‘
`the sequence c, c2, cs, .
`.
`.
`is not eventually periodic and the
`following analysis holds. Of course the actual value of c which 3l(.4.
`
`we Will use Will be a rational approximation, i, . to some desrred promo'ru
`
`)
`.w
`irrational value, such that s -=— 1 mod 11.
`By considering the process of successively placing the points i
`0 mod 1, 20 mod 1, .
`.
`. in the unit interval, and keeping in mind 5;
`the

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket