`
`Patent Owner Ex. 2010 Page 1
`
`
`
`Computing Practices chance? The surprising answer here is that they will never collide. Two strings of the same length that differ in only one character cannot produce the same function value. The proof begins with this succession of observations: (1) h xor c = h’ xor G if and only if h = h’. (2) h xor c = h xor c’ if and only if c = c’. (3) T[x] = T[x’] if and only if x = x’. (4) If c !# c’, T[ h xor c ] !# T[ h xor c’]. (5) If h !# h’, T[ h xor c ] !# T[ h’ xor c 1. If h[i] is the succession of values encountered during the hashing of some string C, and h’[i] that for string C’, the fourth observation above guarantees that the sequences h and h’ begin to differ at the point where C and C’ first differ. The last observation guarantees that they must continue to differ until a second pair of differing input characters is encountered. l On a more practical level, if h is applied to a large number of real English words, are the 256 possible outcomes reasonably equally represented? Yes, h was applied to a spelling-checker dictionary with 26,662 words (many differing from others in appended “-s” or “-edI’), and the number of occurrences of each of the 256 output values was tallied. (All dictionary en- tries were in lowercase and appeared in alphabetical order.) Naturally, the resulting 256 counts were not all equal, but they are not alarmingly uneven. The traditional chi-square goodness-of-fit test asks how often a truly random function would be expected to produce a distribution at least as uneven as this, and the answer is, “About half the time” (x2 = 255.64, 255 degrees of freedom, p = 0.477). To test for correlation between the hash values of adjacent words, a second test was run in which successive hash values were exclusive-ORed, and the resulting 26,661 values were tallied. The distribution of the 256 resulting tallies was notably uniform (x” = 212.47, 255 d.f., p = 0.976). COMPARISON WITH A SIMPLE ALTERNATIVE Although it seems to receive little attention in the liter- ature, the following hashing algorithm is suspected of being widely used due to its speed and ease of imple- mentation on small processors. (This is a special case of the addition option mentioned in [2, p.2681). h[Ol := 0 ; for i in 1 . . n loop h[il := (h[i - l] + C[i] mod table-size ; end loop ; return h[n] As before, the subscripts on h would be omitted in any practical implementation. If table-size is a power of two, the remainder can be extracted without a division operation. Also, if the processor can accommodate inte- gers larger than any plausible sum of characters, the remainder can be taken just once, after all of the totals, instead of after each step. Addition being commutative, this function will not separate anagrams, perhaps an inauspicious sign. The distribution of values returned by this function for the 26,662 dictionary entries mentioned demonstrated a significant deviation from uniformity (x” = 468.9, 255 d.f., p < o.001). On the other hand, in a test involving a smaller num- ber of words, this additive hashing function performed about as well as h. In this test, 128 words were ran- domly selected from the 26,662-word dictionary, and collisions were counted among the resulting hash val- ues. Over ten such trials, the mean number of collisions for the additive hashing function (28.2) was only slightly higher than for h (26.8). So the nonuniform distribution of the additive hashing function does not necessarily confer a large performance penalty. VARIANTS The following variations and extensions of this hashing scheme have been explored. TABLE I. Pseudorandom Permutation of the Integers 0 through 255 1 87 49 12 176 178 102 166 121 193 6 84 249 230 44 163 14 197 213 181 161 85 218 80 64 239 24 226 236 142 38 200 110 177 104 103 141 253 255 50 77 101 81 18 45 96 31 222 25 107 190 70 86 237 240 34 72 242 20 214 244 227 149 235 97 234 57 22 60 250 82 175 ,208 5 127 199 111 62 135 248 174 169 211 58 66 154 106 195 ,245 171 17 187 182 179 0 243 132 56 148 75 128 133 158 100 130 126 91 13 153 246 216 219 119 68 223 78 83 88 201 99 122 11 92 32 136 114 52 10 138 30 48 183 156 35 61 26 143 74 251 94 129 162 63 152 170 7 115 167 241 206 3 150 55 59 151 220 90 53 23 131 125 173 15 238 79 95 89 16 105 137 225 224 217 160 37 123 118 73 2 157 46 116 9 145 134 228 207 212 202 215 69 229 27 188 67 124 168 252 42 4 29 108 21 247 19 205 39 203 233 40 186 147 198 192 155 33 164 191 98 204 165 180 117 76 140 36 210 172 41 54 159 8 185 232 113 196 231 47 146 120 51 65 28 144 254 221 93 189 194 139 112 43 71 109 184 209 This permutation gave a good hashing behavior 678 Communications of the ACM ]une 1990 Volume 33 Number 6
`
`Patent Owner Ex. 2010 Page 2
`
`
`
`Smaller Character Range If the range of the input characters can be limited, a smaller auxiliary table T may be used, and the range of h can be limited accordingly. For example, if the input string can be limited to digits and uppercase letters, then each character can be mapped into the range [0, 631 as it is processed; T can be a table of 64 values in the range [0, 631, and h will then return a value in that range. For example, the 26,544 spelling-checker entries consisting entirely of letters and digits were hashed with a function that mapped the digits onto [0, 91 and both uppercase and lowercase alphabets onto [lo, 351. A 64-element T was built by eliminating all entries exceeding 63 from the table presented earlier. Distribution over the 64 output values was as even as would be expected from a random function (x” = 59.17, 63 degrees of freedom, p = 0.614), while the exclusive- ORs of successive values were insignificantly less uni- form (x2 = 81.69, 63 d.f., p = 0.057). Larger Range of Hash Values In some applications, a range of hash indices larger than 256 is needed. Here is a simple way to get 16 bits of hash index from the function h: (1) Apply h to the string, calling the result Hl. (2) Add 1 (modulo 256) to the first character of the string. (3) Apply h to the modified string to get H2. (4) Concatenate Hl and H2 to get a Is-bit index. When this algorithm was applied to the 26,662-word spelling-checker dictionary, 4,721 collisions occurred. Since perfectly random hashing would produce, on the Computing Practices average, 4,757 collisions, we conclude that this 16-bit extension of h performs essentially as well as random hashing. In a second test, the 65,536 possible Is-bit index val- ues were grouped and tallied in 533 bins. (The number of bins was chosen so that the average bin would catch about 50 of the 26,662 tallies.) The resulting distribu- tion of tallies was consistent with the hypothesis of a uniform distribution (x’ = 558.6, 532 d.f., p = 0.205). Permuted Index Space Some users of hashing functions who are concerned with collision handling prefer to think of the hashing function as producing a permutation of the index space, thereby specifying not just a single hash index, but a succession of hash indices to be tried in case of colli- sions [6]. The function h is well suited to this sort of application. By repeatedly incrementing the first char- acter of the input string, modulo 256, one causes the hash index returned by h to pass through all 256 possi- ble index values in a very irregular manner. This is derived from the assertion that strings of equal length differing in only one character cannot produce the same hashing function value. Perfect Hashing A hashing function is perfect, with respect to some list of words, if it maps the words in the list onto distinct values, that is, with no collisions. A perfect hashing function is minimal if the integers onto which that par- ticular list of words is mapped form a contiguous set, that is, a set with no holes. (See, for example, [l], [5], and [3, pp. 506-5071.) Minimal perfect hashing func- TABLE II. A Permutation Demonstrating Perfect Hashing 39 159 180 252 71 6 13 164 232 35 226 155 98 120 154 69 157 24 137 29 147 78 121 85 112 a 248 130 55 117 190 160 176 131 228 64 211 106 38 27 140 30 88 210 227 104 a4 77 75 107 169 138 195 184 70 90 61 166 7 244 165 108 219 51 9 139 209 40 31 202 58 179 116 33 207 146 76 60 242 124 254 197 80 167 153 145 129 233 132 48 246 86 156 177 36 187 45 1 96 18 19 62 185 234 99 16 218 95 128 224 123 253 42 109 4 247 72 5 151 136 0 152 148 127 204 133 17 14 182 217 54 199 119 174 82 57 215 41 114 208 206 110 239 23 189 15 3 22 188 79 113 172 28 2 222 21 251 225 237 105 102 32 56 181 126 83 230 53 158 52 59 213 118 100 67 142 220 170 144 115 205 26 125 168 249 66 175 97 255 92 229 91 214 236 178 243 46 44 201 250 135 186 150 221 163 216 162 43 11 101 34 37 194 25 50 12 87 198 173 240 193 171 143 231 111 141 191 103 74 245 223 20 161 235 122 63 89 149 73 238 134 68 93 183 241 81 196 49 192 65 212 94 203 10 200 47 la 9 for 17 in 25 the 2 and 10 from 18 is 26 this 3 are 11 had 19 it 27 to 4 as 12 have 20 not 28 was 5 at 13 he 21 of 29 which 6 be 14 her 22 on 30 with 7 but 15 his 23 or 31 you 8 by 16 i 24 that With this table, a minimal, perfect hashing function is constructed that produces the values shown for 31 common English words. June 1990 Volume 33 Number 6 Communications of the ACM 679
`
`Patent Owner Ex. 2010 Page 3
`
`
`
`Computing Practices tions are useful in applications where a predetermined set of high-frequency words is expected and the hash value is to be used to index an array relating to those words. If the hashing function is minimal, no elements in the array are wasted (unused). The table T at the heart of this new hashing function can sometimes be modified to produce a minimal, per- fect hashing function over a modest list of words. In fact, one can usually choose the exact value of the function for a particular word. For example, Knuth [3] illustrates perfect hashing with an algorithm that maps a list of 31 common English words onto unique integers between -10 and 30. The table T presented in Table II maps these same 31 words onto the integers from 1 to 31, in alphabetical order. Although the procedure for constructing the table in Table II is too involved to be detailed here, the follow- ing highlights will enable the interested reader to re- peat the process. (1) (2) (3) (41 (5) A table T was constructed by pseudorandom per- mutation of the integers (0 . . . 255). One by one, the desired values were assigned to the words in the list. Each assignment was effected by exchanging two elements in the table. For each word, the first candidate considered for exchange was T[h[n-l] xor C[n]]), the last table ele- ment referenced in the computation of the hash function for that word. A table element could not be exchanged if it was referenced during the hashing of a previously as- signed word or if it was referenced earlier in the hashing of the same word. If the necessary exchange was forbidden by Rule 4, attention was shifted to the previously referenced table element, T[h(n-21 xor C[n-I]]). This procedure is not always successful. For example, using the ASCII character codes, if the word “a” hashes to 0 and the word “i” hashes to 15, it turns out that the word “in” must hash to 0. Initial attempts to map Knuth’s 31 words onto the integers (0.. . 30) failed for exactly this reason. The shift to the range (1 . . . 31) was an ad hoc tactic to circumvent this problem. Does this tampering with T damage the statistical behavior of the hashing function? Not seriously. When the 26,662 dictionary entries are hashed into 256 bins, the resulting distribution is still not significantly differ- ent from uniform (x” = 266.03, 255 d.f., p = 0.30). Hash- ing the 128 randomly selected dictionary words re- sulted in an average of 27.5 collisions versus 26.8 with the unmodified T. When this function is extended as described above to produce 16-bit hash indices, the same test produces a substantially greater number of collisions (4,870 versus 4,721 with the unmodified T), although the distribution still is not significantly differ- ent from uniform (x” = 565.2, 532 d.f., p = 0.154). CONCLUSION The main advantages of the hashing function presented here are: (1) No restriction is placed on the length of the text string. (2) The length of the text string does not need to be known beforehand. (3) Very little arithmetic is performed on each charac- ter being hashed. (4) Similar strings are not likely to collide. (5) Minimal, perfect hashing functions can be built in this form. Its principal disadvantages are: (1) Output value ranges that are not powers of 2 are somewhat more complicated to provide. (2) More auxiliary memory (the 256-byte table T) is required by this hashing function than by many traditional functions. This hashing function is expected to be particularly useful in situations where good separation of similar words is needed, very limited instruction sets are avail- able, or perfect hashing is desired. REFERENCES 1. Cichelli, R. J. Minimal perfect hash functions made simple. Commun. ACM 23, 1 (Jan. 1980). 17. 2. Knott, G. D. Hashing functions. Comput. I. 18, 3 (1974). 265-278. 3. Knuth, D. E. The Art of Computer Programming. Vol. III, Searching and Sorting. Addison-Wesley, Reading, Mass., 1973. 4. Meyer, C., and Matyas. S. Cryptography. John Wiley & Sons. New York, 1982. 5. Sprugnoli, R. Perfect hashing functions: A single probe retrieving method for static sets. Commun. ACM 20, 11 (Nov. 1977). 841. 6. Ullman, J. D. A note on the efficiency of hashing functions. I. ACM 19. 3 (July 1972), 569-575. CR Categories and Subject Descriptors: E.2 [Data]: Data Storage Rep- resentations-hash-table representations; H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing-indexing methods; H.3.3 [In- formation Storage and Retrieval]: Information Search and Retrieval- search process. General Terms: Algorithms. Additional Key Words and phrases: Hashing, scatter storage. ABOUT THE AUTHOR: PETER PEARSON is a computer scientist at Lawrence Liver- more National Laboratory, where his work tends to emphasize microcomputers, statistics, cryptology, and physics. Author’s Present Address: 5624 Victoria Lane, Livermore, CA 94550 Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commer- cial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 660 Communications of the ACM lune 1990 Volume 33 Number 6
`
`Patent Owner Ex. 2010 Page 4