A number of recent papers have considered the quadratic hash method when the table size is a prime number. This paper shows that, contrary to what is normally assumed, the method can be used for tables whose size is a power of 2 without the usual drawback that the period of search is significantly less than the table size.
To access or enter a key K into a hash table of length M, a mapping function I(K) is defined where 1 ≤ I(K) ≤ M for all keys K in the population. If the position I(K) in the table is empty or contains the key K then the table search is concluded. However, if this entry contains some other key, then a systematic method of trying additional positions in the table must be defined (Morris, 1968; Maurer, 1968). The simplest is the linear search method where the positions immediately following the initial position are tried in turn. This can be expressed as follows:
In the simple linear search method, the ith position tried after the initial one is k + i (in future it will be assumed that this is modulo the table size). An obvious extension is to define the ith entry as k + ai where a is coprime with the length of the table, M. The major fault with the linear search is that if two sequences originating from k1 and k2 come together so that
k1 + ai = k2 + aj
then all subsequent entries in one sequence also appear in the other. Once the table starts filling up therefore, sequences tend to join together producing clusters. These clusters themselves join with other clusters so that, when the table is nearly full, the search length is much higher than would be expected if the entries had been added randomly.
The quadratic hash method avoids this clustering by defining the ith position in a sequence as
k + ai + bi2
The next position in a sequence now depends on the length of the sequence so that random collisions of two sequences do not cause them to combine. The computation required for the quadratic search method can be reduced by defining R = a + b and Q = 2b so that the algorithm is:
Frequently the method is used with Q = 1. This means that on most computers the updating of j is a single machine order.
The number of entries that appear in a sequence from a particular initial position before an entry is encountered twice is called the period of search. The period of search should be as large as possible and preferably the same as the table size. Otherwise the table can appear to be full when there is still space available. The condition that a and M are co-prime ensures that the period of search for the linear hash method is M, the table size. If M is a prime number then the period of search for the quadratic hash method is M/2. Although the table would most likely be nearly full (at say 90%) before the maximum search length became as high as M/2, it is possible that exceptional cases could arise.
Radke (1970) and Day (1970) have shown that it is possible to combine two quadratic sequences with disjoint members so that, at the expense of having a more complex function for defining the next in a sequence, it is possible to produce a method having M as the period of search.
Maurer (1968) states that, in general, when M is a power of 2 the period of a quadratic search is usually too small for effective use. However, in the special case Q = 1, the period of search is M - R + 1 (see Appendix). Consequently with R = 1, the complete table is searched. The period of search in this case is considerably better than the case where M is a prime number. Also, as M is a power of 2, additions modulo the table size can be achieved in most computers by masking off the desired number of bits. The algorithm therefore takes the simple form:
An unusual property of the method is that the period of search is reached when the ith and i + 1 th entries are the same. That is the sequence repeats when j = M. For example, the entries tried for a table of length 8 starting at k = 1, R = 1 are
1, 2, 4, 7, 3, 8, 6, 5, 5 . . .
This is often a more convenient test for the table being full than keeping a count of the entries in the table.
A measure of the efficiency of a table search is the average length of search assuming that each entry in the table is accessed as frequently as any other. If p is the fraction that the table is full then the average length of search for the linear hash method is approximately (2 - p)/(2 - 2p) assuming that the keys are chosen randomly (Hopgood, 1969). If we assume that the quadratic method eliminates clustering then the average length of search is approximately -(1/p) log (1 - p) (Morris, 1968). The accuracy of these formulae are shown in Table 1 where the results achieved for a table of length 2048 using random data are compared with the values that should be obtained according to the formulae.
Table 1: Comparison of formulae for linear and quadratic hash
with results achieved with random data. Table size=2048.
|p||(2-p)/(2-2p)||LINEAR a=1||-1/p log(1-p)||QUADRATIC R=7, Q=1|
These results are not usually achieved in practice due to keys normally not being random. These tests were therefore repeated with non-random data. As a frequent use of hash tables is the Symbol Table of a compiler, the identifiers declared in the Algorithms published in the Communications of the Association for Computing Machinery were used. These were hashed into a table of length 2048. The hash function used depended on the first three characters of the identifier. The results obtained are shown in Table 2. The value a = 173 in the linear case was the best result obtained in a set of 10 runs. It can be seen that the quadratic method is considerably better than the linear hash method even though the extra amount of work involved in incrementing j is usually only a single computer instruction per entry examined.
|Table 2: Results obtained with a table of length 2048 and non-random data.|
|p||LINEAR a=1||LINEAR a=173||QUADRATIC R=1, G=1||QUADRATIC R=7,Q=1|
The value R = 7 in the quadratic case was a typical result from a set of R values chosen between 1 and 23. The best result for the average length of search at 0.8 full was 3.538. The case R = 1 gave the worst result in this set.
Consider the sequence
ai = iR + ½i(i - 1) for i = 0, 1,2
If bi = ai (mod M) where M = 2t for some positive t then we require to show that the bi's for i = 0 to M - R are distinct. Suppose bi+ j = bi for some i, j and i + j ≤ M - R then
0 = bi+j - bi = ai+j - ai (mod 2t)
ai+ j - ai = jR + ½j(j + 2i - 1)
jR + ½j(j + 2i - 1) = 0(mod 2t)
If j is even, = 2Tr, where r is odd and we know T < t then dividing by ½j gives:
2R + j + 2i - 1 = 0 (mod2t-T+1)
2(R + i) + (j - 1) = 0 (mod 2t-T+l)
This implies j - 1 is even which is a contradiction.
If j is odd then dividing by j gives:
R + i + ½(j - 1) = 0 (mod 2t)
We can assume that i = 0 and j = 1 are not both true as then R = 0 or λ2t.
0 < i + j ≤ 2t - R
0 < i + ½(j - 1) < 2t - R
i + ½(j - 1) = λ2t - R for some λ
and again there is a contradiction.
DAY, J. C. (1970). Full table quadratic searching for scatter storage, CACM, Vol. 13, No.8, pp. 481-482.
HOPGOOD, F. R. A. (1969). Compiling Techniques, London: Macdonald.
MAURER, W. P. (1968). An improved hash code for scatter-storage, CACM, Vol. 11, No.1, pp. 35-38.
MORRIS, R. (1968). Scatter Storage Techniques, CACM, Vol. 11, No.1, pp. 38-44.
RADKE, C. E. (1970). The use of quadratic residue research, CACM, Vol. 13, No.2, pp. 103-107.
Despite the many papers on this subject in both the Computer Journal (the latest one appeared in Vol. 15, No.4, written by Messrs. Hopgood and Davenport) and other journals -particularly the Communications of the ACM I have come to the conclusion that the solution which should generally give the shortest average search length is the common-sense one, viz:
Initial entry position. P0 = function1(key). Then if further probing is necessary, calculate Increment I, = function2 (key), such that I < M and then subsequent positions, Pi, are given by
Pi = (Pi-1 + I) modulo (M)
As for (1) above, only to ensure I and M are co-prime, calculate:
Increment, I, = 2 × function2 (key) + 1
This is the method given by F. Luccio (1972).
It seems to me that both these methods:
Should anyone know of any analysis or practical work which might discredit these methods, I should be most grateful to hear from them/
LUCClO, F. (1972). Weighted increment linear search tables, CACM, Vol. 15, No. 12, pp. 1045-1047.
The major cause of the average length of search increasing when a table is becoming full is primary clustering, that is, search paths from a number of initial entry positions come together and stay together, thus creating long search paths. A second order effect is the problem of all entries from a particular initial entry position having the same search path. Mr. Pawson suggests a very sensible cure for the second-order effect which also reduces the primary clustering by having more possible search paths from each initial entry point.
The results obtained by Luccio suggest that the weighted increment linear search is comparable to the quadratic method, but no better. This implies that the amount of primary clustering remaining is about equivalent to the second-order effect. If the user feels that the second-order terms are worth removing, it is a simple matter to define a weighted increment quadratic search. Taking the average of a set of runs using random data reduces the average length of search at 90% full from 2.84 to 2.76 for the quadratic method (Luccio's method gives 2.79). As these values are never achieved in practice due to poor hash functions and non-random data, differences in the third figure are academic. In practice, I have found that the straightforward quadratic hash method is both simple to use and gives good results in a number of applications.