Chilton::ACL::Hash Overflow

A Solution to the table overflow problem for Hash Tables

F R A Hopgood

March 1968

Computer Bulletin

One of the main problems arising in the use of Hash Tables is what to do when the Table becomes full. This paper suggests a solution and shows that it can be applied at a time that keeps the cost to a minimum.

The Hash Table (sometimes called Computed Entry, Scatter, Randomized or Key Transformation) has been widely used in compilers for the past ten years. The method first seems to have appeared in some unpublished papers of H. P. Luhn and A. D. Lin in 1953 and was used in an assembly program for the IBM 701 by A. L. Samuel, G. M. Amdahl and E. Boehm in 1954. The features of the method that make it so desirable are that for a fixed table length it gives very quick access compared with more straightforward methods while retaining flexibility and simplicity which tend to be lost in the more complex methods involving reordering. However, it suffers from one drawback which is that when the table becomes full it is not obvious what action should be taken for additional entries. This paper suggests that the solution is to remove all the entries from the original table and re-enter them in a larger table. This suggestion is not new in itself; however, the paper gives some analysis to support this view and also indicates at what time the new table should be adopted.

Definition of Hash Table

This paper will restrict itself to considering only the Open Hash Table [Iverson, 1962. Page 153]. Let us suppose that we have a set of elements each of which has a name or key which uniquely defines it. Let us define the set of all possible keys in the population as K and individual elements as k₁, k₂, k₃ . . . etc. It is assumed that although the number of possible keys is very large only a small subset will be required at any time. We wish to enter these elements into a table such that at a later time they can be accessed by presenting the key. An example would be the set of all six-character FORTRAN identifiers. For a particular FORTRAN subroutine it is necessary to store in the Symbol Table for the routine the names of the identifiers appearing in the routine together with information such as type and location. The table is examined each time an identifier appears in the routine to see if it is already there. If it is there, then the relevant information is read from the table entry; if not, then a new entry is made in the table and a location and default type assigned to the identifier. Although a large number of different FORTRAN identifiers is possible (about 1,500,000,000) only a small number of these will appear in any particular FORTRAN program. Given a Hash Table D of length M with positions D₁, D₂ . . . D_M we must define a mapping function I which maps the set of keys K onto integers in the range 1 to M. In general this will be a many-one mapping. The set of keys which map onto the integer i is defined as C(i) and the number of elements in the set C(i) is defined as L(i). For best results the L(i) for different i should be nearly equal.

The algorithm for entering or accessing the key k in the Open Hash method can be described as follows:

Calculate i = I(k)
If the position D_i is empty or has the key field equal to k then this is the required entry.
If not, set i = i + 1 (Mod M) and go back to step (b).

For example, suppose M = 8 and our identifiers are only allowed to start with one of the letters A to H then the mapping function I could be defined as the position in the alphabet of the initial letter. Entering the identifiers CAT, DOG, COD, DAY, HOT and HAY into the table in that order is shown in Figure 1. The dotted paths indicate additional entries tried after the initial entry before the correct entry is found. It is important to notice that the two classes C(3) and C(4) are producing a cluster of intermingled entries and that to access or enter the identifier DAY it is necessary to examine entries in the classes C(3) and C(4). (The adjective open derives from this property that an access may involve a path through entries of classes other than the one accessed.) This clustering cannot be avoided, but the Mapping Function should be chosen so that for particular sets of keys required there is no correlation between neighbouring entries.

Fig 1: Entering identifiers into hash table with mapping function equal to alphabetic position of initial letter.

To avoid clustering it may be necessary to choose an unreasonably complicated Mapping Function. Often this can be avoided by a redefinition of neighbour. This is done by modifying step (c) of the algorithm to:

..
..
If not set i = i + P (mod M) and go back to step (b).

The neighbours of position i are now i + P (mod M), i + 2P (mod M) etc. All elements of the table will still be searched as long as M and P are co-prime. This useful device does not alter the characteristics of the table but can be thought of as equivalent to changing the Mapping Function. The analysis described below is therefore independent of P.

Efficiency of Hash Tables

Let us define S(k) as the number of entries that have to be examined before the key k is accessed in the table. This will be called the length of scan. (In our example S(CAT), S(DOG), S(HOT) = 1; S(HAY) = 2; S(COD), S(DAY) = 3.) If the number of times that each key will be accessed is known, then the average length of scan time for the table can be obtained. In general this will not be known and an estimate of the efficiency of the table accessing can be obtained by assuming that each entry is accessed the same number of times. approximate average length of scan is then just the sum of the length of scans divided by the number of entries. Assuming that there are N entries in a table of length M then we define A(s) as the approximate average length of scan where:

A(s) = where s = N/M

Later we shall also use A(N,M) to mean the same as A(s).

(In our example A(0.75) = 11/6.)

Peterson (1957) did some simulation runs to obtain values for A(s). These are given in Table 1. The figures given are the average of a set of nine runs with M = 500 using keys mapped randomly over the range 1 to M. Later Schay and Spruth (1962) obtained an approximate formulae for A(s):

A(s) = (2-s)/(2-2s)

This is also given in Table 1 and shows the remarkable agreement with the Peterson simulation results. Schay and Spruth obtained their result for a particular Hash Table which can be shown to be equivalent to the Open Hash. The important points to note are:

A(s) is independent of table size.
The average length of scan is still not much greater than 3 when the table is 80 per cent. full.

Simulation runs with real rather than random data to produce figures slightly worse than those obtained by Peterson. However, with care in choosing the Mapping Function, results fairly close to those of Peterson can be obtained in practice. [See Peterson (1957) and Buchholz (1963)].

TABLE 1 Average length of Search in Open Hash

s(N/M)	Peterson's Simulation Runs	Schay and Spruth Formula
0.1	1.053	1.056
0.2	1.137	1.125
0.3	1.230	1.214
0.4	1.366	1.333
0.5	1.541	1.500
0.6	1.823	1.750
0.7	2.260	2.167
0.8	3.223	3.000
0.9	5.526	5.500
1.0	16.914	∞

Handling Table Overflow

When a Hash Table becomes full then the method suggested here is to define a new table of length P greater than M and then to serially re-hash into the new table the elements already defined in the original table. Subsequent entries and accessing will be made on the new table. The two decisions to be made are:

At what value of s to do the re-hash.
The size of the new table P.

In any situation where time is important it will be unwise to wait until the table is completely full before re-hashing as the average scan length will already have become unreasonably large.

Let us suppose that the decision is made to re-hash the original table of length M into a table of length P when N entries have been inserted in M. Further, let us assume that each of the entries in the original table will be accessed a further b times. Our aim is to choose a time of re-hash such that the cost of the re-hash will be made up for by the b accesses to these original entries. Any new entries will, of course, on average have a smaller length of scan in the new table than the old. This ensures that the cost, as far as time is concerned, will be less by re-hashing rather than continuing with the original table.

If the re-hash had not taken place the total scan length for accessing the N entries a further b times would be bN x A(N,M). It is not possible to give a precise time required for the re-hash. However, let us assume that the time required to pick up serially an entry from the original table is approximately equal to the time required to scan an element in a table. The first will involve a load and loop-test while the second will involve a comparison between the key and the table entry. The assumption is that these are approximately equal. This is not too important as the large part of the time required for the re-hash is entering these N keys into the new table. Therefore, the total time for the re-hash is N × (A(N,P) + 1). The total scan length for doing the re-hash and accessing the N entries a further f3 times in the new table would be approximately

N × (A(N,P) + 1) + bN × A(N,P)

For the re-hash to cost us nothing as far as time is concerned we must, therefore, choose N and P such that

bN × A(N,M) > N × (A(N,P) + 1) + bN × A(N,P)

If we assume that a particular Hash Table closely approximates the formula of Schay and Spruth (1962) then we may substitute for A(N,M) and A(N,P) to give

P/M > s(b + 3s -3) / ((b + 4)s - 4) where s = N/M

From the denominator we can get the condition that it is only possible to do the re-hash at zero cost if

s > 4/(4 + b)

From this formula we have, given b, a lower limit to how full the table must be before the re-hash at zero cost is possible. If, for example, the assumed value of b is 1 (each entry will be accessed once more) then the table must be at least 80 per cent. full before the re-hash. If b= 6, the table need only be 40 per cent. full.

Figure 2 shows how the ratio P to M varies for different s and b. Near the lowest possible value of s for a particular b it is necessary to increase the table size by a very large amount while waiting until the value of s is much larger means that only a small increase in table size is necessary. Of course a larger increase in table size than these minimum positions defined in Figure 2 will also cost nothing for the re-hash.

In fact, choosing a larger table size than strictly necessary will mean that the time cost will be less than continuing with the original table.

Fig 2: Plot to show minimum increase in table size required for b=1, 2, and 6.

Doubling the Table Size

In many cases the most convenient change is to double the size of the original table. This will normally require only a small alteration to the original Mapping Function. (Each of the sets C(i) have to be divided into two approximately equal sets by the new Mapping Function.) Setting P/M = 2 we obtain the following relation for zero cost

-3s² + (11 + b)s - 8 > 0

and the minimum value of a for which the table size can be doubled for zero cost is

s = (11 + b - sqrt(b² + 22b +25)) / 6

Figure 3 shows how this minimum value of a changes with b.

Fig 3: Plot minimum s required for doubling table size at zero cost in time.

So far we have only concerned ourselves with the time required to access the keys in the table. From Figure 3 if the expected value of b was 3 then it is only necessary to fill the table two-thirds full before re-hashing. However, in many cases space is also a rare commodity and it would not be economical to re-hash until it is fairly certain either that the present table will overflow or that it is becoming inefficient due to excessive lengths of scan. As can be seen from Table 1 the average length of scan starts increasing rapidly once the table is about 80 per cent. full. From Figure 3 if b is greater than 1.4 then the cost of accessing these elements will be less Fig. 3. Plot of minimum a required for doubling table size at zero cost in time. if we re-hash into a table twice the size once the original table becomes 80 per cent. full. This suggests that the table should be loaded at least 80 per cent. fuIl before the re-hash takes place. By loading this fuIl we have not increased the average scan length too enormously and yet have waited until the probability of the table overflowing has become more likely.

If we waited until the table was 90 per cent full then the average length of scan would have increased to 5.5. However, the probability of overflow is now much higher and also for b greater than 0.6 the cost of accessing these elements will be less. The precise figure chosen for s is obviously dependent to a large extent on the characteristics of a particular Hash Table and its keys. This analysis would suggest that for a large number of applications doubling the table size once the table becomes between 80 per cent. and 90 per cent full is desirable.

Using Statistics Obtained from the Actual Data

So far all the analysis has been based on figures obtained from the random data. In any particular case, however, the values of A(s) may depart from the random values and it would be more accurate to make decisions using average lengths of scan obtained from our actual data. Considering the case of doubling the table size we have, for zero cost in time, that

b × A(N,M) > (b + 1) × A(N,2M) + 1

The values of A(N,M) can be calculated during the entering of keys into the table. However, the values of A(N,2M) are unknown. In many cases where the Mapping Function for the new table closely resembles the Mapping Function for the original table it might well be more reasonable to use A(N/2,M) as an estimate of A(N,2M). If this is done then,given b, the cost of re-hash is zero if

b × A(N,M) > (b + 1) × A(N/2,M) + 1

This could either be used to determine when to re-hash or a check to safeguard against correlated data.

Space-time Cost Considerations

In the above analysis there has been an underlying assumption that an initial table size has been decided upon and that we are concerned with the relatively infrequent problem of the table becoming full. This has been the standard way that Hash Tables have been used in the past. Space was usually allocated to cover most cases and time was the rare commodity. In modern time-sharing system where both time and space are expensive and a cost function relating the two is defined it may well be that a different approach will be adopted. If, for example, space was very expensive, then it would be reasonable to start with a very small table and change its size several times during the run, only taking more space as it was required. A more likely situation is somewhere in between these two extremes. For example, it might be reasonable to choose a table size initially so that 50 per cent. of the applications do not overflow while the remainder require a re-hash. Any detailed analysis of this would, of course, require knowledge of the cost function which would change from system to system.

ACKNOWLEDGEMENTS

The author is indebted to Dr A. Bond, of Carnegie-Melli University, for several helpful discussions during the preparation of this paper.

REFERENCES

1. Buchholz W., File organization and addressing, IBM Systems Journal, June 1963.

2. Iverson K. E., A Programming Language, New York, John Wiley and Sons Inc., 1962.

3. Peterson W. W., Addressing for random access storage, IBM Journal of Research and Development, 1, 1957.

4. Schay G. and Spruth W. G., Analysis of a file addressing method, Comm of the Assoc Comp Mach, 5, Aug 1962, page 459.

ABOUT THE AUTHOR

Mr Hopgood graduated from Cambridge in 1959 and since then has worked at A.E.R.E., Harwell, A.W.R.E., Aldermaston, and the Atlas Computer Laboratory, Chilton. He has recently returned from a sabbatical year's leave which was spent as a member of the Faculty at the Carnegie Institute of Technology, Pittsburgh.