Chilton::ACL::Compiling Techniques

Chapter 4: Tables

The most usual method of storing a table is in a vector. However, lists and plexes are more usual for certain special types of tables, such as decision tables. As stated previously, a table is a storage device for items each having a unique name or key associated with them. The key is given, and on a table access, a pointer to the information associated with the item is returned by the table accessing mechanism. If this information is small in size, then the information itself may be returned. In the case of adding an entry to the table, given the key, the table entry mechanism should return with a pointer to where the information is to be stored. For example, in an English-French dictionary, the English word, for which the French equivalent is required, is the key to the table entry, and the corresponding French word is the information accessed by the key.

In many of the implementations of a table, the insertion and retrieval of information are not based on the key itself but on information dependent on the key. Given a key, k, then a mapping function M is defined such that M(k) is the derived information. The function M is used to produce a simpler and more efficient derived key for accessing entries. In the example of the English-French dictionary, the book may have a thumb index, and it is then more efficient to use the thumb index to access the section equivalent to the initial letter of the English word and then to find the entry in the section by a search over the remaining letters. In this case the accessing has been divided into two parts with M₁(k) being the initial letter and M₂(k) the remainder of the English word. In most compiler applications the mapping function will produce a numeric (usually integer) result in some range. Such a mapping function will be denoted by I(k).

4.1 Direct access tables

In the English-French dictionary suppose that there are only a limited number of English words which need to be looked up and it so happens that no two English words start with the same letter. In this case if a vector of 26 elements was available and each element was capable of holding the equivalent French word, than all that would be needed is a mapping function I(k) such that for a word k, I(k) is the position in the alphabet of the initial letter of the word k. The I(k)th position in the vector would then be set to contain the equivalent French word. Such a table would be called direct access. Some properties of direct access tables are:

If the number of English words to be looked up is less than 26 then some elements of the vector are not used and will be wasted.
It is not necessary to store the key k itself as the mapping function is unique.
If, later, an English word with the same initial letter as one already entered is to be added then the method breaks down and a different mapping function must be defined.

This may be defined more precisely. Given a vector A having elements A[1] to A[n] and a mapping function I for a set of keys k₁ ... k_m (where m ≤ n) such that I(k) takes a value between 1 and n for all keys and I(k_i) ≠ I(k_j) for any i and j, then the accessing is equivalent to an array access and the process can be thought of as mapping the table on to an array.

4.2 Table look-up

The direct access method is only applicable if a mapping function can be found so that the proportion of unused entries is small. If in the dictionary example only five English words were required to be looked-up, then 21 of the 26 storage elements of the vector would be wasted. The table look-up is a method which economises on storage at the expense of efficiency. The five words could be stored in consecutive elements of a vector and, assuming that the vector element can contain the French word in another field, then to find the French equivalent of the English word it is necessary to search through the elements in some well defined order comparing the English word that is to be looked-up with the English word in each element. When a match is found, that element contains the equivalent French word. It is important to notice that adding additional entries to the dictionary is simple. These are added at the first empty position in the vector. The vector can either be defined as long as the maximum number of elements ever likely to be added or alternatively a more sophisticated allocation scheme may be arranged. To aid in adding new entries it is usual to reserve the first two positions in the vector for the length of the table and the position of the first empty entry.

4.3 Search length

In comparing the efficiency of the various methods of table access it is necessary to introduce the quantity S, defined as the length of search required to find a particular entry in the dictionary. By this is meant the number of dictionary entries that had to be examined before the correct entry was found. In the case of the direct access table, S=1 for all entries. For table look-up, the Mth entry has S=M.

To get a measure of the efficiency of the accessing method, a second quantity T is introduced which is the average length of search time required to find an entry in the table. In general this will of course depend on the frequency with which entries are accessed. For example, if the first entry in a look-up table is required much more frequently than the remaining entries then the value of T will be approximately 1. If the access frequency in unknown, each entry will be assumed to be accessed the same number of times. This average length of search will be denoted by A.

  A = (S₁ + S₂ + ... + S_n) / n

for n entries where the search length for the ith entry is S_i. Obviously A=1 for the direct access table and for table look-up with M entries A=½(M + 1).

In this case the trade-off for a more efficient storage allocation is a longer average search time. In the case of a table with 39 entries the average length of search for table look-up is A=20. Therefore, unless T is much less than A, this kind of search length is too expensive. More sophisticated methods have, therefore, been adopted to access tables.

4.4 Binary search

Returning to the example of the English-French dictionary, great help is given to the user in finding a particular entry by having the entries in alphabetical order. For any entry required, if the dictionary is opened at any word then it is known whether the required word is before or after this position in the dictionary. A simple algorithm for finding the required entry would, therefore, be to open the dictionary in the middle and see if the word looked at was the required word. If not, then it is known whether the required word is in the first or second half of the dictionary. Suppose the word required was in the first half, then the next step would be to look at the middle entry of the first half. If this was not the required entry then once more it would be known in which half of the first half of the dictionary the required word is contained. The procedure is repeated until the required word is matched. This technique is called binary search.

For a table or dictionary of N entries, the length of search S for most entries will be approximately log₂N and the average length of search A will not be very much less than log₂N as the large number of entries found after approximately log₂N entries will predominate. More precisely, this method can be defined as finding a mapping function I which produces a unique set of integers, and then the table entries are ordered depending on the values I(k).

A modification of the standard binary search is to attempt to make a better guess at the position to try at each step rather than try the middle of the table. For example, if the English word in the example was CAT, then it would seem reasonable to try approximately 2/26 of the way from the front initially. How good a guess is made depends on how much is known about the dictionary. For example, knowing how many words in the English language start with each letter would enable a better initial guess to be made. In a compiler it will quite often be the case that the actual entries will be a non-random subset of entries taken from the complete set of entries about which some general information is known. In this case Peterson (1957) has shown that the average search length A is approximately ½log₂N. For a table with 1000 entries the average length of search would be 5 in place of 10 with the straightforward binary search.

The big drawback with this method, which makes it less useful in a compiler, is that adding further entries is not usually a simple process and will require re-ordering of the entries. It is usually used, therefore, for fairly fixed tables where entries are seldom added.

4.5 Hash tables

The most frequently used of the more sophisticated methods of table accessing is the hash table. This method first seems to have appeared in some reports by H. P. Luhn and A. D. Lin in 1953 and was used by A. L. Samuel, G. M. Amdahl and E. Boehm in an assembler for the IBM 701 in 1954. The name hash table is usual in the U.S.A. although it frequently goes under the name computed entry table in Great Britain. Alternative names for the method that have appeared in the literature are scatter, randomized or key transformation table.

For a fixed length table, the method usually gives a value for A comparable with the more sophisticated methods while being more flexible than the methods, such as binary search, which require re-ordering. In the methods described so far, the mapping function I has been defined to produce a unique set of integers such that, for any two keys k₁ and k₂, I(k₁) ≠ I(k₂). In general, the span of the possible values of I(k) is usually too large to use a direct access table. However, given a table consisting of a vector of length M, it may well be possible to produce a mapping function such that the span of the values of I(k) is 1 to M if the uniqueness condition is dropped. That is, it is possible to have two keys k₁ and k₂ such that I(k₁ = I(k₂).

This table, called a hash table, can be loaded and accessed as though it was a direct access table until a key k₁ is entered where I(k₁) = I(k₂) and k₁ is a key already entered. The k₁ key has been entered in position I(k₁). The problem is where to place the k₂ entry. This situation is called overflow and the different sub-methods of the hash class depend on the method used to deal with overflow.

4.5.1 Open hash

The open hash technique is the method most generally used in compilers. Given a key k, the algorithm for entering or accessing the key in a table of length M is:

Calculate i = I(k).
If the position i in the table is empty or contains the key k in the name field of the element then exit.
If not, set i=i+1 (mod M) and go back to step 2.

In this and all methods of the hash type, the key for the entry is stored in a field of the table entry called the name field. The search for an entry will always terminate as long as the table is not completely full because of the enforced circularity of step 3.

For example, suppose that in the English-French dictionary it is known that not more than eight words will be required in the dictionary and only words with initial letters A to H will be required. Initially, the table will be assumed empty, and as each word required appears, it is added to the table. The key, that is the English word, is added to the table, together with the French equivalent which is entered in another field of the element. One of the simplest forms that the mapping function I could take would be for I(k) to be the position in the alphabet of the initial letter of the word k. If now the actual words required in order were CAT, DOG, COLD, DAY, HOT HAY, then the table would have been as in Fig. 4.1. The first and second words CAT and DOG are mapped into the third and fourth positions, respectively, and, as they are empty, are inserted there. COLD is mapped into the third entry also, but, as it is not empty, the fourth entry is tried. This is also non-empty, so the fifth entry is tried. This is empty and COLD is inserted there. DAY is mapped into the fourth position. This and position 5 are tried, but both are non-empty, so it is inserted in position 6. HOT is mapped into position 8 and entered as it is empty. HAY is also mapped into position 8. This is now occupied so the next entry is tried which, by the circularity, is entry 1. This is empty, therefore, HAY is inserted there.

In the diagram it is assumed that the hash table is organized as a vector with each element having two fields. It is assumed that both the English and French words may be stored in the element of the vector. Quite often this will not be the case. For example, both the English and French words can be of variable length and it would be uneconomical to allow space for the longest possible words. Therefore, it may be necessary to have either or both the key and information fields indirectly addressed. The hash table would then contain pointers to the information which could be stored tightly packed in a separate vector. If either or both the key and information fields vary greatly then this may well be the most efficient storage method.

Fig. 4.1

In the example, the two elements mapped into the third position and the two elements mapped into the fourth position have intermingled. In order to access DAY the entries looked at include entries mapped into the third position (in this case COLD). The adjective open derives from this property that the entries searched from a particular entry position are open to elements mapped into other entry positions.

An important result which should always be kept in mind when designing a mapping function is that for random entries the average length of scan is minimised if the mapping function maps equal numbers of keys on to the vector positions 1 to M. In general, this may not be possible because of the size of the vector and the size of the set of possible keys. In this case the mapping should be as close to this desired form as possible.

In the example, a cluster of entries has started to appear following the third and fourth positions. This is unavoidable with the open hash technique. However, if the mapping function is badly designed it may be that there is a high correlation between the probabilities of keys appearing that are mapped into neighbouring entries. Care should be taken that this does not occur, for, if it does, much larger clusters than are necessary will be produced, causing the average length of search to be increased. It may be difficult to alter the mapping function I to stop neighbouring entries being correlated without making I very complex. Another approach is to redefine the term neighbour by changing rule (3) of the open hash definition:

Calculate i = I(k).
If the position i in the table is empty or contains the key k in the name field of the element then exit.
If not, set i = i + P (mod M) and go back to step 2

The constant 1 has been replaced by P. As long as M and P are coprime all positions of the vector will still be searched. However, the nearest neighbours of position i are now i + P, i + 2P, etc. This change is equivalent to redefining a more complex mapping function, and all results obtained for P = 1 apply for a general P.

In order to simplify the circularity of the table it is usual to have M a power of 2, say 2^R, in which case a simple mask of R bits is sufficient to take the modulus. Also, this usually helps the definition of the mapping function if this is being done at a machine-code level. One inconvenience of a hash table is that, during the debugging stage of writing a compiler, entries will appear fairly randomly throughout the table, which is not very convenient for print-out. It is quite often useful, therefore, to parameterise the routines for accessing the table so that it is easy to change both I and P. In this case, during the debug phase, P could be set to 1, and a mapping function I, when I(k) = 1 for all k, could be used. The table is then transformed to a simple table look-up.

It is not easy to determine the average length of search A for an open hash table because of the effect of clustering described earlier. Even assuming that the mapping function maps equally on to all positions and assuming that the keys are taken randomly from the set of all possible keys, then it is not possible to assume that the entries in the table are randomly distributed as clustering appears as soon as a significant number of entries have been made. Let the average search length be defined as A(s) where s=N/M, the fraction that the table is filled (sometimes this will be written A(N,M)). N is the number of entries so far inserted in the table of length M. Peterson (1957) did a set of nine simulation runs with keys mapped randomly into the range 1 to 500. The results for these are given in Table 4.1. One point worthy of notice is that the values of A(s) are independent of the order in which the N entries were inserted. Any permutation of the order can be broken down into a set of legal permutations involving only two entries. Reversing the order of insertion of just two entries can easily be shown to give the same length of search for the two elements. Schay and Spruth (1962) assumed that, in a cluster, all entries mapped into the ith position appeared before those mapped into the (i + l)th. They were then able to obtain the approximate formula:

    A(s) = (2-s) / (2-2s)

This is also tabulated in Table 4.1, and it can be seen to agree very well with the simulation results of Peterson. An example of a run using non-random data (identifiers taken from the Algorithms published in the Communications of the A.C.M.) is also given in Table 4.1. It will be noticed that because of correlation effects between neighbouring entries, non-uniform mapping and non-random data, the figures of Peterson and of Schay and Spruth were not obtained. However, the results shown were fairly typical of a large number of runs where not too much attention was paid to producing the best mapping function. Peterson, using IBM Data Files, produced similar results. It is important to notice that figures obtained for average length of search are independent of table size, and depend only on how full the table is. The surprising result is that with the table 80 per cent full the average length of search is still around 3.

So far all that has been examined is the value of A(s). The value of T(s), the genuine average length of search noting the frequency with which each entry is accessed, may be considerably less than A(s) if the entries required most frequently are entered first in the table.

	A(s)
s	Peterson	Schay and Spruth	ACM Algorithms
0.1	1.053	1.056
0.2	1.137	1.125	1.43
0.3	1.230	1.214
0.4	1.366	1.333	2.35
0.5	1.541	1.500
0.6	1.823	1.750	3.24
0.7	2.260	2.167
0.8	3.223	3.000	5.22
0.9	5.526	5.500
1.0	16.914	∞	14.67

Table 4.1

An example of this can be seen in Batson (1965) where a symbol table for Algol identifiers together with the Algol reserved words is defined as an open hash table with the reserved words entered first.

One problem with a hash table of this kind is what to do when the table is full. Various suggestions have appeared which take special action for the additional entries such as having an additional table which is accessed after the hash. Another approach is to alter the mapping function so that the table can be enlarged and still leave the keys already entered in the same positions. Alterations of this kind usually produce a non-uniform mapping function which cause the average search length to go up. The simplest and, as it turns out, most efficient method is to increase the size of the table and re-hash all the entries into the new table. For example, if the original table was of length M and the new table is 2M in length, then suppose that it is decided to increase the size of the table when the original table becomes 80 per cent full. A measure of how much extra work will have been done in the re-hashing procedure will be to assume that after the re-hash each of the entries is accessed once more. A comparison of the time taken for accessing all the entries, assuming that the re-hashing had not taken place, with the time taken after the re-hash together with the time for the re-hash will give a measure of how much work is involved.

If it is 80 per cent full then the time for accessing each entry once, assuming no re-hashing, will be N £ 3.223 using Peterson's simulation figures. The time taken to re-hash is not very well expressed in terms of search-length. The entries in the old table can be removed serially so that the search length for picking up a key can be assumed to be 1. The search length for doing the re-hash will be 1.366 on average (using the 40 per cent full figure for the new table). Accessing each entry once more will be 1.366N.

Therefore, continuing to use the old table would have cost 3.223N compared with 3.732 N for re-hashing. These two times are so similar that obviously the cost of the re-hash is soon absorbed by the decrease in the search length for new entries and subsequent accesses to the original entries. These will now have an average search length of 1.366 instead of 3.223.

The question of when to re-hash has been glossed over in the above discussion. A fuller discussion can be found in Hopgood (1968).

4.5.2 Overflow hash

In the open hash method, overflow entries are inserted in the first empty neighbour. An alternative would be to have a completely separate table into which the overflow entries are inserted. If the number of overflow entries is small, then a look-up table could be used for this purpose. The method is only practical if the number of overflow entries is small.

4.5.3. Overflow with chaining

A variation of the overflow method described above is to have a separate overflow table for each position of the hash table. This subsidiary table can be thought of as a simple table look-up, although it would normally be organised as a single area of storage with pointers passing down the chain of entries associated with any position in the hash table. This is shown in the diagram Fig. 4.2. The zeros in the overflow table are used to represent the end of the chain from a particular entry. In this case the average scan length, assuming randomly distributed data, is

   A(N,M) = 1 + (N-1)/2M

Fig. 4.2

4.5.4 Overflow with internal chaining

The two methods described above require a subsidiary overflow table. In general, if N ≤ M then there is in fact space in the original hash table to store the overflow entries. A variation of the overflow with chaining technique is called overflow with internal chaining where the overflow entries are chained into the empty positions of the hash table itself. Chaining in the overflow entries cannot of course be done until the hash table has been filled with all of its entries as, until this moment, the unused entries are not known. Production of the table is, therefore, in two parts. Initially only non-overflow entries are stored in the hash table (in our example CAT, DOG and HOT). Once it is known that no new entries will be added, the overflow entries, which have been temporarily stored in an overflow table, are chained into the empty positions. This could be done serially from the top as in Fig. 4.3. The average length of search is obviously identical to the overflow with chaining, that is:

   A(N,M) = 1 + (N-1)/2M

The surprising result is that the table can be completely filled and. the average length of search will still be less than 1.5. Also, if the non-overflow positions are filled with the most frequently required entries, then the average length of search T(N,M) can be reduced even further. Its drawback is, of course, that, as the complete structure of the table cannot be determined until all entries are known, it is not very useful for tables where new entries are constantly being added (for example, identifier symbol tables). Its main use is, therefore, for pre-loaded tables which are only going to be accessed. This method is particularly useful in, for example, the mnemonic-machine code conversion table of an assembler or a table for looking-up reserved words. Some efforts have been made to use this kind of hash table for a conventional table where entries are constantly being added. Overflow entries are filled in at empty positions chosen at random and chained to their key position. If a later key required this position as its key position, then the overflow entry is moved to another empty position. The overhead for moving elements around means that it is difficult to give an estimate for the average search length in this method.

Fig. 4.3

4.5.5 Deletion of entries

In some tables used, it may be necessary to delete entries from the table. In the hash methods, especially the open hash, it does modify the algorithm. In our example if it was decided that the entry COLD was no longer required, then the action required is not simply to change the entry COLD back to empty. If this was done, the entry DAY could no longer be accessed. Therefore, a subsidiary field of one bit is required which is the deleted entry marker. This would be set on when the entry is deleted. On accessing the key k, position I(k) and its neighbours will be checked until either a match is found or an empty entry encountered. During the scan of I(k) and its neighbours, the first entry with the deleted marker set is remembered. If this is a new entry then, instead of being added to the empty position, it will overwrite the first deleted entry position if there is one. With deletions the average length of search A(N,M) will, of course, go up as at any time there will be deleted entries on some of the access paths.

4.5.6 Mapping functions

So far, little has been said about how to obtain a mapping function I(k) for a key k. This depends to a large extent on the peculiarities of the keys concerned so that it is difficult to talk in generalities. It is important to remember that the calculation of I(k) is an overhead on top of the length of search and that if the time required to calculate I(k) is equivalent to a search time of length 5, then 5 must be added to all our figures for accessing hash tables. If a slightly worse mapping function could be obtained in a search time of length 1 then it is unlikely that this will increase the average length of scan by a value of 4. Consequently it is important to remember how much work is being done in calculating the mapping function compared with the amount of work done in scanning an entry. In general, a slightly worse mapping function which could be obtained much more quickly is usually desirable. The exception to this is where the hash table is on some secondary storage device and access time is very large. This is usually the case when hash tables are being used for file manipulation. In this case much care should be taken over the mapping function [see Buchholz, 1963)]. In compilers the table will normally reside in primary storage, and there is then a good case for producing the mapping function very quickly.

Consider the example of designing a mapping function for a hash table of size 128. The table is to be used as a symbol table for storage and accessing of identifiers defined in a Fortran subroutine. The identifiers must start with a letter and can be followed by one to five more letters or digits. This gives us more than 26 × 36⁵ possible identifiers to be mapped into 128 entries called 0 to 127. The identifiers are available in the computer as a set of octal digits, two digits per character.

    A to Z are represented by 41 to 72 octal. 
    0 to 9 are represented by 20 to 31 octal.
    Space is represented by 01 octal.

The identifier CAT2 would be represented by:

434164220101

As the table is of 128 entries a mapping function is required which will produce as a result seven binary digits, and the aim is for the possible entries to be spread uniformly over the table positions.

The simplest method that could be adopted would be to truncate the binary number to the top seven bits giving in the above example 1000111 as the result of the mapping function. Looking at the definition of identifier, however, it will be noticed that the first character has to be a letter and all letters have the top most bit set to 1. Consequently, choosing this mapping function would mean that only entries 64 to 127 would be mapped on to by this function. This is obviously not uniform and a better choice would be to extract the seven bits after the initial bit, that is 0001110. This is obviously a lot better. In this case, an identifier with one of the first fifteen letters of the alphabet would be mapped on to the first half and the remaining eleven in to the second half. Another possibility would be to remove the top five digits and extract the next seven. In this case the split would be equal between the first and second halves of the table. However, as it is fairly frequent that single letter identifiers appear, it may be that quite an amount of useful information has been thrown away by this method. As can be seen this changing and adjusting of the mapping function can be carried out until a good balance is obtained, but it may well be that the cost gets exorbitant in the meantime.

Another technique used to get information about the lower characters of an identifier to play a part is one called folding. Here the second half of the identifier would be added to the first giving 654265 as the new key and extracting from there.

A third possibility is division. For example, the octal number 434164220101 could be divided by the decimal number 127 and the remainder taken. This will give a fairly even distribution over the range 0 to 126 at the price of a division. If the time to do the division is small then this may well be the most desirable method to choose.