Chilton::ACL::Compiling Techniques

Chapter 9: Storage Allocation

9.1 Introduction

The allocation of storage for programmer defined variables in a higher level language is usually straightforward. The languages in use today have been designed so that storage allocation is to a large extent under the control of the user. Consequently, efficient utilisation of storage is the programmer's problem and not the compiler writer's. In Fortran, for example, there are basically only two types of variables. The COMMON variables are global to all routines and must be allocated a length of storage specified by the user. The user defines how this storage is structured, and there is little to be done by the compiler. The LOCAL variables give a little more scope in deciding how storage should be allocated for them. Unless an EQUIVALENCE declaration has been used, there is no assumed relationship between these variables, and storage can be allocated in an order and position that is defined by the compiler. Usually the only important decision is where these variables are stored. Allocation is often done in an area of storage immediately following the routine in question. However, with modern computers and operating systems requiring fixed and variable parts of a program to be separated, allocation in a separate area is becoming more necessary.

In Algol, recursion makes storage allocation slightly more difficult. As each variable defined in a recursive procedure must have storage allocated for it at each level of recursion, it is necessary to define a stack in which storage is allocated to variables as each block is entered and storage is released as each block is left. A good description of such a system is given by Irons and Feurzeig (1961).

The major problems of storage allocation arise when considering how to allocate storage for temporary variables required for partial results during a compilation. These variables are not defined by the language but by the compiler. How many are required and how storage is allocated is completely dependent upon the compiler. Consequently an algorithm is required which will make the allocation most efficient. Efficiency in this context will usually mean attempting to minimise the amount of storage used.

Storage allocation also becomes a problem as soon as the computer does not have a single-level of storage. Problems concerning efficient usage of main and backing storage are out of the scope of this book and are usually dealt with by operating systems surrounding a compiler. However, many computers have a limited set of fast registers with either a much quicker access time than main storage or alternatively more desirable properties. Using such registers efficiently is up to the compiler and algorithms must be defined which allocate these fast registers to variables in the most advantageous way. Many computers have a set of index registers which fall into this category of fast registers. Similarly a multi-accumulator computer would be considered as having the additional accumulators as fast registers.

9.2 Allocating storage for temporary variables

The algorithm to be discussed here is basically for allocating storage to temporary variables required for short periods during a calculation. In Chapter 8 it was necessary to use temporary working spaces for storing parts of an expression that had already been calculated while another part was being calculated. The code generation algorithms did not concern themselves with how storage, for these temporary variables, should be allocated. They can be thought of as taking a new variable name from an infinite set of unused names each time a temporary variable is required by the algorithm. The storage allocation algorithm has to allocate storage for these variables in such a way that the minimum number of storage locations is required. Obviously, variables having completely disjoint ranges can be allocated the same storage. The range of a variable is defined as the time interval between the initial definition of a variable and its last use.

Consider a set of variables V1, V2, V3, ..., Vn. For each variable Vi, an instruction is defined:

   ST    Vi

which initially sets a value to the variable and indicates the start of the range for Vi. An instruction:

   U     Vi

is defined which indicates a subsequent use of this variable. The last instruction of this type using Vi indicates the end of the range for Vi. This instruction will be used to represent instructions such as ADD, SUB etc. defined in Chapter 8.

A sequence of code produced by the code generator then consists of a set of instructions independent of the Vi together with orders of the two types defined above which use the variables Vi for i = 1 to n. The problem is to allocate storage to the Vi so that the minimum number of locations is required. The assumption is that a variable is no longer required after the last appearance of it in the sequence. In this sense it is assumed that the sequence is complete. Once a variable is no longer required, its storage location may be re-allocated to another variable not yet defined. It is assumed that this sequence of instructions does not contain any entry points other than at the top and that control passes straight through the sequence leaving at the bottom.

For example the tree algorithm in Chapter 8 would produce code for the expression given in Table 8.2 as follows:

   L     h 
   ADD   k 
   ST    V1 
   L     d 
   ADD   e 
   DIV   V1 
   ST    V2 
   L     f 
   MPY   g 
   SUB   V2 
   ST    V3 
   L     b
   MPY   c 
   ADD   a 
   DIV   V3

The names V1, V2 and V3 would be taken from the set of unused names. The storage allocation algorithm to be defined would show that in fact these three variables can all use the same storage location T1. As far as the storage allocation algorithm is concerned the significant parts of the sequence can be written:

As the last use of each variable appears before the next is defined, it is easily seen that one storage location is sufficient.

Suppose that a stack of unused storage locations written [T1 T2, T3, . . .] is available, and suppose that each time a location is required one is taken from the top of the stack and that these locations are returned to the stack as soon as the last use of the variable to which it is allocated appears. The simplest description of the algorithm is then as follows:

Scan the sequence of instructions from the end backwards. For each instruction of the type U Vi if no storage location has been allocated to Vi, take the top free storage location from the stack and assign it to Vi and replace Vi in the instruction by the address of the storage location. This replacement is also done in the case where the storage allocation had been done previously. For each instruction ST Vi, if no storage location has been allocated to Vi then either there is an error or this order is redundant as this implies that there are no subsequent uses of the variable. If storage has been allocated to Vi then replace Vi in the instruction by the address of the storage location and, as this is the first use of the variable Vi, the location may now be returned to the free store stack as it is no longer required. Dantzig and Reynolds (1966) have shown that this algorithm is optimal although not uniquely so.

Consider for example:

  (1)  ST   V1 
  (2)  ST   V2 
  (3)  U    V2 
  (4)  U    V1 
  (5)  ST   V3 
  (6)  U    V1 
  (7)  U    V3 
  (8)  ST   V4 
  (9)  U    V3 
  (10) U    V4

and assume the free storage stack is [T1, T2, T3]. Scanning from the end, (10) will cause V4 to be allocated storage T1 and (9) will allocate T2 to V3 leaving the stack as [T3]. Instruction (8) will release T1 so that the stack is then [T1, T3] and (6) will cause V1 to be allocated to T1 Similarly (5) releases T2 which is then allocated to V2 by instruction (3). The code produced is therefore:

  (1)  ST   T1 
  (2)  ST   T2 
  (3)  U    T2 
  (4)  U    T1 
  (5)  ST   T2 
  (6)  U    T1 
  (7)  U    T2 
  (8)  ST   T1 
  (9)  U    T2 
  (10) U    T1

The algorithm finds the range of each variable by using the first appearance of the variable in the backward scan to define the end of the range and the corresponding ST instruction to define the start of the range. This is all that is required, and the algorithm can be modified to use a forward scan to produce this information. In the forward scan, the ST instruction defines the start of the range as before. If a count of the number of uses of each variable has been kept then the last use of a variable can also be found, so that the end of the range is known. In the tree algorithm defined in Section 8.7, the occurrence number contains a count of the number of uses of a variable and could therefore be used for this purpose.

Another alternative is to locate the end of the range by the method described in the backward technique although in fact scanning forwards. This is achieved by chaining together all the uses of a particular variable as they are scanned. Associated with each variable Vi are three locations Fi, Li and Ri. Fi is the position of the ST instruction for Vi and Li is the position of the last use of Vi. The range of Vi is therefore (Fi,Li). Ri contains the location assigned to Vi and this is set initially to zero.

As the sequence of instructions is scanned, a ST Vi instruction will cause Fi and Li to be set to contain the position of this instruction in the sequence. An instruction U Vi will cause the contents of Li to be set in place of Vi in this instruction and Li changed to contain the position of this instruction. In the example above the final values of Fi and Li are as follows:

      F1 = 1        L1 = 6
      F2 = 2        L2 = 3 
      F3 = 5        L3 = 9
      F4 = 8        L4 = 10

and the instructions will be:

  (l)  ST   0 
  (2)  ST   0 
  (3)  U    2 
  (4)  U    1 
  (5)  ST   0 
  (6)  U    4 
  (7)  U    5 
  (8)  ST   0 
  (9)  U    7 
  (10) U    8

At the end of the scan, Li points to a chain through the uses of the variable Vi. Each instruction contains the position of the next instruction in the chain in the address part. Once a location has been assigned to Vi, then it is a simple matter to move down the chain replacing the address parts of the instruction by the location assigned to Vi' The range for each variable is now defined by the pair (Fi,Li). These pairs can now be scanned to allocate storage to each variable. If Fi = Li, then the ST instruction is redundant as before.

The ranges are scanned in ascending order of Fi and, in the example, the variables are already defined in this order. Usually the code generation algorithm can be linked to the storage allocation algorithm so that this is always the case. It just requires that the order in which variable names appear for the first time in the sequence of instructions is the same order as positions given in the vectors for Fi and Li. In the example V1 must be defined before V2 and so on. Scanning in ascending order of Fi, the first variable is allocated T1. For each subsequent range (Fj,Lj), the active ranges, already with storage allocated, are examined to see if a range exists, call it (Fi,Li), such that Li < Fj. If a range exists, then Vj is allocated the same storage location as Vi and the range (Fi,Li) is marked as inactive. At each stage therefore the active ranges define the current variables to which the locations are allocated, and when these locations will become free. If no range exists with Li < Fj, a new location must be taken from the stack and allocated to Vj.

In the example, V1 would be allocated T1. As L1 > F2, a new location is required for V2 and so this would be allocated T2. However, L2 < F3 so V3 can use the same location as V2 (that is T2). The range for V2 is now marked as inactive. As L1 < F4, V4 is also allocated T1.

9.3 The Belady algorithm

Some computers have a small number of fast registers which are similar to main storage except that the time required to access information stored in the fast registers is very much less than the access time for main storage. They may, in fact, be index registers or additional accumulators which can use their contents in arithmetic or logical operations, but such properties will be ignored at present.

These fast registers can be used to contain copies of variables which have been allocated main storage and, in the case of a temporary variable, it may be that it can reside in a fast register for its entire existence so that no main storage need be allocated. As long as there are more fast registers available than variables required, there is no problem and the fast registers can be allocated by an algorithm like those described in Section 9.2. The situation will, however, arise when the number of fast registers available is inadequate, and a decision has to be made as to which of the variables should continue to reside in the fast registers. This will depend on the frequency of use of the different variables and the algorithm must aim to maximise the utilisation of the fast registers.

The Belady algorithm is one which produces an optimal result for a rather restricted situation and often gives good results in more general cases. It is assumed that the fast registers are to be used to contain copies of the variables in main storage and that it is only possible to operate on these variables by first loading them into fast registers. An example of a computer of this type is the CDC 6600. At the moment it will be assumed that these variables will not be altered, so that if a variable in a fast register is to be replaced, it is not necessary to store the current contents of the fast register as a copy exists in main storage already.

The problem is analogous to the replacement of pages in a two-level storage system. For example, on a computer with main storage and a drum for backing storage, it is often necessary to keep parts of a program and its data in the main store and the remainder on the drum. The program and its data can only be accessed from the main store but not from the drum. When a part of the program is required which is currently on the drum, it is necessary to remove a section or page from the main store and replace it by the relevant page of drum storage. Various algorithms are used to decide which page of main store to remove but all tend to be inefficient as future accesses to the various pages in main store is unknown when the decision has to be made. An optimal algorithm was defined by Belady (1966) in the case where the complete pattern of future accesses is known. This has been used in the two-level storage system to give a measure of the efficiency of the many heuristics adopted. The allocation of fast registers differs from the two-level storage problem as allocation need not be done until a complete sequence of code is available. Consequently, the Belady algorithm itself can be used to give optimal results. The algorithm has been used for several years in compilers and it is only recently through its use in the two-level storage system that the name Belady has been associated with it.

Fig. 9.1

The ranges in which the main variables Vi are used can be shown by a diagram like Fig. 9.1. Each horizontal line represents the range of a variable with the dots giving positions in the instruction sequence where the variable is used. It will be assumed that only one variable is used in each instruction so that two or more dots cannot appear on a vertical line. The left hand end defines the instruction where the variable was first used and the right hand end defines the last use. The number of lines intersected by a vertical line indicates the number of storage locations that are in use at that point. If a vertical line intersects more lines than the number of fast registers available then one or more of the variables cannot reside in the fast registers.

Each time a variable is accessed from a fast register it will be assumed that this takes a negligible amount of time. If a variable is not available in a fast register then the contents of one of the fast registers must be replaced by the required variable. If a function S(i,t) is defined where:

  S(i,t) = 0 if Vi is not in a fast register at time t.
  S(i,t) = 1 if Vi is available in a fast register at time t.

then for each point t in the sequence:

  S(1,t) + S(2,t) +  ...  ≤ N

where N is the number of fast registers available (N is assumed independent of t). If m(t) is defined as the number of the variable used in instruction t then a utilisation function U can be defined as:

 
  U = S(m(1),1) + S(m(2),2) + ...

The maximum value of U (subject to the constraints given above) will ensure that most uses of variables will take place from the fast registers. The only other constraint is that:

   S(m(t-1),t) = 1

for all t. This ensures that each variable is brought to a fast register when used. The Be1ady algorithm for producing an optimal solution to the problem is as follows:

The sequence is examined starting at the first instruction defined at t = 1 in Fig. 9.1. For each t the variable Vi where i = m(t) is examined and action taken as follows:

The variable Vi has a fast register allocated to it and therefore this register is used.
No fast register is allocated to Vi but an unused fast register is available and so this is allocated to Vi.
The variable Vi has no fast register allocated to it and all fast registers are allocated. The variables currently having a copy in the fast registers are examined and if one is no longer required then this fast register is reallocated to Vi. If all fast registers are still in use then the variable whose next use is furthest away from this point has its fast register reallocated to Vi.

In the example given in Fig. 9.1, consider the case where three fast registers are available and call them R1, R2 and R3. Starting from t = 1 the first three instructions will cause R1, R2 and R3 to be allocated to V1, V2 and V3. At t = 5 all three fast registers are still in use but a register is required for V4. V1 is not used again until t = 10 whereas V2 is used at t = 6 and V3 at t = 8. Therefore R1 is reallocated to V4. At t = 7 the problem arises again. V4 is not required until t = 11 whereas V3 is used at t = 8 and V2 at t = 9. Therefore R1 is now reallocated to V5. At t = 10, V1 is used once more. Condition (2) above now applies as V2 is no longer required. R2 is therefore allocated to V1.' Similarly, at t = 11, R2 is reallocated to V4 as V1 is now no longer required. R1 and R3 continue to be allocated to V5 and V3 until the end of the sequence. This allocation of fast registers is shown diagrammatically in Fig. 9.2.

Fig. 9.2

9.4 Index register allocation

The Belady algorithm is only optimal for a very limited set of problems although it is usually an effective algorithm in many other cases. This is intuitively reasonable, for each time a register has to be re-allocated, the aim is to remove the variable which will keep the problem from reappearing for the longest possible time. The Belady algorithm is not optimal as soon as the cost for removing each variable is different.

The example given in Section 9.3 of a set of fast registers being allocated to variables in main store could be extended so that these variables can be altered as well as used. A sequence of variables used will be written:

   (Vi, Vj, Vk, Vl.. .)

This notation implies that variable Vi is used, followed by the use of variable Vj and so on. The example in Fig. 9.1 can then be written as (V1, V2, V3, V2, V4, V2, V5, V3, V2, V1, V4, V5, V3). A variable which is altered as well as used will be denoted by * following the name of the variable. For example (V1, V2*, V1) will denote a sequence of instructions which uses V1, uses and resets V2, then uses V1 again. If the copy of a variable that has been loaded into an index register is altered and later the register is re-allocated to another variable, then the contents of the index register must overwrite the old value of the variable in main store and the new variable loaded into the index register. This requires two store accesses and so the cost of such an operation should be 2 units compared with 1 when the old variable has not been altered.

Consider, for example, a sequence (V1, V2, V3*, V2) and two index registers containing initially V2 and V3*. The Belady algorithm would cause the contents of the index registers to be altered as follows:

  Sequence    Index Registers    Cost
                  V2 V3* 
    V1            V2 V1            2
    V2            V2 V1            0
    V3*           V2 V3*           1
    V2            V2 V3*           0

however:

  Sequence    Index Registers    Cost
                  V2 V3* 
    V1            V1 V3*           1
    V2            V2 V3*           1
    V3*           V2 V3*           0
    V2            V2 V3*           0

This has a total cost of 2 units. The Belady algorithm breaks down because of the 2 unit cost of removing V3 when it is going to be modified later on, so that this modified value in main store is never used. Starting at the head of the sequence, there is an initial configuration of the index registers and a finite number of changes which can be made to their contents so that the variable required in the first position of the sequence is available in an index register. Each of the configurations arrived at after the first variable in the sequence has been allocated an index register is now examined in turn. Each of these generates a finite set of configurations which load the second variable into an index register and so on. A complete graph can be built up of all the possible different states that the index registers may pass through as the sequence is evaluated. This is shown in Fig. 9.3 for the example above. Associated with each transition from one state to another is a certain cost and this is marked on the line joining the two states. With each state is marked in brackets the minimum cost required to arrive at this state. For a complete sequence, the path with minimum cost to the final step is the optimal one required. General algorithms for finding shortest paths have been in use for many years (see Dantzig (1960)) and algorithms of this type would solve the problem. However, the type of graph that appears in index register allocation is a subset of the general class and special algorithms can be used to prune the graph.

Fig. 9.3

Such an algorithm is described by Horwitz, Karp, Miller, and Winograd (1966). This paper first defines a set of possible paths from each node which are minimal change branches. It proves that a shortest path does exist through the graph consisting of only minimal change branches. As only one such path is required it is necessary to generate only minimal change branches.

Possible minimal change branches are defined as follows:

If the variable used at the next step in the sequence is already in an index register then this is used rather than loading a second copy of the variable into another index register. If the variable is modified by the step then the variable in the final configuration will be marked with * even if the original was not.
If the variable used at the next step is not in an index register, then only one variable is removed from an index register and replaced by the required variable. No other changes are made.

Both of these rules are intuitively reasonable, and in producing Fig. 9.3 the second had already been applied. Therefore Fig. 9.3 is strictly only a subgraph of the complete graph. Removing paths which are not minimal change branches because they violate (1) would cause the node V2 V2 to be removed from Fig. 9.3. The graph can be further pruned as it is being produced by a set of rules which can be shown to leave at least one shortest path. These rules are as follows:

If several branches arrive at a node then only one branch need be kept and this must be one of the paths which give minimum cost at this position.
A node having no branches leaving it can be eliminated together with all the branches pointing at that node.
If at some level it is possible to show that one of the existing nodes (call it A) can be changed into another node (call it B) on the same level (same point in the sequence of instructions) and the total cost to arrive at node B by this path is less than or equal to the minimum cost for node B then node B can be eliminated.

In the example in Fig. 9.3, on the lowest level V2 V3* can be changed to V1 V2 at a cost of 2 units. Arriving here by this path would therefore cost 4 whereas the minimum cost for V1 V2 is 5 units. V1 V2 can therefore be eliminated by rule (3). The path from V1 V3* to V2 V3* does not produce a minimum cost at V2 V3* and so this path can be eliminated by (1). V1 V3* now has no branches leaving it and so is removed by (2). Repetition of the use of rules (1) and (2) will eventually reduce the graph to Fig. 9.4 where only the minimum path is left.

Fig. 9.4

In general, the rules described above will not reduce the graph to a single path and if a large number of variables and index registers is available then the size of the graph could still be too large to enumerate completely. Two additional rules which involve look ahead are therefore introduced to prune the graph still further.

The first rule is merely an extension of the rule used in the Belady algorithm. As was shown earlier, the Belady algorithm is no longer applicable and the reason is that the cost of storing an altered variable is now 2 and this makes it more desirable to keep it in the index register if possible.

(4) If a variable in an index register is no longer required then this should be replaced when an index register is required. If all variables in index registers are still required and the next step uses a variable not available in an index register then each pair of variables Vi and Vj currently resident in index registers are considered. Vi should not be removed if:

Vi and Vj are in index registers and the next appearance of either Vi or Vi* appears before the next appearance of Vj or Vj*.
Vi* and either Vj or Vj* are in index registers and the next appearance of Vi* is before the next appearance of Vj or Vj*

(5) This rule considers two nodes on the same level which differ only in the contents of one index register, and this will be defined as Vi in one node and Vj in the other. The node involving Vj can be eliminated if:

Vi is in the index register and either Vj or Vj* is in the index register at the other node. If either Vi or Vi* appears before Vj or Vj* in the following sequence of instructions and the minimum cost of arriving at the node involving Vi is less than or equal to the node involving Vj then the node involving Vj can be eliminated.
Vi* is in the index register at the first node and Vj at the second. If Vi* appears before Vj or Vj* in the following sequence and the minimum cost of getting to the node involving Vi* is one or more less than the minimum cost of getting to the node involving Vj then the node involving Vj can be eliminated.
Vi* is in the index register at the first node and Vj* at the second. If Vi* appears before Vj or Vj* in the sequence and the minimum cost of getting to the node involving Vi* is less than the node involving Vj* then the node involving Vj* can be eliminated.

These rules are usually sufficient to reduce most graphs to a size where the minimum path can be found by enumeration.

In some cases the allocation can be divided into a set of subproblems by choosing points where the original sequence is to be broken. Each sub-sequence then has its own initial configuration of the index registers and the break-point is chosen so that the final contents of the index registers at the completion of one sub-sequence are what is required to start the next sub-sequence. For example suppose that the sequence of variables is (. .. V1* V2* V1* . . .) and only two index registers are available. After the second appearance of V1* the contents of the index registers in the minimum cost path must be V1* and V2* by rule (4) above. Therefore the sequence could be broken at this point and the remainder of the sequence treated separately with the initial configuration having the index registers set to V1* and V2*. Details of such an algorithm are given by Luccio (1967) and Belady (1966).