Chilton::ACL::Compiling Techniques

Chapter 3: Data Structure Mappings

This chapter is intended to give some examples of the ways in which the abstract data structures, defined in Chapter 2, are mapped on to internal data structures.

3.1 Stacks and queues

The conventional method of representing a stack in a computer is by a vector. The base of the stack is usually the first position in the vector. A pointer, P, is set to point to the top-most element of the stack. The length of the vector is usually set to the maximum length of the stack that is expected. Insertion of an item consists of incrementing the pointer P and inserting in the address pointed to by P.

A queue, however, is not easily represented as a vector, so it is usually represented by a list. The reason being that, if it was represented as a vector, each time an item was removed from the queue, it would either be necessary to move the elements down the vector to fill up the gap made at the base or alternatively the set of elements in the queue would gradually move down the vector. When represented as a circular list with the pointer to the list pointing at the last item added, both adding and removing items from the queue is simple. It can easily be seen that these operations can be achieved without doing a complete search of the list which would be necessary for adding items if the pointer was to the first item of the list.

3.2 Strings

Strings are most usually stored as a vector. Whether or not more than one item of the string is stored in a single computer word depends to a large extent on the operations available in the computer's order code and the operations required upon the string. If it is necessary to do operations on the string, such as breaking the string into two parts or matching two strings, then it is often more reasonable to store the string as a list. If the string can have variable length and matching is required, then it is often useful to keep a count of the number of elements in the string in the initial element.

3.3 Trees and directed graphs

These, in general, require an arbitrary number of pointers from each node and, although it is possible to represent them by list structures, it is more usual to use a plex of some kind. In the case of arithmetic expressions, for example, most operators are binary and the structure shown in Fig. 2.5 would be a reasonable one to use.

In some cases it is necessary to find parts of the tree or graph with the same pattern. In an arithmetic expression, for example, it might be desirable to find common sub-expressions. In this case the nodes of the tree could be stored in a table. The information at the node and the nodes pointed to by this node could define the key. The information part of the table could be used to carry any additional pointers or information required.

3.4 Arrays

Arrays are almost invariably mapped into vectors in the computer store. In the case of arrays which are sparsely filled, that is have only a few non-empty entries, it is sometimes useful to store arrays as a table of the non-empty entries. However, this occurs infrequently and only vector storage will be discussed here.

Representing the vector by array A[1:N] then an array defined as:

    array B[i:j]

would normally be mapped such that:

    B[k] is mapped into A[k-i 1]

Similarly, an n-dimensional rectangular array will be mapped into a vector. This is usually done in one of two ways. Either the first subscript is allowed to vary most quickly or alternatively the last. This is best illustrated with an example. The array defined as:

     array B[1:2, 3:5]

would have its items stored in the vector in the order

     B[1,3], B[2,3], B[1,4], B[2,4], B[1,5], B[2,5] or 
     B[1,3], B[1,4], B[1,5], B[2,3], B[2,4], B[2,5]

Given the declaration:

    array B[i₁:k₁, i₂:k₂, ..., i_n:k_n]

then the item B[j₁,j₂, . . ., j_n] could be mapped into the vector in position:

  A[1 + (j₁-i₁) D₁ + (j₂-i₂) D₂  +..+ (j_n-i_n) D_n  ] 
where 
  D_i=1 and D_m = (k_m-1 - i_m-1 + 1)D_m-1

3.4.1 Array accessing using dope vectors

So far a description of how the array is stored has been given. It is now necessary to describe the way in which an individual item of the array is accessed. The simplest approach adopted is to allow just sufficient space for the array item in the vector and, whenever an item of the array is required, to access it by calculating

(j₁-i₁) D₁ + (j₂-i₂) D₂  +..+ (j_n-i_n) D_n

However, both D_m and i_m depend only on the array definition and not on the individual item accessed. For efficiency, therefore, the quantity

i₁D₁ + i₂D₁ + .. + i_nD_n

and the individual D_m values could be stored to avoid recalculation. They could be stored with the array itself, although it is more usual to use a separate vector called the dope vector. Two arrays with the same parameters could then, to a large extent, use the same dope vector.

It is possible that a request for an array item is made with either too many or too few subscripts. Also, individual subscripts must be within the declared range. To aid in checking such illegal requests it is usual to store the dimensionality and upper and lower bounds of each subscript in the dope vector.

For example, the declaration for B given as an example might produce two pointers associated with B. The first would point to the vector containing the array items; the second to a dope vector containing:

Number of dimensions
Lower and upper bounds for each subscript
Number of items in array
The D_m for m = 1 to n
i₁D₁ + i₂D₂ + .. i_nD_n

Two different arrays sharing the same storage could be defined by having the first pointer pointing to the same vector. Two arrays having the same shape and size could use the same dope vector by having the second pointers the same.

3.4.2 Array accessing by Iliffe vectors

A second method of accessing, due to Iliffe, is frequently used. This requires more space but will give faster access for individual items. Stored with each array is a set of pointers. For example, the array defined as:

   B[4:6, -2:1, 0:1]

would be stored by a structure shown in Fig. 3.1. Using the notation (B+5) to mean the contents of the address B+5 then the item B[i,j,k] is accessed by:

    (((C) + i) + j) + k

Fig. 3.1

The contents of location C are taken, i is added to this, and the contents of this address give a pointer to the next level down, and so on.

It is important to note that accessing an array item does not now require any multiplications. However, in the example one 3-vector and three 4-vectors are required as well as the array vector itself; that is, instead of 24 elements, 39 elements are required. The method is most economical when the ranges of the subscripts are increasing so that the largest range is the last subscript. It is most inefficient when the largest range is the first subscript.

In the example given no check has been made that the subscript given is a legal one. This could be introduced by storing with each element of the Iliffe vector the lower and upper bound of the subscript. For the case of the rectangular array this could be uneconomical. The vectors E, F and G in the example would all have each element with the same lower and upper bounds. It is important to note, however, that these elements could have different subscript bounds, and a much more complex array structure can be accessed by the use of Iliffe vectors.

For example, Fig. 3.2 shows a set of Iliffe vectors for accessing a three dimensional array having only the following items stored in order:

B[4,-1,-1], B[4,-1,0], B[4,-1,1], B[4,-1,2], B[4,0,0], 
B[4,0,1],   B[4,0,2],  B[4,0,3],  B[4,0,4],  B[4,0,5], 
B[4,0,6],   B[5,1,2],  B[5,2,2],  B[5,2,3],  B[5,3,2], 
B[5,3,3],   B[5,3,4],  B[5,3,5],  B[6,0,0],  B[6,0,1], 
B[6,0,2],   B[6,0,3],  B[6,0,4],  B[6,0,5]