Chilton::ACL::Compiler-Compiler Re-Examined

Simply Partitioned Data Structures: The Compiler-Compiler Re-Examined

R A Brooker and J S Rohl

Machine Intelligence 1 Workshop, September 1965

Manchester University

1. INTRODUCTION
2. DATA DECLARATIONS
3. ASSIGNMENTS
4. PHYSICAL REALISATION
5. TREES
6. CONCLUSION
7. ACKNOWLEDGEMENTS
8. APPENDIX: PHRASE DEFINITIONS AND FILE STRUCTURE
9. REFERENCES

1. INTRODUCTION

Languages such as Fortran and Algol are concerned with fairly simple types of data, mainly integers and reals, and with rectangular arrays of these quantities. In this paper we consider some of the problems which arise with more complex types of data and data sets. Interest in this aspect of computer languages arose partly from a reappraisal of the Compiler-Compiler (Brooker, MacCallum, Morris & Rohl 1963, Rosen 1964) and partly from the need to introduce further numerical and non-numerical types of data into our algebraic compiler. Some features of the Compiler-Compiler, particularly the facilities for processing the tree produced by the syntax analysis, are useful enough to be included in a general purpose programming language. It is also relevant to the implementation of some of the facilities of Iverson's Programming Language (Iverson 1962).

The discussion relates to a simple language model with an Algol-like nested block structure. We assume that the reader understands what is meant by the scope of a declaration in this context. Ignoring input-output there are four main types of statement in the language:

data declarations,
assignments,
control statements,
procedure declarations.

The distinction between them is not always clear cut. For example a conditional expression can be regarded both as a control statement and an arithmetic assignment. We propose to ignore these niceties, however, and concentrate on the first two classes of statement.

2. DATA DECLARATIONS

These can take three forms

<type> : <name list>
<type> name <name list>
<type> type <name list>

(1) The first of these

<type> : <name list>

is the most familiar. It allocates storage space for a set of objects of the given <type>, these objects to be known by the names in the list. (The : can be omitted if no ambiguity results.) For example,

 
real matrix (m, n) u, v, w

allocates space to three m × n matrices with real elements. The matrices are to be referred to as u, v, w. In general the <type> has both a static and a dynamic component. In this case, if m, n are computed quantities, storage allocation cannot take place until the statement is encountered at run time. The usual solution is to reserve space in the stack, or section of stack, associated with the current block. The dimensions thus constitute the dynamic component of the <type>; the static component, i.e., the part known at compile time, being the fact that it is a real matrix. If the language restricts one to assigning constant dimensions, for example,

 
real matrix (10, 10) u, v, w

then everything is known at compile time, although it may still be convenient to allocate storage space dynamically. (Note: the Algol declaration

 
real array a, b, c [1 : n], u, v, w [1 : m]

is simply a multiple declaration in which a common, static, component has been factored out.)

Another way to specify (type) is by reference to a previously declared name. For example,

 
u : a, b, c

allocates storage space for three objects a, b, c of the same (type) as u, the dynamic characteristics, if any, being those which u possesses when the declaration is encountered at run-time. In this case, of course, u must refer to an actual object, but if the (type) is purely static then u can be a name defined as in (3) below.

(2) The second form of declaration is

 
<type> name <name list>

and will be referred to as a name list.

For example,

 
t name p, q, r

means that the identifiers p, q, r will be used to name objects with the same static characteristics as t. Thus if t has previously been declared a real matrix, then p, q, r can be dynamically assigned (see later) the addresses of similar objects (though not necessarily with the same dimensions, if these are a dynamic characteristic). Similarly real name p, q, r would mean that p, q, r are to be used for the names of reals.

(3) The third form of declaration

 
<type> type <name list>

attributes the given static <type> to the names in the list. It cannot be used for a <type> possessing dynamic characteristics. For example, real type real does not allocate any storage space but allows us to write real: a, b, c as an alternative to real a, b, c provided it is within the scope of real. By defining static types in this way one may save a good deal of underlining. The main purpose, however, in introducing this form of declaration is a theoretical one - to allow names to be given the characteristics of an object without at the same time allocating space for the object itself.

3. ASSIGNMENTS

(a) Data evaluation:

Examples of statements for evaluation and assigning data are:

 
x:=y:=z:=1+cos(2×pi×x/n)     Algol 
z:=1+Cos[2Pi x/n]	     CPL 
2πx/n cos+1 => x => y => z   Anon

We use the last of these styles for our illustrations. The assignments will be called value assignments to distinguish them from those of the next section.

(b) Name assignment:

We shall sometimes need to manipulate data names. For example, given the declarations

 
real x; complex z; x name y

the statement

 
z im => x

transfers the imaginary part of z to x but

 
z im ≡> y

transfers its name to y. The latter operation allows y to be used as a name for the imaginary part of z, so that, for example,

 
y => x

is then equivalent to

 
z im => x

Similarly x => y allows y to be used as an alternative name for x. One may only transfer names in this way if the source and destination have the same static type characteristics.

Why should we wish to make name assignments? One use of the facility is simply to change a name, possibly to introduce an abbreviation, for example:

 
real temperature; temperature name t
temperature ≡> t

Related to this is the use of name lists and value lists to define the formal parameters of a routine, for example,

 
routine <name> integer name i; real a, b; 
begin................ end

The formal parameters are those between the routine <name> and the first begin. A call for this routine might be

 
<name>(n, u+3, y)

The calling mechanism consists of making name or value assignments between the actual parameters and the corresponding formal parameters according to whether the latter are names or actual locations. In this case they amount to

 
n ≡> i; u+3 => a; y => b

Another and important use of name assignments is to record as intermediate quantities names which would otherwise need to be recomputed, possibly at some considerable cost, for example, array elements a(i, j, k). Such a need is likely to arise when analysing deeply structured data such as trees.

4. PHYSICAL REALISATION

The names in a name list are represented by cells which hold the addresses of the objects they refer to, for example,

It is convenient to assume, for the sake of uniformity, that all names including those given to objects when they are first declared refer to the associated objects in this way. (It is not, of course, conducive to efficiency, since it introduces an extra store access operation.)

The object shown above is pictured as occupying a consecutive set of storage locations. This need not always be the case provided the structure contains within itself the information necessary for the compiler (with the aid of the static type information) to construct appropriate processing instructions.

Two examples illustrate the physical connections established as a result of name assignment:

 
(i) real array (2 : 6) a; real name last a; a(6) ≡> last a

 
(ii) complex z; real name x; z im ≡>  x

The data structures used in these examples can easily be partitioned into their basic units (real numbers). In designing more complex structures, however, it is necessary to bear in mind how we shall want to partition them. Consider, for example, the case of a matrix and a vector

 
matrix A(10, 10); vector v(10)

If the matrix is stored by rows we have

while the simplest representation of a vector is

(It is assumed in both cases that dimensions are constant and form part of the static type characteristics.) The result of the name assignment given by

 
v name u; A(2, *) ≡> u

is to copy the address of the 1st element of the 2nd row of A into the name cell u, thus

Note that we cannot write A(2, 0) ≡> u because A(2, 0) is the name of an element not a vector. For this reason a language involving operations of this kind (e.g., Iverson's Programming Language (Iverson 1962)) requires a distinguishing notation such as that used here. The representations just described do not permit a matrix to be partitioned (in our sense) into column vectors. We can overcome this by representing a vector as follows:

The name cell points first to a cell containing an integer k which gives the spacing of the elements, i.e., k is added to the address of each element to get that of the next. Similarly the first element, vl stands k cells beyond that of k itself. If k=1, the elements stand in consecutive cells.

If now a matrix is bordered by a row and a column, thus

 
   0      n+1      n+1   ...   n+1 
   1      a11	   a12   ...   a1n
   1      a21	   a22   ...   a2n
   .      ...      ...   ...   ...
   1      am1	   am2   ...   amn

and then stored by rows in the usual way, both the rows and columns appear as particular cases of the vectors just described, with k=1 and n+1 respectively.

Consider, for example, the case of a 3 × 3 matrix.

 
matrix A(3, 3); vector a(3); a name u, v

The assignments

 
A(2, *) ≡> u 
A(*, 2) ≡> v

name the 2nd row and column of A as u and v respectively. The physical connections involved are as follows:

5. TREES

Another example of a data structure that can be simply partitioned is a tree. Consider, for example, the structure

which results from analysing

 
u, v, w

with reference to the phrase <variable list> defined as follows:

 
phrase variable list = variable . rest of variable list
phrase rest of variable list = ',' . variable . rest of variable list, nil
phrase variable = 'a', 'b', 'c', . . ., 'x', 'y, 'z'

The second of these is a recursive definition.

Here . denotes concatenation and , is used to separate alternative formats. These are listed in order of preference, i.e., it is the order in which the recognition routine tries to match them with the source string. If a match is found the remaining alternatives are abandoned. For further details of this algorithm see Rosen (1964). These phrase definitions might be better called tree definitions and in general they could be used to declare trees occurring in any application, and not just trees obtained in a compiler by syntax analysis.

The format eventually chosen is indicated by the numbers in the analysis diagram, 1 denoting the first alternative, 2 the second, and so on. If there is only one alternative the number is 1. The Ss are address words which point to subtrees. Each node of the tree thus consists of an integer word followed, if necessary (i.e., if the structure goes any deeper), by S words.

These words at each node are contiguous in the store but the nodes themselves can be located anywhere in the store. However a tree is usually stored in a compact form when it is first produced.

A single address is all that is necessary to locate the subtree associated with any particular phrase. With the help of the associated phrase definitions the subtree yields to the compiler all necessary information about the phrase. We shall now describe some operations for analysing the structure of a given tree. Consider the routine:

 
routine read variable list; variable list name vl; begin
variable name v; rest of variable list name rovl
resolve vl into v.rovl
1: read variable (v)
-> 1 if rovl resolves into ','. v.rovl
end

There are three local name locations associated with this routine. First the parameter vl which on entry points to a tree with the structure of <variable list>, i.e., it points to the 1 at the head of the tree. The other name locations will be associated with <variable> and <rest of variable list>.

The operation

 
resolve vl into v.rovl

isolates the S words which point to the two principal subtrees, <variable> and <rest of variable list>, and records them in the name locations v and rovl respectively. The operation is performed with the help of a template constructed by the compiler from the original phrase definitions, which are indicated in the name declarations. As a result v and rovl now point to the principal subtrees. The process can be repeated to resolve these subtrees into further subtrees and so on. Before we do this, however, consider the instruction

 
read variable (v)

This is a call for a routine

 
routine read variable; variable name v; begin
BODY   OF   ROUTINE
end

(Note: the use of v in both the call sequence and the routine heading is purely a coincidence.)

The body of the routine processes a <variable> tree in some way. We have in mind a language for which we are writing a compiler, and that read (<variable list>) and read (<variable>) are input statements in this language. The task of these routines is to compile instructions to read the data in question. The former calls on the latter to compile the equivalent of read (<variable k>) for each of the variables in the list. How the latter routine works is irrelevant for the present discussion. To return to the main routine, the operation

 
-> 1 if rovl resolves into ','. v.rovl

is a conditional version of the first instruction. If rovl has the first of the two alternative formats in the definition of <rest of variable list), the instruction locates the subtrees corresponding to <variable> and <rest of variable list> and records their addresses in v and rovl. It then transfers control to label 1. (The repeated use made of v and rovl is the usual device of using a single location for several purposes.) In this way we unravel the recursive structure of the original vl, exposing a new <variable> v at each stage, which is then handed on to a subroutine, read variable, to process. The tree for a <variable> in this case is utterly trivial consisting as it does of a single I word only. In actual practice it would be much deeper. Finally we reach a stage with this conditional operation where rovl takes the second alternative form, namely <nil>. In this case we do not jump to 1 but pass on to the end (which also serves as a return exit). The foregoing routine would appear in Compiler-Compiler language as:

 
ROUTINE [SS] ≡ read ([VARIABLE LIST]) [EOL]
LET [VARIABLE LIST] ≡ [VARIABLE] [REST OF VARIABLE LIST]
1) read ([VARIABLE])
->1 IF [REST OF VARIABLE LIST] ≡ [,] [VARIABLE] [REST OF VARIABLE LIST]
END

It will be seen that the name location vl is implied by the heading, while v and rovl are implied by the identifiers appearing in the subsequent statements. If more than one subphrase of the same type has to be identified, they would take the form, for example,

 
[VARIABLE/1], [VARIABLE/2], . . .,

corresponding say to

 
variable name v1, v2

The tree manipulations just described are essentially name assignments. We have said nothing so far about the actual generation of trees from other data, and their assignment to tree locations in the store. A phrase definition, for example,

 
phrase index = 'i','j','k','l','m','n'

is essentially a type declaration, it does not imply the existence of a location for a tree named index. Likewise only type names (or, as in the above case, basic characters) can appear on the right-hand side of a phrase definition. If we wish to actually store trees of a given type then we write a declaration of the form, for example,

 
index: i1, i2

The precise mechanism by which local storage is allocated for a tree does not concern us here provided the tree takes the general form described earlier. The difficulty is that it is impossible to predict at compile time what the size of the tree will be. One can determine the maximum extent of each node and hence, provided it is not recursive, the maximum size of the tree assuming the nodes are packed contiguously. If it is recursive, however, it is not possible to say anything until the tree is actually generated, and in this respect it is worse than dealing with an array declaration where we can at least allocate storage on entering the block in question. (It is somewhat similar to the problem of storing multilength integers in which all digits are retained.) One way to deal with objects of this type is to use the sliding store principle (see Rosen 1964). In this system storage is only allocated when the objects are generated, and when it is necessary to replace an old value by a new one, the former is deleted by closing the ranks and adding the new value at the end.

There are two methods by which trees can be generated. The first is by analysis of a character string (stored say as coded integers in an integer array) with respect to some phrase definition. This can be done by an instruction of the form

 
analyse A(i) wrt p

where A(i) indicates the character at which scanning is to start, and p the destination of the resulting tree, (p will be given in a value list together with the associated phrase definition.) The analysis algorithm is described by Rosen (1964).

The second method is by an instruction of the form, for example,

 
let ss='r','e'.'a'.'d'.'('.v.')'

Here the right-hand side is analysed (at compile time) with respect to the phrase definition associated with ss and a provisional tree is constructed into which a subtree of the type (associated with) v can be plugged when the instruction is eventually executed (and the subtree computed). More precisely the provisional tree, which forms part of the instruction itself, is copied to ss, and at the same time the connection with v is made. The latter is in effect a name assignment: no physical copy of v is made. (In the Compiler-Compiler trees constructed in this way were kept in the stack and simply allowed to advance the working area.) However, one can run into trouble with this approach if tree generation is allowed to become repetitive, without some attempt being made to recover the space occupied by trees which are no longer required.

Another very useful tree function is

 
p category

which yields the value of the principal category word as an integer. This is more or less essential for dealing with phrases in which there are many possible formats, for example, the class of source statements in Fortran. The category number can be used (via a switch) to transfer control directly to the relevant processing sequence or routine. The only alternative would be a sequence of conditional resolve instructions which would be very slow indeed.

To sum up, the new features proposed for tree processing are the introduction of phrase definitions, the instructions

 
resolve . . . into . . . 
if ... resolves into . . . 
analyse . . . wrt. . . 
let . . . =

and the function category.

6. CONCLUSION

Discussion has been confined to data which can be named by a single address word which points to the first of the locations occupied by the item. The information found in these locations is interpreted according to rules built into the compiler, although certain decisions, involving the dynamic characteristics of the type, may be postponed until run-time. This type of data reference is sometimes called a simple name (a term introduced by Strachey and Wilkes in connection with the call-by-name facility of Algol procedures) to distinguish it from the more general (but far less efficient) device of using an algorithm to define either the precise locations involved (the left-hand-value) or the information itself (the right-hand-value).

In some compilers a machine address is synonymous with an integer. (If not, one can declare a sufficiently large vector, for example, array store (1 : 10,000), and use subscripts as addresses.) In this case data systems of the type discussed here can be simulated by imputing an association between certain integers, the names, and certain storage locations (either absolute or array elements), the data. Some Compiler-Compiler facilities recently introduced into Atlas Autocode (Brooker, Rohl & Clark 1966) are partly of this character and partly built-in.

The trouble with such interpretive systems is that we throw away the possibility of checking the legitimacy of the operations we attempt to perform. This is one of the objections to the use of array fn s in Atlas Autocode (at least as implemented on the Atlas). The declaration array fn A(s, p, q) implies that a subsequent operand A(i,j) refers to the number standing in location s+p×i+q×j. It is usually employed to remap an existing matrix, B say, s being associated in some way with B. There is, however, no provision for checking this, or even that the final location is sensible. We must try to ensure that in providing more facilities we do not at the same time give the user more opportunities to make a fool of himself.

7. ACKNOWLEDGEMENTS

Mr R. B. E. Napper read the first draft of this paper and made some very helpful comments.

8. APPENDIX: PHRASE DEFINITIONS AND FILE STRUCTURE

There is a close analogy between phrase definitions and the file structure of data processing records. The latter could be defined in the same way as phrases (although at most levels there is usually only one possible format) but physical separability of the data, particularly at the lowest levels, is not consistent with efficiency. The latter dictates that information must be packed, at least down to character units, and possibly to bits, in order to minimise scanning of magnetic tape on which thousands of records may be stored. The use of machines with byte storage would help in this matter.

In any case phrase definitions corresponding to basic units of information, e.g., integer, real, are implemented as subroutines, which recognise the appropriate data formats in the source string, and convert the information to the appropriate internal form. For example, the phrase for real would simply call in a subroutine to read a decimal number from the input source string, convert it to standard floating binary form and terminate the tree with it in place of the usual S words which would otherwise lead to something further.

9. REFERENCES

Brooker, R. A., MacCallum, I. R., Morris, D., & Rohl, J. S. (1963). The Compiler-Compiler. A. Rev. autom. Progrmg, 3, 229-275.

Brooker, R. A., Rohl, J. S., & Clark, S. R. (1966). Main features of Atlas Autocode. Comp. /., 8, 303-310.

Iverson, K. E. (1962). A Programming Language. New York: John Wiley.

Rosen, Saul (1964). A Compiler building system developed by Brooker and Morris. Communs Ass. comput. Mach., 7, 403-414.