Chilton::ACL::Compiling Techniques

Chapter 10: Compiler-Compilers

10.1 Introduction

Producing a compiler in an assembly language can be quite tedious and time-consuming. For example, the original Fortran I compiler for the IBM 704 took eighteen man-years to produce, and about another ten man-years of effort were needed to extend this to the Fortran II version for the IBM 709 [Backus et al. (1957)]. With the development and understanding of compiling techniques, the manpower required has been considerably reduced but, even now, a reasonable compiler for a language like Algol is unlikely to take less than four man-years to write if it is hand-coded. This amount of man-power may not be available and, even if it is, it may be uneconomic to use skilled programmers in this way. Also, if the compiler is required urgently, it may well take too long to produce. There is an optimum number of people who can work together on a project of this kind (probably about four) so that, although four people could produce the compiler in a year, it would be unlikely that eight people could produce it in six months and almost certain that sixteen people could not produce it in three months.

To try and improve the speed of production of compilers and reduce the man-power required, special purpose higher level languages called Compiler-Compilers (abbreviated CC) have been developed. Just as the Fortran language contains statements which most scientific users require so CC languages contain statements particularly relevant to the problem of compilation. Primitive CC languages started appearing around 1958-9 and since then a large number of languages and dialects (possibly around 75) have been developed. They range from sub-languages which aid in only one particular part of the compilation process to large systems which will generate a whole compiler once the language and target computer have been defined.

Very few direct comparisons of the efficiency of hand-coded and CC generated compilers exist [but see Brooker, Morris, and Rohl (1967)] although there has been much speculation. The object code produced by a CC generated compiler should be better than that produced by a hand-written one. Optimising code generation is, to a large extent, a formulation of special cases that can be handled either more or less efficiently on a particular computer. As it is easier to do this enumeration in a higher level language, it is also likely to be more exhaustive. The language will also tend to focus attention on the areas of interest and so avoid overlooking certain features. On the other hand, the algorithms used in the CC generated compiler tend to be more generalised than the hand-coded ones. This is to allow compilers for a large range of languages to be defined without requiring too much alteration of the underlying structure. Consequently it is usual for hand-coded compilers to be smaller and compile faster than CC generated ones. However, just as higher level languages have taken over from assembly languages for coding scientific problems so CCs will take over from assembly languages for writing compilers.

In several cases the CCs used in this Chapter have had to be simplified to avoid description of features unnecessary to the example. It is important to realise that most of these CCs provide many more facilities than the ones described so that this Chapter does not do justice to any of the more powerful CCs.

10.2 What is a compiler-compiler?

An idealised form of a CC is shown producing an Algol compiler in Fig. 10.1. The syntactic and semantic definitions of Algol are input to the CC and the output consists of an Algol compiler written in some language. This would probably be similar to machine code although it could be either an assembly or higher level language. Fig. 10.1 implies that the code produced is completely dependent on the Algol definitions so that a specialised compiler for Algol is the result. Most CCs in existence tend not to approach this idealised form but more closely resemble Fig. 10.2. The CC language is often divided into a set of sub-languages each producing a specific part of the Algol compiler. For example, a special language might be defined in which to express the syntactic definition of the language; a second language might define the characteristics of the computer for which the compiler is to produce code; a third language might be used to express the semantics of the language. In many cases the output from these sub-languages are tables which drive a general purpose compiler written only once in machine code. For example, a standard top-down analysis routine might be provided in the fixed part of the compiler and this would be driven by a table produced by a special CC language in which the syntax was defined.

Fig. 10.1

Fig. 10.2

The CC produced is, of course, capable of defining and producing the sub-compilers CC1, CC2, etc. Consequently it is usual, once the pilot version has been written, to produce new versions of the compilers for the sub-languages defined in terms of the complete CC. This can lead to duplication of the fixed part of the compiler unless special precautions are taken so that it is capable of driving several different sets of tables at the same time. A particular sub-class of CCs, which are easily defined in terms of themselves, is the set of CCs which are also extendible compilers. A good example of an extendible CC is the Brooker-Morris Compiler-Compiler (in future this will be abbreviated to BMCC. References are Brooker, Morris, and Rohl (1962) and Brooker et al. (1963).

The form of an extendible CC is shown in Fig. 10.3. The extendible CC (abbreviated ECC) is able to enlarge itself by defining new statements to be added to the statement repertoire of the ECC language. It can, for example, add all the possible Algol statements to the existing language. The ECC language produced could then be thought of as a language with Algol as a subset. For an ECC to be a CC, it must be possible to divide the statements in the ECC language into classes so that, once the Algol statements have been defined, it is possible to state that only the class of Algol statements is expected in the program to be compiled. Once the ECC has been extended so that it can translate Algol statements as wen, it is no longer necessary to keep the definitional parts of the ECC if only an Algol compiler is required. Consequently, it is usual to remove a large part of the original ECC leaving a compiler which will now only translate Algol.

Fig. 10.3

10.3 Data structures

Most CCs provide the user with a certain set of data structures to aid him in the definition of the compiler. The sophistication of the CC will determine to some extent what facilities are provided. There is not, however, any direct correlation between the sophistication of the CC and the quality of the data structures provided. In the simplest kind of CC, very little is provided at all. As the CC becomes more powerful, it may provide more and more data structures together with instructions for manipulating them or alternatively it may take over complete responsibility for large sections of the compilation task but still not provide any data structures for the compiler writer. In the second case it will, of course, be using complex data structures behind the scenes but these need not become available to the user. Two examples of CCs providing data structural definitions and operations upon them are BMCC and FSL [Feldman (1964), (1966)]. The CGS system [Shapiro and Zand (1963), Zand (1964)] is an example of a sophisticated CC which does not provide data structures for the user and is arranged so that he does not need them.

BMCC basically provides the user with a list of free storage and an instruction set which maps stacks, tables, and queues on to basic list structures. Storage can be removed and returned to the free storage list. A stack (called a NEST in BMCC) can be initialised by an order:

   B1 = NEST(1,2,4)

This would set up a stack containing three levels with the constant 4 on the top level. The name of the stack is B1. To withdraw the top element or pop the stack and put the contents in A1 would require:

   WITHDRAW A1 FROM NEST B1

while the constant 3 could be added to the stack by:

   ADD 3 TO NEST B1

Tables are arranged by having a circular list with two fields per entry; the first gives the key to the table entry while the second contains a pointer to the information part of the table entry. The entry is accessed by a statement of the form:

   A1 = LIST B3 (k, ?)

where B3 is the name of the table and k is the key to the table entry. The result is to put the pointer to the information part in the variable A1.

FSL also provides tables and stacks but these are stored as vectors rather than lists. The stack orders are similar to BMCC but slightly more general. For example:

   stack STR;
   pop [STR,A1];
   push [STR,3];
   STR ← 4;

The first line defines a stack called STR; the second removes the top element of the stack and puts it in A1; the third adds the constant 3 to the stack. The fourth statement is not available in BMCC and overwrites the top element of the stack with the constant 4. The table orders are:

   table SYM [200,4];
   enter [SYM, FRST, 6, 8, 10 ];
   A1 ← SYM [FRST, , $, ];
   A2 ← SYM [0, , , $];

The first defines a table of maximum length 200 with a key and 3 additional information fields; the second defines an entry in the table whose name is stored in FRST and has the information fields set to 6, 8 and 10 respectively; the third denotes a table look-up for the entry FRST which sets A1 equal to the information field indicated by the symbol $, that is 8; the fourth has a zero in the key field which indicates that the same table entry is to be used as on the previous access and, in this case, the last information field is required. FSL also provides declarations for defining stacks and tables in the object program and instructions for producing code to use these in the object program.

More complex data structures can be found in languages such as BCL [Hendry (1967)], PSL, and the AED system [Ross (1967)].

10.4 Lexical analysis

The lexical analysis phase is the section of the compiler where most CCs provide little general help. The lexical analysers or subscanners as they are often called tend to be inflexible in the form of input they will accept. Most CCs only allow input on one peripheral device and often put restrictions on how this may be used. Consequently, it is very difficult to produce compilers with several input forms using CCs. A typical example is BMCC which has a specific subscan routine with no parameters and the only way it can be altered is for the user to replace the routine with a version of his own. This lack of a lexical scan means that it is usual for reserved words to be recognised at the syntax analysis phase. This is inefficient, and later additions to the BMCC system include a trapdoor from the syntax analyser back to user defined lexical analysis routines.

Most subscanners in CCs tend to use the recognition of terminal symbols as an opportunity to group these into classes so that the syntax analyser can refer to the attribute defining the class rather than the individual symbols where appropriate. For example, the declaration of terminal symbols in TGS [Dean (1964), and Plaskow and Schuman (1966)] would be:

   BEGIN. TERMINALS 
   + ,ADOP.
   - ,ADOP. 
   LS,RLOP. 
   EQ,RLOP.
   BEGIN.
   END. TERMINALS

Here + and - have been defined as belonging to the class of addition-type operators and given the attribute ADOP. Similarly, the relational operators LS and EQ have been assigned the attribute RLOP. The terminal symbol BEGIN is not assigned to any class.

In Chapter 6 a general method of lexical analysis was described using direct access tables for converting simple terminal symbols and hash tables for the reserved words. Most systems provided in CCs are not as complex as this and often simply consist of a table lookup. For example, the FSL-like CC produced at the University of Illinois [see Northcote and Steiner (1964)] has a single look-up table for storing all terminal symbols. These are precisely ordered so that the length of the reserved word can be used to point to a specific part of the table and therefore improve the length of scan required to locate a terminal symbol. The kind of directive controlling the subscanner is:

   *PUNCS  + , -, *,/ etc.
   *RESER  LS,EQ,BEGIN,PROCEDURE etc.
   *EQUIV  PROCEDURE, SUBROUTINE

The first two statements define the terminal symbols while the third allows alternative representations of reserved words to be declared.

The subscanner is also provided with a set of small routines for analysing special input forms. For example, a routine is provided to convert all numbers appearing on the input (see Section 6.4). This is provided with parameters to allow exponents in decimal or octal, different exponentiation signs and also minor changes in the syntactic definition of a number.

10.5 Syntax analysis

It is in the area of syntax analysis that the CC has been most useful to the compiler writer. Virtually every CC provides the user with a set of statements in which to define the syntax recogniser and, in the more powerful systems, these definitions differ very little from the BNF notation often used to define the grammatical constructs in the language. Top-down analysers (see Section 7.6) and Floyd productions recognisers (see Section 7.5) are most frequently used in CCs. Precedence methods tend not to be used as the range of languages that can be handled is not large. Also, it is not easy and often impossible to generate a precedence table from a syntax defined in BNF. Consequently a CC using a precedence analyser would only be able to automate the loading of the precedence table and not the more complex problem of deciding which entries to load.

In order to give brief examples of different CCs, consider the following language defined in the notation introduced in Chapter 5:

   P  →  L .
   L  →  L ; S
   L  →  S
   S  →  I = E
   E  →  E + I
   E  →  I
   I  →  a
   I  →  b
   I  →  c
   I  →  d

The language consists of a set of assignment statements separated by ; and terminating with . . Each assignment statement evaluates an expression, E, and assigns its value to one of the variables a, b, c or d. The expression consists of any set of the variables added together. An example program is:

   a = b + c + d;
   c = d + d.

10.5.1 Top-down analysers

One of the earliest CCs using a top-down analyser is the BMCC.

This uses a simple non-selective top-down analyser which does not handle left recursion. Consequently the recogniser for a language like the one above, which has left recursive rules, has to be modified so that these rules do not appear. It is usual to change these to a right recursive form. The BMCC analyser for the language above could be written:

  PHRASE [TERMINATOR] = ;,.
  PHRASE [I]          = a, b, c, d 
  PHRASE [E]          = [I] [EXE], [I] 
  PHRASE [EXE]        = + [I] [EXE), + [I] 
  FORMAT [SS]         = [I] = [E] [TERMINATOR]

The PHRASE statement corresponds closely with the form of grammatical rule used in the language definition given above. Rules involving the same non-terminal on the left-hand side can be amalgamated; the different right-hand sides are separated by the symbol ,. The non-terminal symbols are differentiated from terminal symbols by enclosing them in square brackets. The last four statements in the example language's definition therefore correspond to the second PHRASE statement given above. The left recursive definition of E has been replaced by the definitions for [E] and [EXE] in a right recursive form.

The PHRASE statement is all that is needed to define the complete syntax recogniser for the language. However, it will be necessary at some stage to take the part of the program analysed so far, find out what it means and produce code accordingly. Therefore it is necessary to mark certain constructs defined by PHRASE statements so that, when this construct is recognised, a call of a semantic routine associated with this construct can be made. In BMCC a second type of PHRASE statement called the FORMAT statement is provided. The FORMAT statement resembles even more closely the language rules given in the example as only one alternative can be given per statement. With each FORMAT statement is associated a semantic routine and this is called as soon as that alternative of the nonterminal class is recognised. In the example the non-terminal class [SS] (Source Statement) has just one alternative defined which is the assignment statement. Each time an assignment statement has been analysed, the semantic routine is called and code generated before returning to analyse the next statement.

The semantic routine associated with a particular statement will, of course, require to know all the relevant information about the specific form of that statement before it can know what code to generate. In the BMCC system, the analysis routine builds up a complete parse tree (called analysis record) of the sentence recognised and passes this to the semantic routine. By examining this structure, the semantic routine can determine the form and meaning of the statement. The assignment statement:

   a = b + c + d

would, when analysed, produce an analysis record as shown in Fig. 10.4. The tree structure produced consists of a set of nodes where each node represents a non-terminal symbol that has been recognised. The number associated with each node is the alternative in the PHRASE statement that has been recognised. The nodes pointed at from a node denote the non-terminal symbols that appear in the alternative recognised. The terminal symbols recognised are not stored in the tree structure as these can be reconstructed from the PHRASE definitions. Apart from some redundancy that has been generated, the form of the analysis record resembles the binary tree shown in Fig. 10.5.

Fig. 10.4

Fig. 10.5

A more powerful system than BMCC is CGS which uses a top-down analyser that does handle left recursion. The analyser is normally a fast-back but individual rules can be marked as slow-back where required. The notation is, of course, different but the recogniser in CGS can be written down in a form very similar to the syntactic definition of the example. The syntax of the language written in CGS is:

    P = L $.               //      /
    L = L $; S / S         //      /
    S = I $= E             //GENRAT/ 
    E = E $+ I / I         //      /
    I = $a/ $b/ $c/ $d     //FIXLST/

In CGS terminal symbols must be preceded by the symbol $ and alternatives with the same left-hand side are separated by /. Instead of having a separate statement for constructs which call the semantic language, such a statement is tagged with a call such as GENRAT. Several of these calls exist but GENRAT closely resembles the call associated with a FORMAT statement in BMCC. The analyser in CGS is similar to BMCC in that it generates a tree structure to be passed to the semantic routines of the CC. However, the form of the tree is considerably different, and the sub-tree produced before the call of GENRAT is shown in Fig. 10.6 for the statement:

   a = b + c + d

Fig. 10.6

The number associated with each node is the number of sub-trees if the node is non-terminal. Terminal elements are stored in the tree unless the particular statement is tagged with FIXLST in which case this information is not stored and the number associated with the non-terminal node then contains the alternative recognised just as in BMCC. The other important difference is that the string of additions, instead of producing a binary tree, produce a bushed tree with all the identifiers involved in the additions on the same level.

10.5.2 Floyd productions

A good example of a CC using Floyd productions for the syntax analysis is the language FSL. The analyser is similar to the one described in Section 7.5 and shown in Table 7.3. It is less general than this as only symbols to the left of the pointer can be matched against a rule. Consequently, the pointer must first be moved over any input string of interest before checking can take place. As described in Section 7.5, the symbols to the left of the pointer are loaded into a stack so that all testing takes place on the top elements of the stack. In FSL there is a restriction that not more than the top five elements in the stack can be tested. The language has a subscan which automatically packs up identifiers so that the identifiers are converted to the class I as they are loaded into the stack. The set of productions for the example are:

    LABEL      INITIAL       FINAL         NEXT 
     P1              I   |                |     *P2 
     P2             I=   |                |     *P3 
     P3            I=I   |   → I=E |      EXEC3 *E2 
     E1            E+I   |   →   E |      EXEC2 *E2 
     E2             E+   |                |     *E1 
                  I=E;   |   →     |      EXEC1 *P1 
                  I=E.   |   →     |      EXEC1

Each rule consists of a label followed by the initial configuration of the stack expected for this rule to apply. This is terminated by | and, if an arrow follows, then the symbols in the column headed FINAL are to replace the stack elements recognised when a match occurs. The NEXT column denotes the next rule to try if this rule has been matched successfully. If this label is preceded by * then another character is read from the input into the stack before the next production is tried.

Unlike the top-down analysers described (but typical of this type of CC), no parse tree is produced automatically during the recognition process. Instead it is possible to call the semantic part of the system each time a rule is recognised. This has data structures with which the user can construct a parse tree if he desires or alternatively code may be generated each time a semantic routine is entered. In FSL the semantic routines are entered by inserting EXEC; in the production. This calls the ith semantic routine each time this production is recognised.

This type of analyser tends to be more efficient in syntax recognition than the top-down analysers although it is not as readable as the BNF-like notation used in these. However, if code is generated as syntactic productions are recognised, the code produced is not good. Recently [Earley (1965)] an algorithm has been given for converting syntax defined in BNF into an equivalent set of Floyd productions. Although the algorithm does not produce the most efficient set of Floyd productions, it could become quite important as it would allow syntax to be defined in a more readable form than Floyd productions yet still retaining the speed of syntax recognition typical of this method.

10.6 Semantic analysis

In Section 10.5 it was shown that syntax analysers tended to fall into two classes. The first calls a semantic routine each time a syntactic construct has been recognised while the second produces a parse tree for a large section of the program before passing the complete structure to a semantic routine for analysis. Most CCs using Floyd productions fall into the first class (e.g. FSL and TGS) while the top-down analysers are in both classes. Top-down analysers designed to produce efficient object code (CGS and BMCC) tend to be in the second class while simpler systems [Meta II: Schorre (1964)] are in the first class. If efficient object code is not required, the first class is usually adequate and can produce code directly (see Section 8.3). CCs in the first class which intend to produce good object code normally just pass on individual nodes of the parse tree as they are recognised, and no semantic analysis is done until the complete parse tree for a section is available. Consequently there is little difference between the methods used by semantic analysers of CCs producing good object code. Each attempts to do this by examining the form of the parse tree for a section of the program. How efficiently this can be done may well depend on the particular form of the parse tree produced. The parse tree should be kept as small as possible as this will usually save on both storage required for the tree and time required to analyse it. Usually the Floyd production recogniser is more suited to this. Not every construct recognised need add to the parse tree, and information about the parse tree can be kept in the syntactic stack until a suitable point for generation. The stack will, in any case, be used to keep pointers to the unfinished parts of the parse tree already generated. TGS is an example of a system generating a parse structure of this form. The top-down analysers tend to produce too large a parse tree unless care is taken. Syntactic classes are often defined for purely grammatical reasons causing tree nodes of little value to be introduced unless special facilities are provided to remove them. In CGS, for example, it is possible to remove some redundant nodes and also insert pointers down the most important paths through the tree.

Once the parse tree has been produced the semantic analyser must define the order in which the various parts of the tree should be examined and also recognise particular patterns in the tree which represent special cases requiring individual actions. The usual method is to enter the semantic analyser with a pointer to the top node of the tree. The form of this node and its descendants define which node or nodes should be examined next. The pointer moves about the tree in a tree-walk collecting information and producing code until the complete tree has been analysed, A simplified form of the semantic analyser in CGS for the example given in Section 10.5 would be:

   IF S, $SON3 $OUTPUT( STORE,SON1). 
   IF E, $SON1.
   IF I AND LFTSIB = $+ , $OUTPUT( ADD,SELF) $RTSIB*RTSIB.
   IF I, $OUTPUT( LOAD, SELF) $RTSIB*RTSIB.

This ordered set of rules is examined each time the pointer is moved to a new node to find the first applicable rule. This rule is then executed. The first part of the rule up to the ',' defines the type of tree node that the rule applies to. The second part of the rule defines the actions to be taken. For example, $SON1 moves the pointer from the current node to its left most sub-node.

Consider the example given in Fig. 10.6. Initially the pointer would be set at the top node so that the first rule was applicable. The action required is first to move the pointer to the third sub-node which is the one of type E. The CGS system automatically stacks the old rule number and the position in the rule so that, when the actions for the sub-node E have been completed, the original rule can be completed. As the node is of type E, the second rule is applicable and the required action is to move the pointer to the left most sub-node of E. This is the I node representing the variable b. The third rule is not applicable as the left sibling (that is the node immediately to the left of the one being considered and both pointed at from the same node) does not exist. The fourth rule is applicable and

   LOAD b

is output followed by the pointer being moved to the right sibling of the right sibling of this node. That is the node I representing the variable c. The third rule is now applicable as the node to the left of this one is a + character. The operation

   ADD c

is output followed by moving the pointer to the last I node. This causes

   ADD d

to be output using the third rule. The rule now requires the pointer to move two nodes to the right again. However, no such node exists as this is the rightmost node. The system therefore assumes that the action required for this routine is complete. Retracing through the nodes, the only one not completed is the initial node S and this outputs

   STORE a

The BMCC tree-walk is rather different from CGS in that any number of pointers to the parse tree are allowed. This means that the control of the tree-walk cannot be defined by the position of the pointer as in CGS. Instead the order in which nodes are examined is controlled by a routine associated with a particular type of parse tree. Although Fig. 10.4 shows each node having its type associated with it, this is not strictly correct in BMCC. Only a number, denoting the alternative in the PHRASE statement that has been recognised, is stored in the parse tree. This is sufficient to specify the type of each node as the types can be obtained from the PHRASE statements by starting at the top of the parse tree and working downwards. In BMCC each pointer has associated with it the type of the node that it points at. A BMCC routine for analysing the parse tree for the assignment statement given as an example would be:

   ROUTINE[SS] ≡ [I/1] = [E/l] [TERMINATOR/l] 
          JUMP 1 IF [E/l] ≡ [I/2] [EXE/l] 
          LET [Ell] ≡ [I/2] 
          OUTPUT(LOAD [I/2]) 
          JUMP 3 
   (1)    OUTPUT(LOAD [I/2]) 
   (2)    JUMP 4 IF [EXE/l] ≡ + [I/3] [EXE/l] 
          LET [EXE/l] ≡ + [I/3] 
          OUTPUT(ADD [I/3]) 
   (3)    OUTPUT(STORE[I/l]) 
          END 
   (4)    OUTPUT(ADD[I/3] ) 
          JUMP 2

The variables in square brackets denote pointers to the parse tree. The first part of the name up to the symbol / defines the type of node pointed at while the remainder denotes which pointer variable of this type is being used. The symbol ≡ appearing in several of the statements implies that the pointer variable to the left of the symbol ≡ may be to a node whose parse tree matches that for the expression to the right of the symbol ≡. The expression may contain pointer variables which define the type of sub-trees expected at these points. If the expression does match then these pointer variables are set to point to the corresponding sub-trees of the parse tree pointed at by the variable to the left of the symbol ≡. For example, the initial statement of the ROUTINE given above attempts to match the parse tree with the expression:

   [I/l] = [E/l] [TERMINATOR/l]

The example in Fig. 10.4 is an assignment statement and would therefore match this expression. Consequently the pointer variable [I/l], [E/l], and [TERMINATOR/l] would be set to point at the three sub-nodes of the node [SS]. The next statement of the routine:

   JUMP 1 IF [E/l]≡ [I/2] [EXE/l]

attempts to match the sub-tree pointed at by [E/l] with the expression [I/2] [EXE/l]. This will be successful unless the right-hand side of the assignment statement consists of a single variable. In the example, the match would be successful so [I/2] would point at the node for b and [EXE/l] set to point to the upper node of type [EXE]. As the match was successful, control is then transferred to label 1; the order:

   LOAD b

is output and so on. It is not necessary to restrict the matching to the nodes immediately below the node under consideration and a statement:

   JUMP 5 IF [E/l] ≡ b + c + [I/4]

is allowed. The routine shown is not completely correct as it does not define the different actions required for the two possible terminators.