Chilton::ACL::The Use of Syntax Analysis in Compilers

The Use of Syntax Analysis in Compilers

D Morris

July 1962

Proceedings of a Symposium held at the London School of Economics, Editor Peter Wegner

Introduction

Most readers will be familiar with the meta-linguistic variables and formulae used to describe syntax in the ALGOL 60 report (Backus et al., 1960). The purpose of this paper is to discuss how formulae of this kind are used in compilers. Of the several compiling systems (Brooker et al., 1962a, b; Irons, 1961; Gibbons, 1961; Glennie, 1960), which use the notion of meta-linguistic variables and formulae as a means of identifying the various kinds of statements (instructions or commands) which may occur in the programs to be compiled, we shall be mainly concerned with the Atlas Compiler Compiler System (Brooker et al., 1962a). The meta-linguistic formulae in question describe the language implicitly by means of a set of substitutions which generate all the legal statements of the language. They are used by compilers in order to discover the sequence of substitutions which leads to each statement, and this reveals its meaning. The process is referred to as syntactic analysis.

Usually compilers based on this principle can be applied to several different languages by substituting the appropriate sets of meta-linguistic formulae. In the Atlas Compiler Compiler system these formulae are legal statements of the compiler writing language, but in some of the other systems they have to be hand coded.

Meta-Linguistic Formulae

In what follows meta-linguistic will be abbreviated to meta.

The meta-linguistic formulae we are concerned with are simply a means of assigning names to semantically similar strings (i.e. concatenations) of symbols. For example, the name INTEGER might be assigned to the class of integer constants. The names are regarded as meta-variables and may appear in other strings to indicate that any of the strings with which they are associated is a permissible substitution at that point. It is therefore necessary to distinguish symbol strings representing meta-variables from symbol strings in which each symbol represents itself. The obvious solution is to enclose the meta-variable in some form of parentheses, but this implies that the opening bracket is a meta-symbol which marks the start of a meta-variable, and it cannot then be used to represent itself. In Brooker et al., 1962a meta-variables are written thus [name] and opening brackets not used in this context are represented by the reserved name [[]. Again in this system commas are used to separate the alternative strings in a definition and [,] is reserved to represent a comma not used in this meta-sense. Following this notation we may now consider the definition of the class of integer constants (to which we assign the name [INTEGER]).

It is convenient first to define a meta-variable representing the class of decimal digits, thus:

[DECIMAL]   = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
[INTEGER]   = [DECIMAL] [INTEGER] , [DECIMAL]

This recursive definition states that an integer is either a decimal digit followed by a further integer or just a decimal digit. Thus 107 is the decimal digit followed by the integer 07 which is the decimal digit 0 followed by the integer 7, which is the second kind of integer. For further examples the reader is referred to Backus et al., 1960 and Brooker et al., 1962a, but before continuing to consider how these formulae are used in syntactic analysis it is appropriate to mention two rules which they must obey if they are to be used by algorithms such as the one described below.

It will be seen that the definitions are scanned in a left to right direction and the first alternative string which is found to match the string of symbols at the head of the input stream is accepted. Hence if one alternative is a stem of another it must appear after the other. Secondly in recursive definitions some non-recursive meta-variables or symbols must appear to the left of the recursive element otherwise an endless loop would result. The following therefore are not acceptable.

[INTEGER] = [DECIMAL] , [DECIMAL] [INTEGER] 
[INTEGER] = [INTEGER] [DECIMAL] , [DECIMAL]

The formulae used by Backus et al., 1960 do not obey these rules, but they are formulae for synthesis and not analysis.

Implementation of Syntactic Analysis

Some schemes, for example Glennie (1960) do not employ meta-formulae directly. Instead each meta-variable is defined by means of a routine, which is coded to search for each of the symbol strings associated with that metavariable in turn. Where the strings involve symbols the routines make the relevant comparisons with the symbols at the head of the input stream, but where they involve meta-variables the corresponding routines are called as sub-routines. Obviously the routines must have a potential for recursive use. If they are successful in the search which they carry out they will advance the pointer in the input stream and report the success to the calling routine. If they are unsuccessful this also will be reported and the calling routine will then explore the next possibility or report failure itself if the last possible string is being explored.

In contrast, other schemes (Brooker et al., 1962a; Irons, 1961; Gibbons, 1961) store the meta-formulae and a single routine is used to carry out the syntactic analysis. This routine references the formulae in order to identify the symbol strings associated with the meta-variables which they define. The input parameters to this routine are effectively, the address of the formula defining the meta-variable which it is to attempt to recognize, and a pointer marking the current position in the input stream. Its output parameter is a binary variable which states whether or not the routine has recognized a string corresponding to the given meta-variable, and if it has the pointer in the input stream will have been advanced. The routines used by Brooker et al., Irons and Gibbons appear to be logically similar, and only one used by Brooker et al., 1962a will be described.

First we must consider how the meta-formulae are recorded in this scheme. Internally, both symbols and meta-variables are represented by integers (24 bit words on Atlas). The integers representing symbols are the actual symbol code plus a shift digit and are less than 128. Meta-variables are represented by integers greater than 128, and these integers are the addresses of registers in an index which in turn contain the addresses of the corresponding definitions. Thus the definition of a meta-variable consists of one or more strings of integers which are either the values of symbols or indirectly the addresses of the definitions of other meta-variables. For reasons which will become apparent below, the strings are enumerated in natural order and this category number is appended to each string. Further, if adjacent strings in a definition have a common stem the stem will only be recorded once. Thus, denoting the integer representing [INTEGER] by I and [DECIMAL] by D, and writing the category numbers explicitly, the definition given earlier for [INTEGER] would become

In order to record such definitions in a consecutive sequence of store registers a special kind of integer is used to mark branch points. It is distinguished by using as a flag a digit not otherwise required. The value of such integers is the address of the last word in the branch which follows them, and the next branch follows immediately after this. In order that we may know where the last string in a definition ends, an extra branch word is placed in front of each definition pointing to the last word in the definition. Denoting branch words by & and indicating the word whose address they contain by arrows, the above definition is coded as:

Returning to the syntactic analysis routine, this routine is entered with two parameters, namely the integer representing the meta-variable for which it is to search, and the address of the next character in the input stream. First it looks up the address of the definition of the meta-variable and sets a pointer on the first word. It also sets up a local push down list to be used when & words are encountered. Now the main loop commences. The current integer in the definition is examined and if it is an & word it goes into the push down list along with the current value of the pointer in the input stream. The pointer in the definition is then advanced and the process is repeated, If the integer is not an & word the routine next tests if it is the last word (i.e. category word) of the string. This is done by comparing its address to the last & word entered in the push down list. If these agree then the current integer is the category word and the routine ends by returning control to the place from which it was last called and reporting its success. However, if the current integer is not the last in a string, a comparison has to be made. If it represents a symbol (i.e. is <128) the comparison is made directly with the symbol at the head of the input stream. If they correspond both the pointer in the definition and the pointer in the input stream are advanced and the loop is repeated. Otherwise both pointers are reset from the last entry in the push down list and the loop is then repeated to test for the next alternative. When the push down list is empty (or contains only the first &) there are no further possibilities to explore, therefore the routine ends and reports failure. The comparison process in the case where the current integer is a meta-variable is effected by the routine calling itself recursively. In due course control will return and the routine then continues exactly as it does after the comparison of symbols.

For the purpose of extracting the meaning from a statement it is necessary to have some output from the syntactic analysis routine. It is at this point where the various schemes referred to begin to show a marked difference.

Synthesis of Target Program

It is not within the scope of this paper to discuss fully the problem of transforming source programs into machine coding, but it is interesting to compare briefly the schemes previously referred to. Glennie (1960) and Irons (1961) appear to have produced somewhat similar solutions although the latter is interpretive. In these schemes the analysis and synthesis phases merge into one, and an output stream representing the target program is generated as the input stream is being digested.

Each meta-variable has a meaning which is to be copied into the output stream each time the meta-variable occurs in the input stream. Its meaning may be expressed as a sequence of output language symbols, as a function of the recognized input symbols, or as a transformation to be carried out on symbols already in output stream.

In Brooker et al., 1962a - Gibbons (1961) is similar - the output from the analysis phase is a tree record of the statement showing which alternative form of each meta-variable in the statement on hand has occurred (i.e. its category number). Each node in this tree corresponds to a meta-variable and it contains the category number of the recognized alternative form of this meta-variable followed by the addresses of the nodes defining the meta-variables which appear on this alternative. When an alternative consists entirely of basic symbols, the category number only is recorded and this is an end node. Thus the tree record for the integer 107 analysed with respect to [INTEGER] is

Its one dimensional layout would be

but the details of this need not concern us here, and the mechanics are described in more detail in Brooker et al. (1962b).

For each kind of statement in the language the user has to provide a routine for the purpose of converting the tree record for the statement into target language.

The overall structure of these routines is very similar to that of routines in other languages, e.g. FORTRAN, in so far as they consist of a sequence of instructions with a labelling system for effecting transfers of control. Whereas in conventional routines, however, the types of quantity being processed are usually integers and floating point numbers, in our case the two principal types are integers and expressions, the latter being represented internally by trees. Integer variables are either local to the routine in which they appear and denoted by α₁ α₂ α₃ . . ., or they are global and are denoted by β₁ β₂ β₃ . . . The local ones occupy space in an area of store which is used as a stack and are thus protected during recursive use of the routines. The global ones represent the index registers of Atlas, of which only the first 40 are generally available. Both kinds of integer are used for much the same purposes as integers in other languages, e.g. counting, remembering addresses, and so on; and the usual arithmetic and logical operations are available for manipulating them, or the integers in the store registers whose addresses they contain. There are, however, no formal arrays and for systematic working space (e.g. lists of labels and associated control numbers, type declarations, array dimensions) we use chained lists. In these each item occupies two consecutive store registers; the item itself being in the first word and the address of the next item being in the second word. The addresses of these lists are recorded as α's or β's, and examples of the instructions for manipulating lists are:

add α₄ + 10 to list β₂
list α₅ = list α₁₀ + list α₁₁

If the lists become too numerous to keep track of in the α and β registers their addresses may be stored in further lists, and in this way complicated list structures are built up.

The expressions with which routines are concerned are the constituent parts of the statement being translated. They are represented in the store by the tree record produced by the syntactic analysis routine, and these tree records are referred to by using the meta-variables as identifiers. The operations to be performed on expressions consist of: resolving them into their constituent sub-expressions, testing for a given alternative form of an expression, or perhaps constructing new expressions using the existing expressions as sub-expressions. Examples of the kinds of instruction available whose meanings are fairly obvious are:

→ 1 UNLESS [INTEGER] = [DECIMAL] [INTEGER] 
α₁ = CATEGORY OF [DECIMAL] 
1) LET [INTEGER] ≡ [DECIMAL]

Internally the operations of this kind amount to matching trees in a topological sense, locating the top nodes of particular sub-trees, and linking incomplete tree structures to particular sub-trees.

To give a clearer picture of how this system works in practice we must consider briefly the form in which the user presents his compiler. It will consist mainly of three kinds of statements called PHRASE, FORMATS ROUTINE statements. For every legal statement of the source language a FORMAT statement must be supplied to define its syntax (this definition is simply a string of symbols and meta-variables), and a ROUTINE statement must be supplied to transform its tree record into target language. The Compiler Compiler collects together all the FORMAT statements and assembles them as the alternative strings in the definition of a meta-variable representing the class of source statements (denoted by [SS]). The category number assigned to each alternative in this definition is the index number (i.e. indirect address) of the associated ROUTINE. When the Compiler Compiler is translating source program the master routine calls in the syntactic analysis routine to analyse the input stream with respect to [SS]. When this routine has identified the source statement at the head of the input stream control will return to the master routine. This routine then examines the top node of the tree record to discover the category number of the recognized statement, and from this locates the routine which is to be used to translate the statement. The master routine now calls in this routine supplying it with the tree record as a parameter and the statement is thus converted to target coding. When control is returned to the master routine the process is repeated.

Conclusions

The principal virtue which the compiling systems mentioned above have in common is that they greatly simplify the detailed coding of a compiler. Also the structure of the resulting compiler is such that to modify it to accept new or extended statements is a less hazardous undertaking than with a conventional hand coded compiler. They do nothing, however, to assist in choosing strategies for mapping the target program and its working space into the available store, or for optimizing the target program.

REFERENCES

1. J. W. BACKUS, F. L. BAUER, J. GREEN, C. KATZ, J. MCCARTHY, P. NAUER, A. J. PERLIS, M. RUTISHAUSER, K. SAMELSON, B. VAUQUIS, J. H. WEGSTEIN, A. van WIJNGAARDEN and M. WOODGER (1960). Report on the Algorithmic Language ALGOL 60. Numerische Mathematik 2.

2. R. A. BROOKER, I. R. MacCALLUM, D. MORRIS and J. S. ROHL (1962a). The Compiler Compiler. Annu. Rev. Autom. Progr. 3.

3. R. A. BROOKER, D. MORRIS and J. S. ROHL (1962b). Trees and Routines. Computer J. 5, No. 1.

4. A. GIBBONS (1961). Running Pegasus Autocode Programs on Mercury. Computer J. 3, No. 4.

5. A. E. GLENNIE (1960). On the Syntax Machine and the Construction of a Universal Compiler. Technical Report No. 2, Computation Center, Carnegie Institute of Technology.

6. E. T. IRONS (1961). A Syntax Directed Compiler for ALGOL 60. Comm. A.C.M. 4, No. 1.