Chilton::ACL::Compiler Compiler

A Compiler-Building System Developed by Brooker and Morris

S Rosen

CACM Vol 7, No 7, July 1964

Purdue University

1. Introduction and Summary
2. Phrases and Classes
3. Analysis Record
4. Format Classes
5. Expression Recognition Routine
6. Format Routines
7. System Routines
8. Comments and Conclusions
9. References

AN ACM EXPOSITORY PAPER

Including a Comprehensive Discussion of the Major Features of the System

This paper was written while the author was employed as a consultant by System Development Corporation, Santa Monica, California. It was produced in connection with a research project sponsored by SDC's independent research program.

In a number of articles published during the past two years, R. A. Brooker and D. Morris (joined by J. S. Rohl in their most recent paper) have presented a very interesting programming system that they have developed for the Ferranti Atlas computer. The present paper describes some of the major features of their system. It expands on some points that the original authors cover briefly, and treats only very lightly some topics to which they devote considerable space.

The purpose of this paper is purely expository. Except in some very small details, and in some comments, it does not intentionally depart from or add to the material published in the listed references.

In the opinion of the writer, systems of this kind are well worth implementing and will provide useful research tools in the development of languages and techniques. This opinion is true even when such systems turn out to be of limited usefulness in producing production compilers, where compiling speed and object code optimization may be considered more important than language flexibility and elegance or generality of system organization.

1. Introduction and Summary

It is convenient to break the description of a compiler into two parts. One is the description of the source language - the formats in which the programmer may write source statements, and the classes of elements or phrases that may appear in such statements.

Formal descriptions of source languages have been published; the best known is the ALGOL report [6]. It has been noted by a number of writers [1, 2, 7, 8, 9] that it is relatively easy to produce standard service routines that will convert the formal definition of the syntax of a language like ALGOL into a set of tables in a computer memory. It is then also relatively easy to write a routine which, when given a source statement, will search the definition tables and determine the format of the source statement, and will provide an analysis of just how the symbols that make up the source statement satisfy the definitions and subdefinitions that define the format.

An analysis routine of this kind can be written so as to be independent of the actual definitions. It depends only on their form. It is therefore independent of the source language, being dependent only on the metalanguage.

The second part of the description of a compiler is a description of the actual processing routines or generators that take whatever action is indicated when a source language statement is recognized. This action may produce object language directly, or may add to or otherwise modify lists and indicators that guide future compiler action. It is assumed that the object language is implicitly defined in these routines.

The system discussed in this paper associates a macro-instruction (or generator) called a format routine with each format. A source statement is analyzed and found to be in a particular format. It is interpreted as a call on the format routine associated with that format. The analysis routine may be looked upon as a preliminary translator that converts the source statements into generator calls in a standard form.

The system provides a language in which the format routines can be conveniently written. This language is described in the same formal terms as the source language and is converted into tables by the same service routines. The language provides formats for the instructions that make up the format routines. A format routine consists of a list of such instructions which are themselves statements in system formats and are therefore calls on format routines.

The system contains a basic structure consisting of a number of basic instruction formats with built-in format routines, a general-purpose analysis routine and a number of service routines which handle general system sequencing, additions to and deletions from lists.

A compiler is built up on this structure by adding definitions of classes of phrases, definitions of source statement formats and definitions of statement formats that define the instructions that may be used in the format routines. Corresponding to each format, a format routine is added. The format routine itself is a list of statements in formats that are already in the system. (Their format routines need not be in the system until they are actually used.) In this way enough source statement formats are added to define a useful language, and at this point the system will act as a compiler, with this language as its source language. Additions to, changes to and deletions from the language can be made, and completely new languages can be introduced with relative ease.

2. Phrases and Classes

There is a set of elementary symbols that are recognized by the system. However, if it is convenient to make use of more symbols than are available in the character set that is recognized by the computer hardware, an input routine can recognize certain combinations of external symbols as the equivalent of single internal elementary symbols.

A class identifier is a string of elementary symbols enclosed in square brackets [ ]. The elementary symbols used in a class identifier may not include the symbols /, (, ], *, ?, since these symbols are used to terminate or modify class identifiers.

A phrase is a string of components which may be either elementary symbols or class identifiers. A phrase class is defined by writing a class identifier followed by the symbol = which is in turn followed by a list of phrases separated by commas. For example,

Cl   [ES] = 1, 2, ..., 9, 0, a, b, ..., z, +, -, ... , *
C2   [L] = a, b, c, d, e, ... , u, v, w, x, y, z
C3   [D] = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

The class [ES] is the class of all of the elementary symbols defined in the system. Subclasses of this class are the class [L] of letters and the class [D] of digits.

The , separates the alternatives or categories of a class. Definition C2 states that a member of class [L] is an a or b or c, etc. There are 26 categories in the class [L] and the category number of c in this class is 3. The category number of 3 in the class [D] is 4.

Similarly [D] can be said to be of category 3 in the class [PI] of defined phrase class identifiers.

Classes Cl, C2, C3 are defined in terms of elementary symbols alone. Their class identifiers can be used to define new classes of phrases. The definition

C4   [L, D] = [L], [D]

defines a class whose members are letters or digits. The class identifier [L] appearing in the definition means any phrase of the class [L]. Any letter is of category 1 with respect to the class [L, D]. Any number is of category 2 with respect to this class. The class

C5   [L,D] = [L] [D]

defines a class whose members are phrases of the form a9, b7, z4. This is an example of a class that has only one category. Category 1 in this class has the two components [L] and [D].

The special symbol * is used to define a string that may consist of any number of phrases of a class. In the definition

C6   [N] = [D*]

[D*] is interpreted as meaning a string of one or more [D]'s. [N] is the class of all integers.

The special symbol ? is used to indicate the optional presence of a member of a class as in the definition of the class [V] of variables,

C7   [V] = [L][L,D*?]

The ? indicates that a member of class [V] may consist of a member of class [L] alone or it may be a member of class [L] followed by a member of class [L,D*]. This means, of course, that a variable may be a letter or any combination of letters and digits beginning with a letter. In order to describe arithmetic expressions in an ALGOL-like language, the following definitions are introduced:

C6A    [N] = 1, 2, 3, 4, ... , 0
C8     [K] = [N], [N]., .[N], [N].[N]
C9     [P] = [K], [V], ([SAE])
CIO    [↑P] = ↑[P]
Cll    [F] = [P][↑P*?]
C12    [×/F] = ×[F], /[F]
CIS    [T] - [F][×/F*?]
C14    [±] = +,-
C15    [±T] = [±][T]
C16    [SAE] = [±?][T][±T*?]

The class [K] is the class of floating-point numbers. It would be more consistent with ALGOL to omit the second alternative [N]. These definitions define the class [P] of primaries, the class [F] of factors, the class [T] of terms and finally the class [SAE] of simple arithmetic expressions.

The definition of [N] in C6A replaces the definition in C6. In most cases it is convenient to treat [N] as a built-in class in which an integer is represented by the actual value of that integer. As pointed out in [5], the two definitions are not equivalent, since the definition in C6A will not distinguish between the strings 3, 03, 003, and so forth.

3. Analysis Record

To demonstrate that a string S of elementary symbols is a member of a particular class C, it is necessary to show that it satisfies the definition of one of the alternatives or categories of C. In general the categories of C will be defined in terms of class identifiers which may themselves be defined in terms of class identifiers. The question as to whether S is or is not a member of C can be answered only if a trace through a sufficient number of levels of class definitions leads to definitions in terms of elementary symbols against which it is possible to test the elementary symbols of S.

An analysis record is a diagram of a successful trace of this kind. The record takes the form of a multilevel branched structure which is a tree. There is a branch of the tree corresponding to each class identifier that appears in the trace. The branch points to a node which contains the selected category number within that class. There is a branch from this node corresponding to every component of the definition of this category which it itself a class identifier. If the definition is in terms of elementary symbols only, the node degenerates into a terminal point.

In constructing an analysis record, an optional class is always treated as a class with two alternatives, as if for each class identifier [CI] there existed a class definition [CI?] = [CI], NIL where NIL is the null class.

Figure 1 is the analysis record of the phrase y28 with respect to the class [V].

Fig 1 Analysis record of y28 with respect to [V]

The class [V] has 1 category which has the two components [L] and [L,D*?]. Category 25 of class [L] is a y which matches the first symbol in the string y28. The class [L,D*?] is the optional form of [L,D*] and has two alternatives: 1 = present and 2 = not present. The alternative 1 leads to the class [L,D*] which also has two alternatives: 1 = more than 1 present and 2 = only 1 present. The first alternative of [L,D*] has the two components [L,D] and [L,D*]. The component [L,D] leads to the terminal category 3 in the class [D] which corresponds to the elementary symbol 2. The class [L, D*] leads to the elementary symbol 8.

The &-symbol at the head of each branch symbolizes an address word when the tree is stored as a linear array in memory. The tree in figure 1 is stored as

Each category number and each address word (&) occupies a position in memory. An address word contains the address of the category word to which it points.

4. Format Classes

A format class is like a phrase class in that it is defined in terms of a list of categories which are phrases, i.e., strings of elementary symbols and class identifiers. Each category in a format class is called a format.

The basic distinction between format classes and phrase classes is that there is a format routine associated with each format (category) in a format class.

Format classes also differ from phrase classes in that format classes are open-ended. Their categories are submitted to the system individually, and the system itself can grow by the addition of new formats along with their associated format routines.

An example of a format class is the class [SS] of source statements. Examples of source statement formats as they would be defined to the system are:

SS1     FORMAT [SS] = [V] = [SAE] 
SS2     FORMAT [SS] = GO TO [V]

The phrase (statement)

y2 = a × u ↑ 2 + b × u

is a source statement which satisfies the definition of the format [V] = [SAE].

The source statement formats taken together define a source language. The format routines associated with the source statement formats are the generators or macro-instructions which perform translation and other compiling operations that convert a source language program into an output that is a program in an object language.

In addition to the source language the system provides a language in which the format routines themselves are written. This language is defined in terms of two additional format classes: the class [BS] of built-in statements and the class [AS] of auxiliary statements.

The format routines corresponding to built-in statement formats are built-in routines. They are programmed and put into machine code outside the system, and may be considered part of the initial state of the system. The built-in statement formats, which are discussed in detail in Section 6, define the basic instructions of the system from which all format routines are ultimately constructed.

Auxiliary statement formats are introduced to provide instruction formats that are convenient for use in writing format routines.

The format routines for auxiliary statement formats and for source statement formats are lists of instructions which may be basic statements, auxiliary statements or source statements. Figure 3 in Section 6.5 shows such a format routine as it is written by the programmer and as it would be punched on cards or paper tape to be submitted to the computer.

Auxiliary statements are placed in a separate class from source statements in order to save time when the expression recognition routine (see Section 5) is analyzing a source language program.

It is consistent with the system philosophy to have only a very few basic statement formats, since these form a rigid base for the system and cannot be changed within the system. However, the introduction of additional basic statement formats may have considerable effect on reducing compiling time, since it is usually much faster to execute a built-in routine than to execute an equivalent format routine written out in terms of system instructions. Brooker and Morris [4] discuss format classes other than [SS], [AS] and [BS] which might be introduced. The other possible format classes will not be treated here.

5. Expression Recognition Routine

When a source statement is submitted to the compiler, it is analyzed by a routine which is called the expression recognition routine, abbreviated as ERR.

The ERR attempts to match the source statement with one of the formats of the class [SS]. If it is not successful, the source statement is an illegal statement. The class [SS] is considered to be an ordered class, and it is assumed that if the source statement agrees with more than one format in [SS], it is the first one that is to be used.

In the case of a valid source statement the ERR will find a unique format, and thus a unique category number for the source statement in the class [SS]. In the process of determining that the statement actually satisfies the format, the ERR constructs the analysis record of the source statement with respect to the class [SS]. The analysis record is a tree whose trunk or initial node consists of the category number of the format in the class [SS], and address words leading to branches corresponding to each of the class identifiers that appear in the definition of the format.

Figure 2 shows the analysis record of the statement

y2 = a × u ↑ 2 + b × u

which corresponds to the format [V] = [SAE]. The figure shows the tree diagram and the equivalent linear array as it would be stored in the computer memory. The significance of the address words and category words in the analysis record has already been discussed in Section 2.

Fig 2 Analysis record of y2 = a × u ↑ 2 + b × u

The expression recognition routine uses a trial and error procedure to construct the tree which is the analysis record for a source statement. A node in the tree corresponds to a category in the class whose identifier is determined by the branch leading into the node.

The routine tentatively assumes a category number at the node, starting with number 1. The components of the category are then examined in turn. A component is either an elementary symbol, which can be checked against the next elementary symbol in the source statement, or it is a class identifier. If it is a class identifier an address word is inserted at the node and a branch leads to a node at the next lower level. The routine tentatively assumes a category number at this node and the process continues.

Whenever a category consists only of elementary symbols, these can be compared with the symbols in the statement being analyzed. If the comparison fails the routine backs up and tries the next category. If there are no more categories to try the routine backs up to the next higher level and tries the next category there. If the comparison is successful, the routine backs up and examines the next component at the next higher level. When all components at that level have led to successful comparisons, the routine backs up to the next component at the next higher level, and the process continues. A routine of this type can be described very simply hi terms of a recursive subroutine that calls itself in order to advance from one level to the next. An exit from the routine at any level represents backing up to the previous level. A routine of this type can be very time-consuming, even on a very fast computer. An important factor in determining the efficiency of the analysis procedure is the organization of the storage of class definitions in the computer memory.

Stored along with the definition of each class is a set of information words containing a bit corresponding to each of the elementary symbols. A 1 at a bit position states that members of that class exist which start with that symbol. A 0 states that no such members exist. The ERR can examine these information words to avoid a useless search through all the categories of the class.

The class definitions are organized so that categories that have a number of initial components in common can be recognized as such by the ERR. If the ERR has made successful comparison on k-1 components of one category, and then fails on the kth component, it will be able to proceed immediately to a test of the kth component of the next category provided that the first k-1 components of the two categories are identical.

The ordering of the source statement formats can affect the analysis time. If the most used formats are put first the average amount of search time will be improved.

6. Format Routines

6.1 THE BASIC LANGUAGE

In Section 4 it was pointed out that there is a format routine associated with each format. When a source statement is recognized as definitely belonging to a particular category or format, the format routine associated with that format is executed. This can happen only after the ERR has constructed the complete analysis record of the source statement in the computer memory. The category number of the format is an indirect address that locates the format routine in question. The rest of the analysis record is the data input - the set of parameter values for the format routine.

This section describes the basic language in which format routines are written.

The programming system as a whole contains a set of system registers referred to as Bl, B2, B3, .... These B-registers may correspond to hardware index registers on a computer that has enough such index registers. In most cases these registers are conveniently thought of as pointers or address registers. One of them, BI, will point to the current position in the input string. Another, BO, will point to the next position to be filled in the output string. BS will point to the first available location in the main working area and BF will point to the beginning of the routine that is currently being executed. Others will be defined as needed.

A format routine has access to the B-registers and to the locations to which they point. The notation (B3) is used to designate the contents of the location in main memory whose address is in B3.

A format routine may require temporary storage for its own internal purposes. It may refer to any number of A-registers, Al, A2, A3, ... . The system will allocate temporary storage registers equal in number to the number of A's used whenever the format routine is called.

The basic formats provide for the manipulation of numbers, A-registers, B-registers and the contents of locations specified by A-registers and B-registers. The following class and format definitions specify a useful set of basic statements. Additional basic statement formats will be introduced later.

[A] = A1, A2, A3,
[B] = B1, B2, B3,
[AB] = [A], [B]
[ABN] = [A], [B], [N]
[ADDR] = [AB] + [ABN], [AB] - [ABN], [AB]
[WORD] = [ADDR], ([ADDR]), - [N], [N], [OW]
[OW] = * Followed by an octal word
[OPERATOR] = +, -, ×, /, &, ∨
[COMPARATOR] = =, ≠, >, <, ≥, ≤
[IU] = IF, UNLESS
[LABEL] = [ABN]
BS1   [BS] = [AB] = [WORD]
BS2   [BS] = [AB] = [WORD][OPERATOR][WORD]
BS3   [BS] = ([ADDR]) = [WORD]
BS4   [BS] = ([ADDR]) = [WORD] [OPERATOR][WORD]
BS5   [BS] = JUMP [LABEL] 
BS6   [BS] = JUMP [LABEL][IU][WORD][COMPARATOR][WORD] 
BS7   [BS] = CALL R [ABN] 
BS8   [BS] = END

Instructions illustrating the first four of these formats respectively are:

A4 = 10 
B4 = B4 + B2 
(B4+1) = B6 
(B3) = (B2) + B1

The first of these will store the number 10, perhaps a tally, in the temporary location A4. The second will add the numbers in B4 and B2 and put the result in B4. The third will put the number in register B6 into the computer memory location whose address is found by adding 1 to the number in B4. The fourth will add the contents of register Bl to the number in memory at the location whose address is in B2, and put the result in memory at the location whose address is in B3.

A format routine is a list of instructions that are executed in sequence unless the sequence is broken by a conditional or unconditional transfer of control. To permit such transfers of control an instruction may be preceded by an integer label followed by a right bracket ]. For example:

2]   (B3) = (B2) + Bl 
3]   A5 = B2 + 5

The jump instruction formats then provide instructions like:

  JUMP 3
  JUMP 2 UNLESS A4 > 5

It is possible to set up and use switches. Thus one might write:

   JUMP A3
   JUMP A3 IF B3 = B4

These instructions must be preceded in the logical flow of the routine by instructions that set the value of A3 equal to an integer which appears as a label in the routine.

Each of the sample basic instructions listed above is a member of the class [BS] of basic statements, i.e., it corresponds to one of the formats (categories) of the class [BS]. The expression recognition routine can therefore form the analysis record of these instructions with respect to the class [BS].

As an example the instruction (B4+l) = B6 corresponds to the format ([ADDR]) = [WORD]. Its analysis record would be:

The symbol BS3 represents the category number of the format ([ADDR]) = [WORD] in the class [BS].

The analysis record with respect to a format class is also referred to as the analysis record with respect to the format within the format class.

6.2 PARAMETERS

A parameter is a class identifier that appears in an instruction in a format routine. An example is the class identifier [AB] in the instruction [AB] = [AB] + 1. This instruction is a member of the class [BS] corresponding to the format [AB] = [WORD]. An analysis record of [AB] = [AB] + 1 with respect to [BS] cannot be completed until a value is assigned to the class identifier [AB].

Instructions with parameters are stored as incomplete analysis records. A parameter word PI is used in place of an address to indicate the presence of a parameter in the statement being analyzed. A parameter word contains a parameter number and a tag that identifies it as a parameter word.

Before a typical instruction can be executed, values must be substituted for all of its parameters. A parameter substitution is effected by replacing a Pi-word corresponding to the parameter by an address word which is the address of an analysis record with respect to the class identifier which defines the parameter. As an example consider a very trivial routine:

ROUTINE [AS] = ADD [WORD] to SIMPLE LIST [AB] 
([AB]) = [WORD] 
[AB] = [AB] + 1

This routine would correspond to a format

AS1   FORMAT [AS] =  ADD [WORD]  to SIMPLE LIST [AB]

in the class [AS] of auxiliary statements. The routine has two parameters [WORD] and [AB] which are referred to as P1 and P2, respectively. It assumes that a member of class [AB], an A- or a B-register, is pointing to the next location to be filled in a simple list. It stores a [WORD] in that location and increments the pointer.

The instruction ([AB]) = [WORD] corresponds to the format ([ADDR]) = [WORD] and its analysis record with respect to this format is:

The analysis record of [AB] = [AB] + 1 which is of the format [AB] = [WORD] is:

The two instruction routine will be stored in the format routine library in the form:

An instruction like ADD (B4) to SIMPLE LIST B6 might appear as an instruction in another format routine. This instruction is a particular case of the format ADD [WORD] to SIMPLE LIST [AB] and has the analysis record:

which would appear in memory in the form:

This analysis record of an instruction that satisfies the format ADD [WORD] to SIMPLE LIST [AB] provides values for the input parameters of the corresponding routine. The first address word is the address of an analysis record with respect to [WORD] which can replace P1. Similarly, the second address word can replace the parameter word P2.

The class identifier that represents a parameter may be modified by a label or by a index or by both.

A label is a string in the form /[N]. It is used where it is necessary to distinguish between different parameters which may have the same class identifier.

As an example consider a simple list that is specified by two registers, one of which contains the starting address, and the second containing the number of entries already in the list. A format for adding a word to such a list might be

AS2   FORMAT [AS] = ADD [WORD] TO SIMPLE LIST [AB/l][AB/2]

and an associated format routine might be

ROUTINE [AS] = ADD [WORD] TO SIMPLE LIST [AB/l][AB/2]
([AB/1] + [AB/2]) = [WORD] 
[AB/2] = [AB/2] + 1

An index is a string in the form ([ABN]). It can be used only in connection with a class that appears in a definition with a * which indicates a string consisting of any number of members of that class, and it is used to refer to a particular member in the string. For example, one may refer to [±T(2)] and [±T(4)] to refer to the second and fourth [±T] appearing in a simple arithmetic expression [SAE] which satisfies the definition

[SAE] = [±?][T][±T*?]

Clearly [±T(2)] and [±T(4)] refer, respectively, to the third and fifth terms in the definition of the expression that is being analyzed.

It is possible to have a parameter in the form [P/5(A3)]. The value of A3 at the time the instruction containing this parameter is transplanted (see 6.3) is the value that is used.

All combinations of class identifiers and labels are treated as distinct parameters. The parameters [AB/1] and [AB/2] in the example given above would have the associated parameter words P2 and P3.

Parameters which have the same class identifier and label but different indices are treated as cases of the same parameter. They have the same P-number, but a Pi-word referring to an indexed parameter will contain the index as well as the number i. When the Pi-word is to be replaced by an address, the routine which does this replacement searches through that part of the tree corresponding to the starred class definition/label and determines the address word corresponding to the occurrence number determined by the index.

6.3 STORAGE OF FORMAT ROUTINES

It is now possible to describe in some detail the processing of a source statement at compile time. A source statement may be a statement that is to be translated into machine code, or it may contain declarations or other information relative to the compiling process. It can always be interpreted as a request that the compiler perform a task-translation in the first case, and making entries or modifications in its internal tables and lists in the second.

The expression recognition routine determines the format of the source language statement, and in the course of determining the format it constructs the analysis record of the statement with respect to the class [SS].

A main working area has been set aside for the processing of source statements. When an analysis record with respect to a format class is constructed in, or moved into, the main working area it is interpreted as a call on the associated format routine. The first word of the analysis record, the category number of the format in its format class, is the indirect address of the format routine. The address words in the analysis record provide the values that are to be substituted for the parameters in the instructions that make up the format routine.

A format routine is a list of instructions which may be source statements, auxiliary statements or built-in statements. Those instructions which are source statements or auxiliary statements are themselves interpreted as calls on format routines, which are lists of instructions which may themselves call on format routines, etc. The built-in statements cause built-in format routines to be executed, and the process terminates since all format routines are ultimately defined in terms of built-in statements.

It should perhaps be emphasized that a source statement that appears in a format routine is itself a call on a format routine. Its effect is to cause object language code to be generated, not executed.

The instructions of a format routine are stored as analysis records with respect to their format classes. Instructions that have parameters are stored as incomplete analysis records. An alternative form for storing instructions that contain no parameters is discussed in 6.7.

Within an instruction the address words are relative to the first word in the instruction. The instruction is thus stored in a relocatable form. An instruction is always moved into the next available space in the main working area when it is to be executed. The process by which an instruction in the form of an analysis record is moved into the main working area and modified for execution is called transplanting. Parameter substitution takes place during this transplanting operation, as does the replacement of relative addresses by absolute addresses.

6.4 EXECUTION OF FORMAT ROUTINES

Assume that the analysis record of the source statement with respect to the class [SSJ has been constructed in the main working area. This is a call on the format routine that is associated with the format of the statement, and the following sequence of operations is performed.

The pointer BS points to the next available location in the main working area. Information is now placed in locations BS to BS+3 that will serve as a link back to the next step in the calling routine after the execution of the format routine has been completed. BS is advanced by 4 and the pointers BE , BF and BP are set as follows:

BE

is set to the value BS and points to the beginning of the area allocated to the execution of the format routine. It also remembers the location of the return link.

BF

points to the beginning of the format routine in the format routine library.

BP

points to the instruction (initially the first) in the format routine that is currently being executed.
A number N has been stored with each format routine. N is the number of storage locations needed during execution of the format routine for absolute parameter addresses and for temporary storage. It is the number of different parameters Pi plus the number of different A registers Aj that occur in the format routine. The number of Pi is the number of different class identifier/label combinations that occur in the format routine. BS is now increased by N+l to allocate locations BE+1 to BE+N for this purpose. The location BE itself will be used to store the value of BS if and when the current routine calls in another format routine. The reason for storing this forward link is explained in 6.6.

A pointer BA is now set equal to the value of BS . BA is the value that BS must be set back to after each instruction in the format routine has been executed.
The structure of an analysis record is such that the first set of address words (up to the first category word) are the addresses of the subanalysis records that are the input parameters to the format routine. These addresses are now inserted into locations BE+1, BE+2,...
The analysis record which is the first instruction (in the general case, the next instruction) of the format routine is now transplanted into the main working area starting with location BA . BS is of course increased by an amount equal to the length of the analysis record. Some changes are made in the analysis record as it is moved. The address words which are relative to the beginning of the analysis record when it is stored as an instruction in the format routine in the library are now made into absolute addresses. Parameter addresses in the location BE+i are substituted for the parameter word Pi. In addition, special parameter words and temporary storage words are modified by the address BE as described in 6.6.
The transplanted analysis record corresponds to a format in one of the format classes and is now interpreted as a call on the associated format routine. If the format class is the class [BS] the format routine is a built-in routine which is now executed. If the format class is [SS] or [AS], BS is stored in (BE) as a forward link and the values of BE , BF, BP and BA are now stored as a return link in BS to BS+3 and the entire process described in steps 1-5 is repeated.

Whenever a built-in routine that is not a jump instruction is executed, BS is set back to the value BA , thus making available again the space taken up by the instruction (the analysis record) that called in the built-in routine. BP is advanced to the next instruction in the format routine pointed to by BF, and the process continues with the transplanting of the next instruction. The execution of jump instructions is discussed in 6.6.

If the next instruction is the instruction END, the execution of the current format routine has been completed. Locations BE-1 to BE-4 contain the return link, the values that BE, BP , BF and BA had before the current format routine was called. These B values are now restored. The format routine just completed was called by an instruction in the format routine pointed to by the previous value BF. That format routine now proceeds to its next instruction, and the process continues until it finally reaches the END instruction in the format routine associated with the original source statement. The link at this level is a return to the master sequencing routine which now proceeds to the next source statement.

During the execution of a format routine, control may go up and down through many levels of format routines calling on other format routines which call on still other format routines. It is possible for the same format routine to appear at more than one level, and it is clear that the procedure described above will handle such recursive calls on format routines.

6.5 PARAMETER OPERATIONS

The class identifiers that appear in a format definition are the input parameters of the associated format routine.

One of the most important characteristics of this system is the fact that instructions in format routines may involve parameters that are not explicitly defined as input parameters. Instructions may contain as parameters not only the class identifiers in the format definition, but also any class identifiers that are used in defining the class identifiers that appear in the format definition. The analysis process carried out by the expression recognition routine makes available as a parameter any class identifier that corresponds to a branch in an analysis record with respect to the format.

The substitution of values for input parameters is done automatically when the format routine is called. The substitution of actual values (addresses) for other parameters is handled by a number of basic instructions which are among the instructions referred to as parameter operations.

An example of a nontrivial format routine will help explain how some of these operations are used.

ROUTINE [SS] = [V] = [SAE]	01
LET [SAE] = [±?][T][±T*?]	02
Ac - [±?][T]	                03
JUMP 1 UNLESS [±T*?J - {±T*]	04
Al = NUMBER OF [±T*]	        05
A2 = 1	                        06
2] LET [±T*(A2)J = [±][T]	07
Ac = Ac [±][T]	                08
A2 = A2 + 1	                09
JUMP 2 IF A1 ≥ A2	        10
1]   [V] - Ac	                11

FIG. 3. An example of a format routine

Figure 3 is a format routine associated with FORMAT [SS] = [V] = [SAE]. The routine is presented as it would be written and punched. During the compiler building phase of system operation, a service routine which makes use of the expression recognition routine converts this input form into the standard form in which it is stored in the format routine library.

Figure 2 is an analysis record with respect to this format, and therefore represents a call on the format routine in Figure 3. Assume that the analysis record of Figure 2 is in memory at locations 1000 to 1098. A return link is set up in locations 1099 to 1102 and BE is set to 1103 indicating the beginning of the memory area to be used by the format routine.

This format routine has 7 parameters (Brooker and Morris [5] would count the format class identifier [SS] as a parameter and would count eight psarameters in this routine.) which are labeled P1 to P7 and it makes use of two A-registers. The value of N for this routine is 9, and the locations 1104 to 1112 are set aside as follows:

P1 = [V]          1104
P2 = [SAE]        1105
P3 = [±?]         1106
P4 = [T]          1107
P5 = [±T*?]       1108
P6 = [±T*]        1109
P7 = [±]          1110
Al                1111
A2                1112

The input parameters are the first pair of address words in the analysis record in Figure 2. These addresses, 1003 and 1014, are now inserted into 1104 and 1105.

Line 02 contains the first instruction of the format routine. It is an instruction corresponding to the new basic format

BS9   FORMAT [BS] = LET [PI] = [RESOLVED-PJ

The members of the class [PI] are all the class identifiers that have been defined. In an instruction in this format, [PI] will be replaced by a particular class identifier, and [RESOLVED-P] will be replaced by a phrase which represents a member of that class. This phrase will contain class identifiers which are parameters in the format routine in which the LET [PI] = [RESOLVED-P] instruction is contained. The purpose of the instruction is to substitute actual values for the parameters that appear in [RESOLVED-P]. The actual values are addresses taken from the analysis record that called the format routine. The parameters for which values have been substituted can be used as parameters in subsequent instructions of the format routine.

An instruction in the format

LET [PI] = [RESOLVED-P]

is stored as an analysis record with respect to that format. An analysis record with respect to [RESOLVED-P] is formed as if for every class identifier CI there existed a definition [RESOLVED-CI] = [CI].

Brooker and Morris use an information item (see 6.6) rather than a simple parameter word like PI in an analysis record with respect to [PI]. The simple parameter word can be used here because the instruction will be executed by a built-in routine.

The analysis record of instruction 02 is:

The primed P-words are specially marked to indicate that the values of these parameters are not known and therefore values cannot be substituted for these P-words when this instruction is transplanted for execution. The execution of the instruction will result in the substitution of addresses for the P'-words.

This instruction is transplanted into locations 1113 to 1121. During the transplanting the address 1014, which is the value of P2, replaces the parameter word P2.

In a practical implementation of the system it will probably be necessary to mark the end of an instruction, either by a special word or by a tag in the last word. It is assumed here that there is a tag of this kind in order to recognize the end of an instruction.

The format routine for LET [PI] = [RESOLVED-P] is a built-in routine, so that the four-word link is not established. During execution, the analysis record starting from P2 which now points to 1014 is compared with the analysis record with respect to [RESOLVED-P] starting at the branch corresponding to [SAE]. Unless there is an error of some kind, these will be identical except in those cases where a terminal P'-word in the latter corresponds to an address word in the former. This address word is the value of the parameter and is substituted into the position in the parameter list determined by adding BE to the parameter number. In this case, it is found that the parameters P3, P4 and P5 have the values 1018, 1019 and 1060, and these values are substituted into locations 1106, 1107 and 1108, respectively.

During the execution of this instruction BA was equal to 1113. BS is now set back to 1113 and instruction 03 can now be executed. Instruction 03 is an auxiliary statement corresponding to the format

ASS   FORMAT [AS] = Ac = [±?][T]

The associated format routine will not be given here. The symbol Ac is a special symbol designating an accumulator, and the format routine constructs and places in the output string the coding that puts the value of [±?][T] into the accumulator. Instruction 03 is stored as an analysis record with respect to this format in the form: AS3 P3 P4.

Instruction 03 is transplanted into 1113 to 1115 and the parameter values that are now known, 1018 and 1019, are substituted for P3 and P4 in locations 1114 and 1115, respectively.

The transplanted instruction 03 is now a call on the format routine associated with AS3. The values BE, BF, BP, BA are placed in locations 1116 to 1119 as a return link, and BE is set to 1120. The parameters for this format routine will start in location 1121, and the first two will be input parameters whose values come from the first set of address words in the call. These are the values 1018 and 1019 in locations 1114 and 1115. The placing of these values in 1121 and 1122 passes these parameter values down to the format routine for Ac = [±?][T], and illustrates the process by which parameter values are transmitted to subroutines.

The format routine for Ac = [±?][T] is now executed, instruction by instruction, in the area starting from 1120. It may call other format routines which extend farther into memory, which may in turn call other format routines themselves. Eventually all the instructions of the Ac = [±?][T] routine are executed and the routine exits by resetting the B values stored in the link at 1116 to 1119. The value of BA is restored to 1113 and BS is reset to this value.

The substitution of a parameter value may also be done by an instruction in the format

BS10   LET [PI] = [GENERATED-P]

Just as in the case of [RESOLVED-P], the expression recognition routine implicitly assumes that for every class identifier [CI] there exists a class definition of the form [GENERATED-CI] = [CI].

In the case of LET [PI] = [RESOLVED-P], the value of the parameter corresponding to [PI] is known and the execution of the instruction inserts values for the parameters that appear in [RESOLVED-P]. In the case of LET [PI] = [GENERATED-P], values must already have been substituted for the parameters that appear in [GENERATED-P], and the execution of the instruction substitutes a value for the parameter corresponding to [PI].

An example given in [5] is

LET [WORD/1] = (B8 - [ABN/2])

Assume that [WORD/1] corresponds to P4 and [ABN/2] corresponds to P5 in the format routine in which the instruction occurs. P5 is either an input parameter or it has appeared on the right-hand side of a previous LET [PI] = [RESOLVED-P] instruction. The analysis record for the present instruction is

When this analysis record is transplanted the known value of P4 in the parameter list will be substituted for P4. P5' is a marked parameter word and no substitution takes place. Instead the routine which executes LET [PI] = [GENERATED-P] will place in the P5 position in the parameter list the address marked P5 in the above analysis record. The transplanted analysis record will remain in working storage as an extension of the analysis record that called the format routine. This is accomplished by increasing PA so that it points beyond the end of the transplanted analysis record.

Instruction 05 in Figure 3 corresponds to the format

BS11    FORMAT [BS]  =  [AB] =  NUMBER OF [PI]

The class identifier that replaces [PI] in an instruction of this format must be in the form [CI*]. If the string corresponding to [CI*] contains only one member of [CI], the category number with respect to the first [CI*] will be 2. If there are more than one, the category number will be 1 and the second address word will point to the next [CI*]. The routine that interprets this instruction starts from the beginning of the analysis record with respect to [CI*] and counts the number of appearances of category 1 before category 2 appears. Similarly, the routine that interprets instruction 07 will examine the current value of A2. Then starting from the beginning of the analysis record with respect to [±T*], it will trace through the analysis record until it finds the analysis record corresponding to the A2-nd member of [±T] in the string [±T*].

Other formats similar to BS11 are

BS12   [AB] = CATEGORY OF [PI] 
BS13   [AB] = CLASS OF [PI]

BS12 places in [AB] the first category number that appears in the analysis record with respect to [PI]. BS13 finds the category number of the particular [PI] with respect to class [PI].

6.6 LABELS AND INFORMATION ITEMS

Instructions 04 and 10 in the format routine in Figure 3 contain jumps to the labels 1 and 2, respectively. In the most general case jumps to labels (switches) like A5 and B15 may be included. In order to handle this case, a label directory is stored along with the format routine. This directory has one entry for each labeled statement. The position within the directory corresponds to the label number. The entry in the directory is the address of the labeled statement. It is the value that must be placed in BP to execute a jump to that label.

Instruction 04 is an instruction in the format

BS14   JUMP[LABEL] [IU] [PI] = [RESOLVED-P].

Its analysis record could be stored as:

It would not actually be stored this way for a number of reasons.

For the handling of labels it is convenient to introduce a special [ABN] word which contains an indicator to distinguish the three categories [A], [B], or [N]. This special [ABN] word can replace that part of the analysis record corresponding to [ABN]. In the case of a format routine in which there are no switches, as determined by the absence of any jumps to an [AB], even the special [ABN] word need not be used. It can be replaced in the analysis record by the absolute address that would normally be carried in the label directory of the format routine, and in this case the format routine need not contain an explicit label directory.

The analysis record that was constructed for instruction 04 is adequate only because of the fact that instruction 04 is to be executed by a built-in format routine. The class identifiers [LABEL], [PI], [RESOLVED-PJ and [A] require special treatment when they occur in formats that are not in the format class [BS]. They will normally not occur in the format class [SS], but they can occur in formats of class [AS].

Consider first the class [LABEL]. A label like 1 or 2 is defined only in the format routine within which it occurs. The label value is a reference to the label directory in that format routine. It cannot be passed down as a parameter value to routines on a lower level without additional information that will permit the lower level routine to locate that particular format routine.

So far, analysis records have consisted of address words, category words and parameter words. An additional kind of word, the block word, is now introduced to permit the analysis record to carry additional or special information where necessary. In addition to an identifying tag, a block word Bn contains an integer n and indicates that it is followed by an n-word information item.

When the identifiers [LABEL] or [PI] occur in a format, the analysis record of an instruction in that format will contain a corresponding Bn-word and a two- or three-word information item. One of the words in the information item is a blank which is filled in with the value of BE when the instruction is transplanted for execution. The address word pointing to the information item is the value of the parameter that is transmitted from level to level. Regardless of where it occurs, it always permits the recovery of the value that BE had in the format routine in which the parameter was defined. The word in BE is a forward link. It points to the location of the link addresses which are the BE, BF, BP and BA for the original format routine. In the case of a jump to a [LABEL], these values permit the jump to be executed in the format routine in which it occurs.

For a label, the second information word is the [ABN] word described above. Corresponding to a [PI], there is a second information word which is an ordinary parameter word and a third information word which is the category number of the particular member of [PI] in the class [PI].

For the same reasons as discussed above, the Pi-words in an analysis record with respect to [RESOLVED-P] are modified when transplanted by the addition of BE. This makes them absolute memory references, which can be transmitted as such. Similarly references to an Aj are converted into absolute references by adding BE+( number of parameters) to the value of j.

The right-hand part of instruction 04 is handled just like an instruction in the format LET [PI] = [RESOLVED-P], except that the actual substitution of values for parameters is performed only if the equality condition is satisfied.

6.7 PRIMARY ASSEMBLY ROUTINES

During compilation the system acts as an interpretive system. The format routines are written and submitted to the computer as sequences of instructions in system formats. These instructions are translated into a standard format by the ERR but they are not translated into machine language. They are stored as encoded calls on format routines which are the subroutines, generators or macro-instructions of the system. The actual calls are made during the compilation, which is run time from the point of view of the format routines.

The reason for this interpretive treatment of instructions in format routines is to permit the very general parameter substitution scheme that was described in 6.4. There is no advantage gained in handling instructions without parameters in this way, and an alternative procedure is provided in this case.

When a format routine is added to the system, each instruction is submitted to the ERR which determines its format and sets up the corresponding analysis record. If the instruction has parameters, the resulting incomplete analysis record is stored as the instruction.

If an instruction has no parameters, a routine (a primary assembly routine) associated with the format will be called in to translate the instruction into a sequence of machine-code instructions which are stored in the format routine in place of the instruction.

It is not necessary to have such a primary assembly routine for every format, since instructions without parameters can be stored as complete analysis records and executed in the standard way. The only reason for the use of primary assembly routines is the saving of time during compilation, a saving that can be quite considerable.

Format routines will therefore be stored in the library as analysis records and machine-code sequences. The execution of a format routine proceeds as described in 6.4 except that whenever the next instruction to be executed is stored as a machine code sequence, that sequence is executed at that time.

7. System Routines

Routines which are considered to be service routines of the system may themselves be written just like format routines, as lists of instructions in system formats. They may be executed by the same mechanism which executes the format routines. The instruction format BS7 FORMAT [BS] = CALL R [ABN] may be used within such routines, and also within format routines to call system routines as subroutines. The routines are numbered, and an instruction like CALL R25 will cause subroutine number 25 to be executed.

The system may be modified and extended by the addition of new system routines written in terms of system instruction formats.

8. Comments and Conclusions

In handling declarations in a language like ALGOL, the format routine which is called in by the declarations that appear in a block head would call on the table building routines of the system to enter new identifiers into appropriate classes - real, integer, Boolean, array, etc. The format routine called at the end of a block would call routines which make the necessary deletions.

It is probably desirable, in handling ALGOL-like languages, to have a special built-in routine for recognizing variables, and perhaps another for searching tables of variables.

The compiling phase of the system has been described as a one-pass compiler. There seems to be no reason why the same techniques could not be used in more elaborate multipass compiling systems.

The system as described here can be implemented without great difficulty on almost any large computer. The chief problems are those of time and space. It was originally proposed for the Ferranti Atlas which has a unique organization of storage that permits the programmer to refer directly to a very large storage area consisting of cores and drums. The executive routine, aided by the hardware design, permits very rapid access to this extended storage. In more conventional systems the time taken for references to intermediate storage may be so great as to make the size of core storage the practical limit on the size of the system. Ingenuity in the packing of information can save core storage. Brooker and Morris in [3] propose an elaborate packing scheme for analysis records, but in [5] they indicate that they will not use it because of the time consumed in packing and unpacking.

The very elegance and generality of the system is bound to make it relatively slow. As pointed out earlier there are ways in which speed can be improved at the expense of generality. In an extreme case one can go back to a rigid but very fast compiler.

More experience with systems of this type will show just how much time is sacrificed and whether the compensations are adequate. Probably even on the Atlas, and certainly on other computers, compilers produced by more conventional techniques will exist alongside of compilers produced by systems of this type, and it will be possible to make some very interesting comparisons.

9. References

1. BROOKER, R. A., AND MORRIS, D. An assembly program for a phrase structure language. Comput. J. S (I960), 168.

2. BROOKER, R. A., AND MORRIS, D. Some proposals for the realization of a certain assembly program. Comput. J. S (1961), 220.

3. BROOKER, R. A., AND MORRIS, D. A description of mercury autocode in terms of a phrase structure language. In Second Annual Review of Automatic Programming, Pergamon, New York, 1961.

4. BROOKER, R. A., AND MORRIS, D. A general translation program for phrase structure languages. J. ACM 9 (Jan. 1962), 1.

5. BROOKER, R. A., MORRIS, D. AND ROHL, J. S. Trees and routines. Comput. J. 5 (1962), 33.

6. NAUR, P. (Ed.). Report on the algorithmic language ALGOL 60. Comm. ACM S (May I960), 299.

7. IRONS, E. T. A syntax directed compiler for ALGOL 60. Comm. ACM 4 (Jan. 1961), 51.

8. IRONS, E. T. PSYCO, the Princeton syntax compiler. (A manual) Institute for Defense Analysis, Princeton, N. J.

9. WARSHALL, S. A syntax-directed generator. Proc. Eastern Joint Comput. Conf., Washington, D.C., Dec. 1961.