Chilton::ACL::Compiler Compiler 1966

A Parameterized Compiler based on Mechanical Linguistics

Howard H Metcalfe

1964

Planning Research Corporation, Los Angeles, California, Washington D.C.

SUMMARY
FOREWORD
I. INTRODUCTION
II AUTOMATIC PARSING
III. THE SYNTAX MACHINE
IV. PROGRAMMING LANGUAGES
V. OUTPUT AND THE EDITOR
VI. MACHINE LANGUAGES
VII. IMPLEMENTATION
VIII. APPLICATIONS
- A. General
- B. A specific application
APPENDIX: Summary of Syntax Machine Operations
REFERENCES

SUMMARY

A multiplicity of problem-oriented and machine-oriented languages for computer programming exists today. Many compilers have been constructed to translate between these languages. This report presents a technique for reducing the number of compilers necessary. A translation algorithm is presented, capable of being conveniently parameterized for various source language-target language pairs. Concepts are drawn from modern linguistic theory, and practical considerations in implementation and application are discussed.

FOREWORD

THE author acknowledges the support provided by Planning Research Corporation and the Naval Command Systems Support Activity in the development of this report.

In June of 1962, the Los Angeles chapter of the Association for Computing Machinery established a special interest group on programming languages. A workshop on syntax-directed compilers was formed within the special interest group in September of 1962. Working as individuals with close co-ordination, the workshop produced four experimental compilers based on syntactical notation. This report introduces the concepts and techniques applied in one of these compilers, as developed by the author.

Particular acknowledgment is due to A. E. Glennie of Carnegie Institute of Technology for many of the basic ideas contained in this report. Many helpful suggestions and discussions were contributed by members of the workshop: L. A. Beghtol (Autonetics), J. W. Bryner (Ramo Wooldridge), E. Manderfield (Computer Sciences Corporation), L. Schmidt (Beckman Instruments), C. J. Shaw (System Development Corporation), R. Unger and L. Yarbrough (North American Aviation), and Val Schorre (UCLA). R. F. Mclntosh of PRC made significant contributions to the author's general orientation. Finally, the RAND Corporation provided the author with random access to their computing facility, permitting the actual construction and operation of the compiler.

I. INTRODUCTION

The proliferation of types of computers and varieties of programming languages is a continuing trend in the computer sciences today. Over eighty general-purpose, stored-program digital computers were recently listed as commercially available in the United States. Many of these are of widely divergent characteristics. There are at least twenty major problem-oriented programming languages in use today, excluding different versions of the same language for different computers. There may be as many as 200 or more compilers extant to do the translation job from problem-oriented languages (POL's) (We include here both descriptive and procedure-oriented languages.) to machine-oriented languages (MOL's).

A general trend towards fewer, somewhat standardized POL's has become apparent in the past few years. ALGOL and FORTRAN in the scientific field, COBOL in the commercial field, JOVIAL in the military field, have made progress towards becoming acceptable common (if not standard) languages. However far this trend continues, the need for special-purpose languages (symbol-manipulation, information storage and retrieval, concept-processing, tool-manipulation, etc.), the need for object program efficiency (i.e. fitting the language to the machine), and the desire for extensions of present-day languages in order to adapt to the problems of the future, will become limiting factors on complete standardization.

It is not conceivable in a competitive economy that a single machine or class of machines will ever become the sole instrument for computation. Aside from the proprietary considerations, the same factors mentioned above which contribute to POL proliferation also apply to MOL (hence machine) proliferation.

Compiler writing in the past has generally proved to be an expensive and time-consuming occupation. Exceptions to this rule have begun to appear recently. Compilers such as NELIAC, which were written by a few people in a relatively short period of time and compile fairly efficient programs quickly, have been successfully implemented. However, there is some limit to how cheaply, how quickly, how well any compiler can be written. And, in general, one entire compiler must be written for each POL-MOL combination.

Given the many-language, many-machine problem, the SHARE organization in 1958 proposed a conceptually simple solution. They proposed that a universal computer-oriented language (UNCOL) be specified, such that any POL could be translated to UNCOL, and UNCOL could be translated to any MOL. This would have reduced the M-language, N-computer problem from M × N full compilers to M + N partial compilers. At the time of writing this report, no specification of an UNCOL has been successfully implemented and accepted.

A variant on the UNCOL theme has been applied to the System Development Corporation's JOVIAL compilers. For the single POL, JOVIAL, SDC has developed an intermediate language (IL). All versions of JOVIAL can be translated to IL. IL can be translated to a variety of machines in the large-scale binary class, such as the IBM 7090, CDC 1604, Philco 2000, etc. Hence, for each new machine, only part of a compiler needs to be written. Other advantages accrue from the fact that the JOVIAL-IL translator is capable of bootstrapping itself onto new machines, since it is written in JOVIAL.

In 1960, another approach to the proliferation problem became apparent. With the increasing awareness of the relation of programming languages to natural languages, and the development of constituent <phrase structure> grammars in linguistic theory, the rigorous specification of programming languages was made possible. With the ability to specify the structure of a language, the possibility of processing such a specification for purposes of translation arises. In terms of this report, this means the construction of a single compiler which will accept as parameters the specifications of the languages between which it is to translate. In other words, we shall build a single compiler which can translate any one of several source languages (which may be POL's) to any one of several target languages (which may be MOL's), by the simple insertion of a set of parameters which by themselves rigorously define the two languages and their relationships.

This report deals with one method for the construction of such a compiler. We may note at this point that the source language is not necessarily restricted to POL's, nor the target language to MOL's, but in general may apply to any two languages definable by our set of parameters.

Throughout this discussion we have assumed that the compiler itself runs on the machine for which it is compiling. We might note that this is certainly not necessary, and indeed in practice one often finds that a manufacturer will provide a copy of a compiler, for a new but as yet unavailable machine, which operates on a different but currently available machine for purposes of early programming and checkout.

Let us now summarize the solutions to the proliferation problem that were discussed or implied above:

Standard POL: requires N compilers, one for each machine.
Standard MOL: requires M compilers, one for each language.
Standard POL and MOL: requires 1 compiler.
Standard UNCOL: requires M + N partial compilers.
No standards: requires M × N compilers.
Parameterized compiler: requires 1 compiler plus M × N sets of parameters.

As mentioned above, we shall develop one technique for accomplishing alternative (6).

Two sets of parameters must be provided to our compiler in order to carry out a specific translation: a specification of the source language and a specification of the target language. Our method will employ an expanded form of constituent <phrase-structure> grammar notation to describe the source language. We shall construct our target language description in some form consistent with and embedded in the source language description. We shall refer to the total specification of the source and target languages as a grammar, and to the notation in which it is (preliminarily) written as a meta language. Thus, we are limiting our scope of translation to those language pairs completely definable by phrase-structure analysis. Such analysis can be referred to as parsing. Indeed, translation by our compiler will depend upon the capability of the compiler to parse to source language, i.e. to discover the specific structure of the source language in terms of the given grammar.

II. AUTOMATIC PARSING

A. Diagramming

As we may recall from elementary school, an English sentence can be parsed by the technique of diagramming the sentence. For example, the sentence The boy sees a tree may be diagrammed as follows:

Such a diagram displays the syntax (grammatical structure) of a sentence in a tree fashion. Each node represents a constituent (or phrase) of the syntax. The upper-case words are the ultimate constituents (or basic words) and the lower-case words are the syntactical constituents of the sentence.

We may describe the general syntax of this type of sentence in a more elegant fashion, in order to display all possible well-formed sentences employing this structure and vocabulary:

Sentence → subject + predicate
Subject → noun phrase
Noun phrase → article + noun
Predicate → verb + object
Object → noun phrase
Article→THE, A
Noun → BOY, TREE
Verb → SEES

The arrow separates a syntactical constituent name from its definition in terms of other constituents. The plus sign connects constituents which are defined to appear contiguously. The comma separates alternative definitions of a constituent. The set of definitions given above comprise the grammar of this type of sentence, and our notation used to describe the grammar is our meta language.

We might note that the syntax of our example is completely defined by our grammar, and that any one of several such sentences of this type are so defined. Thus, given such a sentence and grammar, it is certainly possible to parse the sentence, i.e. discover its syntactical structure.

In order automatically to parse, we shall define two basic techniques in terms of our example:

Matching: or starting with the ultimate constituents (basic words) and working up the grammar tree.
Searching: or starting with the final constituent (sentence in our example) and working down the grammar tree.

B. Matching

Matching examines the ultimate constituents and attempts to transform them into phrases by matching them against the definitions in the grammar. The resulting constituents are matched again against the definitions and transformed into higher order constituents, and so forth until the root constituent is reached.

For example, using our sample sentence (where the arrow indicates a successful match and the corresponding transformation) (Subscripts refer to position of constituents among similar constituents in the sentence.):

Scan 1
THE → article₁    
BOY → noun₁    
SEES → verb
A → article₂
TREE → noun₂    
Scan 2
article₁ + noun₁ → noun phrase₁
article₂ + noun₂ → noun phrase₂
Scan 3
noun phrase₁ → subject
noun phrase₂ → object
Scan 4 and 5
verb + object → predicate
subject + predicate → sentence

It can be shown that the matching can be achieved in one left-to-right scan, but for clarity we show the simpler procedure above. We also note that a trace of the matching process is essentially equivalent to the diagram of the sample sentence.

However, in examining such a trace as that given, we see that in order to match noun phrase, we must have knowledge of a constituent other than those immediately involved in the matching. We must know whether or not the constituent preceding the article and noun is a verb or not, in order to decide whether the proper transformation is to subject or object. This reflects an inherent problem with the matching technique. In some grammars, in fact, it may be impossible to match many constituents without extensive knowledge of all the other constituents, or without expanding the grammar to an unwieldy size.

Matching also requires proper ordering of definitions in (or of sequencing logic through) the grammar, in order to minimize the total number of matches. Unless this is done, the procedure will be intolerably slow.

C. Searching

Searching is essentially a heuristic procedure, i.e. a procedure which establishes goals and subgoals and attempts to relate its environment to these goals by manipulating it in a trial-and-error fashion. As applied to our case, we want to establish sentence as our final goal. We then search down through the grammar tree, establishing each constituent as a subgoal, until we can match the ultimate constituents against the given sentence. We examine all possibilities (i.e. all branches of the tree), until either we have successfully matched (in order) the given sentence with ultimate constituents, or we have failed to do so. The former result defines a well-formed sentence, the latter an ill-formed sentence.

For this purpose, let us rewrite our grammar in the form of a search procedure, or program, in which each definition is used as a closed subroutine. Flag is a variable which may be set to true or false.

In this parsing program we have several steps, some of which are labelled. We use various typical computer-like operations (Call, Return, If, Match, Display, Stop) in these steps which may have as operands either step labels (Call, ultimate constituents (Match, or the flag variable (If, Display).

              Call    sentence.
              Display    flag.      
              Stop.    
Sentence:     Call       subject
              If         flag = false, return.
              Call       predicate.
              Return.    
Subject:      Call       noun phrase.
              Return.    
Noun phrase:  Call       article.                  
              If         flag = false, return.
              Call       noun.
              Return.    
Predicate:    Call       verb.
              If         flag = false, return.
              Call       object.
              Return.    
Object:       Call       noun phrase.
              Return.    
Article:      Match      THE.
              If         flag = true, return.
              Match      A.
              Return.      
Noun:         Match      BOY.
              If         flag = true, return.
              Match      TREE,
              Return.    
Verb:         Match      SEES.
              Return.

Match sets flag to true if the following word matches the sentence word, or to false if not. If a match is made, it also advances to the next word in the sentence.

A trace of the execution of our parsing program, when presented with our sample sentence The boy sees a tree, might appear as follows:

Call        sentence, 
Call        subject, 
Call        noun phrase, 
Call        article. 
Match       THE.          Set flag to true and advance a word. 
Return.                   From article.
Call        noun. 
Match       BOY.          Set flag to true and advance a word. 
Return.                   From noun.
Return.                   From noun phrase.
Return.                   From subject.
Call        predicate.
Call        verb
Match       SEES          Set flag to true and advance a word. 
Return.                   From verb.
Call        object.
Call        noun phrase.
Call        article.
Match       THE.          Set flag to false. 
Match       A.            Set flag to true and advance a word.
Return.                   From article.
Call        noun. 
Match       BOY.          Set flag to false.
Match       TREE.         Set flag to true and advance a word. 
Return.                   From noun.
Return.                   From noun phrase.
Return.                   From object.
Return.                   From predicate.
Return.                   From sentence.
Display Flag.             Which is still set to true.
Stop.

We have made a single left-to-right scan of the simple sentence. The subroutine calls acted as a memory of the goals and subgoals that were examined. The trace of the execution of the parsing program is essentially equivalent to the diagram of the sample sentence. No knowledge of constituents other than those immediately involved is required, since such information (for the entire sentence) is embedded in the sequence of subroutine calls. No ordering of the definitions is required, since each definition is treated as a closed subroutine. Compared to the matching technique described previously, only the minimal amount of grammar scanning is done, and the entire parsing of a sentence is accomplished with far greater speed.

The parameterized compiler which we will describe employs the searching technique. We will not discuss further algorithms based on matching techniques. In the following section, we shall introduce the notion of a syntax machine which is based on this searching technique.

III. THE SYNTAX MACHINE

A. The metalanguage

In order unambiguously to state a set of syntactic definitions, we require a rigorous metalanguage. For this purpose, we shall adopt an expanded version of the Backus normal form of constituent <phrase-structure> grammar notation.

We shall denote syntactic variables (e.g. subject) by enclosing the name of the variable in the metabrackets < >. We shall denote syntactic constants (e.g. BOY) by themselves. We shall separate the defined constituent from its definition by the symbol ::=. We shall denote a sequence of constituents defined to appear contiguously by juxtaposition of those constituents. We shall separate alternative definitions of a constituent by the symbol |. Other notations we shall introduce as required.

We may now rewrite our sample grammar as follows:

<sentence>    ::= <subject><predicate>
<subject>     ::= <noun phrase>
<noun phrase> ::= <article><noun>
<predicate>   ::= <verb><object>
<object>      ::= <noun phrase>
<article>     ::= THE | A
<noun>        ::= BOY | TREE
<verb>        ::= SEES

B. The syntax machine

We now proceed to define an abstract machine (We are referring here to a class of automata similar to those defined as Turing Machines.) for which we may write the kind of syntax program that we discussed earlier. Our syntax machine requires the following elements:

An input tape on which the symbol strings to be parsed are inscribed.
A set of operations which perform the functions necessary for parsing (similar to those used in Section II C above).
A stack memory (i.e. push-down, pop-up memory. This type of memory operates in the fashion of the common cafeteria tray stack. Placing a tray on the top of the stack pushes down the previous tray; removing a tray pops up the previous tray to the top of the stack.) to contain information about input tape position and subroutine exits.
A flag (i.e. one-bit indicator) to contain a true or false indication.
An instruction memory to contain the program of operations to be performed in parsing the input symbol string (i.e. the grammar program).

We construct our operations to be of the one-address type. The basic instruction word (in symbolic form) consists of three parts:

Label: Name of location of an instruction.
Function: The operation to be performed.
Data: A label, the flag, or a basic symbol.

The functions we initially include are as follows:

CALL: Places the location of the input tape cell currently under the read head in the stack along with the location of the following instruction (a push-down operation). Transfers control to the location specified in the data-part.
MATCH: Matches the symbol in the data-part against the symbol in the cell of the input tape currently under the read head. If a match occurs, the tape is advanced one cell and the flag is set to true. If no match occurs, the flag is set to false.
TRUE: If the flag is set to true, control is transferred to the location specified in the data-part.
FALSE: If the flag is set to false, control is transferred to the location specified in the data-part.
RETURN: If the flag is set to false, the input tape is repositioned to the cell whose location is given in the top entry of the stack memory (i.e. the cell that was under the read head when this subroutine was called). Control is transferred to the instruction whose location is also given in the top entry of the stack, and the top stack entry is deleted (a pop-up operation).
STOP: The syntax machine is stopped and the setting of the flag displayed.

Other operations will be introduced as required.

C. Programming the syntax machine

Now let us write a syntax program for the sample grammar given in Section III A above. We shall assume that the first instruction to be executed is the first instruction in the program. We shall use somewhat more concise names for the syntactic variables, and we shall assume that each basic word in our grammar is a single symbol, which may be contained in a single cell of our input tape.

Location	Label	Function	Data
0		CALL	SEN
1		STOP
2	SEN	CALL	SUB
3		FALSE	L1
4		CALL	PRE
5	L1	RETURN
6	SUB	CALL	NP
7		RETURN
8	NP	CALL	ART
9		FALSE	L2
10		CALL	NN
11	L2	RETURN
12	ART	MATCH	THE
13		TRUE	L3
14		MATCH	A
15	L3	RETURN
16	NN	MATCH	BOY
17		TRUE	L4
18		MATCH	TREE
19	L4	RETURN
20	PRE	CALL	VB
21		FALSE	L5
22		CALL	OBJ
23	L5	RETURN
24	VB	MATCH	SEES
25		RETURN
26	OBJ	CALL	NP
27		RETURN

When presented with our sample sentence, The boy sees a tree, a trace of the execution of this program would appear as follows, where the trace of each step is printed before execution.

Location	Flag setting	Return stack	Tape stack	Tape cell	Symbolic instruction
0	....	...	...	1	CALL SEN
2	....	1	1	1	CALL SUB
6	....	1,3	1,1	1	CALL NP
8	....	1,3,7	1,1,1	1	CALL ART
12	....	1,3,7,9	1,1,1,1	1	MATCH THE
13	True	1,3,7,9	1,1,1,1	2	TRUE L3
15	True	1,3,7,9	1,1,1,1	2	RETURN
9	True	1,3,7	1,1,1	2	FALSE L2
10	True	1,3,7	1,1,1	2	CALL NN
16	True	1,3,7,11	1,1,1,2	2	MATCH BOY
17	True	1,3,7,11	1,1,1,2	3	TRUE L4
19	True	1,3,7,11	1,1,1,2	3	RETURN
11	True	1,3,7	1,1,1	3	RETURN
7	True	1,3	1,1	3	RETURN
3	True	1	1	3	FALSE L1
4	True	1	1	3	CALL PRE
20	True	1,5	1,3	3	CALL VB
24	True	1,5,21	1,3,3	3	MATCH SEES
25	True	1,5,21	1,3,3	4	RETURN
21	True	1,5	1,3	4	FALSE L5
22	True	1,5	1,3	4	CALL OBJ
26	True	1,5,23	1,3,4	4	CALL NP
8	True	1,5,23,27	1,3,4,4	4	CALL ART
12	True	1,5,23,27,9	1,3,4,4,4	4	MATCH THE
13	False	1,5,23,27,9	1,3,4,4,4	4	TRUE L3
14	False	1,5,23,27,9	1,3,4,4,4	4	MATCH A
15	True	1,5,23,27,9	1,3,4,4,4	5	RETURN
9	True	1,5,23,27	1,3,4,4	5	FALSE L2
10	True	1,5,23,27	1,3,4,4	5	CALL NN
16	True	1,5,23,27,11	1,3,4,4,5	5	MATCH BOY
17	False	1,5,23,27,11	1,3,4,4,5	5	TRUE L4
18	False	1,5,23,27,11	1,3,4,4,5	5	MATCH TREE
19	True	1,5,23,27,11	1,3,4,4,5	6	RETURN
11	True	1,5,23,27	1,3,4,4	6	RETURN
27	True	1,5,23	1,3,4	6	RETURN
23	True	1,5	1,3	6	RETURN
5	True	1	1	6	RETURN
1	True	...	...	6	STOP

For this example, the input tape appeared as follows:

It might be instructive to examine a trace of the ill-formed sentence The boy a tree sees. Following is such a trace.

Location	Flag setting	Return stack	Tape stack	Tape cell	Symbolic instruction
0	...	...	...	1	CALL SEN
2	...	1	1	1	GALL SUB
6	...	1,3	1,1	1	CALL NP
8	...	1,3,7	1,1,1	1	CALL ART
12	...	1,3,7,9	1,1,1,1	1	MATCH THE
13	True	1,3,7,9	1,1,1,1	2	TRUE L3
15	True	1,3,7,9	1,1,1,1	2	RETURN
9	True	1,3,7	1,1,1	2	FALSE L2
10	True	1,3,7	1,1,1	2	CALL NN
16	True	1,3,7,11	1,1,1,2	2	MATCH BOY
17	True	1,3,7,11	1,1,1,2	3	TRUE L4
19	True	1,3,7,11	1,1,1,2	3	RETURN
11	True	1,3,7	1,1,1	3	RETURN
7	True	1,3	1,1	3	RETURN
3	True	1	1	3	FALSE L1
4	True	1	1	3	CALL PRE
20	True	1,5	1,3	3	CALL VB
24	True	1,5,21	1,3,3	3	MATCH SEES
25	False	1,5,21	1,3,3	3	RETURN
21	False	1,5	1,3	3	FALSE L5
23	False	1,5	1,3	3	RETURN
5	False	1	1	3	RETURN
1	False	...	...	1	STOP

For this example, the input tape appeared as follows:

In this case our flag was correctly set to false when we stopped.

So far we have programmed our machine for a simple grammar and traced its execution for both well-formed and ill-formed input strings. We now examine recursive definitions.

D. Recursive calls

It is quite possible to define a constituent partially in terms of itself. For this purpose we use recursive subroutine calls, i.e. subroutines which call themselves. For example, we might define:

<noun phrase> ::= <adjective><noun phrase> | <noun> 
<adjective>   ::=   BIG | BAD 
<noun>        ::=   BEAR

That is to say, a <noun phrase> is a <noun> preceded by zero or more <adjective>'s. We can code this subroutine as follows:

Location	Label	Function	Data
0	NP	CALL	ADJ
1		FALSE	L1
2		CALL	NP
3		RETURN
4	L1	CALL	NN
5		RETURN
6	ADJ	MATCH	BIG
7		TRUE	L2
8		MATCH	BAD
9	L2	RETURN
10	NN	MATCH	BEAR
11		RETURN

Our execution trace with an input tape such as:

would appear as follows:

Location	Flag setting	Return stack	Tape stack	Tape cell	Symbolic instruction
0	...	...	...	1 ...	CALL ADJ
6	...	1	1	1	MATCH BIG
7	True	1	1	2	TRUE L2
9	True	1	1	2	RETURN
1	True	...	...	2	FALSE L1
2	True	...	...	2	CALL NP
0	True	3	2	2	CALL ADJ
6	True	3,1	2,2	2	MATCH BIG
7	False	3,1	2,2	2	TRUE L2
8	False	3,1	2,2	2	MATCH BAD
9	True	3,1	2,2	3	RETURN
1	True	3	2	3	FALSE L1
2	True	3	2	3	CALL NP
0	True	3,3	2,3	3	CALL ADJ
6	True	3,3,1	2,3,3	3	MATCH BIG
7	False	3,3,1	2,3,3	3	TRUE L2
8	False	3,3,1	2,3,3	3	MATCH BAD
9	False	3,3,1	2,3,3	3	RETURN
1	False	3,3	2,3	3	FALSE L1
4	False	3,3	2,3	3	CALL NN
10	False	3,3,5	2,3,3	3	MATCH BEAR
11	True	3,3,5	2,3,3	4	RETURN
5	True	3,3	2,3	4	RETURN
3	True	3	2	4	RETURN
3	True	...	...	4	RETURN

Our last return takes us back to the routine which called this subroutine, with a proper true indication. Here we have seen the recursive use of our subroutines.

E. Difficulties with definitions

1. Misordered alternatives. A difficulty may arise from the ordering of alternatives in some definitions. Suppose we had written in III D above:

<noun phrase> ::= <adj. list><noun> | <noun> 
<adj. list>   ::= <adjective> | <adjective><adj. list> 
<adjective>   ::= BIG | BAD
<noun>        ::= BEAR

In the example big bad bear, big would be recognized as an <adj. list> by itself, and the program would fail attempting to recognize bad as a <noun>. In such cases, the more complex alternative must be tested first in order for the program to check for the more complete definition. Thus we rewrite:

<adj. list> ::= <adjective><adj. list>[<adjective>

This condition can arise indirectly:

<adj. list> ::= <type 1 > | <type 2>
<type l>    ::= <adjective> 
<type 2>    ::= <adjective><adj. list>

This is clearly the same situation, and we can rewrite:

<adj. list> ::= <type 2> | <type 1>

2. Circular definitions. A different problem would arise had we written:

<adj. list> ::= <adj. list><adjective> | <adjective>

This definition is circular. The program will try to define <adj. list> as initially an <adj. list>, and will loop indefinitely on the first constituent of the first alternative. Our check against this condition is to make sure the first constituent of each alternative in every definition is not circular, i.e. does not directly or indirectly through other first constituents loop back on itself.

3. Common constituents. Suppose we had the following definition:

<verb phrase> ::= <verb> <adverb> | <verb> <adjective>

This says a <verb phrase> is a verb followed by either an adverb or an adjective. Given a <verb phrase> such as saw red, we would initially recognize saw as a <verb> but fail on red as an <adverb>. After this failure the tape is still positioned on red. We would then try to recognize red as a <verb> in the second alternative and of course fail. In order to force the tape to reposition back to saw we should write:

<verb phrase> ::= <type 1> | <type 2>
<type 1>      ::= <verb> <adverb>
<type 2>      ::= <verb> <adjective>

Upon returning from the <type 1> subroutine, the tape is repositioned as usual to the symbol at which it was positioned at entry to the subroutine, i.e. to saw, and we may proceed successfully. A faster and more elegant solution is to write:

<verb phrase> ::= <verb> <modifier>
<modifier>    ::= <adverb> | <adjective>

These three problem areas represent the major errors that can be made in constructing a grammar for our syntax machine. We now examine the application of our machine to the translation of programming languages.

IV. PROGRAMMING LANGUAGES

Up to now we have drawn examples for our development from natural language. In this section we will investigate the application of our syntax machine to programming languages. We will construct a simple programming language and a syntax program to parse it, beginning with its simplest elements. As difficulties arise, we will introduce notions and techniques of overcoming them. We will delay any discussion of target language definition to later sections.

A. <letter> and <digit>

<letter> ::- A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z
<digit>  ::= 0|l|2|3|4|5|6|7|8|9

The corresponding code is:

LETTER     MATCH A 
           TRUE L1 
           MATCH B 
           TRUE L1
           ...
           ...
           MATCH Z 
    L1     RETURN 
 DIGIT     MATCH 0 
           TRUE L2 
           MATCH 1 
           TRUE L2
           ...
           ...
           MATCH 9
    L2     RETURN

No difficulties arise with these subroutines.

B. <identifier>

<identifier> :: = <letter> | <identifier> <letter> | <identifier> <digit>

Here we have problems. All three of those mentioned in Section III E above, in fact. The common constituent problem is solved by writing:

<identifier> ::= <letter> | <identifier> <ld>
<ld>         ::= <letter> | <digit>

The misordered alternative problem is solved by writing:

<identifier> ::= <identifier> <ld> | <letter>

The circular definition problem is not so easy. We cannot rewrite:

<identifier> ::= <ld> <identifier> | <letter>

since this says an identifier is a letter preceded (instead of followed) by a string of zero or more letters and digits. Our solution is simple. We introduce a new syntactic operator into our metalanguage:

∫ ... an integral sign

which indicates that the following constituent may occur zero or more times. Now we can rewrite:

<identifier> :: = <letter> ∫ <ld>

We now have succinctly said what we want to say and have also eliminated some time-consuming recursion. How can we code this definition? As follows:

IDENTIFIER   CALL LETTER
             FALSE LD 
        L4   CALL LD 
             TRUE L4       
        L3   RETURN 
        LD   CALL LETTER 
             TRUE L5 
             CALL DIGIT 
        L5   RETURN

This allows us to recognize all letters and digits in the string. We recognize the end of the string by the occurrence of a symbol other than a letter or digit. But note, when we return in this fashion through L3 for a legal identifier, the flag is set to false! We avoid this difficulty by defining a new syntax machine operation:

  FLAG ... Set the flag to true.

Now we can code the <identifier> subroutine properly:

IDENTIFIER   CALL LETTER
             FALSE L3 
        L4   CALL LD 
             TRUE L4             
             FLAG
        L3   RETURN

We may note that our code can express some things that out metalanguage cannot, or can express other things more efficiently. For instance, we might have coded:

IDENTIFIER   CALL LETTER
             FALSE L3 
        L4   CALL LETTER 
             TRUE L4       
             CALL DIGIT 
             TRUE L4       
             FLAG
        L3   RETURN

How can we express this efficiency in our metalanguage? Actually, we have replaced the <ld> constituent with its definition. So it would seem natural in the metalanguage to do the same, but after enclosing the subdefinition in the metabrackets { }, thusly:

<identifier> ::= <letter> ∫ {<letter> | <digit>}

It is clear that the problems mentioned in III E above may occur within bracketed constituents. Care should be taken to avoid such problems, and any doubts should be resolved by examining the resulting code. In many cases, the value of simplicity and ease of modification overweigh added efficiency.

C. <number>

<number>   ::= <integer> <fraction> | <integer> | <fraction>
<fraction> ::= .<integer>
<integer>  ::= <digit> ∫ <digit>

In <number> we find the common constituent problem; we might rewrite:

<number>   ::= <decimal> | <integer>
<decimal>  ::= <integer> <fraction> | <fraction>

Or we might say:

<number> :: = <integer> {<fraction> | } | <fraction>

Note in the last case the empty alternative. Our code for this case follows:

NUMBER       CALL INTEGER 
             FALSE L5 
             CALL FRACTION 
             FLAG 
             RETURN 
      L5     CALL FRACTION
             RETURN
FRACTION     MATCH
             FALSE L6 
             CALL INTEGER 
      L6     RETURN 
 INTEGER     CALL DIGIT
             FALSE L14 
     L15     CALL DIGIT 
             TRUE L15 
             FLAG 
     L14     RETURN

D. <variable>

<variable> ::= <identifier>
VARIABLE     CALL IDENTIFIER 
             RETURN

E. <expression>

<expression> ::= <term> ∫ { + <term> | - <term> } | + <expression> | - <expression>
<term>       ::= <factor> ∫ {* <factor> | / <factor> } 
<factor>     ::= <primary> ∫ { ** <primary> } 
<primary>    ::= <variable> | <number> | ( <expression> )

Note here the implied precedence of the source language operations, i.e.:

First ... Parenthesized elements
Second ... Exponentiation (**)
Third ... Multiplication (*) and Division (/)
Fourth ... Addition (+) and Subtraction(-), both unary and binary

and the left-to-right evaluation of elements with equal precedence.

EXPRESSION  CALL TERM
            FALSE L13  
        L7  MATCH +
            FALSE L8
        L9  CALL TERM
            TRUE L7
            RETURN
        L8  MATCH -
            TRUE L9
            FLAG
            RETURN
       L13  MATCH +
            FALSE L10
       L11  CALL EXPRESSION
            RETURN
       L10  MATCH -
            TRUE L11
            RETURN
      TERM  CALL FACTOR
            TRUE L12
            RETURN
       L12  MATCH *
            TRUE TERM
            MATCH /
            TRUE TERM
            FLAG
            RETURN
    FACTOR  CALL PRIMARY
            TRUE L15
            RETURN
       L15  CALL EXPON
            FLAG
            RETURN
     EXPON  MATCH *
            FALSE L17
            MATCH *
       L17  RETURN
   PRIMARY  CALL VARIABLE
            TRUE L18
            CALL NUMBER
            TRUE L18
            L18
            MATCH (
            FALSE L18
            CALL EXPRESSION
            FALSE L18
            MATCH >
       L18  RETURN

Note the use of the <expon> subroutine for ease of reference, and the efficient use of code in <term> and <factor>.

F. <statement>

<statement>       ::= <label list> <basic statement> 
<label list>      ::= ∫ {<label>.}
<label>           ::= <identifier>
<basic statement> ::= <set statement> | <go to statement> | <if statement>
  STATEMENT     CALL LABEL LIST 
                CALL BASIC STATE 
                RETURN 
 LABEL LIST     CALL LABDOT 
                TRUE LABEL LIST 
                FLAG 
                RETURN 
     LABDOT     CALL LABEL 
                FALSE L19 
                MATCH . 
        L19     RETURN 
BASIC STATE     CALL SET STATE 
                TRUE L20
                CALL GO TO STATE 
                TRUE L20 
                CALL IF STATE 
       L20      RETURN
     LABEL      CALL IDENTIFIER 
                RETURN

Note that <label list> must return with flag set to true.

G. <set statement>

<set statement> ::= <variable> = <expression>
 SET STATE       CALL VARIABLE 
                 FALSE L21
                 MATCH = 
                 FALSE L21 
                 CALL EXPRESSION 
       L21       RETURN

H. <go to statement>

<go to statement> ::= GO TO <label>
GO TO STATE    CALL GO TO
               FALSE L22
               CALL LABEL
       L22     RETURN
     GO TO     MATCH G
               FALSE L23
               MATCH O 
               FALSE L23
               MATCH T 
               FALSE L23
               MATCH O 
       L23     RETURN

Note the use of the <go to> subroutine to define a source language constant symbol string. Also note that we assume blank spaces are ignored in the source language as it appears on the input tape.

I. <if statement>

<if statement> ::= IF <expression> = <condition>, GO TO <label>
<condition>    ::= 0 | + | -
  IF STATE     CALL IF
               FALSE L24 
               CALL EXPRESSION 
               FALSE L24 
               MATCH = 
               FALSE L24 
               CALL CONDITION 
               FALSE L24 
               MATCH , 
               FALSE L24 
               CALL GO TO 
               FALSE L24
               CALL LABEL 
      L24      RETURN
       IF      MATCH I 
               FALSE L26 
               MATCH F 
      L26      RETURN 
CONDITION      MATCH 0 
               FALSE L25 
               MATCH + 
               FALSE L25 
               MATCH -
      L25      RETURN

J. <declaration>

<declaration>   ::= (<variable list>) 
<variable list> ::= <variable> ∫ { ,<variable>} 
DECLARATION   MATCH ( 
              FALSE L28 
              CALL VAR LIST 
              FALSE L28 
              MATCH ) 
        L28   RETURN 
              CALL VARIABLE 
              FALSE L29 
        L31   MATCH ,
              TRUE VAR LIST 
              FLAG 
        L29   RETURN

K. <program>

<program>          ::= <declaration list> $ <statement list> $ 
<declaration list> ::= <declaration> ∫ { $ <declaration>} 
<statement list>   ::= <statement> ∫ { $ <statement>} 
    PROGRAM   CALL DECL LIST
              FALSE L32
              MATCH $
              FALSE L32
              CALL STATE LIST
              FALSE L32
              MATCH $ 
       L32    RETURN
 DECL LIST    CALL DECLARATION
              FALSE L38
              MATCH $
              TRUE DECL LIST
              FLAG
       L38    RETURN 
STATE LIST    CALL STATEMENT
              FALSE L39
              MATCH $
              TRUE STATE LIST
              FLAG 
       L39    RETURN

L. Main routine

The ordering of definition subroutines is not significant. However, the main routine must appear first.

CALL PROGRAM 
STOP

We have now developed a syntax program which will parse a complete simple programming language. The program consists of around 220 steps, and will tell us whether any given input source language is well-formed or ill-formed according to our grammar. To summarize up to this point, we have defined an automatic technique for parsing phrase-structured languages. We have designed an abstract machine which implements this technique. And we have given examples of this machine in operation and of its application to programming language. We now investigate the methods for generating target language as a function of the parsing technique.

V. OUTPUT AND THE EDITOR

A. Syntax machine output

Thus far we have dealt with source language parsing. Now we look at a method for generating the elements of the target language. We shall develop the target language generation in a fashion consistent with and embedded in our parsing logic.

We must add several elements to our syntax machine in order to generate output:

An output tape on which the generated symbols will be inscribed.
Additional functions for previously defined operations.
A new operation.
Additional information about output tape positioning in the stack memory.

The call and return operations are modified as follows:

CALL: Places the location of the output tape cell currently under the write head in the stack memory, along with the input tape cell and return locations already so placed.
RETURN: If the flag is set to false, the output tape is re-positioned to the cell whose location is given in the top entry of the stack memory (i.e. the cell that was under the write head when this subroutine was called).

The effect of repositioning the output tape along with the input tape is that all output associated with unrecognized input is essentially erased. The new operation is:

PRINT: Prints the symbol contained in the data-part on the cell of the output tape currently under the write head, and advances the tape one cell.

We must add to our metalanguage the ability to specify printed output symbols. We will enclose symbols which we wish printed in the meta-brackets [ and ], and such an enclosed set may be considered as a special kind of constituent.

Suppose we wished to translate our sentence The boy sees a tree to its German equivalent. Whenever we recognize the symbol boy we want to print Knabe, and for tree, we want to print Baum. We would change our grammar from Section III A as follows:

<noun> ::= BOY [KNABEl | TREE [BAUMl

We can add the remaining German equivalents as follows:

<article> ::= THE [DER] | A [EINEN] 
<verb>    ::= SEES [SEHT]

(We will solve the declension problem in Section V D.)

The grammar now indicates that as we recognize the basic English words, we print the German equivalent. Let's see what our syntax code looks like for the changed grammar of Section III A:

           CALL SEN 
           STOP
     SEN   CALL SUB 
           FALSE L1 
           CALL PRE   
      L1   RETURN
     SUB   CALL NP
           RETURN
      NP   CALL ART 
           FALSE L2 
           CALL NN
      L2   RETURN 
     ART   MATCH THE 
           FALSE L3 
           PRINT DER 
           RETURN
      L3   MATCH A 
           FALSE L4 
           PRINT EINEN
      L4   RETURN 
      NN   MATCH BOY 
           FALSE L5 
           PRINT KNABE 
           RETURN
      L5   MATCH TREE 
           FALSE L6 
           PRINT BAUM
      L6   RETURN 
     PRE   CALL VB 
           FALSE L7 
           CALL OBJ
      L7   RETURN
      VB   MATCH SEES 
           FALSE L8 
           PRINT SEHT
      L8   RETURN 
     OBJ   CALL UP 
           RETURN

Thus, if our input sentence is:

then our output sentence will be:

B. Editing and the Editor

In many cases we will want to combine and rearrange symbols on our output tape. For instance, suppose our translation objective is simply to invert the order of symbols on the input tape. Assume that the input string is a string of one or more letters. Our parsing grammar for our source language would be:

<string> :: = <letter> ∫ <letter>

Our code would be:

STRING     CALL LETTER 
           FALSE L1
    L2     CALL LETTER 
           TRUE L2 
           FLAG 
    L1     RETURN

We initially want to print the letters. Rather than place a print instruction for each letter in the <letter> subroutine, we will define a new operation for the syntax machine:

COPY: Copies the symbol contained in the cell of the input tape just previously read onto the cell of the output tape currently under the write head. Advances the output tape one cell.

We shall represent the copy operation in the metalanguage by the symbol K. We now can write:

<string> ::= <letter> K ∫ {<letter> K}

and code:

STRING       CALL LETTER ,
             FALSE L1 
    L2       COPY
             CALL LETTER 
             TRUE L2 
             FLAG 
    L1       RETURN

So far, we are only copying the input tape on to the output tape. In order to discuss combining and rearranging symbols on the output tape, we must introduce the notion of an editor machine.

By the nature of our approach we are not able to manipulate the output tape except to write on it and to position it forward and backward. It would prove cumbersome, at best, for the syntax machine to attempt to read the output tape, manipulate the symbols, and rewrite them back on the output tape. However, it is quite simple to write on the output tape instructions to another machine (an editor) which would use the syntax machine's output tape as an input tape. We can write these instructions just as we write symbols on the tape, assuming they will be distinguished from symbols by some unique tape coding.

For this purpose, we introduce another new operation for the syntax machine:

EDIT

Prints the symbol contained in the data-part on the cell of the output tape currently under the write head, but in a form that differentiates it from symbols written by the print operation. This symbol is referred to as an edit code. Three such codes are:

       C     Combine 
       X     Exchange
       W     Write

The output tape is advanced one cell.

In order to understand the meaning of these edit codes, we must understand the operation of the editor machine.

The editor is composed of the following elements:

An input tape (which is the syntax machine's output tape).
A stack memory with variable length entries to contain basic symbols.
A set of operations which are executed as the corresponding edit code is detected on the input tape.

The editor simply reads successive cells on its input tape and, if the content of the cell is a basic symbol, places that symbol in its stack memory (a push-down operation). If the cell instead contains an edit code, the corresponding operation is performed on the stack, followed by resumption of the input process.

C. Edit codes

Let us examine the combine code first. When the editor detects a combine code on its input tape, it removes the top two entries from the stack (pop-up operations), concatenates them into one entry, and places that entry back in the stack (a push-down operation. For instance, if the top entry of the stack contains a B and the next-to-top entry contains an A, the combine code will remove both entries from the stack and then place the single entry AB back into the stack.

When the editor detects an exchange code on the input tape, it simply interchanges the top entry in the stack with the next-to-top entry.

We can now finish our example of letter inversion. In our metalanguage we will represent the combine code by the symbol C and the exchange code by the symbol X .

<string> ::= <letter> K ∫ {<letter>KXC}

For example, given the input string PQRS, our output tape would contain PQXCRXCSXC . Hence the editor stack would ultimately contain SRQP. Our final syntax code for this example is:

STRING         CALL LETTER
               FALSE L1
               COPY
     L3        CALL LETTER
               FALSE L2
               COPY
               EDIT X
               EDIT C
               TRUE L3 
     L2        FLAG 
     L1        RETURN

At this point, no provision has been made for editor output. So we add an output tape to the editor machine, and employ the W edit code ( W in metalanguage). The write code writes the contents of the editor's stack on the output tape, starting with symbols contained in the bottom entry of the stack, and ending with the symbols contained in the top entry of the stack. We now complete our letter inversion with the following main routine:

              CALL STRING
              EDIT W
              STOP

D. Null symbols

Suppose we wished to accept the following definition of a string to be inverted:

<string> :: = ∫ <letter>

We might rewrite this definition with output as:

<string> :: = <letter> K ∫ <letter> KXC} |

i.e. either a string of one or more letters, or a null string.

Alternately, we can introduce a null symbol, and use it as our initial symbol in the string. We will define a new syntax machine operation as follows:

NULL: Prints a special symbol on the cell of the output tape currently under the write head, and advances the tape one cell. This special symbol when output later by the editor machine is ignored.

We may now write (using 0 for the null operation in the metalanguage):

<string> ::= 0 ∫ { <letter>KXC}

and code:

STRING       NULL
    L3       CALL LETTER
             FALSE L2
             COPY
             EDIT X
             EDIT C
             TRUE L3
    L2       FLAG
             RETURN

E. Marks

We will now take up the declension problem noted in the English-German translation example in Section V A. The problem is to decline der and ein for both the nominative and accusative (direct object) cases, as follows:

   Nom:    der, ein 
   Acc:    den, einen

Having defined the editing process, we can try the following grammar:

<sentence>    ::= <subject> <predicate>
<subject>     ::= <noun phrase> [ER] X
<noun phrase> :: = <article> <noun>
<predicate>   ::= <verb> <object>
<object>      ::= <noun phrase> [EN] X
<article>     ::= THE [D] | A [EIN]
<noun>        ::= BOY [KNABE] | TREE [BAUM]
<verb>        ::= SEES[SEHT]

Here we treat the article suffix differently according to whether it forms part of the <subject> or <object>. In the example The boy sees a tree, this grammar works well. However, if the sentence were A boy sees the tree, we end up with Einer Knabe seht den Baum instead of Ein Knabe .... We need a method of remembering that A was found in the <subject>, not the, in order to suppress the suffix.

For this purpose, we define two new operations, a modified call operation, and a new element for the syntax machine. The new element is a set of one-bit marks (which can be set or cleared) in the syntax machine stack. Each entry in the stack has a set of these marks.

We modify the call operation as follows:

CALL: Clears the marks in the newly placed top entry of the stack.

The new operations are:

MARK: Sets the mark specified in the data-part (by an integer) in the next-to-top entry of the stack (i.e. the entry for the calling routine).
TEST: Tests the mark specified in the data-part (by an integer) in the top entry of the stack (i.e. the entry for the current subroutine). If the mark is set, the flag is set to true. If the mark is cleared, the flag is set to false.

To express these operations in the metalanguage, we use the symbol Mn to represent the mark operation, where n is an integer specifying the particular mark of the set. We use the symbol Tn to represent the test operation.

Now we can complete our English-German example properly by changing three definitions as follows:

<subject>     ::= <noun phrase> { T1 [ER] X | } 
<noun phrase> ::= <article> <noun> T1 M1    
<article>     ::= THE [D] M1 | A [EIN]

We change our code given in Section V A above as follows:

SUB       CALL NP 
          FALSE L9 
          TEST 1 
          FALSE L10 
          PRINT ER 
          EDIT X 
          RETURN 
L10       FLAG 
L9        RETURN 
 NP       CALL ART 
          FALSE L2
          CALL NN
          FALSE L2
          TEST 1
          FALSE L11
          MARK 1
L11       FLAG
 L2       RETURN
ART       MATCH THE
          FALSE L3
          PRINT D
          MARK 1
          RETURN
 L3       MATCH A
          FALSE L4
          PRINT EIN
 L4       RETURN

VI. MACHINE LANGUAGES

We shall now develop a full grammar for the simple programming language described in Section IV, including the additions necessary to specify a simple machine language.

A. A hypothetical computer and its assembly program

For simplicity, we shall assume our target computer is of the Polish notation class (e.g. Burroughs B5000, English Electric KDF9) with a stack-type accumulator. The use of a Princeton class machine (IBM 7090, CDC 1604, etc.) is somewhat complex for our purposes. We can define a symbolic assembly language as follows:

A program is a string of operands and operators separated by commas, terminated by the *END operator.
An operand is a variable (represented by an identifier) or a number.
An operator (specified below) is represented by an asterisk and three letters.

The allowable operators are of two kinds - declarative and imperative. The declarative operators, which are actually instructions to the assembly program, are:

*VAR: Defines the following operand as the location specified by the assembly location counter, and increments that counter by one.
*LAB: Identical to *VAR, but does not increment the assembly location counter.
*END: Terminates assembly.

The appearance of an operand other than immediately following a *VAR or *LAB operator assigns the address of that operand (for variables) or the value of that operand (for numbers) to the location specified by the assembly location counter, and increments that counter by one.

The appearance of an imperative operator assigns that operator to the location specified by the assembly location counter, and increments that counter by one.

Each imperative operator and operand is contained in separate words of our computer. Control steps sequentially through the program, jumping to a new sequence only when so directed by certain operators. The appearance of an operand causes that operand to be placed in the computer's stack (a push-down operation). The appearance of an operator causes execution of the corresponding function. Execution is assumed to begin following the last declarative operator and its operand. The imperative operators and their functions are as follows:

*CLA: Replaces the address in the top entry of the stack with the value in that address.
*ADD: Replaces the top two entries of the stack with their sum.
*SUB: Replaces the top two entries of the stack with their difference.
*MUL: Replaces the top two entries of the stack with their product.
*DIV: Replaces the top two entries of the stack with their quotient.
*EXP: Replaces the top two entries of the stack with their exponentiation.
*NEG: Replaces the top entry of the stack with its negative value.
*ABS: Replaces the top entry of the stack with its absolute value.
*STO: Stores the value in the top entry of the stack in the address specified by the next-to-top entry. Removes both entries from the stack.
*TRA: Transfers control to the address specified in the top entry of the stack. Removes that entry from the stack.
*TZE: Transfers control to the address specified in the top entry of the stack only if the next-to-top entry contains a zero value. Removes both entries from the stack.
*TPL: Functions as *TZE, but for a plus value.
*TMI: Functions as *TZE, but for a minus value.
*HLT: Stops execution.

B. The complete grammar

We may now write a complete grammar for our source language-target language translation. We give below only the metalanguage formulation, adding an absolute value operation to <expression>.

<letter>          ::= A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q||R|S|T|U|V|W|X|Y|Z 
<digit>           ::= 0|1|2|3|4|5|6|7|8|9
<identifier>      ::= <letter> K ∫ {{<letter> | <digit>}  KC } [,] C 
<number>          ::= <integer> {<fraction> C |} | <fraction> [,] C 
<fraction>        ::= . K <integer> C     
<integer>         ::= <digit> K ∫ {<digit> KC} 
<variable>        ::= <identifier> 
<label>           ::= <identifier> 
<expression>      ::= <term> ∫ {+ <term> [*ADD,] CC | -<term>[*SUB,] CC} | + <expression> | -<expression> [*NEG,] C 
<term>            ::= <factor> ∫ {* <factor> [*MUL,] CC | /<factor> [*DIV,] CC}
<factor>          ::= <primary> ∫ {** <primary> [*EXP,] CC} 
<primary>         ::= <variable> [*CLA,] C | <number>| (<expression>) | /<expression>/[*ABS,] C
<statement>       ::= <label list> <basic statement> C 
<label list>      ::= O ∫ {<label> . [*LAB,] XCC} 
<basic statement> ::= <set statement> | <go to statement> | <if statement>
<set statement>   ::= <variable> = <expression> [*STO,] CC
<go to statement> ::= GO TO <label> [*TRA,] C 
<if statement>    ::= IF <expression> = <condition> , GO TO <label> XCC
<condition>       ::= 0 [*TZE,] | + [*TPL,] | - [*TMI,] 
<declaration>     ::= (<variable list>)
<variable list>   ::= <variable> [*VAR,] XC ∫ {, <variable> [*VAR,]XCC}
<program>         ::= <declaration list> $ <statement list> $ [*END.] W
<declaration list>::= <declaration> ∫ {$ <declaration> C} 
<statement list>  ::= <statement> ∫ {$ <statement> C} [*HLT,] C

C. An example program

As an example, consider the following source language program which computes the square root (B) of a number of (A) to a precision of 0.0001:

    (A, B, T) $
    B=(A+1)/2$
S1. T = B$
    B = B + (A/B - B)/2 $	
    IF / B - T/- 0.0001 = +, GO TO S1 $.

The resulting target language program follows:

*VAR,A,*VAR,B,*VAR,T,B,A,*CLA,1,*ADD,2,*DIV,*STO,
*LAB,S1,T,B,*CLA,*STO,B,B,*CLA,A,*CLA,B,*CLA,*DIV,
B,*CLA,*SUB,2,*DIV,*ADD,*STO,B,*CLA,T,*CLA,*SUB,
*ABS,.0001,*SUB,S1,*TPL,*HLT,*END.

This target language program can now be assembled by the target machine and executed, giving the desired results. A more familiar target language would be that of the IBM 7090. Had we written a grammar for that target language, our generated program might have been:

       ENTRY  (1)    
A      BSS    1        
B      BSS    1        
T      BSS    1        
(1)    CLA    A        
       FAD    = 1.     
       FDP    = 2.         
       XCA             
       STO    B        
S1     BSS    O        
       CLA    B        
       STO    T     
       CLA    A      
       FDP    B      
       XCA           
       FSB    B      
       FDP    2.
       XCA
       FAD    B
       STO    B
       CLA    B
       FSB    T
       SSP
       FSB    = .0001
       TPL    S1
       HTR    *
       END

The flexibility inherent in the combined source-target language specification permits consideration of many target machine characteristics. We next examine the problems of implementing the compiler, and adding features to the compiler to take advantage of the target machine's characteristics.

VII. IMPLEMENTATION

A. Storage requirements

Thus far we have specified two abstract machines, the syntax machine and the editor machine. Let's make some broad-scale estimates of storage requirements for practical implementation.

The syntax machine requires one input and one output type, a stack memory, a one-bit flag, and a grammar memory. In practice, there is little need for more than 100 entries in the stack, and 4000 syntax instructions. Each stack entry requires about 48 bits (12 each for input cell location, output cell location, return address, and marks). Each syntax instruction needs about 16 bits (4 for function part, 12 for data part). A program to simulate the logic of the syntax machine would require less than 500 instructions on any computer.

The editor requires one input and one output tape and a stack memory. The stack memory might be organized as a stack of threaded lists, since each entry must be of variable length. In practice, such a stack would require less than 100 entries, containing a total of 2000 symbols with their link addresses. Each such symbol might need 24 bits (12 for the symbol, 12 for the link address). A program to simulate the logic of the editor would require less than 500 instructions on any computer.

In general, we are talking about 100K bits or less, with 2 tapes. Computers in this range are the SDS 910 and 920, IBM 1620 and 1410, CCC DDP-19, CDC 160A, GE225, Philco 1000, UNIVAC SS80 and 90, Honeywell 400, NCR 315, DEC PDP-1, etc.

B. Tape movement

The tape movement in a straightforward implementation of the syntax machine would normally be excessive. A large amount of back spacing (impossible with paper tape or cards) will occur. This may be overcome by creating input and output stacks to buffer the I-O processes. As symbols are read in, they can be placed in the input stack. Repositioning of the tape can be effectively accomplished by reducing the index of the current entry of the input stack. The symbols already read remain in the stack above the new index position. Additional symbols are actually read from the input tape only when an empty stack entry is encountered. The output stack can be manipulated similarly, except overwriting of erased symbols occurs.

Clearly, unless the input and output strings are smaller in length than the respective stack dimensions, some method of occasionally clearing the stacks must be available. The philosophy can be adopted of segmenting the translation process; e.g. in a programming language one complete statement may be translated in a segment. In this example, whenever such a statement has been parsed and output generated, the output stack can then (and only then) be written on the actual output tape, and both the input and output stacks cleared. One simple method of accomplishing this would be to trigger the output tape write by the execution of the EDIT W operation, aside from its similar use by the editor. In the programming language example given in Section VI B, we might rewrite for this purpose:

<declaration list> ::= <declaration> W ∫ {$<declaration> W}
<statement list>   ::= <statement> W ∫ {$<statement> W} [*HLT,] W

Of course, if the source language is not well-formed according to the syntax defined in routines which call the subroutine that execute EDIT W codes, we have no means for recovery. Some thought applied to deciding when to output will generally minimize this problem.

The input stack would not require for most purposes more than 1000 12-bit symbols, the output stack no more than 2000 12-bit symbols. The use of these stacks enables us to employ paper tape or punched card input-output.

C. The translator machine

A further elegance now possible would be to combine the syntax and editor machines into one translator machine. We could for this purpose consider the syntax machine's output stack as a pre-loaded editor machine stack, which the editor would scan, rather than read. The EDIT W operation would be modified to transfer control to the editor. The editor would process all symbols in the stack and output the edited string. The editor would then clear both the input and output stacks, and return control to the syntax machine at the instruction following the EDIT W operation.

We now have a single translator machine, operating in a one-pass mode, capable of translating in that one pass a Slan string indefinitely long. The entire translator should fit easily into a 200K-bit computer (say an 8K 24-bit word length machine, or 40K character machine), such as expanded versions of those mentioned above. This combined approach yields high translation rates (e.g. on the IBM 7090, upwards of 1000 FAP instructions per minute for FORTRAN-type source language's).

D. Other features

Two other programs are necessary for a complete compiling system. First, a grammar assembly program is desirable to convert from our symbolic notation for syntax code to an absolute encoded format. Second, a target language assembly program (or load-and-go program) is necessary to convert the translator machine output to the object computer's absolute language.

It is important to note again that our translator can so far translate only these language pairs which are definable by a constituent grammar. For many other languages, we will need additional output and editing facilities. A dictionary entry and look-up feature would be very useful in both natural and programming language translation. The ability to generate internal labels is necessary in most programming languages. Format control (spacing, tab positioning, carriage return, etc.) are desirable for any language. All these and other features can be implemented through the use of additional edit codes and further extensions to the editor. In fact, it might prove useful to have several editors, since many of these features may be dependent on the target language required. An editor could even assume the duties of a target language assembly program.

For some source languages it might be advantageous in obtaining a high-quality translation to write more than one syntax program, and to use them in a multi-pass fashion, each processing the output from the previous pass. In fact, both a one-pass grammar and a multi-pass grammar might be provided for the same language pair, and, depending on the quality of translation desired, one or the other grammar would be employed.

It is not in the scope of this report to investigate these concepts in greater detail. However, one can see that the opportunity for innovation and extension in many directions exists.

VIII. APPLICATIONS

A. General

The parameterized compiler based on automatic parsing is applicable to a variety of fields. The obvious ones are programming languages and natural languages. The approach of our compiler, however, is somewhat different. Rather than write several compilers for various source and target languages, we will write one compiler only, and provide it with different parameters (i.e. grammars). We are in a position (compared to previous techniques) to modify the compiler for changes in either the source or target language with relative ease and simplicity. To a great extent, we have at the same time developed rigorous and complete documentation of both languages and the transfer function between them.

A secondary application might be to employ the compiler to translate from some convenient higher order metalanguage to our syntax code - using a metagrammar, so to speak.

A not so obvious application is in the areas of man-machine communication. In information storage and retrieval problems, it is often useful to enter and query the computer in a restricted form of a natural language (say English). A small version of our translator machine could be operated as a subroutine in order to convert the natural language to encoded information and/or instructions, and vice-versa. A similar application is the use of such a subroutine to accept and display procedures; e.g. to accept an algebraic or Boolean Formula as parameter data in a computational program.

Parsing does not need to be restricted to languages as we commonly understand them. Any structure: linear, spatial, temporal or otherwise, that is well-formed in the phase-structure sense, can be parsed. Our compiler conceivably might discover whether a chair or a game or a concerto or a molecule or a war plan is well-formed, and display for us the particular structure involved or some transformation of it.

In the long run, the uses to which we put our translator machine are limited mainly by the ingenuity and foresight of those who apply it.

B. A specific application

As a specific illustration of the possible uses of this compiler, we consider the case of strategic command and control systems in which it is essential that there be direct communication between high ranking military officers and the data processing system. Since the principle duties of these officers normally involve activities which are very broad in scope and are largely independent of the data processing system, it cannot be expected that they will have the time or inclination to develop an adequate capability for communicating with the machine via either the highly formalized languages or elaborate consoles which are currently used in the more common tactical command control situations.

The function of the technician operating a SAGE console can be defined in terms of the analysis of a small finite number of situation displays and responses in the form of a limited number of light gun or switch movements. With adequate training a junior officer working in a specialized area may find that communication with the machines via formatted file update or report query formats is adequate. For high level military officers with extremely comprehensive responsibilities and hence virtually unlimited data processing and display requirements it is essential that communication with the system be via simple natural language. The development of the compiler discussed in this paper would be a major step in making such communication possible.

As illustrations of the types of statements which a first production model or our compiler could be expected to process, we list the following statements taken from the military logistics field.

WHERE IS THE USS DIXIE?
HOW MANY MACH 46 TORPEDOES ARE AT SASEBO ?
ADD 6 C-124 AIRCRAFT TO THE 315TH AIR DIVISION ON DAY + 4.
THE ARMY'S 217TH BATTLE GROUP NOW HAS 1760 MEN,
23  105MM HOWITZERS, 46 2-1/2 TON TRUCKS, AND NO GENERATORS.
THE UTILIZATION RATE FOR ALL CRAF AIRCRAFT IS 16 HOURS PER DAY.
HOW MANY C-130'S CAN REACH CLARK AIR BASE BY 4PM HST?
HOW MANY DAYS OF POL SUPPLIES ARE INCLUDED IN THE EMBARKATION LOAD OF THE 1ST ANGLICO?
I WOULD  LIKE  TO  SEE  THE  CUMULATIVE   KOREAN
THEATRE  POPULATION  DURING THE  FIRST  30  DAYS
FOLLOWING THE IMPLEMENTATION OF OPLAN XX-XX.

The realization of such a capability can be achieved within the next two years if adequate skills are applied now. A pilot implementation program could be started in an area in which there is already a good foundation of data processing experience on the part of both the user personnel and the compiler and grammar development personnel. By the very nature of the compiler described in this report we need not embark on an all or nothing program. The compiler development can proceed simultaneously with the grammar development. The system may start with a simple grammar tailored to a narrow class of problem (for example, inter-theatre airlift analyses) and be expanded in an evolutionary manner as its usefulness is demonstrated.

Through this specific example, we can perhaps foresee some of the varied but as yet unexplored fields to which the parameterized compiler may be applied.

APPENDIX: Summary of Syntax Machine Operations

Operation              Section
CALL                   III B
RETURN                 III B
MATCH                  III B
TRUE                   III B
FALSE                  III B
FLAG                   IV  B
STOP                   III B
PRINT                  V   A
COPY                   V   B
NULL                   V   D
EDIT                   V   C
MARK                   V   E
TEST                   V   E
COMBINE                V   B
EXCHANGE               V   B
WRITE                  V   B

REFERENCES

1. GLENNIE, A. E., On the syntax machine and the construction of a universal compiler. Technical Report No. 2, Computation Center, Carnegie Institute of Technology, 10 July 1960 (AD-240 512).

2. YNGVE, Victor H., A model and hypothesis for language structure. Proc. Amer. Phil. Soc. Vol. 104, No. 5, pp. 444-466, October 1960.

3. CHOMSKY, Noam, Syntactic Structures, Mouton, The Hague, The Netherlands, 1957.

4. SAMELSON, K. and BAUER, F. L. Sequential formula translation, Comm. Assoc. Comp. Mach. Vol. 3, No. 2, pp. 76-82, February 1960.

5. FLOYD, Robert W., A descriptive language for symbol manipulation, J. Assoc. Comp. Mach. Vol. 8, No. 4, pp. 579-584, October 1961.

6. BROOKER, R. A. and MORRIS, D., An assembly program for a phrase structure language, Comp. J. Vol. 3, pp. 168-179, October 1960.

7. LEDLEY, Robert, and WILSON, J. B., Automatic programming language translation through syntactical analysis, Comm. Assoc. Comp. Mach. Vol. 5, No. 3, pp. 145-155, March 1962.

8. IRONS, Edgar T., A syntax-directed compiler for ALGOL 60, Comm. Assoc. Comp. Mach. Vol. 4, No. 1, pp. 51-55, January 1961.

9. WARSHALL, Stephen, A syntax-directed generator, Proc. East Joint Comp. Conf. pp. 295-305, 1961.

10. NAUR, Peter et al., Report on the algorithmic language ALGOL 60, Comm. Assoc. Comp. Mach. Vol. 3, No. 5, pp. 299-314, May 1960.