syntaxrule ::= identifier = alternativelist ; alternativelist ::= alternative | alternativelist / alternative alternative ::= test qtestlist? | <- testlist testlist ::= test | testlist test qtestlist ::= qtest | qtestlist qtest test ::= ntest | stest qtest ::= ntest | stest ? integerorstring? integerorstring ::= integer | string string ::= Any string of characters between a pair of the current string delimiter characters
A top-down method of syntax analysis is used in the TREE-META system. The main goal is to match the complete program with the Syntax Rule whose name follows .META at the head of the program. The name of the Syntax Rule appears at the start of the rule followed by the = symbol. The right-hand side of this Syntax Rule indicates which entities have to be recognised to achieve this goal. The recogniser achieves its main goal by looking for these smaller entities, from left to right, as individual subgoals. These subgoals are themselves defined in terms of other entities, and so the process of recognising a complete program becomes a matter of looking for progressively smaller entities, right down to the level at which basic symbols such as characters and numbers are recognised. For example, a simple TREE-META program is:
.META SIMPLE SIMPLE = 'A' VOWEL / 'B' SIGN ; VOWEL = 'A' / 'E' / 'I' / 'O' / 'U' ; SIGN = '+' | '-' ; .END
The right-hand side of each Syntax Rule consists of a set of alternatives separated by the / symbol (similar to the BNF | symbol). Identifiers define subgoals that need to be recognised while strings indicate that the particular characters making up the string are expected at this point. In the example, a SIMPLE program is defined as consisting of the letter A followed by a vowel, or the letter B followed by a sign. The complete set of SIMPLE programs is:
AA AE AI AO AU B+ B-
TREE-META has two forms of syntax analysis. The standard one is a restrictive top-down which does not allow backtracking other than in the first subgoal of an alternative. If non-recognition occurs at a subgoal other than the first, the translator will stop and output an error diagnostic. For example:
. META SMALL SMALL = FIRST SECOND / ONLY ; FIRST = 'AB' ; SECOND = 'C' ; ONLY = 'ABD' ; .END
This translator will only recognise the program consisting of the symbols ABC. It cannot recognise the program ABD. If this program is input, the first alternative of SMALL will be tried which, initially, matches the subgoal FIRST with the input symbols AB. The subgoal SECOND now expects the input to consist of the character C when, in fact, it is D. This failure is not in the first subgoal of an alternative and so the recogniser will stop with an error. This form of syntax analysis is not as restrictive as it may at first seem. By suitable bracketing of subgoals, many languages can be recognised without backtracking.
The second form of syntax analysis allowed is top-down with backtracking. The two forms can be mixed in a program. In fact, both forms can be used in a single Syntax Rule. Preceding an alternative by the symbol <- indicates that backtracking is allowed within this alternative. For example, the program SMALL could be made to recognise ABD by replacing the SMALL rule with:
SMALL = <- FIRST SECOND / ONLY ;
Non-recognition, in attempting to match the input D with the subgoal SECOND, will no longer cause an error halt. Instead, the input pointer will be returned to its position before subgoal FIRST was recognised and the second alternative, ONLY, tried.
Each alternative in a Syntax Rule consists of a set of subgoals (stest in BNF definitions in Section 2.1) which need to be recognised. In addition, the alternative may contain a number of tree-building commands (ntest in BNF definitions in Section 2.1) which will be described in Chapter 4. TREE-META provides facilities for generating syntax error messages. These occur whenever the second or successive subgoal in an alternative does not match the input, and backtracking has not been specified. Consider, for example, the following program:
.META DEC DEC = 'INTEGER' NAME ';' ; NAME = 'I' / 'J' / 'K' ; .END
If the input program consisted of:
REAL I ;
then the TREE-META system would just fail to recognise the rule DEC. However, if the input was:
INTEGER Z ;
then this would be treated as a syntax error. If no error message has been defined, the recogniser will stop and issue the diagnostic:
SYNTAX ERROR 0 IN INTEGER Z ; ↑
The last line of the input already read is printed, together with an arrow on the following line indicating the position in the line reached by the recogniser.
To generate an error number other than 0 for each type of syntax error, each subgoal should be followed by a number enclosed within two ? symbols. For example:
DEC = 'INTEGER' NAME ?2? ';' ?5? ;
would issue the diagnostics:
SYNTAX ERROR 2 IN INTEGER Z ; ↑ SYNTAX ERROR 5 IN INTEGER I,J ; ↑
for the two input lines shown.
To generate more meaningful diagnostics, a string enclosed between ? symbols will be printed. For example:
DEC = 'INTEGER' NAME ? INCORRECT IDENTIFIER? ';' ?SEMICOLON DOES NOT END DECLARATION? ;
This will produce a diagnostic of the form:
SYNTAX ERROR INCORRECT IDENTIFIER INTEGER Z ; ↑
The strings will have to be stored within the compiler that is generated and so its size will be larger than if simple error numbers had been used.
As it is always possible to try another alternative if a failure to match occurs on the first subgoal of an alternative (or if backtracking is specified), the following example rules are not allowed:
DEC = 'INTEGER' ?3? NAME ';' ; DEC = <- 'INTEGER' NAME ?2? ';' ; EC = 'INTEGER' NAME ';' / <- 'REAL' NAME ?3? ';' ;
The first is attempting to issue a diagnostic on failing to match the initial subgoal, while backtracking would take place in the other two.
stest ::= string | identifier | . string | @ integer | .EMPTY | ( alternativelist ) | $ stest | basictype basictype :: = .NUM | .ID | .OCT | .HEX | .SR | .CHR | .DIG | .LET
At any point in the syntax analysis, the recogniser will be attempting to match the current subgoal or stest with the input still to be read. The result of matching the input with a subgoal will always be to move the position reached in the input to a point immediately after the recognised characters. For basictype subgoals and strings preceded by the symbol fullstop, the recognised input will be placed on a stack ready for use in tree-building operations. A full description of the stacking and tree-building operations is given in Chapter 3.
The possible subgoals, which match the input with a particular entity are given below. In all cases, other than .CHR, spaces and newlines are ignored if they appear on the input preceding the entity to be recognised.
Some examples of possible rules, with input they can recognise, are:
(1) FRED = 'ABC' .'CDE' / .'GH' 'JK' ; ABC CDE GH JK (2) FRED = @26 @29 / @46 @48 ; *- NP (3) FRED = .NUM .ID .OCT .HEX .SR .CHR .DIG .LET ; 39 ABC1D 257 1A2B 'A STRING'*3A
Recursive definitions are permitted in TREE-META Syntax Rules. That is, the name of the Syntax Rule can appear as a subgoal on the right-hand side of the rule. The right-hand side must have at least one alternative not involving this rule as a subgoal, otherwise the recogniser will get into an infinite loop. It is essential that the first subgoal of the first alternative is not the name of the rule itself. For example:
FRED = FRED / ' CD' ;
Such a rule would, if called, attempt to match the input with the goal FRED. To do this, the first alternative is tried which requires the input to be matched with the subgoal FRED and so on. A more sensible use of recursion is:
SET = 'B+' SET / 'B'
which would recognise input of the form B+B+.... B+B.
The right-hand side of a TREE-META Syntax Rule usually consists of a set of alternatives separated by the / symbol. It is possible to use brackets so that a set of alternatives acts as though it was a single subgoal call. The bracketed set matches the input if one of the alternatives is recognised. For example:
EXP = FIRST '*' SECOND ; SECOND = 'A' / 'B' ;
can be written as:
EXP = FIRST '*' (' A' / 'B')
The form of the entity within the brackets is the same as a complete right-hand side as far as syntax tests, tree-building and diagnostics are concerned. Backtracking, for example, can be specified for a particular alternative within the brackets:
EXP = 'E' (<- FIRST SECOND / ONLY) ; FIRST = 'AB' ; SECOND = 'C' ; ONLY = 'ABD' ;
Factoring of alternatives, in conjunction with the .EMPTY test, is particulary useful in avoiding back-up situations. For example, an expression consisting of a single primary or two primaries separated by an asterisk could be recognised by:
EXP = <- PRIMARY '*' PRIMARY / PRIHARY ;
The single primary can only be recognised by backtracking when the asterisk does not match the input. Rewriting the rule as:
EXP = PRIMARY ('*' PRIMARY / .EMPTY) ;
avoids the need for backtracking. Non-recognition of an asterisk now occurs as the first subgoal of an alternative.
Although it is possible to define a sequence of similar items recursively, TREE-META provides a specific construction for this case which is both more efficient and easier to read. The special symbol, $, is used to signify that the following subgoal will be recognised repeatedly until a failure to match occurs. It is possible for the subgoal not to be matched at all. For example:
SET = 'C' $ 'C' ;
will recognise a sequence of any number of C characters as long as there is at least one. The subgoal following the $ symbol can be a bracketed entity. In this case, one of the alternatives within the brackets is recognised repeatedly until a failure to match occurs. For example:
SET = 'C' $ ( 'D' / 'E' ) ;
recognises any string consisting of the letter C followed by any sequence involving the letters D and E. Thus:
CDDEDDE
and
CEDDDED
would both be recognised.
Most compilers require a recogniser for arithmetic expressions. The following TREE-META program generates a compiler capable of recognising a single arithmetic expression:
. META EXP EXP = TERM ( '+' EXP / '-' EXP / .EMPTY ) ; TERM = FACTOR ( '*' TERM / '/' TERM / .EMPTY ) ; FACTOR = '+' PRIMARY / '-' PRIMARY / PRIMARY ; PRIMARY = .ID / .NUM / '(' EXP ')' ; .END
The arithmetic expression contains the standard operators and allows operands to be either integer constants or variable names. Some typical expressions which would be recognised by this compiler are:
ALPHA 27 -AL25B + 18 ALPHA * (-BETA + 19) - GAMMA/DELTA A*-B
The last expression is rather unusual. It is allowed because the syntactic unit FACTOR may be signed. It is more normal for such FACTORs to be unsigned. This would have been achieved by defining the expression as:
.META EXP EXP = ( '-' / .EMPTY) TERM ( '+' UEXP / '-' UEXP / .EMPTY ) ; UEXP = TERM ( '+' UEXP / '-' UEXP / .EMPTY ) ; TERM = PRIMARY ( '*' TERM / '/' TERM / .EMPTY ) ; PRIMARY = .ID / .NUM / '(' EXP ')' ; .END
A similar recogniser can be defined using the $ notation. The expression is now terminated by a semicolon:
.META EXP EXP = ( '-' / .EMPTY ) TERM $ ('+' TERM / '-' TERM ) ';' ; TERM = PRIMARY $ ( '*' PRIMARY / '/' PRIMARY ) ; PRIMARY = .ID / .NUM / '(' EXP ')' ;
Each rule in TREE-META is compiled into a section of code which is similar to a recursive subroutine. The appearance of a rule name on the right-hand side of a Syntax Rule is equivalent to a call of that rule. For example, to recognise:
X+3;
the translator first calls the rule EXP. The input does not start with a - symbol and so the second alternative, .EMPTY, is accepted. The rule TERM is called which, in turn, calls the rule PRIMARY. This recognises the symbol X as an identifier, .ID. On return to TERM, this rule attempts to match the input with first the * symbol and then the / symbol. Neither match the symbol + on the input and so a complete TERM has been recognised. On returning to EXP, the input symbol + is matched with the symbol + in the rule. The rule TERM is called again and the integer 3 is recognised in a similar fashion to X. The bracketed expression in the rule, EXP, has a $ in front of it and so it will be entered repeatedly. Now, however, the next input symbol is a semicolon and, as this does not match either + or -, the expression is not repeated again and the last ; symbol is matched.