Chilton::ACL::Compiling Techniques

Chapter 7: Syntax Analysis

7.1 Introduction

The main aim of the syntax analysis phase of the compiler is to take the S-string produced by the lexical analyser, and to use some parsing algorithm to verify that the S-string consists of a legal string or strings in the S-language. In addition, it will be required to collect information about the language, and produce as output a C-structure which could be code, which is ready to be executed or interpreted, but is more likely to be a structural representation of the S-string which will be used to generate code. The C-structure could, for example, be passed to an optimising phase which will modify the C-structure so that better code can be produced. For example, Fig. 7.1 shows two possible C-structures for the arithmetic expression a + b*c.

Fig. 7.1

The design and theory of parsing algorithms has grown enormously in recent years. Many are of only academic interest because of the impossibility of defining efficient computer programs for the algorithm. Only a few methods have been widely used in handwritten compilers. Other more general methods have only become evident in the compiler generator systems being produced where some efficiency has to be sacrificed so that the system can handle a variety of languages. The algorithms to be described here have all been used in hand-written compilers and in compiler generator systems.

7.2 Precedence grammars

Before describing the algorithm for parsing sentences in a language which is a precedence grammar, some additional syntactic definitions must be given. In section 5.4 the term handle of a sentential form was defined as the substring which is first reduced in the canonical parse. In the example subset of the English language defined in section 5.2, the sentence the man has a dog had, at each step in the parse, the handles shown in Table 7.1.

       sentential form                     handle
  the    <man>   has   a   dog              man
 <the      N>    has   a   dog              the N
           P    <has>  a   dog              has
           P      V   <a>  dog               a
           P      V    A  <dog>             dog
           P      V   <A    D>              A D 
           P     <V      O>                 V O
          <P          Q>                    P Q
                S

Table 7.1

The symbols < and > have been used to delimit the handle from the rest of the sentence. Similarly the individual symbols in the handle can be delimited by the symbol =. The original sentence could then be written:

    the < m = a = n > has a dog

The relationships < , = , > can be thought of as existing between the two symbols that the operator delimits. In the example e < m, m = a, and n > h, for example. In order to simplify the meaning of the operators at the end of the sentential form, it is assumed to be enclosed in the symbols { and } , which can be thought of as giving context to the sentential form. Therefore, for example, the relationship { < P exists. Considering all the possible sentences, then the relationships given in Table 7.2 can be seen to exist between pairs of symbols.

	a	d	e	g	h	m	n	o	s	t	A	N	O	P	Q	S	V	}
{							<			<				<		<
a		>				>	=		=
d								=
e		*				*						=
g					>													>
h	=		=
m	=
n					>			=										>
o		<		=		<						=
s	>
t					=
A		<				<						=
N					>													>
O																		>
P					<										=		<
Q																		>
S																		>
V	<									<	<		=

Table 7.2

Apart from the two entries marked by *, there is a unique relationship between pairs of symbols. The relationship between the pairs (e,m) and (e,d) can be both < and >. This can be seen from the sentential form:

   {the < d = 0 = g > has the dog}

which later in the reduction sequence has the handle given by:

  {P V < t = h = e > dog}

A unique relationship between pairs of symbols can be obtained by replacing rule 2 by:

    P → no N

The relationship e > d and e > m then exist. A unique relationship therefore exists between pairs of symbols that can appear adjacent to each other in legal sentential forms.

Finding the handle and therefore defining a parsing algorithm is now quite simple, assuming uniqueness of the relation between any two symbols. The sentential form is examined from left to right until a pair of symbols is found between which the relationship > holds. This position is noted and the sentential form is then examined from this point leftwards until the first two symbols with the relationship < are encountered. The set of symbols between < and > is the handle to be reduced. In the example:

  {no man has a dog}

the relationships from left to right are:

{ < n = o < m = a = n > h ...

Therefore the handle must be given by:

       < m = a = n >

which produces:

{ < n = o = N > h ...

Therefore the next handle is no N. Notice that after the first reduction there can be no symbols with the relation > to the left of the point reached, so that the scan of the string from left to right to find the next occurrence of > may be started at the point reached so far. The definition of which precedence relation exists between a pair of symbols can be formalised as follows:

The relationship A = B exists between two symbols, either of which can be terminal or non-terminal, if a production exists in the language of the form:
```
    C → αABβ 
```
where α,β are two sentential forms of which either or both may be null.
The relationship A < B exists between two symbols if a product exists in the language of the form:
```
    C → αAγβ 
```
and γ produces Bπ for some string π.
The relationship A > B exists between two symbols either if a production exists in the language of the form:
```
    C → αγBβ 
```
and γ produces πA for some string π or if a production exists in the language of the form:
```
    C → αγδβ 
```
and γ produces µA δ produces bπ for some strings µ and π.

The strings denoted by Greek letters may be null. A language where a unique relationship exists between all pairs of terminal and non-terminal symbols is called a precedence grammar.

This immediately leads us to a very simple and straightforward parsing algorithm for precedence grammars. The symbol { is inserted in the initial position of a stack, and characters from the input S-string are added to the stack one at a time as required. Each time a symbol is added to the stack, the relationship between the top two symbols is looked up. If the relationship is < or =, then the next symbol from the input S-string is added to the stack and the procedure repeated. If the relationship > exists between the top two symbols, then all symbols are removed between the top symbol and the first symbol of the pair of symbols that last had the relationship < between them. The symbols removed are replaced by the equivalent non-terminal symbol, and the process continues by examining the relationship between the top two symbols in the stack. For example the sentence no man has a dog would be parsed as shown in Fig. 7.2.

  *  *  *  *  *  *  *  *  *  *  *
                    }
  h                 g
  n        a        o  }
  a  h     s     d  d  N  }
  m  N     a  a  a  A  A  Q  }
  o  o  h  h  V  V  V  V  V  Q  }
  n  n  P  P  P  P  P  P  P  P  S
  {  {  {  {  {  {  {  {  {  {  {

Fig 7.2

Symbols are added to the stack until the pair (n,h) appears in the stack. This is the first pair where n > h. This causes symbol N to be recognised. As N h, the reduction to the non-terminal P occurs immediately. The parsing then proceeds as shown.

A good example of a language with a precedence grammar, which is parsed in this way, is given in the papers on the language Euler by Wirth and Weber (1966).

7.3 Operator precedence grammars

An operator grammar is one where no productions of the form:

    C → αABβ

occur, where A and B are both non-terminal symbols, and α, β are strings of symbols either or both of which may be null. For operator grammars, the relationships =,< and > need only be defined between terminal symbols of the language as follows:

a = b if there is a production
```
    C → αabβ
or  C → αaAbβ
```
where A is non-terminal and the symbols a,b are terminal.
a < b if there is a production
```
    C → αaAβ
```
and A produces bπ for some π or A produces Dbπ where D is non-terminal.
a > b if there is a production
```
    C → αaAbβ
```
and A produces πa for some π or A produces πaD where D is nonterminal.

An operator precedence grammar is an operator grammar where a unique relationship occurs between pairs of terminal symbols. These relationships are called the precedence relations of the terminal symbols.

7.4 A parser for an operator precedence grammar

Most computer languages can be altered by the lexical conversion so that the S-language has an operator precedence grammar and, as a parser for an operator precedence grammar is efficient and simple, it has therefore been widely used in compilers. Arithmetic expressions, which constitute the chief part of most scientific languages, will be used as examples in the analysis procedures to be discussed here. A simplified structure of arithmetic expressions in the S-language can be defined as:

  1    P → (E)
  2    P → <identifier>
  3    E → T
  4    E → E + T
  5    T → P
  6    T → T * P

Subtraction and division behave in a similar manner to addition and multiplication, as far as the analysis is concerned, and so have been ignored in order to simplify the examples. Also the relationship between addition and multiplication resembles the relationship between multiplication and exponentiation. Consequently exponentiation has also been ignored and examples containing only addition and multiplication will be used.

Identifiers denoting variables will have had their individual letters and digits inserted in the symbol table, and the lexical scan of the expression will produce on the S-string in place of the identifier a pointer to the symbol table entry corresponding to the particular identifier name. Therefore, when an example S-string is written as a + b*c, the symbols a,b,c stand for pointers to the symbol table and these will be marked in some way to distinguish them from the symbols + and * which will be numeric values defining these operators.

These symbol table pointers tend to be very similar to non-terminal symbols, and it is not normally necessary to define precedence relations between the pointers and the terminal symbols such as + and *. They can be thought of as non-terminal symbols that have been produced in an ad hoc fashion independent of the parser. It is as though the S-string to be parsed already contained non-terminal symbols. Identifiers and non-terminal symbols will, in the context of arithmetic statements, be called operands and the terminal symbols will be called operators.

From the definitions of the simplified arithmetic expression, the precedence relations between the terminal symbols (operators) can be defined and stored in a table (see Fig. 7.3) called the precedence table. As the precedence relations only exist between the operators, it is more convenient to use a stack with two columns rather than the one column stack used in the case of precedence grammars. One column will have operators inserted in it, while the other will contain operands. This is mainly a notational device and could in practice be implemented as a one column stack. As the parsing depends on the top two operators at each stage, it is convenient to have these next to each other in the diagrams.

	+	*	(	)	}
{	<	<	<
+	>	<	<	>	>
*	>	>	<	>	>
(	<	<	<	<
)	>	>		>	>

Fig 7.3

From the definitions of an operator grammar it can be seen that the handle is delimited, as before, by the symbols < and >, and the same algorithm applies for finding the handle as was used in the case of precedence grammars. As the precedence relations only exist between terminal symbols, there is some doubt about whether the non-terminal symbols also occur in the handle. For example, if a > b then in the case a A b, should A be part of the handle? Looking at the definitions for < and > it is easily seen that

if a > b, then a A b has the relationship:

                  aA > b

if <, then a A b has the relationship:
```
                  a <Ab
               
```

Therefore on combination all operands between < and > will be part of the handle.

For example, the expression {a + b * (c + d) + e } would be parsed by first loading { into the initial operator position of the stack and filling the stack until the next operator is encountered (see Fig. 7.4).

Fig. 7.4

Looking up the precedence relation ({,+) in the precedence table will show that, as { < +, the stack should be filled. The next operand and operator will be loaded (2) and, as + < *, filling occurs again. The next character on the input is the operator ( so that this will be loaded into the operator position (3) and, as * < (, filling will occur once more until the top two operators are + ) and where + > ). Combination will now occur and, as ( < +, the handle must be c + d. This C-structure will be output (see Fig. 7.5) and the stack will be collapsed (6) with the pointer E1 to this C-structure inserted in the stack. The top two operators now have the relation ( = ) so that filling will occur (7). As ( > + combination will occur once more. In this case the handle is (E1). This bracketing may be inherent in the C-structure used and, in this case, no additional structure is produced but instead the pointer E1 is left in the stack (8). Combination now occurs as * > + and + > + so that the structures E2 (9) and then E3 (10) are produced. Finally the stack is filled once more (11) with the operand d and operator } and combination (12) produces the final C-structure E4.

Fig. 7.5

The above example is typical of how a precedence table together with a stack would be used to deal with syntax analysis. There are many variations on this technique but all basically rely on the precedence table for their results.

7.5 Floyd productions

Several compilers and compiler generator systems have used methods similar to one given by Floyd (1961), and these are collectively known as Floyd productions. This is an unfortunate misnomer as these methods depend on a matching of part of a sentential form with a pre-assigned pattern, followed by reduction of the sentential form. This method is the basis of the compiler generator systems FSL and CABAL of Carnegie-Mellon University. Parsers using Floyd productions are described by Evans (1964), Feldman (1964), and Pratt and Lindsay (1966).

Algorithms of this type consist of an ordered set of rules for reducing the input stream. A special symbol | is used to define a pointer in the input stream and the symbol ! is used to denote any allowable symbol. The rules define rewrite operations on the input stream. An actual implementation usually consists of the symbols to the left of the pointer being stored in a stack and the characters to the right still on the input stream (that is the S-string). Table 7.3 defines a recogniser for the subset of arithmetic expressions introduced in the previous section. The letter I is used to denote the appearance of a member of the class of identifiers. The reduction of the expression a + b * ( c + d) + e would then proceed as shown in Table 7.4.

RULE              PRODUCTION              NEXT RULE 
1           ( | !     →    ( ! |       1
2           I | !     →    P ! |       3
3           T * P ! | →    T ! |       5
4           P ! |     →    T ! |       5
5           T * | !   →    T * ! |     1
6           E + T ! | →    E ! |       8
7           T ! |     →    E ! |       8
8           E + | !   →    E + ! |     1
9           ( E ) | ! →    P ! |       3
10          | !       →    ! |         1
11          { E } |   →    exit

Table 7.3

Rule           Stack after rule applied         Input
                                         |a+b*(c+d)+e}
10             {I                        | +b*(c+d)+e}
2              {P+                       |  b*(c+d)+e}
4              {T+      )                |  b*(c+d)+e}
7              {E+                       |  b*(c+d)+e}
8              {E+I                      |   *(c+d)+e}
2              {E+P*                     |    (c+d)+e}
4              {E+T*                     |    (c+d)+e}
5              {E+T*(                    |     c+d)+e}
1              {E+T*(I                   |      +d)+e}
2              {E+T*(P+                  |       d)+e}
4              {E+T*(T+                  |       d)+e}
7              {E+T*(E+                  |       d)+e}
8              {E+T*(E+I                 |        )+e}
2              {E+T*(E+P)                |         +e}
4              {E+T*(E+T)                |         +e}
6              {E+T*(E  )                |         +e}
9              {E+T*P+                   |          e}
3              {E+T+                     |          e}
6              {E+                       |          e}
8              {E+I                      |           }
2              {E+P}                     |           
4              {E+T}                     |           
6              {E}                       |           
11             exit                      |

Fig 7.4

The process is straightforward. It has been assumed that the identifier pointers are transformed into the symbol I on loading into the stack. This is figuratively what might happen. More likely the appearance of I in a rule will be equivalent to causing a check for an identifier pointer to be made on the item in the stack. In Table 7.3, the string of characters to the left of the symbol => defines the stack configuration to be matched, and the string to the right of the symbol defines the symbols to replace the string of characters on the left if a match occurs.

The straightforward algorithm would search the ordered set of rules from the top after each application of a rule until the first rule that could be applied was reached. This would be executed and then the scan would begin at the top again. This tends to be inefficient and the process can be speeded up if an extra column (the final one in Table 7.3) is added which gives the rule to be tried next after the successful application of a rule. The rules are tried sequentially from the point specified until the next rule that is applicable is encountered. This is executed and the next rule is then the starting point for the sequential search for the next applicable rule. In the example the number of rules examined without the next rule column was 125 compared with 45 when the successor was specified and the initial entry was made at rule 10.

In addition to making the process more efficient it also makes the reductions easier to write. With a continuous set to be scanned from the top at each step, it is vitally important to get the rules in the correct order. With the successor given, rules can be shielded from certain configurations on the input stream and stack so that ordering is not so important. In FSL, a subroutine facility has been added to make recognition of widely used non-terminal symbols easier.

The method allows great flexibility in the form of output, as each recognised rule can have a routine associated with it which defines the required action to be taken on recognition. For example, a postfix string could be output if rule 2 caused the relevant identifier to be output and rules 3 and 6 output * and + respectively. The arithmetic expression in the example would then produce:

   a b c d + * + e +

7.6 Top-down analysis

The top-down analysis method, unlike the methods already described, attempts initially to find the final goal of the analysis. This leads it to look at the sub-goals necessary to achieve the final goal. These sub-goals will involve finding sub-goals and so on. Eventually a sub-goal will require the matching of the input stream with some terminal characters and either this sub-goal is achieved or fails. In the examples given earlier, a notation was defined for producing legal sentences of the language. This same notation can be used as the definition of a recogniser as well as a generator for the language. In the example of English sentences:

   S  →  P Q

can be defined as meaning that to recognise the non-terminal symbol S, the non-terminal symbol P must be recognised followed by recognising Q. Analysers which use the syntax definition itself as the data to control the parser are called syntax directed.

In the example sentence the man has a dog, the top-down process could be described as follows. A sentence consists of a subject phrase followed by a verb phrase. First consider the subject phrase. This must consist of the followed by a noun. The input stream starts with the, so that matches correctly. Will the next part of the input be a noun? A noun is either the word man or the word dog. The first of these appears next on the input so both the and noun have been recognised. Therefore subject phrase could be matched with the input string the man. As sentence consists of a subject phrase followed by a verb phrase, can verb phrase match the remainder of the input has a dog? The subgoals of verb phrase would eventually all be obtained in a similar manner and finally sentence, the original goal, would be recognised as subject phrase followed by verb phrase.

The top-down method described above is called goal-oriented. At each stage a goal is set up. This will in turn involve finding some sub-goals. The sub-goals are checked from left to right and if all are found, the goal itself is found. This goal will usually be a sub-goal itself, and so information that successful recognition has taken place will be passed back to the goal of which this goal is a sub-goal.

The example given above was so trivial that the only place that non-recognition of a goal took place was at the level of the terminal words. However, in general, non-recognition may occur at any level. Alternatively a sub-goal may be able to get recognition in several different ways. The process can then be loosely described as follows:

The initial goal gets recognition from a sub-goal and later gets non-recognition from a second sub-goal. In this case the first sub-goal is repeated to give another alternative recognition if possible and the second sub-goal tried again. This continual back-tracking to try and find alternative sub-goals on failure makes the algorithm quite complex and also slow when implemented. An excellent description of such an algorithm is given by Floyd (1964). A simple example requiring the need to back-track once a reasonable sub-goal has been found is as follows:

    S   → A x 
    A   → v 
    A   → B
    B   → vw

where the input is v w x.

To recognise S, the sub-goal A must first be recognised. Taking the first rule involving A, the character v is matched and the sub-goal A recognised. To recognise S, the character x must match the remainder of the input stream w x. This fails and so the sub-goal A must be attempted again. This time the string v w is recognised as B and then A. The goal S is then recognised by the final matching of x.

The algorithm described above has been given the name slow-back. A more efficient but less general top-down analyser is called a fast-back. The difference between the two algorithms is that, in the fast-back, once a sub-goal has been recognised, the assumption is made that this recognition is the correct one and the original goal, if it fails when attempting a later sub-goal, will immediately return with non-recognition rather than attempt to find alternative recognitions of the first sub-goal. As well as being much more efficient, the fast-back is simpler to construct than the slow-back. The problem of keeping information about sub-goals already entered and returned from is now removed. Only information about partially recognised goals need be kept. As most computer languages can have their syntax specified in a form that can be recognised by a fast-back, the slow-back has not been used to any extent in compilers. Examples of fast-back top-down analysers can be found in Glennie (1960); Brooker, Morris and Rohl (1962); Cheatham and Sattley (1964); and Banerji (1966).

One problem with top-down parsers is their inability to deal with left-recursion in a language. Consider, for example, the rule:

E → E + T

in the definition of an arithmetic expression. If this was presented to a top-down parser, then to recognise E it is necessary to recognise the sub-goals E, + and T. Consider the first sub-goal E. To recognise the sub-goal E, the sub-goals E, + and T must be recognised, and so on. This process obviously goes on indefinitely. The problem of left-recursion can either be solved by changing the original set of rules defining the language or by having the parser itself do this rearrangement. Changing the rules will often change the syntactic structure itself however. For example:

E → E + T
E → T

could be replaced by

E → T X
X → + T X
X → null

Descriptions of top-down fast-back parsers handling left-recursion can be found in Cheatham (1966) and Banerji (1966).

The other major problem, which is usually easier to solve, is the order in which the rules are written. As these are tried in sequential order for a particular non-terminal, it is important that a rule which is part of a larger rule comes after the larger rule. Consider:

E → B
E → B C

When presented in that order, recognition would always be achieved on the first before attempting the second, so that the second rule would never be tried. In this case they must be ordered the other way around. In addition, the efficiency of the algorithm can be improved if rules having the same stems are grouped together. For example:

E → B a
E → B b

would, as it stands, cause the first rule to recognise B and then, if failure occurs on the match with a, would try the next rule. The next rule would enter and recognise B again before recognising b. If the rules are grouped together, then the algorithm can be made to only recognise B once and, if failure to match a occurs, then it tries immediately to match b. An example of such an algorithm is given in Brooker, Morris and Rohl (1962).

7.6.1 Selective top-down analysis

In the example English sentence, the top-down analyser would first attempt to recognise the subject phrase. This in its turn required the followed by a noun to be found as sub-goals. To stop the searching of fruitless paths, one modification, which has been applied, is to deduce beforehand what are the possible words that can start any phrase and, before attempting a sub-goal, to check that the characters on the input stream do satisfy this condition. This could of course in some cases lead to a considerable saving in time spent in searching for false goals.

In the example, the following relations could be set up:

    <sentence>         the
    <subject phrase>   the
    <verb phrase>      has 
    <object phrase>    a, the

where the right hand column defines the words which could start the non-terminal symbol. Before attempting to check any of these phrases the input stream would first be checked to ensure that it commences with the correct word. This analysis could obviously be carried further. The first two possible words could be stored or, going to the extreme, all legal sentences in the language could be stored. Obviously the latter would be very inefficient and somewhere in between there is an optimum strategy. This depends on the language concerned and it is difficult to generalise. Typically, a single character has been used. With each phrase is stored an ordered set of bits equal in number to the number of terminal symbols. Each bit represents whether or not a particular terminal symbol can appear at the head of the phrase. The improvement in performance of a selective top-down is offset by the inflexibility built into the parser. With a straightforward top-down it is a simple matter to change definitions as required. Conversely in a selective top-down, such changes will require the updating of all the bits dependent on the phrase replaced. This may well be as complex as the original setting up of the selection bits. The paper by Brooker (1967) implies that, in the language of the Compiler-Compiler, selective top-down gave no appreciable improvement over top-down.

7.7 Bottom-up analysis

Considering the English sentence the man has a dog once again, the parsing algorithm might have proceeded as follows. The goal is to produce the non-terminal <sentence>. First consider the possible rules to see if a rule exists which starts with the word the. There are two of these:

<article>          → the
<subject phrase>   → the <noun>

Considering the first, the sentential form is now:

<article> man has a dog

The only rule starting with <article> is <object phrase>. However, <object phrase> is not completed yet so, instead of continuing to aim for the goal, <sentence>, this could be stored and a new goal <noun> set up to match the input string man has a dog. The non-terminal <noun> is recognised and so therefore is <object phrase>. The sentential form is now:

  <object phrase> has a dog

and the original goal, <sentence>, is required once more. However, no rules start with <object phrase> and so this parse is incorrect. Back tracking, the second rule is tried starting with the. This will require the remainder of <subject phrase> to be recognised as a subgoal before proceeding. Once <noun> has been recognised, <subject phrase> has been recognised. In order to recognise <sentence>, therefore, the sentential form has a dog must be recognised as a <verb phrase>. Once again the parsing is done by starting at the terminal symbols and seeing which rules start with has. A hypothetical <verb phrase> will eventually be built up from the terminal symbols, and finally the original goal <sentence> will be achieved. At each stage a goal is set up and a parsing is attempted by examining a sentential form to see if the goal can be reached from the sentential form.

The parse of the arithmetic expression a + b * c is shown diagrammatically in Fig. 7.6. The letter g is used to point at the current goal and the letter i points to the start of the sentential form from which this goal is hoped to be obtained initially. The pointer i is moved along the sentential form as sub-goals of the goal g are recognised.

Fig. 7.6

Initially in (1), (2) and (3) the only possible rules are obeyed. At this stage either the rule E → T or T → T * P could be applied. Assume the first is chosen (4); there is now a match between the goal E and the structure produced so far. However the input is not exhausted, therefore E → E + T must be used (5). At this stage an uncompleted structure defining E has been produced. The current goal could still be aimed for at this point and the match between the structure produced and the goal made. However, if the canonical parse is required, then it is best to stack the current pointer and goal (denoted by g' and i') and to set up a subsidiary goal. This goal is the leftmost uncompleted part of the structure and a subsidiary pointer which points to the leftmost completed structure. In (5), i points to + on the input stream and g points to the + in the structure. As the structure + is complete (it is just a symbol on the input stream), a match occurs. The next goal to choose is again the leftmost one of the uncompleted structure. Diagrams (6), (7) and (8) show how the structure to this goal is made. At this point the structure T is not complete and the current goal must be stacked again and a subsidiary goal set up. In (9) the subsidiary goal is completed and the previous goal and pointer are unstacked (10). As T is now complete, this structure can be matched and the original goal reset. This again will be matched and the complete structure produced.

The parser which chooses to deal with leftmost uncompleted subgoals first, as described above, is called left corner bottom-up. The algorithm can be modified to be selective by again forming a table which gives the possible terminal or non-terminal symbols that can start any goal. This check can then be made each time either a new goal or pointer position is defined. This stops the processing of a large number of incorrect goals.

Mixed top-down and bottom-up strategies have been used. For example, at (6) in the above example it would have been possible to change to a top-down parser to complete the subtree before continuing. This hybrid approach can be used to keep down the number of stacked goals. Descriptions of bottom-up analysers are given by Cheatham and Sattley (1964) and Ingermann (1966).

7.8 Comparison of parsers

Unquestionably, the most efficient of the parsers described is the one using operator precedence. The advantages are that only a subset (the operators) of the total input need be examined when generating the structure and that no back tracking occurs. Its disadvantage is that it may require a more complex lexical pass to get the S-string in the desired form and some languages may not be processed at all by this method. However, with languages in current use, it is still probably the most useful of the parsing algorithms.

The methods using Floyd productions can be hand tuned to give fairly efficient recognition. However, this will depend on how quickly a production can be matched and how long the path is before the next correct production is reached.

Top-down and bottom-up parsers have been compared by Griffiths and Petrick (1965). The results were, however, inconclusive because of the artificial languages considered and the method of defining the top-down parser. It is not obvious at present which is the more useful for computer languages in use today.

7.9 Implementation of operator precedence parsers

The operator precedence method was described briefly in Section 7.3. The decision on whether to fill the stack or produce some combination of the stack elements was made by examining a precedence table. As can be seen from Fig. 7.3, there are only four possible entries in the precedence table. It can contain one of the operations <, >, = or it may be empty. An empty entry implies that in the language there are two operators that should never appear next to each other. As there are only four possibilities, these can be stored in two binary digits so that the amount of space used by the precedence table is very small. The two bits could be stored together in a two-dimensional table. The drawbacks of such an arrangement in practice are that the action to be taken at each entry must be fairly rigid, combination must occur in a set manner and no special operations on the stack can be introduced.

The alternative is to have each entry in the precedence table containing a pointer to a routine to be executed when the relevant pair of operators appear on the top of the stack. This allows housekeeping and non-standard actions to be done, for example, by the routines when special stack configurations are encountered. Consider, for example, a compiler for a statement such as:

   real a, b, c;

The identifiers will have been entered in the symbol table by the lexical scan and the pointer to this entry would appear in the S-string. The operator real would be a single symbol. The stacking would proceed as shown in Fig. 7.7. In this example, the operation defined for (real, ,) would be a special routine which would mark the position of the identifier in the symbol table as being of type real and then remove the top operand and operator from the stack before filling. The (real, ;) position would again make an entry in the dictionary followed by removal of the top two operators, and filling.

Another useful operation would be to change the operators in the stack itself when a small routine is executed. This allows the action required in certain positions to be varied depending on previous context. It allows parsing of languages which are, in theory, not operator precedence grammars. It is rather similar to the transformations produced on the stack by the Floyd productions method.

Fig. 7.7

The use of a bit table for the precedence table means that only decisions such as fill or combine are allowed on the stack. A more flexible combination action can be obtained if combination is always assumed to be between an operator and two operands. This means that the = operator is only allowed to apply between bracketing symbols such as ( and ) which can be treated specially. In Fig. 7.4, position (6) could cause the expression E₁ to be moved to the free operand position on the stack level containing *, followed by the reading of the next operator into the stack. The operator = can then be thought of as a special form of combination. If this is done, then each combination occurs between a particular operator and operands. Each can be treated differently by having an operator table which specifies the desired action. Being one-dimensional this is obviously more conservative on space than the two-dimensional precedence table containing pointers.

It may be that the action required for several different operators is the same as far as the precedence table is concerned. For example, + and - can be considered the same as far as their syntactic relationship with other operators is concerned. In this case it may be reasonable to consider the precedence table entries as being defined by operator classes rather than operators. The stack could be increased to give columns for both the operator and operator class and the switching will be done on the classes. The lexical scan could easily be designed to provide both these symbols on the S-string. For precedence tables, it is possible that two precedence functions, f and g, can be defined so that if X < Y then f(X) < g(Y), if X = Y then f(X) = g(Y) and if X > Y then f(X) > g(Y). A table giving values of f and g for the various operators might be more compact, although it is not completely obvious. Given the precedence table, a method for defining precedence functions, if they exist, is described by Floyd (1963). Floyd defines a precedence table and functions for Algol. The size of the precedence table was 34 by 35 which would require 2380 bits compared with the precedence functions which required 5 bits each and therefore 345 bits. This difference is not large and the precedence functions do not give any means of defining error positions. In an actual implementation this difference in storage size may well be even less, so that it is not too obvious how useful precedence functions are in practice.