Chilton::ACL::Compiling Techniques

Chapter 5: Language Description

5.1 Introduction

Before describing the lexical and syntax analysis of the compiler, it is necessary to define some notation for describing the syntax of a programming language and some classification scheme for different types of languages. Most of the work described here derives from the work of Chomsky (1959) in the field of natural languages. The standard method adopted for describing a language is a generative scheme whereby a procedure is given for producing all the legal sentences of the language. An alternative approach would be to have a recogniser which could decide whether or not a sentence was legal. This alternative is seldom used in practice for defining a language but does closely resemble the syntax analysis phase of a compiler.

5.2 Notation

An example from the definition of English sentences will help to describe the notation adopted. An English sentence might be defined as a subject phrase followed by a verb phrase. The subject phrase might consist of the word the followed by a noun. The verb phrase could be defined as a verb followed by an object phrase. The object phrase might consist of an article followed by a noun. The emphasis have been used here to denote syntactic elements that have been described in the above sentences. A subset of the complete set of English sentences could be defined by allowing only the nouns man and dog, the verb has and the articles a and the. The words above are English words which will actually appear in the English sentence produced and are therefore given the name terminal symbols. The notation adopted for describing these constructs is:

1 <sentence>       →      <subject phrase> <verb phrase> 
2 <subject phrase> →      the <noun> 
3 <verb phrase>    →      <verb> <object phrase> 
4 <object phrase>  →      <article> <noun> 
5 <verb>           →      has 
6 <article>        →      a 
7 <article>        →      the 
8 <noun>           →      man 
9 <noun>           →      dog

In this notation the syntactic elements are enclosed by < and >. An English sentence is produced by taking the element <sentence) and recursively replacing the left hand sides of rules by the equivalent right hand sides until a form is obtained containing no more syntactic elements but only words in the language. Rules of this form are often called productions. The sentence the man has a dog would be obtained by starting from <sentence> and using the following productions:

Production used          <sentence>
1                       <subject phrase> <verb phrase> 
2                       the <noun> <verb phrase> 
3                       the <noun> <verb> <object phrase> 
4                       the <noun> <verb> <article> <noun> 
9                       the <noun> <verb> <article> dog 
8                       the man <verb> <article> dog 
5                       the man has <article> dog 
6                       the man has a dog

Note that there are many different orders in which the productions can be used to produce the sentence the man has a dog from the syntactic element <sentence>. The one chosen was not the most obvious just to emphasise this point.

For brevity, the syntactic elements enclosed between < and > will in future be written as single upper case letters while terminal symbols will always be written as lower case letters. Another name often given to the syntactic elements is the rather obvious one of non-terminal symbols.

The above example in the abbreviated notation would be:

   1    S     →    PQ
   2    P     →    the N
   3    Q     →    VO
   4    O     →    AN
   5    V     →    has
   6    A     →    a
   7    A     →    the
   8    N     →    man
   9    N     →    dog

Starting from the non-terminal S several intermediate strings containing both non-terminal and terminal symbols are obtained before finally producing the man has a dog. These intermediate forms together with the final string the man has a dog are given the name sentential forms. In the example:

    the N V A dog

is a sentential form. From now on, lower case Greek letters will be used to represent sentential forms. The grammar of a language, then, consists of a set of productions having a unique non-terminal symbol, say S, which appear only on the left-hand side of productions. All the sentential forms containing no non-terminal symbols which can be produced from S will be the legal sentences obtainable in this language. The general form of productions allowed is:

 
   α  →   β

where both α and β represent sentential forms. If during the enumeration of legal sentences a sentential form is obtained with the form α as a substring then another form can be produced by replacing α by β.

5.3 Chomsky classification

The type of language obtained when productions of the form:

 
   α  →   β

appear in the definition of the grammar, with no restrictions existing on the sentential forms α and β, has been given the title Type 0 by Chomsky. This language type is much too general for computer languages although it is still unable to cope with natural English.

Chomsky defined several other classes of language each of which is a subset of the previous class.

The Chomsky Type 1 language is one where the productions are restricted to:

 
   α A  β  →   α π  β

where A is a single non-terminal symbol, π is not null and α, β, π are sentential forms. Languages in this class are often given the name context sensitive.

A Chomsky Type 2 language is one where the productions are restricted to the form

 
   A   →  π

that is, the left hand side consists of only one non-terminal symbol. Most programming languages approximate to the classification Chomsky Type 2. The notation BNF [Naur et al. (1960)] is equivalent to the definition given above for Chomsky Type 2, so that the language Algol which is defined by the notation BNF is of Chomsky Type 2.

With languages in use today which do not correspond to Chomsky Type 2, it is usually possible at the lexical scan to alter these languages into this form if they have only minor inconsistencies. Most of the discussion of translator techniques will assume that the language is either Type 2 or a subset of this class.

The Chomsky Type 3 languages are ones where productions are restricted to the form:

 
   A   →  a   
   A   →  aB

that is, the right hand sides must consist of only a terminal symbol or a terminal symbol followed by a non-terminal symbol. The Type 3 languages coincide with the finite state languages and most of the work done at the lexical scan will be on parts of the program defined as a finite state language.

5.4 Parsing

As shown already, it is possible to decide whether one sentential form can be produced from another by taking the initial sentential form and enumerating all sentential forms obtainable from this form by repeated use of the productions defined in the grammar. Given a sentential form α and a second form β, then α directly produces β if and only if

 
   α   = γAδ and   
   β   = γπδ

and there is a production A→ π. Conversely β directly reduces into α.

Similarly, α produces β and β reduces into α if and only if there exists a sequence of sentential forms such that:

α directly produces α₁, α₁ directly produces α₂, . . . 
α_n-1 directly produces  α_n  α_n directly produces β.'

Given that a sentential form α produces a form β, then a parse of the string β with respect to α is a sequence of numbers defining the productions that apply at each stage in reducing β into α.

In the example, the sentence the man has a dog has a parse with respect to <sentence> of [6.5.8.9.4.3.2.1]. Note that this parse is not well defined in all cases. For example the sentence the man has a man might have a parse [6.5.8.8.4.3.2.1] in which case it would not be obvious which of the words man was being reduced to (noun) in the two uses of production 8.

Therefore a canonical parse is defined which starts at the left hand end of the sentential form β and first applies the production which reduces the characters furthest to the left first. This then produces a second sentential form, and the process is repeated until α is obtained. The canonical parse in our example would be:

        [8.2.5.6.9.4.3.1]

The sub-string which is reduced by the first reduction in the canonical parse is called the handle of the sentential form. In the example, the handle of the sentence is man.

A language is said to be unambiguous if there exists only one canonical parse for every sentence in the language. Given a grammar, it can be shown that, in general, there is no algorithm for deciding whether or not the grammar is ambiguous. If, in the example, a production also existed of the form:

   <subject phrase> →  the man

then the sentence in the example could be parsed in two different ways, namely:

   <subject phrase> →   the <noun>
   <noun>           →   man
or,

   <subject phrase> →  the man

So far, assuming it is known how the initial sentential form produces the final form, then the canonical parse can be defined. This does not mean, however, that, given the final sentential form and the set of productions defining the grammar, the canonical parse can be produced. This is the problem to be solved by the syntax analysis phase of the compiler.

Some of the problems can be seen in the following example:

Suppose the sentence the man has a dog is to be parsed. The initial the could have been produced by production 7. Production 8 would recognize man as a noun, giving us the reduced form <article> <noun> has a dog. This would be reduced by 4 to <object phrase> has a dog. It would not be until this point that no production could be applied, and it would then be necessary to go back and try an alternative approach. The organisation of such back-tracking will be one of the main concerns of some of the parsers to be described in the chapter on syntax analysis.

Expressing it another way, the problem of parsing is to find the handle of the sentential form at each stage of the reduction process.

5.5 Recursion

Before closing this chapter it is necessary to define some more terms concerned with recursion. A grammar is said to be left recursive if productions of the form:

 
     A    →   Aα

exist in the language.

A grammar is said to be right recursive if productions of the form:

 
     A    →   αA

exist in the language.

A grammar is said to be self embedding if rules of the form:

 
     A    →   αAβ

exist in the language.