Chilton::ACL::Compiling Techniques

Chapter 6: Lexical Analysis

6.1 Introduction

The aim of the lexical analysis phase of the compiler is to take the input program, which is presented to the compiler in some arbitrary form, and translate this into a string of characters which will be called the S-string. This is the input to the syntax analyser. The S-string should consist of grammatically correct sentences belonging to a language (the S-language) which can be analysed by the parser provided in the syntax analysis phase. The S-language will normally be of Chomsky Type 2 or a subset of this class, depending on the type of parser to be used.

The lexical analyser could of course produce many different S-languages which satisfy this criterion. It is therefore important to remember two other desirable properties of the S-language. First, the grammatical classes existing in the syntax of the S-language should be closely linked to the semantic meaning of the strings defined in the class. For example, A + B * C in Fortran should be parsed as A + (B * C) but not as (A + B) * C. Sometimes the lexical analyser can aid the syntax analyser to do this by conversion or rearranging of characters. Secondly, the grammar of the S-language should be uniform in the sense that the majority of syntactic structures contained in the S-language should also require the same type of parsing algorithm as the whole language. For example, it would be unreasonable for the lexical analyser to have as output an S-language which was of Chomsky Type 2 apart from one construct which made it of Chomsky Type 1, especially if the S-language could easily be altered by the lexical analyser to resolve the anomaly.

For example, Algol can be defined as a precedence grammar [see Floyd (1963)] apart from the two uses of : as a label delimiter and also in the array declaration. If the lexical analyser is able to change the internal form of one of the uses, so that the two uses are distinct symbols to the syntax analyser, then a precedence grammar parser may be used rather than a more general and less efficient parser. This is a fairly trivial problem at the lexical level as the array use always occurs between the symbol array and the following semicolon.

At the other extreme, it may be that parts of the language fall into the class Chomsky Type 3 which can easily be analysed at the lexical level. For example, it is usual for the lexical analyser to pack-up identifiers into a single symbol and floating point constants into the numerical value. If left to the syntax analyser, this will probably be done inefficiently. Similarly basic symbols in the programming language, which, because of hardware deficiencies, are represented by several characters on the external media, should be converted to single characters in the S-language. For example, the Algol reserved words such as begin should be recognised at the lexical level and converted to a single symbol.

6.2 Line reconstruction and input conversion

The input to a compiler will vary considerably from computer to computer and may well be different for two compilers on the same computer. Consequently this section must be a generalisation of the situation in any particular environment. Frequently the input to the compiler is a program punched on paper tape or cards and presented to the computer via some peripheral device. In this case, the characters presented to the lexical analyser as input closely resemble the binary numbers obtained by considering a punched hole as equivalent to a binary 1. Alternatively there may be some supervisory system that converts the input to a form independent of the peripheral equipment before passing it to the compiler. The input may be provided a character at a time, a line at a time or even the whole program.

If characters are being provided to the lexical analyser from the peripheral equipment one at a time, then it is likely that the numerical values of holes punched on tape or cards do not have much resemblance to the numerical values desired for characters in the S-language. However, many characters on the input have a 1-1 correspondence with characters appearing in the S-language. Consequently it is usually desirable to have a conversion table producing internal forms of the characters as they are read from the input. These characters will belong to a language called the L-language, and the aim of the conversion will be to make characters in the L-language and S-language the same where this 1-1 correspondence exists. For example, the external forms of the operators + and - on the input media might be the punched codes 01011 and 11010 on five-hole paper tape. The S-language might require these to be the numerical values 28 and 29 in which case the conversion table would ensure that these values were the ones for + and - in the L-language.

As the number of unused punch configurations on a paper tape, for example, is not high, it is normal to use a direct access table for achieving the character conversion. The conversion table may be divided into several parts corresponding to different states of the input stream, and the look-up in the conversion table is done in the part of the table corresponding to the state. For example a five-hole paper tape only has 32 different hole configurations. In order to increase the number of possible characters available, two of the hole configurations are defined as the control characters figure shift and letter shift. Hole configurations then represent one of two characters depending on which was the last control character to be punched on the tape previous to this character.

The conversion table could be set up with each entry having two parts. The first part contains the L-form of the character, and the second contains a pointer to a small routine which is entered after the conversion and defines any special action to be taken. In the example, the most common small routine would be one which would go back to analyse the next character after dealing with the one converted. The table would consist of 64 entries with the case pointer set either at the start (for figure shift say) or half way down. The entry in the conversion table looked at would be the mth one after the case pointer, if the numerical value of the character was m. When the characters' figure shift or letter shift were read, the relevant position in the conversion table would point to a small routine which would reset the position of the case pointer. Redundant characters on the input such as runout and erases could be removed by small routines which immediately returned for the next character.

It has been assumed, so far, that characters, once converted, could be passed on one at a time to be analysed. However, certain peripheral devices exist which allow the carriage to be moved by a backspace key. This allows overprinting of characters and also allows the position of characters in the input line to be different from the order in which they appear on the input tape. For example the word as could appear on the input as:

  space s backspace backspace a

In this case a line image equivalent to the original print out on the peripheral device must be formed before analysis can take place. This is called line-reconstruction. A buffer allowing several entries in each character position of the line needs to be set up with a pointer P to the current position reached across the line and a second pointer M giving the maximum length of line reached so far. The normal small routine pointed at by the conversion table will then store the character in the buffer at position P and advance the position of P. The pointer M could be updated every time P is increased, or alternatively it is more efficient to reset it only when P is decreased. This can be done by the small routine associated with the backspace character. In this case the final value of M is max(P,M). The number of characters allowed in any position will be two or three (overprinting more than three characters is unlikely as the print out would become unreadable). As it is not possible to decide in which order two overprinted characters have arrived, the entries in any buffer position must be re-ordered into some standard form such as numerically ascending order. This will guarantee that ≠ punched as = backspace / and / backspace = are recognised as the same character. The problems attached to line reconstruction are all concerned with making the internal buffer equivalent to what the user sees on the print-out of his program. If the user overprints a character by itself, should this be a different symbol? If a backspace is attempted at the start of a line, what happens on the punching device? Should a character printed over an erase be ignored? What are the tab settings on the device so that these can be followed internally? Once the line image has been obtained, the lexical analysis is the same as for the normal input straight from the conversion table.

6.3 Lexical conversion

So far, the conversion table has produced an L-string which, in the case of characters having a 1-1 correspondence between the input character and S-character, has the L-character equal to the S-character. Otherwise the conversion table has produced an L-character which will be in the most reasonable form for the lexical analyser.

For example, the composite Algol symbols such as begin are delimited in most input forms by marking the letters of the symbol in some way. Frequently the individual letters are underlined. The lexical analyser in this case can recognise the start of an Algol composite symbol by the underlining of the initial letter. Therefore, this concept can be extended so that the conversion table marks (underlines) all characters that could start a composite Algol symbol. A dialect having a non-escape underline, which would present begin as _b_e_g_i_n, could have the conversion table ignoring the underlined characters apart from remembering that the next character had to be marked as underlined. Similarly begin written as 'begin' could have the primes changing the pointer to a conversion table so that the look-up of input characters would be in a part of the input table which converted characters to the marked form. The final prime would reset the pointer in the conversion table. In most Algol dialects, however, there will be a non-empty set of composite symbols which are not already marked in this way. For example, the Algol symbol := often appears as : followed by =, and the exponentiation sign often appears as **. The conversion table should therefore set underline markers on : and *.

The recognition process can then be described as follows:

If next character is not underlined then convert as simple symbol.
If next character is underlined then check L-string against entries in the composite symbol table to see if a match is found. If a match is found pass on relevant S-character.
If no match is found convert first character as simple symbol.

Rule (c) ensures that, for example, : not followed by = is recognised as :. At step (b) the table look-up would probably be in a hash table with the composite symbols preloaded into it. An open hash or overflow with internal chaining would be the most usual method to use. As the conversion table gives complete control over the form of the entries, an average search length very close to 1 should be obtained. A full description of the lexical analyser for an Algol compiler is given by Hopgood and Bell (1967). It is important to notice that in extending the lexical analyser to accept an additional hardware representation of the input, all that is required is an additional conversion table with possibly minor extensions to the composite symbol table.

Apart from the recognition of the Algol symbols, it is probably desirable to also pack up identifiers appearing in the program so that all that is passed is an S-character representing the position in the table where the identifier is stored. It is important to realise that this process is in no way connected with storage allocation for the identifier. It should be thought of as a process for replacing many character identifiers by single character identifiers. In Algol this same identifier name may be used for several different variables in different blocks. It is the task of the syntax analyser and associated semantic routines to discover this structure of the program and allocate storage.

The identifiers could be looked up and entered in a separate hash table usually called the symbol table. However, it is also possible for the composite symbol table for the Algol symbols to also be the symbol table. The basic Algol symbols are preloaded into the table and identifiers added as they appear. This ensures rapid access to the basic Algol symbols [see Batson (1965)]. Some dialects have no special delimiters for the Algol basic symbols, and they are only recognised by assuming they have reserved names which cannot be used as identifiers. In this case the recognition of Algol basic symbols and identifiers is almost identical.

6.4 Number conversion

A good example of a Chomsky Type 3 subset of a language is usually the structure of floating point constants in the language. The Algol definition of a floating point number is given in the report as follows:

<unsigned number>    →    <decimal number>
<unsigned number>    →    <exponent part>
<unsigned number>    →    <decimal number> <exponent part>
<decimal  number>    →    <unsigned integer>
<decimal number>     →    <decimal fraction>
<decimal number>     →    <unsigned integer> <decimal fraction>
<unsigned integer>   →    <digit>
<unsigned integer>   →    <unsigned integer> <digit>
<decimal fraction>   →    .<unsigned integer>
<exponent part  >    →    E<decimal number>
<integer>            →    <unsigned integer>
<integer>            →    +<unsigned integer>
<integer>            →    -<unsigned integer>

As it stands this is not Chomsky Type 3. However, it is a simple matter to change the rules to the following set:

<unsigned number>         →    digit<rest unsigned number>
<unsigned number>         →    .<decimal fraction>
<unsigned number>         →    E<exponent part>
<rest unsigned  number>   →    digit<rest unsigned number>
<rest unsigned  number>   →    .<decimal fraction>
<rest unsigned  number>   →    E<exponent part>
<rest unsigned  number>   →    null
<decimal fraction>        →    digit<rest decimal fraction>
<rest decimal fraction>   →    null
<rest decimal fraction>   →    E<exponent part>
<rest decimal fraction>   →    digit<rest decimal fraction>
<exponent part >          →    sign<exponent integer>
<exponent part >          →    digit<rest exponent integer>
<exponent integer>        →    digit<rest exponent integer>
<rest exponent integer>   →    digit<rest exponent integer>
<rest exponent integer>   →    null

The term null is used here to denote a null character. It is easily seen that these productions conform to the definition of Chomsky Type 3. As stated previously, a language expressed in this form can be parsed by a finite state algorithm. The number of states required coincides with the number of syntactic classes to define the language. Table 6.1 gives a state table for a recogniser for the numbers.

State	digit	.	E	+ or -	null
1 <unsigned number	2	3	5
2. <rest unsigned number>	2	3	5		EXIT
3. <decimal fraction>	4
4. <rest decimal fraction>	4		5		EXIT
5. <exponent part>	7			6
6. <exponent integer>	7
7. <rest exponent integer>	7				EXIT

Table 6.1

The process is started in state 1, and the character appearing on the input stream defines the state into which the recogniser is to move. The null symbol can be assumed to include all other characters that could appear after the number. A number is recognised if one of the exit positions is reached. Non-recognition occurs if a blank entry is encountered. For example recognition of 1.45null would proceed as follows:

INPUT STATE	CHARACTER	NEW STATE
1	1	2
2	.	3
3	4	4
4	5	4
4	null	EXIT

If each entry in the state table also contains a pointer to a small routine to be executed, then the calculation of the numerical value of the constant proceeds in parallel with the recognition. Table 6.2 gives the form of the augmented state table where the small routines are as follows:

 
   V: A = lOA + d 
   W: n = n + l 
      A = lOA + d 
   X: E = lOE + d 
   Y: A = 1 
   Z: e = sign

State	digit=d	.	E	+ or -	null
On entry A = n = E = 0; e = '+'
1 <unsigned number	2,V	3	5,Y
2. <rest unsigned number>	2,V	3	5		EXIT
3. <decimal fraction>	4,W
4. <rest decimal fraction>	4,W		5		EXIT
5. <exponent part>	7,X			6,Z
6. <exponent integer>	7,X
7. <rest exponent integer>	7,X				EXIT
On exit, Number = A × 10^eE-n

Table 6.2

In these routines d stands for the digit encountered. The parameters are initialised to:

A = 0 value of mantissa ignoring decimal point 
n = 0 number of decimal places 
E = 0 exponent 
e = + exponent sign

The procedure will exit with the value of this constant being A × 10^eE-n.

Rewriting Chomsky Type 3 subsets of the language as finite state tables is straightforward but may be wasteful in space if the number of empty (error) positions is high. This can be useful if different diagnostics are required at every error position. Each empty position coincides with an explicit error in the format of the number, and it does enable good diagnostics to be obtained. For example the second entry in state 3 could give the diagnostic:

 TWO CONSECUTIVE DECIMAL POINTS APPEAR IN NUMBER

As precise diagnostics can be given, it is also often possible to define the correct recovery action.

A different notation for defining finite state recognisers has been called the syntax machine by Glennie (1960). The equivalent recogniser written as a syntax machine is given in Fig. 6.1. Unless otherwise denoted by arrows, flow of control is left to right or downwards. Comparators are enclosed in circles. The true exit is horizontal and moves the input on one character. The false exit is downwards and does not alter the input.

Fig. 6.1

The virtue of the syntax machine is that it is useful for defining recognisers for strings that do not fit easily into any class. In particular, recognisers can be defined which are more efficient as far as storage is concerned. Fig. 6.2 gives a modified syntax machine for recognising numbers. False exits not marked are error exits. This machine can then be translated into a set of instructions for doing the testing, or alternatively written in tabular form (see Table 6.3). The machine starts in state 1 and exits true or false to the state defined, depending on whether recognition takes place with the character defined for that state. Before the next state is examined the relevant action is taken. The small routines and initialisation coincide exactly with those for the finite state table.

Fig. 6.2

State	Character	Next State if True	Action if True	Next State if False	Action if False
1	d	2	V	3
2	d	2	V	3
3	.	4		6	Y
4	d,W	5	W	Error
5>	d	5	W	6
6	E	7		Exit
7	+-	8	Z	8
8	d	9	X	Error
9	d	9	X	Exit
On exit, Number = A × 10^eE-n

Table 6.3

6.5 A lexical analyser for Fortran

The discussion of Algol has shown the role of the lexical analyser in a language with a precise syntactic structure. The main concern has been the conversion of Algol composite symbols into single characters in the S-language. Fortran is normally punched on cards and the character set is usually restricted to the 48 characters in the original card code set. Consequently some characters are overworked with the same character being used in many different syntactic positions. If possible the lexical analyser should convert the different uses of the same character into distinct characters in the S-language. Once it has been discovered that a basic Fortran statement is not an assignment statement, it is possible to recognise any Fortran statement by its initial characters. The reason for the proviso on the assignment statement first being recognised is because the following are all legal assignment statements:

      DO 10 I = 1 
      IF(I) = 3 
      CALL AB (I,J) = 4

A simple way to recognise an assignment statement is that it has no zero-level commas on the right hand side of a zero-level = sign. This information can be obtained by the conversion table changing the pointer on encountering a zero level = to a part of the table which counts zero-level commas. This process also defines the DO statement as this is the only statement with = and zero-level commas. The remaining statements can now be recognised by comparing the particular statement with a list of the first few characters of the possible Fortran statements. As two of the most frequently found statements have already been recognised, a fairly simple look-up device is probably adequate. The S-form of the statement will probably consist of a character differentiating between the different types of statements followed by the characters of the statement.

It would be desirable to convert the ambiguous symbols to unique characters in the S-language but this is not always easy. For example the output statement:

   PRINT 10,A, (B(I,J), C, J = 1,3), D, (E,F)

has the , used as a separator for list elements, lists, array subscripts, and DO parameters to mention just a few. This means that either the parser has to be carefully designed to differentiate between these different uses or the lexical analyser can do the conversion. The main objection to recognition by the lexical analyser is that arithmetic expressions can replace identifiers in several positions. This means that a bracket count must be kept to locate the end of an arithmetic expression. In a language like Fortran this may well be desirable as otherwise a more complex parser may be required at the syntax analysis phase. For example, Glennie's syntax machine with a subroutine for recognising arithmetic expressions could be used for this purpose.