Chilton::ACL::Compiling Techniques

Chapter 1: Introduction

1.1 Definition of a compiler

The name compiler has been given to a computer program which will accept as data a program in a problem-oriented language, such as Algol or Fortran, and produce as output a computer-oriented code which, after possibly some further processing by an assembler or loader, will be capable of being obeyed by a computer and produce results equivalent to those defined by the program in the problem-oriented language. This definition is rather vague and has to be as the word compiler is used for rather a wide range of computer programs with very different characteristics.

In this book, the problem-oriented language to be compiled will be assumed to be as machine independent and as complex as a language like Fortran or Algol. Most examples will be drawn from one or other of these languages. There is a bias towards scientific languages of this type in the techniques to be described although, of course, many of the techniques will be applicable in general. The output from the compiler will be defined in machine orders for a hypothetical computer which will be similar to most of the single accumulator computers in use today.

1.2 The structure of a compiler

A compiler can be broken down into several distinct sections, each of which performs a specific task. Each of these sections can be thought of as having specific input data and output, so that each section can be thought of as a sub-program. On small computers, where the size of program that can be fitted into the main store is severely limited, it is usual to split the compiler into these subprograms and have one sub-program complete its task for the whole program before the next is called. If, however, the complete compiler can be fitted easily into the main store, then it is often the case that the sub-programs are entered cyclically so that one part of the input program is nearly completely compiled before the next part of the input program is examined. Although these two methods will require completely different control structures, the techniques used will not be very different.

The program in the problem-oriented language is often punched on paper tape or cards, using some external piece of equipment, and presented to the computer to be input by one of the readers on the computer. Alternatively, parts of the program may reside within the computer in previously defined files. In most systems there are several ways in which the program can be presented. Consequently, it is usual to have an initial section to the compiler whose purpose is to provide an interface between the outside world and the rest of the compiler. The aim is that the same program presented to the compiler in several different ways will be identical as far as the remainder of the compiler is concerned, once it has passed through this initial phase. Most programming languages allow a certain amount of redundancy in the preparation of the program. For example, spaces can often be inserted to aid readability, and new lines inserted as desired. These redundancies are also removed by this initial phase which is usually described as lexical analysis.

When lexical analysis has been completed it is necessary to recognise the individual statements that make up the program and, if a statement is complex, to break it down into its simpler parts. In general, programming languages are defined so that this syntax analysis can be done without knowing the meaning of the statement. This phase of the compiler will verify that the program is grammatically correct and the output will exhibit the structure of the program found by this analysis.

Each statement of the program must now be examined and the equivalent computer code generated. For many statements this code generation will be straightforward. Certain statement types can always have a fixed sequence of code generated with only a few fields which change depending on the exact form of this statement type in the program. The fixed sequence of code, which is stored away ready for use, is called a skeleton.

If it is important to use as little storage as possible or, alternatively, to make the program run as fast as possible, then it is necessary to optimise the simple code generation described above. More analysis will be required on individual statements and any interaction between neighbouring statements must be discovered. For scientific languages it is important that the code generation for arithmetic expressions is efficient as these statements will be a large proportion of the total code obeyed in any program.

1.3 A brief history

The first compilers for higher level-languages were produced in the period 1956 to 1958. The most important of these was the Fortran compiler for the IBM 704. The original compiler took eighteen man-years to write and, in some areas, used techniques as sophisticated as any in use today. As the only input medium was punched cards and the Fortran language made statement recognition simple, no special techniques arose for the lexical and syntax analysis parts of the compiler. However, the IBM 704 had only three index registers and, as one of the design aims was to produce code as efficient as that produced by hand, much effort was put into algorithms for index register allocation and optimisation of the code produced for arithmetic expressions. Two papers describing this work are Sheridan (1959) and Backus et al. (1957).

A large number of different algorithms for the syntactic and semantic analysis of arithmetic expressions was produced in the period before 1960. These included many methods of the precedence type together with more unusual algorithms. Algorithms were defined, some of which parsed from right to left while others continually changed direction as the statement was scanned. A history of these early methods is given by Randell and Russell (1964).

The 1960 Algol report was the first widely accepted language definition which had the syntax rigorously defined. This together with the BNF notation [Naur et al., (1960)] used in the definition encouraged research into syntax analysis and related subjects. This has continued until the present day, and the syntax analysis phase is the one area where the techniques used are well understood and the relative merits of different techniques estimated.

The acceptance of Algol as an international standard without any defined hardware representation has led to more effort being put into lexical analysis in recent years. In Great Britain, for example there are over ten different external representations of Algol available and work in this area has had to be done to aid program interchange.

Many of the early papers on compiling techniques have been brought together by Rosen (1967).