Jump Over Left Menu
Chapter 2: Data Structures
Before describing the various parts which comprise a compiler, it is necessary to consider the abstract data structures which will be used by the compiler and also how these abstract structures are implemented using particular storage structures. The compilation process can be thought of as taking some input data structure and transforming it to produce an output data structure which in some sense is equivalent to but more desirable than the input form. The input data is often called the source program and the output the object code. One of the simplest forms the compilation process can take is to have the input data structure coming from a peripheral device (such as a card reader), and the compiler producing as output a machine code program in the store of the computer ready for execution. In most compilers, before the output form is arrived at, the input data structure may well be transformed into several internal data structures. The compiler itself may require to store additional information or alternatively be driven by information stored internally. Finally, the data structures defined in the source program in the specific programming language must be mapped into some storage structure on the background computer that is to execute the program. The problems which need to be resolved when designing a compiler are ones such as finding the most transparent abstract data structure to use at each stage and also the most efficient internal storage structure into which to map the abstract data structure.
2.2 Abstract data structures
A data structure is defined as a set of rules and constraints which show the relationships that exist between individual pieces of data. The structure says nothing about the individual pieces of data which may occur. It may require them to hold the structure in some sense, but any information contained in the data items is independent of the structure. The term item used here will denote a piece of an abstract data structure. The item itself may be another data structure so that a hierarchical set of data structures is built up.
A string is an ordered set of items. The string may either be of variable length or fixed. Each item only has knowledge of its neighbours so that accessing a particular item must be done by a sequential search starting from one of the ends of a string. Examples of operations defined on a string are:
- concatenation of two strings
- comparison of two strings item per item
- breaking strings into several parts.
An array A is a set of items which are so arranged that an ordered set of integers uniquely defines the position of each item of the array and provides a method of accessing each item directly. Using the notation of Algol, an item of the array A could be written A[j1,j2,j3, . . . , jn]. If the ordered set of integers has length >n, then the array is called n-dimensional. The individual integers in the ordered set are called subscripts. Each subscript may have its range defined in any way, and it could consist of several disjoint sub-ranges.
A particular subset of arrays, which is used in both compilers and programming languages, is the set of rectangular arrays. A rectangular array has the range of each of its subscripts constant and contiguous. For example, the Algol declaration
array A[1:10, 0:5, 3:7];
defines a three-dimensional array with its subscripts having ranges 10, 6, and 5, respectively. A typical element would be A[3,4,5].
2.2.3 Queues and stacks
Queues and stacks are dynamically changing data structures. At any time, the queue or stack contains an ordered set of items. In the case of a queue, if an item is added, it is placed at the end of the ordered set. Items can only be accessed or removed from the front of the ordered set defining the queue. The first item added to the queue is the only one accessible and must be the first removed. For a stack, items are similarly added to the end of the ordered set. The difference is that the accessing or removal of items from the ordered set is from the end of the set. The last item added is the only one accessible and is the first to be removed. The items may well be of variable length and contain the length defined within the item.
A table consists of a set of items, and associated with each item is a unique name, or key. In general, each item will consist of its key, together with some information associated with the key. An item can be added to the table by presenting its key together with the associated information. Items are accessed by presenting the key to the table.
A tree is a structure consisting of a set of nodes. Each node (or item) has the property that, apart from information that it may carry, it also contains pointers to lower-level nodes. At the lowest level of the tree, the nodes point to leaves which consist of data structures external to the tree. The idea of level comes from the fact that each tree must have one top-most node which has no pointers to it from other nodes (this is often called the root of the tree) and also no node can point to a node previously defined. This latter condition ensures that each node has a unique path to it from the root. Nodes on level 1 are those pointed at by the root; nodes on level 2 are pointed at from nodes on level 1, and so on.
Fig. 2.1 diagrammatically shows a binary tree (each node points to exactly two lower nodes) representing A*B + C*D. The letters A, B, C, D denote the leaves and the ringed quantities represent the information associated with the node. It is conventional to represent the tree with the root at the top and leaves at the bottom unlike most natural trees. The * is used to represent multiplication.
2.2.6 Directed graphs
A directed graph is similar to a tree except that the nodes may point to nodes higher up the structure. Therefore, a node may have more than one node pointing to it. As it is now no longer obvious which direction the line joining two nodes is pointing; it is usual to mark at least the upward pointing lines in a diagram. The relaxation of the rules for trees now means that it is possible to have a path from a node back to itself. This is commonly called a loop. A particular sub-class of directed graphs is the set of loop-free directed graphs. This does not coincide with the definition of a tree, however, as it is still allowable for two nodes to point to the same node which is not allowed in the case of trees. For example, the Algol compound statement:
begin a:= b:= c:=2; first: if a > b then z:=4 else c:=1; if z > 3 then goto next; b:=1; goto first; next: z:=1 ; end
could be represented by the directed graph given in Fig. 2.2.
2.3 Internal storage structures
Most computers tend to have a memory consisting of a set of ordered words, and this is the only available storage structure that the computer itself has. In general, by the use of programs and subroutines, higher level storage structures are superimposed upon the basic structure of the computer, and to the user of the subroutines, the computer itself appears to have this structure. A fairly well defined set of such internal storage structures have been used and, although they are usually designed so as to be equivalent to an abstract storage structure or at least a structure into which the abstract structure can be easily mapped, they still tend to be influenced by the basic structure of the computer memory.
Each internal storage structure will be built up from indivisible (as far as the structure is concerned) pieces of storage called elements. These are closely equivalent to the items of the abstract structure. The element may have several fields some of which could be used in defining the structure. An element could consist of a single bit of a computer word but is more likely to be at least an address field in width.
A vector in most computers is the name given to the basic storage structure of the computer. A vector is a set of elements which are physically adjacent in the computer store. A vector is defined by knowing its base address, its element size, and its length (all elements will be assumed to be the same size). Individual elements can be accessed immediately using the base of the vector and the position of the element in the vector. A vector closely corresponds to the abstract data structure, the one-dimensional array, and almost invariably one-dimensional arrays are mapped into vectors. It will be convenient, therefore, to represent a vector of length N with name A by the Algol declaration:
and individual elements will be called A[i] for i = 1 to N.
A list is a set of elements where each element consists of two fields. The second field contains a pointer to the next element in the list. The first field contains a pointer to the information defined by this element. This could be another list or some external data structure. An external data structure will be called an atom, and will be assumed to be indivisible as far as operations on the list are concerned. For example, a string of characters ABC could be represented pictorially as a list by Fig. 2.3.
A list as a whole is accessed by a pointer P, to the first element of the list. The last element of the list either contains a special entry written as O to denote a null pointer in the second field or alternatively contains a pointer to the initial element of the list, in which case the list is called circular.
An actual implementation will have the list element consisting of sufficient bits to define two computer addresses. On some computers this can be done by using one computer word; on others two words are required. It is important to note that there is no necessity for neighbouring elements to be consecutive in the computer store, and, in fact, they may well be scattered through the computer store. A full description of the operations usually associated with lists can be found in the companion Monograph by J. M. Foster entitled List Processing.
A list is the simplest form in which the sequential storage of a computer can be used in a more flexible manner.
Although the lists defined above are capable of representing the more complex abstract data structures such as trees and directed graphs, quite often they will be inefficient and also opaque. For example, the tree in Fig. 2.1 might well be represented as a list by the structure shown in Fig. 2.4.
A more efficient and transparent storage structure in this case would be a plex. Storage systems similar to the plex have been in wide use for a number of years; a precise definition and theory of plexes and plex programming is due to D. T. Ross of the Massachusetts Institute of Technology. A plex (derived from plexus meaning any complex structure containing an intricate network of interrelated parts) consists of a set of elements called beads, where each element is an N-word vector of computer storage. This N-word block is broken down into a set of fields containing information or alternatively pointers to other beads. With each type of plex is assumed a format which defines what the individual fields represent. The format is assumed to be external to the set of beads defining the plex. This can be thought of, therefore, as an extension of the two-word list element to an n-word element where fields may contain information as well as pointers. The plex processing system of D. T. Ross closely associates with each plex a set of algorithms which refer to and change the information contained in the plex. In most conventionally produced compilers this association is not usually very precise, although the plex as a storage structure is frequently used. The arithmetic expression shown as a list structure in Fig. 2.4 is shown as a plex in Fig. 2.5. In this example two types of beads are defined; the four word bead for the nodes where the format is denoted by the field N, and the two-word bead where T denotes a leaf. The plex is a fairly natural structure for representing trees and directed graphs in a computer.