# Jump Over Left Menu

### Chapter 2: Data Structures

### 2.1 Introduction

Before describing the various parts which comprise a compiler,
it is necessary to consider the *abstract data structures* which
will be used by the compiler and also how these abstract structures are
implemented using particular storage structures. The compilation process
can be thought of as taking some input data structure and transforming it
to produce an output data structure which in some sense is equivalent to
but more desirable than the input form. The input data is often called the
*source program* and the output the *object code*. One of the simplest forms
the compilation process can take is to have the input data structure
coming from a peripheral device (such as a card reader), and the compiler
producing as output a machine code program in the store of the computer
ready for execution. In most compilers, before the output form is
arrived at, the input data structure may well be transformed into
several internal data structures.
The compiler itself may require to store additional information or
alternatively be *driven* by information stored internally. Finally, the data
structures defined in the source program in the specific programming language
must be mapped into some storage structure on the background computer that is
to execute the program. The problems which need to be resolved when designing
a compiler are ones such as finding the most transparent abstract data structure
to use at each stage and also the most efficient internal storage structure
into which to map the abstract data structure.

### 2.2 Abstract data structures

A data structure is defined as a set of rules and constraints which show
the relationships that exist between individual pieces of data.
The structure says nothing about the individual pieces of data which
may occur. It may require them to *hold* the structure in some sense,
but any information contained in the data items is independent of the
structure. The term *item* used here will denote a piece of an abstract
data structure. The item itself may be another data structure so that a
hierarchical set of data structures is built up.

#### 2.2.1 String

A *string* is an ordered set of items. The string may either be
of variable length or fixed. Each item only has knowledge of its neighbours
so that accessing a particular item must be done by a sequential search
starting from one of the ends of a string. Examples of operations defined on
a string are:

- concatenation of two strings
- comparison of two strings item per item
- breaking strings into several parts.

#### 2.2.2 Array

An *array* A is a set of items which are so arranged that an ordered
set of integers uniquely defines the position of each item of the array and
provides a method of accessing each item directly. Using the notation of Algol,
an item of the array A could be written A[j1,j2,j3, . . . , jn]. If the ordered
set of integers has length *>n*, then the array is called *n-dimensional*.
The individual integers in the ordered set are called *subscripts*.
Each subscript may have its range defined in any way, and it could consist
of several disjoint sub-ranges.

A particular subset of arrays, which is used in both compilers and
programming languages, is the set of *rectangular arrays*. A rectangular
array has the range of each of its subscripts constant and contiguous.
For example, the Algol declaration

array A[1:10, 0:5, 3:7];

defines a three-dimensional array with its subscripts having ranges 10, 6, and 5, respectively. A typical element would be A[3,4,5].

#### 2.2.3 Queues and stacks

*Queues* and *stacks* are dynamically changing data structures.
At any time, the queue or stack contains an ordered set of items. In the
case of a queue, if an item is added, it is placed at the end of the
ordered set. Items can only be accessed or removed from the front of the
ordered set defining the queue. The first item added to the queue is the
only one accessible and must be the first removed. For a stack, items are
similarly added to the end of the ordered set. The difference is that the
accessing or removal of items from the ordered set is from the end of the
set. The last item added is the only one accessible and is the first to be
removed. The items may well be of variable length and contain the length
defined within the item.

#### 2.2.4 Tables

A *table* consists of a set of items, and associated with each
item is a
unique name, or *key*. In general, each item will consist of its key,
together with some information associated with the key. An item can be
added to the table by presenting its key together with the associated
information. Items are accessed by presenting the key to the table.

#### 2.2.5 Trees

A tree is a structure consisting of a set of *nodes*. Each
node (or item) has the property that, apart from information that it may
carry, it also contains pointers to lower-level nodes. At the lowest level
of the tree, the nodes point to leaves which consist of data structures
external to the tree. The idea of level comes from the fact that each tree
must have one top-most node which has no pointers to it from other
nodes (this is often called the *root* of the tree) and also no node can
point to a node previously defined. This latter condition ensures that
each node has a unique path to it from the root. Nodes on level 1 are
those pointed at by the root; nodes on level 2 are pointed at from nodes on
level 1, and so on.

Fig. 2.1 diagrammatically shows a *binary* tree
(each node points to exactly two lower nodes) representing *A*B + C*D*.
The letters A, B, C, D denote the leaves and the ringed quantities represent
the information associated with the node. It is conventional to represent
the tree with the root at the top and leaves at the bottom
unlike most natural trees. The *** is used to represent multiplication.

##### Fig 2.1

#### 2.2.6 Directed graphs

A *directed graph* is similar to a tree except that the nodes
may point to nodes higher up the structure. Therefore, a node may have
more than one node pointing to it. As it is now no longer obvious which
direction the line joining two nodes is pointing; it is usual to mark at
least the upward pointing lines in a diagram. The relaxation of the rules
for trees now means that it is possible to have a path from a node back
to itself. This is commonly called a *loop*. A particular sub-class of
directed graphs is the set of *loop-free* directed graphs.
This does not coincide with the definition of a tree, however, as it is
still allowable for two nodes to point to the same node which is not
allowed in the case of trees. For example, the Algol compound statement:

begin a:= b:= c:=2; first: if a > b then z:=4 else c:=1; if z > 3 then goto next; b:=1; goto first; next: z:=1 ; end

could be represented by the directed graph given in Fig. 2.2.

##### Fig 2.2

### 2.3 Internal storage structures

Most computers tend to have a memory consisting of a set of ordered words,
and this is the only available storage structure that the computer itself has.
In general, by the use of programs and subroutines, higher level storage
structures are superimposed upon the basic structure of the computer, and
to the user of the subroutines,
the computer itself appears to have this structure. A fairly well defined
set of such *internal storage structures* have been used and, although they are
usually designed so as to be equivalent to an abstract storage structure or
at least a structure into which the abstract structure can be easily mapped,
they still tend to be influenced by the basic structure of the computer memory.

Each internal storage structure will be built up from indivisible (as far
as the structure is concerned) pieces of storage called *elements*.
These are closely equivalent to the items of the abstract structure.
The element may have several fields some of which could be used in defining
the structure. An element could consist of a single bit of a computer word
but is more likely to be at least an address field in width.

#### 2.3.1 Vectors

A *vector* in most computers is the name given to the basic
storage structure of the computer. A vector is a set of elements which are
physically adjacent in the computer store. A vector is defined by knowing its
*base address*, its *element size*, and its *length*
(all elements will be assumed
to be the same size). Individual elements can be accessed immediately using the
base of the vector and the position of the element in the vector.
A vector closely corresponds to the abstract data structure, the one-dimensional
array, and almost invariably one-dimensional arrays are mapped into vectors.
It will be convenient, therefore, to represent a vector of length *N*
with name *A* by the Algol declaration:

array A[1:N]

and individual elements will be called *A[i]* for i = 1 to N.

#### 2.3.2 Lists

A *list* is a set of elements where each element consists of two
fields.
The second field contains a pointer to the next element in the list.
The first field contains a pointer to the information defined by this element.
This could be another list or some external data structure.
An external data structure will be called an atom, and will be assumed to be
indivisible as far as operations on the list are concerned.
For example, a string of characters *ABC* could be represented
pictorially as a list by Fig. 2.3.

##### Fig 2.3

A list as a whole is accessed by a pointer *P*, to the first element
of the list. The last element of the list either contains a special entry
written as *O* to denote a null pointer in the second field or
alternatively contains a pointer to the initial element of the list,
in which case the list is called *circular*.

An actual implementation will have the list element consisting of
sufficient bits to define two computer addresses. On some computers
this can be done by using one computer word; on others two words are required.
It is important to note that there is no necessity for neighbouring elements
to be consecutive in the computer store, and, in fact, they may well be
scattered through the computer store. A full description of the operations
usually associated with lists can be found in the companion
Monograph by J. M. Foster entitled *List Processing*.

A list is the simplest form in which the sequential storage of a computer can be used in a more flexible manner.

#### 2.3.3 Plexes

Although the lists defined above are capable of representing the more complex abstract data structures such as trees and directed graphs, quite often they will be inefficient and also opaque. For example, the tree in Fig. 2.1 might well be represented as a list by the structure shown in Fig. 2.4.

##### Fig 2.4

A more efficient and transparent storage structure in this case would be a
*plex*. Storage systems similar to the plex have been in wide use for
a number of years; a precise definition and theory of plexes and plex
programming is due to D. T. Ross of the Massachusetts Institute of Technology.
A plex (derived from *plexus* meaning *any complex structure containing
an intricate network of interrelated parts*) consists of a set of
elements called *beads*, where each element is an N-word vector of
computer storage. This N-word block is broken down into a set of fields
containing information or alternatively pointers to other beads.
With each type of plex is assumed a format which defines what the
individual fields represent.
The format is assumed to be external to the set of beads defining the plex.
This can be thought of, therefore, as an extension of the two-word list
element to an *n-word* element where fields may contain information as
well as pointers. The plex processing system of D. T.
Ross closely associates with each plex a set of algorithms which refer to
and change the information contained in the plex. In most conventionally
produced compilers this association is not usually very precise, although the
plex as a storage structure is frequently used.
The arithmetic expression shown as a list structure in Fig. 2.4 is shown as a
plex in Fig. 2.5. In this example two types of beads are defined; the four word
bead for the nodes where the format is denoted by the field N, and the two-word
bead where T denotes a leaf. The plex is a fairly natural structure for
representing trees and directed graphs in a computer.