STATISTICIANS have long dreamt of being able to perform, quickly and easily, any analysis of data that occurs to them whether this be an accepted analysis or a new analysis suggested by the data during the course of analysis. Until a few years ago statisticians have been restricted to those analyses that can be performed using the strictly limited resources of a desk calculating machine and an imaginative approach to data analysis has not been possible. The increase in computing power during the last few years should make possible a completely new approach to data analysis. So far this has not happened. The needs of the statistician are very varied and none of the programs written so far possess the considerable flexibility that is necessary to satisfy these needs. Within the last two years a number of statistical systems capable of performing a number of standard analyses have been written. These permit certain sequences of analyses to be performed without re-presentation of the data and they have the advantage that the data is prepared in the same way for all analyses. Although these are making the work of the data analyst easier and are encouraging the more thorough analysis of data, they are found wanting in ease of use, both to the statistician and the experimentalist who should also be encouraged to look more closely at his data, in the number and variety of analyses that may be performed, and in the ability to build up new analyses as sequences of instructions known to the system. For example a number of systems can perform a regression analysis but none of them can reference the coefficients of the fitted function in a later analysis. ASCOP is a large system which in its first version suffered the same faults as all the other systems although it was easier to use than most. The second version currently being debugged is a major revision and attempts to allow much greater flexibility to the user.
An ASCOP program consists of a sequence of instructions which may be divided into two main types. The first of these are equations specifying arithmetic operations on variables, parameters (single values), and coefficients. New variables, parameters, and coefficients may be created and referred to in subsequent instructions of either type. Instructions of the second type are English-like sentences or phrases specifying particular analyses, or making declarations to the system. Instructions of this type may similarly define new variables, parameters, and coefficients which may be referred to later in instructions of either type. Both types of instruction may be labelled and branching statements making reference to these labels are allowed. This enables the user to specify that the performance of some analyses and arithmetic operations is conditional on the satisfaction of a particular criterion or criteria. A number of data editing operations are available including the amalgamation of several sets of data, the selective inclusion of points in a new set of data, and the inclusion of certain parameters, defined in an analysis, as a point in a new set of data. It is also possible to define subroutines made up of ASCOP instructions and equations and to call these many times over. Their definition and call are very similar to those in FORTRAN. It will be possible when a disc becomes available on ATLAS to have a set of standard subroutines stored on the disc and hence available on call to ASCOP users. It will also be possible for users to add to the standard set or, of course, to their own private set. Instructions are available in ASCOP to allow the user to specify that certain sets of data including data derived during analysis be written onto a private output tape in a form that can be presented again to ASCOP at a later time.
The basic organisational unit of data in ASCOP is the data matrix. The rows of a data matrix are referred to as POINTS and the columns as VARIABLES. Each variable may have more than one column in the data matrix and the number of columns for a variable is referred to as its replication. If variable A is replicated twice there will be two values of A in each POINT or in each row of the matrix. Thus a certain completeness in the data is implied, but in fact missing values are allowed for the incomplete situation. The fact that variables may be replicated introduces the possibility of references to point means, variances, standard deviations and numbers of replicates. Such reference is allowed in arithmetic operations and in analyses. Reference is allowed in arithmetic operations to a label associated with each point. The label may be read with the data or generated as the data is read.
Data matrices are, most commonly, read from cards but they may also be generated from other data matrices using edit operations, or generated using the random variable generation functions available as parts of the arithmetic operations. Arithmetic operations may be used to define new variables in the reading stage or in the editing stage and the inclusion of points in the data matrix may be made conditional on the values of the variables involved. Thus matrices may be formed containing those points that show specified properties. Data to be analysed in several different arrangements need be presented to the system only once and the reorganisation achieved using edit operations.
ASCOP analysis instructions are made up of units of information each introduced by a particular word. A unit of information may be a list of numbers and words, or a single word the presence of which has meaning. One particular unit of information defines the type of analysis to be performed and must appear first in the instruction. Other units may appear in any order but some orders will read more naturally from an English point of view than others. Analyses that are currently included in the ASCOP system are very briefly described below:
READ DATA MATRIX BEC 2 VARIABLE NAMES ABC D POINTS 84 LABEL IN POSITION 3 IGNORE ITEMS 1 AND 2 REPLICATES 4 1 1 1
OUTPUT DETAILED SUMMARIES FOR ALL VARIABLES EXCEPT A
REGRESSION OF A ON VARIABLES B C AND D
REGRESSION OF A ON BEST 2 VARIABLES
COMPONENTS ANALYSIS USING ALL VARIABLES EXCEPT A FACTOR ANALYSIS WITH 3 FACTORS AND USING VARIABLES B AND C AND D
DIMENSIONS 2 DOSES 6 EXPERIMENTS 5 TREATMENTS ANOVA OF VARIABLE A FOR EXPERIMENTS 4 AND 5 AND OMITTING TREATMENT 5
DIALLEL TABLE ANALYSIS OF VARIABLES A AND B PARENT1 AND BLOCKS 4
START DATA MATRIX BEC 6 V ARIABLE NAMES A B C D ADD 24.5 39.47 84 AND PA TO DATA MATRIX BEC 6 ADD POINTS FROM STREAM 4 TO DATA MATRIX BEC 6 ADD POINTS FROM DATA MATRIX BEC 4 TO DATA MATRIX BEC6 A = LOG(A) IF (B-C) CONTINUE, CONTINUE, OMIT, ERROR COMPLETE AND SAVE DATA MATRIX BEC 6
DISCRIMINATE BETWEEN DATA MATRICES BEC 4 AND 5 USING ABC ANDD
NAME FITTED VALUE AFIT RESIDUALS ARES AND COEFFICIENTS ACOF NAME COMPONENTS CA CB AND COEFFICIENTS COFA AND COFB NAME FACTORS FA FB AND COEFFICIENTS COF A AND COFB NAME ARES = DATA - DOSES - DOSES - TREATMENTS NAME COEFFICIENTS DISC AND DECISION VALUE DEC
DEFINE VA AS LINEAR FUNCTION OF ABC AND D USING COEFFICIENTS COFA
At present ASCOP works in the batch processing mode taking instructions sequentially without intervention from the user. ASCOP has been written with a view to the interactive mode allowed by remote consoles. The English nature of the instructions is particularly important in this form of operation. The presence of bulk storage provided by discs and the availability of remote consoles will enable the statistician to have his data available for analysis immediately he wishes to study it. He may ask for analyses simply by typing sentences of the type exampled above, he will be available to answer questions put to him by a later version of ASCOP, and he will able to request further information as the need for it becomes apparer during the course of analysis. He will be able to try several analyses he has built up an adequate picture of his data. This total process wil not be performed at one session, the statistician will be able to store his results at each stage while he thinks, perhaps for a few days, about his problem. He will be able to try sequences of regression analyses, for example, when searching for an adequate description of a particular variable. He will be able to carry out a similar process using discriminant analysis searching for an adequate discriminant function involving as few variables as possible. Having found such a function he will be able to use it to divide further observations into the different populations. He will be able to determine the advantages of the transformation of variables provided by a components analysis before deciding the next stage of setting up the first few components. At present this must be done either in two runs or by deciding before the results of the analysis are seen. The potential of ASCOP, even in its present form, operating in the interactive mode is extremely exciting. The next developments of ASCOP will be to make such use as convenient as possible in anticipation of the availability of the necessary equipment.
The addition of further operations is not difficult and the addition of quantal response, canonical correlation, polynomial regression and regression-within-groups analyses is already planned. Additional smaller operations such as a PRINT statement allowing the user to arrange additional output on another stream are also being designed. ASCOP will continue to develop towards what is needed to make the statisticians dream come true.
I would like to acknowledge the machine time allowed to me by Bell Telephone Laboratories Inc., and the Health Sciences Computing Facility at UCLA for the debugging phase of the first version of ASCOP. The analysis of variance chapter has been written by Mr. T. Gover, the factor analysis chapter by Mr. P. Charlton and discriminant analysis by Miss S. Williams all of this laboratory.