Contact us Heritage collections Image license terms
HOME ACL Associates Technology Literature Applications Society Software revisited
Further reading □ OverviewStatistical FORTRAN programsMultiple Variate Counter program (MVC)ASCOP Statistical Computing ProcedureASCOP inputAnimal feeding trialsBOMM Time Series AnalysisBode's LawLinear and non-linear programmingProblems encountered in archaeologyMedical surveysGeological data banksThe G-EXEC system: DesignG-EXEC: System capabilitiesPreparation of data for analysis by machine
ACD C&A INF CCD CISD Archives Contact us Heritage archives Image license terms

Search

   
ACLApplicationsApplied Maths :: Applied Mathematics and Statistics at Atlas
ACLApplicationsApplied Maths :: Applied Mathematics and Statistics at Atlas
ACL ACD C&A INF CCD CISD Archives
Further reading

Overview
Statistical FORTRAN programs
Multiple Variate Counter program (MVC)
ASCOP Statistical Computing Procedure
ASCOP input
Animal feeding trials
BOMM Time Series Analysis
Bode's Law
Linear and non-linear programming
Problems encountered in archaeology
Medical surveys
Geological data banks
The G-EXEC system: Design
G-EXEC: System capabilities
Preparation of data for analysis by machine

A Plan to Exploit the Rapid Preparation of Data for Analysis by Machine Methods

James E Hailstone, Donald W Hutchins, Judith M Bradley

1973

Biographical Notes:

JAMES E HAILSTONE, born 1924. England. B.Sc. (Econ) (London, 1950) FRSSS, Statistician with NPL 1950-52. Ministry of Supply 1952-55, AERE, 1955-63. Atlas Computer Laboratory 1963 - currently Head of the User Services Group. Research Associate, Oxford University Department of Educational Studies, 1972

DONALD W HUTCHINGS, born 1923, England. B.Sc. (London 1949) AM (Oberlin 1953), MA (Oxon, 1961), Lecturer in Education, Oxford University Department of Educational Studies 1959, currently Director of Academic Motivation Unit. Author: Technology and the Sixth Form Boy, 1963, The Science Undergraduate, 1967, and Education for Industry, 1968.

JUDITH M BRADLEY, born 1945 England, BA (Sociology) Essex 1968, MA (Sociology) Essex 1969. Research Assistant, Oxford University 1970 - Research student, Wolfson College, Oxford, 1973 -

Introduction

Recent developments and availability of new methods of preparing basic social science data, in particular the use in the educational research field of machine scorable tests, present challenging organisational problems. With the greatly increased computing power, both in hardware and software facilities, there is a need to reappraise methods and expectations in order to exploit these new facilities to the full. At present, far too. much time is being spent on supporting activities of research, for example, the checking out of data absorbs an unnecessarily large proportion of a research worker's time when he is attempting to obtain the benefits of using computer systems. The present discussion gives an account of our recent experience in this area of methodology and suggests a framework within which the relatively inexperienced computer user can plan his work in order to use the tool without becoming an expert toolmaker.

Our first experience of document reading by machine was in 1969 [1]. The initial optimism about the saving of time and effort proved ill-founded and the decision to use the machine in the end delayed the analysis by many months. However, it was felt that the delay was due to inexperience rather than intrinsic difficulty with the method itself. It is clear that great care is necessary in committing a project to the use of machine methods of preparing data for analysis. The project can stand or fall depending on the success and flexibility of the machine methods used: the data obtained by questionnaire has in the past been collected in such a form that several ways of presenting the data for analysis have been possible. It has always been open to the research worker to change to a different method of presentation if a more detailed analysis proves necessary and powerful computing facilities are needed. Introduction of machine readable information involves some inflexibility and the chance that human intervention may not be possible once the process has started.

In other words, the use of machine methods introduces equipment which may be completely outside the control of the person actually doing the research and the specification of what is wanted together with the details of what can be achieved needs careful matching. Our work in 1969 involved the preparation of a large amount of paper tape almost completely unreadable by most computer systems without special programming attention, and our assumption that the machine output would be in a generally acceptable form proved to be completely wrong. Therefore, rigorous attention to detail is essential at the planning stage, so that peculiarities of the equipment are recognised. This may seem obvious, but it has been our experience that it is vital to ensure at the outset that the particular machine employed presents the data in a usable format. It could well happen that the researcher is faced with a specification of input for a single program designed to process the basic data and to present it in an acceptable form, usually in characters on some kind of magnetic tape. Finally, the use of a standard package or program is to be recommended for the subsequent analysis and the choice of the particular package or program must be taken almost as early as the decision of what should go into the questionnaire. At least, preparation of the analysis should proceed parallel with other activities and not be left until after the data has been collected and undergone the initial processing.

A short account of our experience so far of machine scoring in an ongoing longitudinal study of secondary school pupils underlines these points and also may serve to draw attention to a number of other problems inherent in the method. We now feel that this second attempt to use the machine method was justified and is to be reccommended provided that certain organisational details are fully understood. A check-list of points to be observed are given at the end of the paper.

Aim and scope of the study

The investigation being carried out at the Oxford University Departnent of Educational Studies examines social factors affecting the academic motivation of secondary school pupils. It is a four year longitudinal study supported by a grant from the Social Science Research Council and is being undertaken in collaboration with the testing services of the National Foundation for Educational Research. The data analysis is being carried out at the Atlas Computer Laboratory, Chilton, and the machine scoring provided by Document Reading Services.

Data is being collected from a series of questionnaires and standardised tests completed by the main sample of 2,000 secondary school children who were 13-years old in the summer of 1971, together with group discussions and semi-structured interviews with sub-samples. The pupils are attending 17 schools in different parts of England and Wales judged to be representative of the main systems of secondary school organisation.

Testing carried out in the first phase of field work [2] comprised the AH4 group test of general intelligence, the APU occupational interests test, the HSPQ personality questionnaire and our own questionnaire (CPI) providing background information on socio-economic status of the family, level of parents' education and their ambitions for their children, orientation of parents' occupations, pupils' level of aspiration, occupational priorities, their opinions of school subjects and their plans for the future in terms of higher education and career. The second phase of field work, completed this year, involved data collection on pupils' attainment and a follow-up questionnaire (CPI!) giving further background information and eliciting details of any changes in their earlier opinions. This included questions on: family size, relations with parents and siblings, educational and occupational aspirations, parental involvement and perceived influences on the decision-making process.

Clearly, a study of this kind, involving a large number of variables collected over different time periods, generates a vast amount of data and traditional methods of dealing with it are not only very tedious and time-consuming, but are liable to error. The first phase of testing when the pupils were thirteen years old in fact produced more than a million pieces of information for processing. [3]

The equipment

The need to analyse the markings of these tests and questionnaires, three of which were already available in machine-readable form, led to the decision to use equipment available via a bureau at Document Reading Services[4]. This equipment, consisting of a reader made by Westinghouse and working through a Digital Equipment Corporation PDP15 equipped with magnetic tape decks and line printer, is capable of interpreting pencil line marks. The marks may be placed on a suitably printed answer sheet in positions corresponding to the responses possible to each question. Answer sheets must be prepared using a special ink so that discrimination between answer marks and printing can be achieved. The three tests, AH4, APU and HSPQ, are already available in machine readable form [5], but our two questionnaires had to be printed under the direction of DRS.

The reading speed is in excess of 30,000 sheets per hour (A4); and due to.the technique of reading by reflection from the surface of the paper, both sides of a sheet may be read simultaneously through the reader. Discrimination allows a close packing of 'linemarks' and the only effective restriction is the ease with which the sheets can be marked up. An answer sheet for a multi-sheeted questionnaire can frequently be reduced to one or two sheets, but since ease of human reading is vitally important then the questionnaire itself may be used as the answer sheet and the appropriate responses marked on it.

The basic data my be processed immediately and a limited analysis printed as part of the processing carried out on the PDP15. Since, however, it was necessary to merge and match the results of the tests and questionnaires, the basic data was output to magnetic tape written in a suitable form for reading by the Atlas Computer at ChiltonS: ½ inch IBM compatible tape 556 bits per inch, even parity, IBM BCD code, in records of 80 characters and variables separated by spaces. Alternative output formats and packing densities are available to meet different requirements.

The Atlas Computer was used for the analysis of the first phase and a statistical package, ASCOP [7] developed at the Laboratory, was chosen for its ease of use by non-programmer and its range of general facilities. A framework within which a research worker could operate with virtually no knowledge of FORTRAN or of operating systems was set up and the computer power made available via an acoustic coupler working into the Atlas multi access system. It was thus possible for a researcher with some knowledge of the techniques available to specify analyses at will. We would suggest, therefore, that it is not necessary for those who wish to use computers and modern techniques to know the intricate details of programming so that the availability of data on a computer accessible only to those with an expert knowledge of the language is considered to be too restrictive and the aim should be to make data available as freely as possible.

Work schedule

It is likely, and our experience confirms this, that the rapid collection and scoring of data will be of little avail if the computer processes are not checked out and ready to accept the data with all the potential difficulties it may contain. In the present study, some of these problems were solved by setting up a small pilot on a single school. The basic problems of magnetic tape compatibility were dealt with and a formatting of the data acceptable to the analysis program was agreed. The data from the pilot study was set up to be running whilst the main data set was collected and showed clearly that, despite an apparently standard set of conditions for the initial processing, a number of important differences in the understanding of the workings of generally accepted options appeared. This, of course, delayed the final presentation of the full data for processing.

It should also be pointed out that a great deal of care is necessary in the handling of the data sheets themselves. The anonymous nature of the process, through rapid machine processing to the prepared basic data on magnetic tape does not permit adjustment of the data if difficulties are found. In this study, the matching of tests carried out at different times and in many different places led to a careful check of what forms were being presented for processing and a close watch was kept on the way in which the whole batch of data sheets was assembled. Whilst the data for each test and questionnaire are of intrinsic interest the main interest of the analysis is with the association of the variables contained in them. It is therefore recommended that, for the first stages of a project, the following points should be considered:

Check List

  1. A small pilot study should be conducted using the machine readable sheets with either real data or generated pseudo data.
  2. The machine reading equipment specifications should be established together with an appreciation of the limitations of the process.
  3. The conpatibility of magnetic tape processes should be checked by running actual tapes between the machine reading equipment and the computer to be used in the analysis.
  4. The data must be kept under strict control from the moment of assembly until it is ready for machine treatment - do not allow the data to be collected piecemeal somewhere near the machine.
  5. The research must allow sufficient time for the processing programs to be prepared and check that standard test sheets really have been used before and are not just similar.
  6. It is expensive and time consuming to make alterations during the project and this can frequently lead to the need for scarce expert help in making up special computer programs to deal with non-standard conditions
  7. The nature of these processes means that unforeseen difficulties cannot now be explained and remedied quite so readily. A greater sense of what the candidate will do with a question and the subsequent marking of responses must be considered and decisions taken before the final processing of the basic data.

Notes

1. This was an investigation which formed the first stage of an extensive project financed by the Leverhulme Trust in a grant to Political and Economic Planning (PEP). The main object was to identify social and psychological factor relating to the careers of highly qualified young men and women in Britain. Designed as a longitudinal study, the research considered three important states in the lives of these young people; at about the age of eighteen when they are on the point of leaving the sixth form; on graduation from university; and at eight years after graduation. An early statement of the project's goals is included in Fogarty, M P, Rapoport, R and Rapoport, R N, Women and Top Jobs, 1967 London. Publications arising specifically from our enquiry included:

HUTCHINGS, D W. 1971. Career Orientation and Level of Aspiration of Sixth Form Boys and Grils. Mimeograph. University of Oxford Department of Educational Studies. HUTCHINGS, D W and CLOWSLEY, J M 1970 Why do Girls Settle for Less? Further Education, Autumn., 6-8.

2. The first phase of testing was carried out in the summer term of 1971, the second phase, two years later, has just been completed and a third phase is planned for Autumn 1973 when the pupils have left school to enter employment or training or gone on into the sixth form.

3. The data from this investigation is being deposited with the Social Science Research Council Survey Archive, University of Essex.

4. Docunemt Reading Services is a commercial organisation whose offices are at 55-57 Newmnan Street, London W.1.

5. AH4 group test of general intelligence parts I and II, available fron the NFER Publishing Co. Ltd. Test Division; APU occupational interests guide intermediate version - male and female, available from University of London Press; High School Personality Questionnaire (Form A) Anglicised, available from the NFER Co Ltd, Test Division. Data obtained fron our study has been included in the preliminary British standardisation of this test: see SAVILLE, P and FINLAYSON, L. British Supplement to the High School Personality Questionnaire (Form A) Anglicised 1967-68 Edition (NFER in press).

6. The Atlas Computer has been replaced this year by an ICL 1906A

7. ASCOP is a statistical and data management computing system developed and written by B E Cooper at the Science Research Council, Atlas Computer Laboratory, Chilton, Didcot, Berkshire. The implementation was completed in 1966. The ASCOP program and further information can be obtained from the Atlas Computer Laboratory.

EXPERIENCES WITH MACHINE-READABLE QUESTIONNAIRE AND TEST DATA

by J E Hailstone (Atlas Computer Laboratory) and D W Hutchings (Department of Educational Studies, Oxford)

23 April, 1974

The use of machine-readable tests employing optical mark techniques was necessary in the study of factors affecting pupils' choice of courses, which reflect a scientific or technological bias. In an extensive study involving over 2,000 children, the effective and efficient preparation of the basic data was important. Of 3 tests required to be carried out, there already existed commercially available prepared machine-readable data sheets. In addition, two lengthy questionnaires were set up with background and non-standard information. The three machine-readable tests used were as follows:

AH4
designed as a group test of general intelligence, with the aim of including as many different biases and principles in problem solving as is possible. Test performance is divided into two parts exemplifying a verbal-numerical bias and a diagramatic or spatial bias.
HSPQ
(Anglicised version) - unlike many previous tests of this nature, the HSPQ does not produce scores in a simple dichotomy - such as introvert/extrovert and convergent/divergent - but measures a set of fourteen factorially independent dimensions of personality derived from Cattell's 16 P.F.
APU
this occupational interests guide gives an occupational profile for each pupil and enables correlations to be made between aptitude and interest in various careers.

Although this test material can be purchased, it is clear that there is very little information available concerning the handling of this material in bulk. Small test groups have been marked by manual methods, but we were unable to discover any extensive application in the UK which took advantage of the ability to read directly to a computer system.

The two background questionnaires were prepared in consultation with Document Reader Services, London. The special printing required was carried out efficiently and with good advice on questionnaire designed for machine reading.

Data was obtained for a main sample of 2,000 Secondary School children who were 13 years old in the summer term 1971, supplemented by group discussions and semi-structure interviews with sub samples. The pupils attended 17 schools in different parts of England and Wales, judged to be representative of the main systems of Secondary School organisation. A follow up one year later with information from the same group completed the data collection. The handling of this data has led us to recognise two areas of special difficulty which have to be faced when using machine-readable material:

  1. The validation of the process.
  2. The control and management of the interface between the user and the bureaux services.

1. VALIDATION OF THE PROCESS

The usual forms of data preparation have now reached the stage where there is a good deal of in-built checking of the data as it is prepared for processing. The expensive but effective process of verification by essentially double-punching the information, provides a good example. But there are other more subtle forms.of checking which have been developed over the years. The punch girls themselves become aware of unusual events and participate in the checking process. With the automatic preparation of data, the process is completely removed from sympathetic human appraisal and checks can be applied positively only at the end of a complex chain of events. The final presentation of the data may well be in a different order from that which can be "read" from the machine-readable data sheets, making checks by eye difficult and only possible for a small sample. It was observed, moreover, that during the machine reading process, bursts of "noise" which adversely affect the "reading" of the data can occur. These are of short duration and may only affect parts of a form. If it is likely that, with limited sample checking, bad batches of data will get through. The "reading" machine operating at 30,000 sheets per hour can be at risk from transients in the reading signals and some large areas of data corrupted. Consistency checks must therefore be made more demanding and additional deliberate check questions built-in to improve the research worker's confidence in handling the data.

2. INTERFACE BETWEEN THE USER AND THE BUREAU SERVICE

The user faced with the prospect of using a bureau service for the preparation of machine-readable information is immediately involved with a different process from the normal services offered by punched card or paper tape bureaux. The reading machine, which in this case operated at very high speed of 30,000 sheets per hour (reading both sides of a sheet simultaneously) was linked to a small computer with special system software.

The specifications and requirements of the user's input and output material require a good deal of computer expertise and understanding of the computer process. The output must be prepared in a suitable form for processing on the user's computer facilities and a number of compatibility problems have to be faced at this stage. Perhaps, however, the greatest difficulty comes from the feeling of losing control of the data preparation process. There is no sense of involvement with the data and the validation process during the actual run must be left to the programmers and operators of the bureaux. In the project this difficulty was compounded by the need to match three different tests and a questionnaire for each sample case. In view of the many absences for one or more tests, the data handling problems were severe.

It is certain that before undertaking projects involving machine-readable material, the research worker should have available professional computer assistance, so that the data handling part of the project does. not become unwieldy and a heavy overhead.

For small batches of data commonly gathered by Social science research workers, it may be that the overheads of professional assistance may not be justified. Devoting scarce manpower to this interface problem may prove to be just too expensive. The advantages gained at the data collection stage with the use of machine readable data could be dissipated at this interface. Some attention is therefore necessary to ensure that this part of a computer project is adequately handled and that some standard interfaces are developed.

Predictions of the development of optical mark and optical character reading devices show that growth of devices is likely during the next few years and that card punching and data preparation services of this class will gradually be overtaken by automatic reading machines. It is necessary, therefore, to consider the impact of these new devices on social science research work and to encourage appropriate uses of these new systems, bearing in mind the need for understanding of some of the requirements and difficulties.

The devices offer the possibility of speeding up the data preparation and data handling stage of some survey-type projects to allow the research worker more time for the consideration of hypotheses and structure and for consideration of the science of the project. We should be able to look forward to the time when the data processing becomes only a minor part of any project, but this will only be achieved by a greater emphasis on managing the components of the computing process.

⇑ Top of page
© Chilton Computing and UKRI Science and Technology Facilities Council webmaster@chilton-computing.org.uk
Our thanks to UKRI Science and Technology Facilities Council for hosting this site