Chilton::ACL::Applications of Computers

10. Error Protection, J Boothroyd M A, AMIEE

1 - INTRODUCTION

If digital computers are to operate successfully in any application it is essential to detect and prevent errors from faulty operation of the equipment. That potential machine users are aware of this is evident from the question most often asked by visitors to a computer installation How do you know when the machine makes a mistake? Phrased in this way the question relates to mistakes made by the machine during the course of some computation or data-processing task.

Machine errors must be detected, but they are not now as frequent as they were in the days of the prototype computers. Machine reliability has increased and although a high standard of machine performance is still a necessary condition of successful operation it is not sufficient. Some safeguards must exist to prevent and detect errors made by the human operators and customers. Mistakes in programmes, inadequate machine handling and errors in data presented to the computer can frequently cause as much, if not more, havoc than a faulty machine.

To obtain satisfactory results from a computer installation it is necessary to satisfy the following requirements:

The input data is free from mistakes.
The programme is fully tested.
The operating instructions are correctly carried out.
The machinery is in good order.

2 - CHECKING INPUT DATA

Mistakes will certainly exist in the original data in document form. How much effort should be expended in checking data at this stage will depend on the application. In scientific work, as handled in computing centres, it is certainly profitable. Trained staff are available and the checks can usually be made fairly quickly to test the reasonableness of the data presented by a customer. In systems handling large volumes of data originating in different parts of a large concern it may be necessary to allow such mistakes to go through and concentrate on protection from the point where information is converted into the appropriate machine input medium.

Most computers accept their information in one or both of two forms, punched cards or punched paper tape. The first task within the computer department is to ensure that the input data on original documents is correctly transcribed to the appropriate form. The methods available are similar for both types of input.

Punched Cards

Direct Verification. Using the original information in document form, a punch operator produces a corresponding set of data on cards. Using the same document, a second operator checks the cards using a hand-verifier.
Comparison. Two punch operators produce two sets of cards from the same document. These are compared using the comparator facility of a punched card reproducer. In the event of disagreement both cards of the faulty pair must be checked against the original document to decide which is correct.
Tabulation. A frequently adopted method of checking depends on the formation of redundant information in the form of a sum check. The total of all input data or totals of suitable groups of it are formed by hand calculation. Such totals are then punched on separate cards following the relevant groups. The hand calculation introduces another source of error but the accuracy of this and the accuracy of punching may be checked in one operation on a tabulator. This machine is arranged to print the data cards and form and print the sum of these, followed by reading and printing the sum check card. If these disagree then either the sum has been incorrectly formed in the hand calculation or by the tabulator, or the cards have been incorrectly punched. Which of these has occurred is readily checked from the printed record or by a repeat of the hand calculation.

Punched Paper Tape

Two tape comparison. Two punch operators produce, independently, two tapes from the same document. These are inserted in the readers of a high-speed tape verifier. This machine reads both tapes, compares the characters on each and if they agree, punches a third tape with a copy of the checked character. In the event of disagreement the tape feeds are stopped and the attention of the operator is required to determine from the original document which tape (if any) is correct. The correct character can be inserted from the appropriate tape or if both are wrong a space may be left for subsequent character insertion by hand.
Cascade Verification. A first tape is produced from the original document using a keyboard perforator. A second operator, the checker, inserts the tape in the reader of a hand-verifier and, working from the same document operates the keyboard of the verifier, producing a second tape. The character corresponding to whichever key is depressed is compared with the character at the reading station. If these agree this character is punched in the second tape. Failure to agree causes the keyboard to lock and the checker is expected to carry out the error procedure to determine the correct character and cause this to be punched in the second tape.

Each method has advantages and limitations. Tabulation is slow and useful for small quantities of input data. It has the advantage that any programme check subsequently made on input equipment is a safeguard primarily against faulty operation. In both comparison methods it is necessary to find the place in the original document before correction can be applied. This disadvantage is overcome in the cascade method.

3 - PROGRAMME TESTING

A programme is rarely free from error at its first trial, and it has been said that programmers only make one correct programme - the first.

Coding slips can, and should be found off the machine by careful scrutiny but machine time is necessary to discover errors of logical thought or application and errors peculiar to particular sets of input data.

The last mentioned error is likely to occur with the most well-tried programmes. The thoroughness with which any programme is tested depends on the ingenuity and far-sightedness of the programmer concerned and short of checking the programme on every combination of input data, which is clearly unpractical, the best that can be done is to check correct operation at both ends of the design range and at other well chosen points. When a well tried programme fails because of unforeseen circumstances arising from particular input data it sets a problem for operating and maintenance staff. The knowledge that a well tried programme is in use constitutes a strong argument for suspecting the machine and these situations are usually resolved by a jury of operator, service engineer and programmer. Not until the case is proven to the satisfaction of all should any action be taken for much time can be lost if, for example, the service team is started on a wild goose chase for a non-existent machine fault.

Techniques exist for programme testing some of which depend on features engineered into the computer. Among those in use are the following:-

Programme Display

The machine carries out the steps in the programme at a speed determined by the output tape or card punch, and produces a record of the order in which instructions are obeyed. This record can be taken away from the machine and compared with the flow diagram elsewhere, thus freeing the machine for other work.

Where specially engineered features for such checks are omitted this technique can be performed by specially constructed programmes, variously called tracing programmes or post mortem routines.

The chief disadvantage of the method is the vast volume of evidence which quickly emerges in programmes having many and oft repeated loops.

What may appear as two or three instructions on a flow diagram can easily result in several cards bearing these instructions as many times as the programme requires.

Machine Speed Control

In addition to operating at normal speed, computers are arranged to take one step each time a button is pressed or to carry out steps in a programme at say 5 steps per second. By such means sequences of instructions can be followed at human speeds and the course of a programme traced.

Request Stop and Conditional Halt

The first of these permits any programme to be stopped on arrival at a selected instruction. The second has much the same effect with the difference that wanted stopping places must be marked in the programme before it is run into the computer.

Monitor Facilities

Most computers include one or more cathode-ray tubes on which are displayed the contents of selected registers and/or blocks of the main store. Not all the store can be viewed simultaneously but selection of any desired portion is usual. By their use it may be established that given registers contain the correct information at chosen points of a programme.

Monitor lamps are also included to display the next instruction (or the last) together with the serial number of the instruction in the case of machines which proceed serially through the instructions.

4 - PROGRAMME CHECKS

It is not yet general practice to include extensive built- in checking facilities on all functions of computer operation. Programme checks are still a necessary part of computers. Other lecturers will deal more fully with programme checks and it will be sufficient to state that checks should be made wherever possible to cover:-

Validity of Input Data. Here input does not necessarily mean information supplied from outside the machine. All the four arithmetic operations require two input quantities and in, for example, a division subroutine a check should be made that correct relationships between dividend and divisor are maintained according to rule.
Computer Operation. Where possible whole calculations or parts of a calculation should be checked by an independent calculation, and storage checked by sum checks either internally or externally formed.
Computer Operators. Certain programmes require action by the operator at specified places in the programme. Checks may be devised to cover as many of the wrong things an operator can do as the programmer thinks possible. Programme checks should be designed to permit diagnosis of a failure when this occurs. Often this diagnosis rests on the logical implications of successful operation up to the last check successfully passed. For example, if data cards passing into a machine are provided with a sum check and a failure occurs then:-
1. The cards may be wrongly punched (an off-computer fault).
2. The reading and conversion processes have been incorrectly performed (a machine error).
If the cards were fully checked (e.g. by tabulation) beforehand then the possibility of (a) is considerably reduced, (b) is not impossible. An operator may have dropped the cards and re-assembled them in the wrong order, which should be covered by perhaps the most important piece of error protection in any punched card installation - serial numbering of cards.

5 - AUTOMATIC CHECKING FACILITIES

The degree of automatic checking of computers varies considerably from one machine to another. Some machines are provided with extensive equipment for checking arithmetic and storage circuits while others have little or none. Where such equipment is scarce the argument is that unless the reliability of the checking circuits is at least one order better than the computing circuits there is little point in providing them. The increase in reliability should be applied to the computing circuits themselves or there will exist a need for more circuits to check the checking circuits and so on. Some compromise can be obtained and the following facilities have been provided on various machines.

ACCUMULATOR OVERFLOW INDICATOR. The range of numbers permissible in a computer is restricted by the size of the register provided. If during a computation numbers grow too large for the register significant digits will be lost and some machines include circuits for detecting when this occurs.
PROGRAMME RHYTHM INDICATOR. The machine rhythm is a function of the programme. By means of a loudspeaker suitably connected to the control circuits this rhythm can be made audible to the operator, who, by experience can detect faulty operation by change of note and/or rhythm.
INPUT AND OUTPUT CODE VALIDITY. By suitably choosing the 5 unit code characters for numerals it is possible to arrange for automatic checking of numeric and selected alphabetic information entering the computer. Punched card machines can read a card at two reading stations and compare the result. A reading station on the output punch can be arranged to read back the information punched and compare this with the original information stored in the machine.
STORAGE CHECKS. Transfers to and from magnetic drums and magnetic tape can be checked by providing what are known as parity-bits. These are redundant information digits which act as indicators to show whether the number of ones in any group is odd or even. The group chosen may be a word, a group of words (a block), a track on a drum, one character or a number of characters. The use of parity checks has increased with the introduction of magnetic tape as a storage medium. Information is frequently written across the tape in 6-bit characters and with characters succeeding each other along the tape in groups, each group making up a word or number of words. Six channels of information are required for the six digits of each character and a seventh channel is usually added to carry the transverse character parity information. If, further, characters are grouped in blocks of ten, an eleventh parity character may be formed to check the longitudinal digit patterns within any block. In such a system parity checking is said to be provided characterwise and blockwise.

6 - MACHINE MAINTENANCE & MARGINAL CHECKING

All need for both automatic and programme checks on machine operation would disappear, leaving only checks on data validity and operational mistakes, if computers were 100% reliable. Reliability is defined as

(Total hours of error free operation)/(Total hours available for use) × 100%

and when so expressed should be accompanied by figures showing the Total Hours available for use over a given period together with the Scheduled Maintenance time. Equally important is Reliability Pattern, Two machines may have the same percentage reliability and different reliability patterns. If each is handed over for a 40 hour week one may break down once only for 4 hours while the other breaks down once each day for 48 minutes, and both will show 90% serviceability.

The reliability of a machine depends on two main factors. These are (i) the inherent reliability engineered into the machine and (ii) the effectiveness of scheduled preventative maintenance. Basic reliability is achieved by a continual process of modification aimed at eliminating weak spots discovered in service while operating serviceability is attained by the correct use of scheduled maintenance periods to ensure that the performance of a machine never falls below an acceptable standard.

When a new machine is commissioned it is subjected to acceptance tests, chosen for their stringency, which test that its performance meets specified standards. The installation team will have done all they can to see that the machine is perfectly adjusted and operating with the widest possible safety margins. In time, and without further attention, these margins would decrease due to drift of component values and loss of emission in valves and ultimately the machine would become unreliable in a most inconsistent and frustrating manner.

To overcome this, marginal checking systems have been devised which can serve as a yardstick of machine performance. The basis of such systems is to apply measurable and controlled changes to the supply parameters which affect machine operation. When the mains input, H.T., bias and heater voltages are adjusted to their optimum value the computer should perform all test programmes and any other programme satisfactorily.

By arranging that any of these variables may be offset from the nominal value by a measurable amount while a programme is in operation, the degree of offset necessary to cause the programme to fail can be recorded. Minimum limits may be specified for the amount of variation necessary to cause programme failure and once these limits are established, test programmes can be run each day and be expected to operate at the limits of marginal conditions. This is a first and important step towards confidence in machine operation.

Between one and two hours may be allocated to machine test each morning. Suitably chosen test programmes are run through the machine on full marginal conditions. If such tests are completed successfully it is unlikely that any fault developing due to drift of components or valves will cause the operation of the machine to deteriorate sufficiently for it to fail with the supply lines on their nominal values.

By splitting the machine into separate marginal checkable sections it is possible to measure, each day, the maximum safe limits for each section of the computer and keep a daily performance record. Inspection of such records will show trends in marginal operation before these come within the minimum safe limits and action taken as soon as such trends are evident to restore the machine to its former performance level will ensure that it is always at acceptance test standard. Time spent in achieving this state of affairs is well worth while, leading to a high standard of performance and an accompanying spirit of confidence bn the part of operating and maintenance staff alike.

Besides its value in providing a measurement of machine performance, a marginal checking system also assists in the location of faults when these occur.

Machine failures can be of the following types:-

Complete breakdown through failure of an electronic circuit.
Complete breakdown due to failure of a mechanical component such as a card reader or tape transport.
Intermittent breakdown caused by marginal conditions in an electronic circuit.
Intermittent (and often infrequent) breakdown unaffected by marginal conditions

Faults (1) and (2) are straightforward. The fault persists, exhibits definite effects from which precise causes may be deduced and the appropriate repair effected. The machine will be out of service for the time taken to detect, locate and repair the fault and of these the last is usually the greatest, particularly with faults of type (2).

Intermittent marginal faults may be traced by diagnostic test programmes and the marginal checking system. Some evidence is usually available to give a clue. Let us suppose that such evidence points to a high speed store failure. Test programmes exist which can indicate which section of the high speed store is faulty and the marginal check system will indicate whether fault is high or low. This has narrowed the field of investigation to say 25 valves. The sectionalised marginal organisation can further narrow the field until finally a special test unit can be used to isolate a particular valve and associated circuit. Pin-pointing the fault has required two grid references, one of which is provided by a test programme which points to a physical part of the machine and the other by the marginal checking system which isolates a part of that unit.

Intermittent faults are more serious than any other, The fault condition may be present only fleetingly once each day, and may occur only when certain programmes are in operation. The worst aspect of such faults is the loss of confidence created. Although the machine may give many hours of satisfactory operation the knowledge that an undetected intermittent fault is lurking in the machine is apt to cause operating morale to fall. The only course of action is to collect all evidence and devise routines to provoke the machine into a permanently faulty state and so catch the offending circuit.

All the foregoing remarks about maintenance are based on experience with valve machines. If the introduction of transistors and cores results in eliminating intermittent faults and nothing more this will be a major step towards 100% reliability.