If digital computers are to operate successfully in any application it is essential to detect and prevent errors from faulty operation of the equipment. That potential machine users are aware of this is evident from the question most often asked by visitors to a computer installation How do you know when the machine makes a mistake? Phrased in this way the question relates to mistakes made by the machine during the course of some computation or data-processing task.
Machine errors must be detected, but they are not now as frequent as they were in the days of the prototype computers. Machine reliability has increased and although a high standard of machine performance is still a necessary condition of successful operation it is not sufficient. Some safeguards must exist to prevent and detect errors made by the human operators and customers. Mistakes in programmes, inadequate machine handling and errors in data presented to the computer can frequently cause as much, if not more, havoc than a faulty machine.
To obtain satisfactory results from a computer installation it is necessary to satisfy the following requirements:
Mistakes will certainly exist in the original data in document form. How much effort should be expended in checking data at this stage will depend on the application. In scientific work, as handled in computing centres, it is certainly profitable. Trained staff are available and the checks can usually be made fairly quickly to test the reasonableness of the data presented by a customer. In systems handling large volumes of data originating in different parts of a large concern it may be necessary to allow such mistakes to go through and concentrate on protection from the point where information is converted into the appropriate machine input medium.
Most computers accept their information in one or both of two forms, punched cards or punched paper tape. The first task within the computer department is to ensure that the input data on original documents is correctly transcribed to the appropriate form. The methods available are similar for both types of input.
Each method has advantages and limitations. Tabulation is slow and useful for small quantities of input data. It has the advantage that any programme check subsequently made on input equipment is a safeguard primarily against faulty operation. In both comparison methods it is necessary to find the place in the original document before correction can be applied. This disadvantage is overcome in the cascade method.
A programme is rarely free from error at its first trial, and it has been said that programmers only make one correct programme - the first.
Coding slips can, and should be found off the machine by careful scrutiny but machine time is necessary to discover errors of logical thought or application and errors peculiar to particular sets of input data.
The last mentioned error is likely to occur with the most well-tried programmes. The thoroughness with which any programme is tested depends on the ingenuity and far-sightedness of the programmer concerned and short of checking the programme on every combination of input data, which is clearly unpractical, the best that can be done is to check correct operation at both ends of the design range and at other well chosen points. When a well tried programme fails because of unforeseen circumstances arising from particular input data it sets a problem for operating and maintenance staff. The knowledge that a well tried programme is in use constitutes a strong argument for suspecting the machine and these situations are usually resolved by a jury of operator, service engineer and programmer. Not until the case is proven to the satisfaction of all should any action be taken for much time can be lost if, for example, the service team is started on a wild goose chase for a non-existent machine fault.
Techniques exist for programme testing some of which depend on features engineered into the computer. Among those in use are the following:-
The machine carries out the steps in the programme at a speed determined by the output tape or card punch, and produces a record of the order in which instructions are obeyed. This record can be taken away from the machine and compared with the flow diagram elsewhere, thus freeing the machine for other work.
Where specially engineered features for such checks are omitted this technique can be performed by specially constructed programmes, variously called tracing programmes or post mortem routines.
The chief disadvantage of the method is the vast volume of evidence which quickly emerges in programmes having many and oft repeated loops.
What may appear as two or three instructions on a flow diagram can easily result in several cards bearing these instructions as many times as the programme requires.
In addition to operating at normal speed, computers are arranged to take one step each time a button is pressed or to carry out steps in a programme at say 5 steps per second. By such means sequences of instructions can be followed at human speeds and the course of a programme traced.
The first of these permits any programme to be stopped on arrival at a selected instruction. The second has much the same effect with the difference that wanted stopping places must be marked in the programme before it is run into the computer.
Most computers include one or more cathode-ray tubes on which are displayed the contents of selected registers and/or blocks of the main store. Not all the store can be viewed simultaneously but selection of any desired portion is usual. By their use it may be established that given registers contain the correct information at chosen points of a programme.
Monitor lamps are also included to display the next instruction (or the last) together with the serial number of the instruction in the case of machines which proceed serially through the instructions.
It is not yet general practice to include extensive built- in checking facilities on all functions of computer operation. Programme checks are still a necessary part of computers. Other lecturers will deal more fully with programme checks and it will be sufficient to state that checks should be made wherever possible to cover:-
The degree of automatic checking of computers varies considerably from one machine to another. Some machines are provided with extensive equipment for checking arithmetic and storage circuits while others have little or none. Where such equipment is scarce the argument is that unless the reliability of the checking circuits is at least one order better than the computing circuits there is little point in providing them. The increase in reliability should be applied to the computing circuits themselves or there will exist a need for more circuits to check the checking circuits and so on. Some compromise can be obtained and the following facilities have been provided on various machines.
All need for both automatic and programme checks on machine operation would disappear, leaving only checks on data validity and operational mistakes, if computers were 100% reliable. Reliability is defined as
(Total hours of error free operation)/(Total hours available for use) × 100%
and when so expressed should be accompanied by figures showing the Total Hours available for use over a given period together with the Scheduled Maintenance time. Equally important is Reliability Pattern, Two machines may have the same percentage reliability and different reliability patterns. If each is handed over for a 40 hour week one may break down once only for 4 hours while the other breaks down once each day for 48 minutes, and both will show 90% serviceability.
The reliability of a machine depends on two main factors. These are (i) the inherent reliability engineered into the machine and (ii) the effectiveness of scheduled preventative maintenance. Basic reliability is achieved by a continual process of modification aimed at eliminating weak spots discovered in service while operating serviceability is attained by the correct use of scheduled maintenance periods to ensure that the performance of a machine never falls below an acceptable standard.
When a new machine is commissioned it is subjected to acceptance tests, chosen for their stringency, which test that its performance meets specified standards. The installation team will have done all they can to see that the machine is perfectly adjusted and operating with the widest possible safety margins. In time, and without further attention, these margins would decrease due to drift of component values and loss of emission in valves and ultimately the machine would become unreliable in a most inconsistent and frustrating manner.
To overcome this, marginal checking systems have been devised which can serve as a yardstick of machine performance. The basis of such systems is to apply measurable and controlled changes to the supply parameters which affect machine operation. When the mains input, H.T., bias and heater voltages are adjusted to their optimum value the computer should perform all test programmes and any other programme satisfactorily.
By arranging that any of these variables may be offset from the nominal value by a measurable amount while a programme is in operation, the degree of offset necessary to cause the programme to fail can be recorded. Minimum limits may be specified for the amount of variation necessary to cause programme failure and once these limits are established, test programmes can be run each day and be expected to operate at the limits of marginal conditions. This is a first and important step towards confidence in machine operation.
Between one and two hours may be allocated to machine test each morning. Suitably chosen test programmes are run through the machine on full marginal conditions. If such tests are completed successfully it is unlikely that any fault developing due to drift of components or valves will cause the operation of the machine to deteriorate sufficiently for it to fail with the supply lines on their nominal values.
By splitting the machine into separate marginal checkable sections it is possible to measure, each day, the maximum safe limits for each section of the computer and keep a daily performance record. Inspection of such records will show trends in marginal operation before these come within the minimum safe limits and action taken as soon as such trends are evident to restore the machine to its former performance level will ensure that it is always at acceptance test standard. Time spent in achieving this state of affairs is well worth while, leading to a high standard of performance and an accompanying spirit of confidence bn the part of operating and maintenance staff alike.
Besides its value in providing a measurement of machine performance, a marginal checking system also assists in the location of faults when these occur.
Machine failures can be of the following types:-
Faults (1) and (2) are straightforward. The fault persists, exhibits definite effects from which precise causes may be deduced and the appropriate repair effected. The machine will be out of service for the time taken to detect, locate and repair the fault and of these the last is usually the greatest, particularly with faults of type (2).
Intermittent marginal faults may be traced by diagnostic test programmes and the marginal checking system. Some evidence is usually available to give a clue. Let us suppose that such evidence points to a high speed store failure. Test programmes exist which can indicate which section of the high speed store is faulty and the marginal check system will indicate whether fault is high or low. This has narrowed the field of investigation to say 25 valves. The sectionalised marginal organisation can further narrow the field until finally a special test unit can be used to isolate a particular valve and associated circuit. Pin-pointing the fault has required two grid references, one of which is provided by a test programme which points to a physical part of the machine and the other by the marginal checking system which isolates a part of that unit.
Intermittent faults are more serious than any other, The fault condition may be present only fleetingly once each day, and may occur only when certain programmes are in operation. The worst aspect of such faults is the loss of confidence created. Although the machine may give many hours of satisfactory operation the knowledge that an undetected intermittent fault is lurking in the machine is apt to cause operating morale to fall. The only course of action is to collect all evidence and devise routines to provoke the machine into a permanently faulty state and so catch the offending circuit.
All the foregoing remarks about maintenance are based on experience with valve machines. If the introduction of transistors and cores results in eliminating intermittent faults and nothing more this will be a major step towards 100% reliability.