Chilton::ACL::Challenges

Challenges

Simon Lavington

11.04.2013

Two challenges encountered when developing the Ferranti Atlas computer.

This note describes two examples of fault-finding issues that posed interesting problems at the time. It draws on information given in audio interviews with four Atlas pioneers [see reference 1] and on subsequent correspondence with the interviewees.

The issues are each examples of what could be described as: Things can change whilst you're not looking. In 1960 -- 1963, when Atlas was being developed, the two issues were relatively novel and were brought sharply into focus by the novel nature of the Atlas high-performance system. In particular, Atlas had:

no CPU clock (control was asynchronous);
many on-line peripheral devices, of various speeds and crisis-times;
about 150 distinguishable causes of Interrupt;
a multi-programming, multi-tasking Operating System.

Issue (1): arbitration between asynchronous events.

At the Atlas hardware level, it was necessary to decide between two or more independent units that were signalling a request for a single resource. Arbitration was made on the basis of fixed priorities. Problems arose when two or more asynchronous requests reached the decision-making circuit very close together in time. This could cause a narrower-than-normal pulse to be produced -- for example 50 nsec. instead of 100 nsec. wide. Narrower pulses might or might not convey sufficient energy to change the state of a decision flip-flop (bistable). Under adverse conditions, the flip-flop might even hover in a metastable state, only finally reverting to the standard set or reset states after a certain settling-time. This phenomenon was recognised by the Atlas design engineers at Manchester around 1959/60, at the time of developing the Atlas Pilot Model [ref. 2]. Dai Edwards also remembers discussing the issues with David Wheeler of Cambridge University. Dai says that it was not known whether anyone else, world-wide, had yet drawn attention to the settling-time problem. [ref. 2].

As is mentioned by Dai Edwards in the interview [ref. 1], one important Atlas area where the problem was known to manifest itself was in arbitrating between three relatively high-priority autonomous units (the central processor, the drum system and the magnetic tape system) when they requested access to the main core store. Some experiments were therefore carried out at Manchester, on the basis of which an allowance for settling-time was made for Atlas that appeared to give an acceptable mean time between failures (MTBF) consistent with the known maximum rate of requests to access the main core store. In the limit, of course, there was no known physical circuit that could guarantee unambiguous decisions in a reasonable finite time.

Ten years later, in about 1970, engineers at Manchester re-visited the settling-time problem because it had become critically important during the design of the MU5 high-performance computer, the successor to Atlas. More experiments were carried out, resulting in the derivation of formal mathematical relationships between settling-times, the gain-bandwidth product of the decision flip-flop and the consequential likely mean times between failures (MTBF) for a given rate of requests [ref. 3]. For a desired MTBF, the faster (ie better HF performance) the flip-flop, the shorter the required settling-time.

Issue (2): Restart points.

When writing code at the centre of an Operating System, particularly Interrupt level subroutines, one of the most important things to decide is where to locate Restart Points. Any subroutine in the Atlas Supervisor could be interrupted by a more important Interrupt subroutine (for example, those associated with the drum system or the magnetic tape system). If this happened, then control was in due course returned to the original subroutine at a Restart Point. This meant that some instructions could be unexpectedly carried out twice, and care was needed to ensure that the contents of some index registers did not get corrupted. This was particularly important in the case of single-bit switches (of which there were several). If a Supervisor subroutine was interrupted after setting a switch and before resetting the Restart Point, it was possible to find oneself in a subroutine with an "irregular" switch setting. For Atlas, the general rule was always to use Boolean instructions (rather that arithmetic instructions) in switch settings.

In his talk at the Atlas 50th Anniversary Symposium in Manchester on 5th December 2012, David Howarth touched on this point. The general type of incident that David discussed, where a peripheral fired off a loose signal and the peripheral fault-handling software mis-handled it, was difficult to trace. The faults were usually inconsistent and were notoriously difficult to reproduce. Mike Wyld remembers [ref. 4] spending ages looking for a problem caused by a dodgy Door open/closed switch on a magnetic tape deck that caused an issue in some unconnected part of the Supervisor system. A less-than-rigorous understanding of the importance of the Restart Point technique was the cause of many intermittent incidents in the early days of developing the Atlas Supervisor.

Someone once said that if a problem was consistent it was probably software and if inconsistent it was probably hardware. On Atlas this was no longer true. Variations in the timing or order-of-occurrence of Interrupts could cause Supervisor software to behave differently on different occasions. If an interrupt level subroutine had a very high priority (say dealing with a specific magnetic tape situation) it might very rarely get interrupted, and there were several situations where routines with incorrect Restart Point settings ran for many months without causing problems. Whilst the code might have been logically correct, incorrect positioning of the Restart Point was the cause of some interesting problems.

This Restart Point concept was used throughout the Atlas Supervisor code and was an important aid to fault finding, since the last Restart Point setting told you where the software had been when a fault occurred. The first thing one did when called to look at a suspected Supervisor fault was to check the Restart Point and the console display was usually set to that address [ref. 4].

References.

1. Archival audio interviews with D B G Edwards, E C Y Chen, D Howarth and M T Wyld, 6th December 2012. The transcripts will appear shortly on the Atlas 50th Anniversary website: www.cs.manchester.ac.uk/Atlas50/

2. Discussion with D B G Edwards, 9th April 2013.

3. D J Kinniment & J V Woods, Synchronisation and arbitration circuits in digital computers. Proc. IEE vol. 123, 1976, pages 961 -- 966. See also: D J Kinniment, Synchronisation and arbitration in digital systems. Wiley, 2007. ISBN: 978-0-470-51082-7.

4. M T Wyld, e-mail sent to Simon Lavington dated 28th December 2012.