The work that forms the basis for this paper was undertaken as part of an exercise to purchase two multi-user minicomputer systems to be developed as interactive facilities for grant holders supported by the Engineering Board of the United Kingdom Science Research Council.
The interactive benchmark was one of the set of machine evaluation tests which included an examination of code produced by the FORTRAN compiler and several measurements of processing power.
There were two major criteria for selection:
Naturally, there was also a very stringent price restriction.
Users of the systems are expected to be developing programs, the majority of which will be written in FORTRAN and require graphics facilities. It is envisaged that storage tubes (Tektronix) graphics will be the norm but that some users will require refresh display graphics.
Most manufacturers tendered systems with about 192K bytes of main store after consultations on workload characteristics and expected response times. A community of up to 60 users is to be served by each system. It was estimated that at least 70 Mbytes of exchangeable disc storage space would be required on-line at any time. For reliability reasons manufacturers were asked to tender systems which contained at least two disc drives of equal size.
The minicomputer systems will, in the fullness of time, be linked into a network, but details of this were unavailable and hence no requirement for communications software to be present was stipulated in the benchmark.
Two techniques have been employed in running interactive benchmarks. The first is to invite real users to type a script. The alternative is to use a stimulator, a piece of software resident (normally) in a front end communications processor which submits messages at predetermined rates from internally stored scripts rather than from real terminals. A number of disadvantages of the former technique are readily apparent:
At the time this benchmark was performed (August 1976) minicomputer manufacturers did not have stimulator facilities available, hence six real users were employed.
At the outset the inaccuracies that must occur in a benchmark of this type seemed such an insurmountable problem that any results obtained would be of questionable validity, but nothing ventured, nothing gained.
Although there were six users and only one basic script this does not imply that everyone was typing the same line at the same time. On the contrary, the script was designed so that each user should, if the system performed as the mythical perfect fit of our specification, have been executing a different phase of the script. This feat was accomplished by constructing a script which consisted of six stages.
As a reasonable approximation to a variable workload, it was decided that the split should be two edits, one compilation and three runs of interactive programs. The script had a specified cyclic order and each user started at a different stage. To clarify this point, if the edits are designated EDITA and EDITB and the runs RUNA, RUNB and RUNC then each user followed a script as shown in Table 1.
STAGE ORDER | ||||||
---|---|---|---|---|---|---|
User | 1 | 2 | 3 | 4 | 5 | 6 |
1 | EDITA | RUNA | COMPILE | RUNB | EDITB | RUNC |
2 | RUNA | COMPILE | RUNB | EDITB | RUNC | EDITA |
3 | COMPILE | RUNB | EDITB | RUNC | EDITA | RUNA |
4 | RUNB | EDITB | RUNC | EDITA | RUNA | COMPILE |
5 | EDITB | RUNC | EDITA | RUNA | COMPILE | RUNB |
6 | RUNC | EDITA | RUNA | COMPILE | RUNB | EDITB |
The elapsed time taken for each stage is important and so each stage was bracketed by TIME commands. All users were required to login and logout at the beginning and end of their script. Users repeated their scripts after logging out until all users had completed one cycle of the script.
Each stage was designed to exhibit different characteristics and last for about 4-5 minutes. so producing a balanced load on the system if users stayed in step. This had the advantage that a machine of inadequate power or guilty of deficiencies in scheduling or memory allocation would tend to tie itself in knots as stages began to overlap. Unfortunately. an over-powered machine finds the whole business too easy and the results, while impressive, do not yield to further analysis.
Manufacturers were supplied with a system independent version of the script from which they were required to produce a translation to run on their particular machine. A copy of the script for the PDP15 (not one of the machines tested) is contained in Appendix A. It is a feature of the design that stages are independent, ie compilation does not depend on a successful edit. This creates firewalls between stages and provides a more robust benchmark.
For easy reading and the distinguishing of significant spaces the manufacturer specific script was copied onto coding sheets. Each user was provided with a copy personalised with regard to the user's identification and terminal speed and complete with space in which to write times.
Attention to detail at this stage was invaluable in ensuring the smooth running of the benchmarking sessions.
It was feared initially that mistyping by the users might be a serious problem. Important commands were clearly marked and remedial action indicated in the scripts. The editing stages were designed to be as insensitive to typing errors as possible but clearly mistyping a string in a context search can have unfortunate effects. Users were told to use common sense as far as possible but the general guideline that if you reach the end of an edit file for the third time, give up, was laid down. In practice little trouble was experienced with typing errors. Typists are remarkably accurate when faced with the prospect of having to run the benchmark again if they make a serious error! The input to interactive programs and text entry within edits was always arbitrary.
The one detail over which the benchmark designer has no control is the operating system's command language. The benchmark certainly indicated vast differences in the conciseness of command languages.
It is to be much regretted that detailed characteristics of user behaviour on minicomputer systems is not available in the literature. Details of user behaviour have been reported by Leeds University for the KDF9 system [1], the University of Edinburgh EMAS system [2] and the Control Systems Centre at UMIST for a DEC KA 10 system [3]. Extrapolating from these figures a workload (and scripts) that would run on a machine of 2 × 105 ips power was derived. It is to be hoped that future work will provide more information on user behaviour in a minicomputer graphics program development area.
EDITA can be performed using either a line editor or a context editor. The file to be edited is FORTRAN text and is approximately 200 lines in length. Strings to be typed were made as simple as possible to minimise errors, since, especially in context searches, these may have undesirable consequences. In particular, differences in typing speed over users of differing keyboard experience were reduced through the use of number sequences.
EDITB includes a global edit and can only be performed by a context editor. The file to be edited is approximately 600 lines in length and is of similar construction to that used in EDITA. As in EDITA the script was optimised towards simple constant speed typing.
Both editing stages require that the file be edited to a new file of different name, that is deleted at the end of the stage. This makes it easy for users to repeat script stages.
A FORTRAN program of approximately 900 lines was used in the compilation phase. The text contained a number of deliberate errors (four were discovered by most compilers), in retrospect this was probably a mistake (see 8 (2) and 8 (3)).
The program was to be compiled with minimal listing, thus minimising delays due to typing messages on different speed terminals.
The three interactive programs to be run were different manifestations of an artificial program. The possibility of using real programs in the benchmark was investigated, but was finally rejected. It was felt that the difficulties of implementing real programs on small computers would be substantial especially as the majority of programs are written in non-standard FORTRAN, with or without the programmer's realisation. The labour cost of converting such programs was judged prohibitive. The CPU time consumed by real programs between interactions is impossible to control. Careful control over this was required for this benchmark and was to prove essential.
The artificial program used was adopted from one designed by Dr C J Pavelin for use in another benchmark. The program generates processor activity, input and output (directed at a terminal) and unformatted input and output to the filestore. The levels of each activity were controlled by parameters read from a data file.
The size of the program was governed by the sizes of three arrays.
The programs had the characteristics shown in Table 2.
Program | CPU Load (fraction of 1 MIPS machine) |
Size (K bytes) |
Disc I/O Transfer |
---|---|---|---|
RUNA | 0.05 | 28 | NONE |
RUNB | 0.03 | 36 | NONE |
RUNC | 0.01 | 44 | 4000 words / CPU sec |
The behaviour of the program is also affected by the time taken by the terminal to output messages. This time is a function of the terminal speed. It was decided to keep this time constant by varying the number of characters output according to the speed. This solution has the drawback that the buffer load placed on the operating system then becomes a function of the terminal speeds available. However, the converse would have the much more serious effect of rendering the elapsed times almost impossible to analyse. The quantitative effects of using terminal speeds different from those requested (1200 bps) are difficult to estimate.
If a comparison of different machines is to be made then it is obviously desirable that the benchmark should be run as specified by the script on the exact hardware and software configuration proposed in each manufacturer's reply to tender. The requirement for a mature files tore should also be included as this will affect files tore access time and it should be ensured that other normal system facilities eg the lineprinter spooler, which are not used in the benchmark are nevertheless present since these may contribute to the size of the operating system and therefore the amount of store remaining for user programs.
It is at this point that problems start to appear. To benchmark the exact proposed system is a pipe-dream. Almost everything can, and will, be different - from the central processor model number to the size of the discs and the version of the operating system. Very little can be done to combat this directly. If a manufacturer has only certain pieces of kit available and has not finished writing the operating system yet then the benchmark can only be run on the nearest system he can offer and the results extrapolated with the agreement of the manufacturer who may then be required to reproduce those results in an acceptance test.
It is hoped that by clearly and unambiguously stating the hardware and software requirements of the system that the benchmarking of a machine so grossly malconfigured as to make the results almost totally irrelevant is avoided. To lessen the risk of misunderstandings, either deliberate or accidental, it is essential to provide thorough documentation of the benchmark. Time is well spent when issuing the substance of the benchmark in ensuring that all cards or paper tape are labelled, tagged and tied in blue ribbons and accompanied by both listings and specimen output to make the manufacturer's job as simple and straightforward as possible. In general, manufacturers will be so surprised at receiving a benchmark so apparently well thought out that the remainder of the exercise can be carried out in a spirit of friendly cooperation!
Naturally, prior to issuing the benchmark every effort must be made to test its correctness very thoroughly indeed. But, one of the first rules of benchmarking is that something can always go wrong - and it usually does.
In the very last resort stopwatches can be used to measure elapsed times but fortunately most systems support a time of day command which responds to the nearest second. Few systems at this time have any additional information available and so analysis is based on the total elapsed time for the script and the elapsed times for each stage. Therefore a permanent record of these times is needed and those users with VDUs are required to write the times on their scripts. However, users sometimes forget this onerous task and for this, and other, reasons it is valuable to have an extra person at the session who notes each person's approximate stage times with the aid of a stopwatch. This same person can control the starting and finishing of the session and if, or when, the users get bored, perform a useful entertainment function.
With the script translated by the manufacturer and agreed by the potential purchaser the stage has been reached where the team of users may be collected together and transported to the scene of the benchmark run. Fellow programmers of the two-fingered typist level were drafted for these occasions and briefed beforehand. They were found to be preferable to the manufacturer supplied product due to their greater reliability, their similar level of competence and their familiarity with the script. After an initial few practice runs they appeared to reach a learning plateau and their typing speed did not thereafter increase.
For several reasons more than one benchmark run was performed on each machine. Firstly, the manufacturer's prime offer was benchmarked twice to establish the repeatability of the exercise. This naturally has a bearing on the validity of the acceptance test. Surprisingly, the results could be repeated always to within 5% and often to within 1%. Secondly, manufacturers on occasions liked to try to improve their ratings by changing either the hardware or software configurations. Such runs often provided useful information for both parties.
Of course, not every run went according to plan. Both software and hardware crashes tend to occur at critical moments and there was the case of the everlasting FORTRAN compilation. Many runs were never completed and a benchmarking day was almost always a long and tiring affair. However, incomplete runs need not be wasted but can be used as an extra consistency check.
But in general, while the sanity of the users holds out, it is advisable to perform as many runs as possible since they provide consistency checks and additional data on the machine's behaviour. By containing all the users within one room a certain esprit de corps can be established which helps to lengthen the life of tired fingers, though frequent infusions of coffee were also found to be useful. It is to the credit of our long-suffering users that remarkably few typing errors occurred. The smooth running of the benchmark can be aided by providing each user with details on the special characters used on that system eg line termination, line deletion and character deletion, in addition to that user's specially configured script. After giving the users some time to read the script a practice run of about ten minutes was found to be invaluable in highlighting any problems that were likely to occur with the script as well as giving the users an opportunity to become familiar with the terminal. All stages were automatically tested during the trial run due to the design of the script.
A typical set of results is shown in Table 3. Table 1 can be used as a key to interpret the figures.
USER | STAGE | TOTAL | |||||
---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | ||
1 | 3.47 | 1.38 | 1.28 | 1.32 | 2.25 | 3.21 | 14.11 |
2 | 1.49 | 1.48 | 1.39 | 2.45 | 3.03 | 3.41 | 14.45 |
3 | 1.50 | 1.14 | 2.54 | 3.32 | 3.43 | 1.48 | 15.01 |
4 | 1.34 | 2.57 | 3.29 | 3.36 | 1.44 | 1.33 | 14.53 |
5 | 3.07 | 3.22 | 3.23 | 1.47 | 1.30 | 1.36 | 14.45 |
6 | 3.03 | 3.49 | 1.56 | 1.14 | 1.18 | 2.52 | 14.12 |
Although a requirement for a certain mix of terminal speeds may be made of a manufacturer, inevitably occasions arise when this cannot be met. While the effect can be minimised by careful design of the edit commands and the artificial interactive program, the situation is still clearly an unhappy one. Unfortunately there seems to be no solution bar that of providing the terminals oneself. However, one pitfall that can with foresight be avoided is that of using the system console as a user terminal. Many systems report login and logout messages to the console which can have a disastrous effect on the unfortunate user allocated to that terminal.
The analysis of the results is not a well-defined process. With physical users involved the number of runs on any machine is severely limited and so the figures are too few to permit a statistical analysis. The major point of comparison of the machines is the average elapsed time for the complete script. This figure has been shown to be repeatable and forms part of the acceptance tests.
Other useful figures are the average elapsed times per stage. When studied in conjunction with details on the characteristics of that stage these figures can identify weak spots in a machine's performance.
The most valuable aid in examining the general progress of users through their scripts is given by the diagram in Figure 1. The diagram shows a well-balanced system with few stage overlaps. Figure 2 shows the results for a different system which had difficulty coping with the program RUNC.
The two sets of results have average elapsed times for one cycle through the script of 14 mins/34 secs and 29 mins/42 secs respectively. The benchmark, though it may be lacking in some respects, reliably distinguished machines by their performance and was able to detect changes in configurations eg the addition of a store module.
Although it is to be regretted that the opportunity of testing the effect of different disc configurations did not present itself, this exercise has shown that a benchmark of this type can aid in the monitoring of performance over changes in hardware or software in a multi-user minicomputer system.
A measure of the extent to which the system has coped with the benchmark load in a clean and balanced manner can be gauged by calculating the standard deviation of the total elapsed time for the script over as many runs as possible on the same configuration. A low figure indicates a well-balanced system.
If it was necessary to repeat this exercise then naturally there are lessons learnt here which it would be valuable to incorporate. A lot of useful information has been gained and it is hoped that our efforts will be of some help to any future traveller along this stony path.
The improvements that could be considered include those listed below:
While it is possible to identify some of the areas in which the benchmark might be improved and suggest appropriate action, there are other trouble spots that present more difficult problems.
Above all, simplicity is the key to success in benchmarking this type of system. In all cases introducing complications was counter-productive.
With a limited amount of time and effort it is impossible to benchmark every aspect of a system's performance. As always a compromise must be reached which represents the optimum return from a given outlay.
It is difficult for us to be objective on our own work but interesting and relevant results were obtained. Of course, the cost of performing such a benchmark must be compared to the value of the machine being purchased. Spending £100K on benchmarking a £50K system is hardly good business. It is not easy to estimate the time taken to design prepare and run the benchmark but an approximate figure would be 3 man-months.
The authors would like to acknowledge the help of their colleagues within the Atlas Computing Division especially F R A Hopgood, C J Pavelin, L 0 Ford, J R Gallop, G W Robinson, P E Bryant, and last, but not least, the manufacturers who made it all possible.
[1] A Multi-Terminal Benchmark, D Holdsworth, G W Robinson and M Wells, Software Practice and Experience, 1, 43 (1973).
[2] Performance Measurement on the Edinburgh Multi-Access System, J C Adams and G E Millard, Proceedings of the International Computing Symposium 1975 2-5 June 1975, Antibes, France.
[3] Performance Measurement of Time-Sharing Computers, I Scialom, M.Sc. Thesis, University of Manchester, 1975.
This is an example of how the script might be implemented for the DOS operating system of a DEC PDP15 minicomputer.
LOGIN BNC TIME PIP T DK WORK SRC ← DK FL200 SRC <alt> Takes local copy of FL200 EDIT OPEN WORK F ABC C /01/10/ LEND Enter INPUT mode 1X123456789012345 2X12345678901234567890 3X1234567890 4X123456789012345 Arbitrary input 5X12345678901234567890 6X1234567890 7X123456789012345 8X12345678901234567890 Return to EDIT mode F GHI M 1 IF F MNP98 D 10 P F QRS12 C /A/B/ T F DEF012 CLOSE EXIT PIP D DK WORK SRC <alt> TIME End of first stage LOAD ←BENCHA <alt> <ctrl s> File assignments performed internally 11 Terminal identifier AB CDE FG HIJ TIME End of second stage F4 B+F900 <alt> PIP D DK F900 BIN TIME End of third stage LOAD ←BENCHB <alt> <ctr s> 21 ABCDEFG HIJK LMNOPQR STUV TIME End of fourth stage PIP T DK WORK SRC ← DK FL600 SRC <alt> EDIT OPEN WORK CONVERT /SER3/SER4/ T F ABC C /34/234/ L DEFGH C /6/123456789012345678906/ L KLM98 F XYZ Fails T F KLM98 Return to previous position F PQR D F STU12 C /S/W CLOSE EXIT PIP D DK WORK SRC <alt> TIME End of fifth stage LOAD ←BENCHC <alt> <ctr1 s> 31 ABCDEFGHIJK This program performs some I/O to LMNOPQR the filestore STUVWXYZABC DEFGHIJ PIP D DK FILE0 SRC <alt> Delete file created TIME End of sixth stage LOGOUT