In almost any discussion of computer applications the subject of information retrieval is sure to arise sooner or later. It is a subject of great interest on which people have strong views. There is general agreement that "the information retrieval problem must be solved, " but what this means and how it is to be done are hotly debatable questions. At the Atlas Laboratory a number of us have been interested in information retrieval for some time. In May 1965 three of us (E. B. Fossey, F. R. A. Hopgood and myself) went to the U.S.A. for the I.F.I.P.S. Conference and took the opportunity of visiting a number of computer installations. At M.I.T. we met Dr. Kessler, who has set up an on-line information retrieval project as an adjunct of Project MAC. Kessler has described his project elsewhere [1] and I will not repeat his account here; but I will now gratefully acknowledge that our information retrieval system is a close copy of his. There are differences of detail, but all the key ideas are present in his system. We were struck immediately by its basic soundness, simplicity, and by the relatively small amount of effort it required to get working and to maintain. That it worked was beyond dispute, for one of us (F.R.A.H.) used it to retrieve one of his own papers and related papers. The results appeared on the typewriter within seconds; a very effective demonstration.
Our objectives in setting up this system are as follows:
It is still too early to write the final summary, and this will not be possible until the system is working on-line to Atlas via consoles; so this account should be regarded as a progress report, written approximately twelve months after we decided to begin work on the system.
For each book in the Atlas Laboratory library a series of cards is punched giving its author(s), title and publisher. Each card relating to a book begins with a seven-digit number which uniquely identifies the book. The first four digits indicate the subject matter of the book (according to a scheme developed elsewhere which we adopted), and digits five to seven give a serial number to the book within that class. An example will illustrate the format:
65600011 KEISTER, WILLIAM 656000l7π RITCHIE, ALISTAIR, E 65600017πWASHBURN, SETH, H 65600012 THE DESIGN OF SWITCHING CIRCUITS 65600019 NEW YORK, VAN NOSTRAND, 1951
The card giving the first author (or editor, for a collection of papers etc.) has a 1 punched in column eight, subsequent authors being identified by a π punched in column eight. The title begins on a card having a 2 in column eight and continues as far as necessary on to cards having 3, 4,.... in column eight. Finally, a card with a 9 in column eight gives the publisher and this is the last card associated with that book.
A similar method is employed here. We now have to indicate the name of the journal, volume number, year and page number, as well as author and title. The significance of the columns on the card are as follows:
Columns | Significance |
---|---|
1,2 | Journal identification number |
3,4 | Year of publication |
5-8 | Page on which article begins |
9-12 | Volume number on edition |
13,14 | Data type identification number |
Columns 15, 16 are left blank (Atlas has an 8 character word, so this is convenient). The data type identified in columns 13, 14 are:
Symbol | Meaning |
---|---|
01 | First author's name |
0π | Subsequent joint authors |
02 | Title of article, first card |
0/ | Title continuation cards |
+0 | First reference to another article (details begin in column 17) |
+& | Further references to other articles |
+9 | Total number of references to other articles |
-0 | First reference to this article by other articles |
-& | Further references to this article by other articles |
Entries belonging to the class of either of these last two data types are worked out automatically by the updating program; when a new article, X, is added to the list, all the articles to which it refers are looked up and their records are extended to include the new information that they have been referred to by article X. An example of data put into the journal updating program is
00600084000301 WINDLEY, P. F. 00600084000302 TREES, FORESTS AND REARRANGING 006000840003+0 005800710001 006000840003+& 005900010002 006000840003+& 025601340003 006000840003+& 006000150003 006000840003+9 0005
The meaning of this is that we have a record of a paper by P. F. Windley in the Computer Journal (= Journal 00), 1960, Volume 3, beginning on page 84 with the title "Trees, Forests and Rearranging". The paper contains five references; three are to articles in The Computer Journal and one to an article in The Journal of the Association for Computing Machinery (= Journal 02). The fifth article is not listed; it was in a journal which was not among those included in the system (though it has since been added).
The program is really a collection of programs for loading, updating, sorting editing and retrieval. When new books or journal articles are to be added to the system, the appropriate magnetic tapes are loaded, the new cards read in and the tapes updated. Each book gives rise to two records on the book/author tapes, viz.
Each journal article gives rise to a number of records on the journal tape, viz.
In the world of information retrieval this last process is called "building up a citation index". It is the key to the whole system.
As soon as the Atlas disc is fitted we intend to replace the tape files by disc sectors, although the tapes will remain as archive records.
The user can ask certain questions. If he wishes to know what books we have on, say, Group Theory he can easily do so: if he wishes to know what book we have in the library written by author X, he can also find out very easily. We have been able to do this for many months: it was, of course, very easy to implement once we had written the updating and editing programs.
The most interesting part of the retrieval process is associated with retrieval of journal articles. Let us suppose that the user wants to know if we have any references to articles on the numerical solution of integral equations. Since we do not use a key-word or abstract based retrieval system, how is the user to get going at all? The answer is that we assume that he knows at least one reference to a paper on the subject of integral equations. This is not very much to ask. Let us suppose that the user knows that Elliott has written a paper on this subject; the user then types in
Find author ELLIOTT
The machine responds with a list of all papers by ELLIOTT. Among these we find:
00630102000601 ELLIOTT, DAVID 00630102000602 A CHEBYSHEV SERIES METHOD FOR THE 0063010200060/ NUMERICAL SOLUTION OF FREDHOLM INTEGRAL EQUATIONS
If we wish we can now follow this reference up by asking
Give references from 006301020006
This will produce a list of the references given in the paper by Elliott, and we discover that there are twelve. Of these twelve, three obviously refer to integral equations and their references can be further recovered, and so on. A further line of retrieval is to make use of the citation index aspect of the system by making a request of the type
Give references to 006301020006
and a list of all papers which have referred to the paper by Elliott will be produced. Again one can ask
Give papers related to 006301020006
and the program will print out all those papers which have given one or more references to the papers cited by Elliott.
In Kessler's scheme one can qualify the retrieval conditions by asking, for example, that only references later than 1962 are reported, or that only articles from certain journals be given, and so on. We intend to implement a similar system. Another important facility is one enabling the user to ask how many references he will be given if he asks a certain question, and if there are too many or too few he may wish to qualify or change the request before seeing any print-outs.
The great advantage of the citation index is that it allows retrieval to proceed forwards as well as backwards in time once a starting point has been established. It would be a difficult and expensive system to organise and maintain manually, but it is very easy and not expensive for a computer. Thus, the use of a citation index is natural in a computer-based information retrieval scheme and adds enormously to the retrieval potential.
One could clearly extend the set of retrieval commands, and we may do this later in the light of experience; but this will not be until the system is being used as it was always designed to be used, on-line from a console via a satellite computer and disc to Atlas.
One of the attractions of Kessler's system is the remarkably low cost in terms of both quality and quantity of staff. Many schemes require highly qualified abstracters. Kessler got his system working on the 7090 at M.I.T. in little over a year with a total staff that never exceeded four: himself and two or three students. Our experience has been closely parallel. During the first year of the project only the following people have worked on the project: Miss M. P. Richards (a student from Bristol College of Advanced Technology), August to December 1965; Miss E. Litherland (a student waiting to go to Oxford University), February to August 1966; Mrs. S. M. T. Harold (a junior programmer on the Atlas Laboratory staff), January to July 1966. In addition I gave perhaps one day a week to the project throughout the year. This low cost is all the more remarkable when it is pointed out that all the programming is in machine-code and that Miss Richards and Miss Litherland had never programmed in machine code before.
In addition to the programming effort the data has to be punched up. At the moment the articles are punched from about a dozen journals, all related to computers. We decided upon these after studying which were the most quoted The number can be extended at any time. Punching of the current journals and steadily working away at the backlog makes a background job for the data preparation section of the Atlas Laboratory. Punching of the backlog is going back to 1960 for all journals on the list, and when this is complete selected journals will be punched from 1959 backwards. The effort required has not been great. One punch girl has been on the project for about three months, and in this time has punched up the articles with references from about 150 issues of journals. Included among these are The Computer Journal, Mathematics of Computation, and The Journal of the A.C.M. which have been punched back to 1960.
The present state of the system is that it has nearly reached completion so far as the off-line use of the computer is concerned. The last phase, on-line retrieval, will require the implementation of the simple language outlined in paragraph IV, and this will not take long when the satellite and consoles have been added.
[1] The M.I.T. Technical Information Project, M. M. Kessler, Physics Today, March 1966, 28-36.