Contact us Heritage collections Image license terms
HOME ACL Associates Technology Literature Applications Society Software revisited
Further reading □ OverviewInformation retrieval
ACD C&A INF CCD CISD Archives Contact us Heritage archives Image license terms

Search

   
ACLApplicationsInformation Retrieval :: Information Retrieval at ACL
ACLApplicationsInformation Retrieval :: Information Retrieval at ACL
ACL ACD C&A INF CCD CISD Archives
Further reading

Overview
Information retrieval

Information Retrieval

Bob Churchhouse

1966

I. Introduction

In almost any discussion of computer applications the subject of information retrieval is sure to arise sooner or later. It is a subject of great interest on which people have strong views. There is general agreement that "the information retrieval problem must be solved, " but what this means and how it is to be done are hotly debatable questions. At the Atlas Laboratory a number of us have been interested in information retrieval for some time. In May 1965 three of us (E. B. Fossey, F. R. A. Hopgood and myself) went to the U.S.A. for the I.F.I.P.S. Conference and took the opportunity of visiting a number of computer installations. At M.I.T. we met Dr. Kessler, who has set up an on-line information retrieval project as an adjunct of Project MAC. Kessler has described his project elsewhere [1] and I will not repeat his account here; but I will now gratefully acknowledge that our information retrieval system is a close copy of his. There are differences of detail, but all the key ideas are present in his system. We were struck immediately by its basic soundness, simplicity, and by the relatively small amount of effort it required to get working and to maintain. That it worked was beyond dispute, for one of us (F.R.A.H.) used it to retrieve one of his own papers and related papers. The results appeared on the typewriter within seconds; a very effective demonstration.

Our objectives in setting up this system are as follows:

  1. to provide an information retrieval service for members of the Laboratory, visitors and users of remote consoles (when we have them) for the retrieval of information to be found in books and articles in certain journals in the Atlas Laboratory library relating to computers and their applications;
  2. to publish an account of our experiences in setting up and maintaining this system, including details of the man-months of effort involved and the efficiency of the retrieval.

It is still too early to write the final summary, and this will not be possible until the system is working on-line to Atlas via consoles; so this account should be regarded as a progress report, written approximately twelve months after we decided to begin work on the system.

II. The Information

(1) Books

For each book in the Atlas Laboratory library a series of cards is punched giving its author(s), title and publisher. Each card relating to a book begins with a seven-digit number which uniquely identifies the book. The first four digits indicate the subject matter of the book (according to a scheme developed elsewhere which we adopted), and digits five to seven give a serial number to the book within that class. An example will illustrate the format:

65600011 KEISTER, WILLIAM 
656000l7π RITCHIE, ALISTAIR, E 
65600017πWASHBURN, SETH, H 
65600012 THE DESIGN OF SWITCHING CIRCUITS 
65600019 NEW YORK, VAN NOSTRAND, 1951

The card giving the first author (or editor, for a collection of papers etc.) has a 1 punched in column eight, subsequent authors being identified by a π punched in column eight. The title begins on a card having a 2 in column eight and continues as far as necessary on to cards having 3, 4,.... in column eight. Finally, a card with a 9 in column eight gives the publisher and this is the last card associated with that book.

(2) Articles in Journals

A similar method is employed here. We now have to indicate the name of the journal, volume number, year and page number, as well as author and title. The significance of the columns on the card are as follows:

Columns Significance
1,2 Journal identification number
3,4 Year of publication
5-8 Page on which article begins
9-12 Volume number on edition
13,14 Data type identification number

Columns 15, 16 are left blank (Atlas has an 8 character word, so this is convenient). The data type identified in columns 13, 14 are:

Symbol Meaning
01 First author's name
Subsequent joint authors
02 Title of article, first card
0/ Title continuation cards
+0 First reference to another article (details begin in column 17)
+& Further references to other articles
+9 Total number of references to other articles
-0 First reference to this article by other articles
-& Further references to this article by other articles

Entries belonging to the class of either of these last two data types are worked out automatically by the updating program; when a new article, X, is added to the list, all the articles to which it refers are looked up and their records are extended to include the new information that they have been referred to by article X. An example of data put into the journal updating program is

00600084000301 WINDLEY, P. F.
00600084000302 TREES, FORESTS AND REARRANGING 
006000840003+0 005800710001
006000840003+& 005900010002
006000840003+& 025601340003
006000840003+& 006000150003
006000840003+9 0005

The meaning of this is that we have a record of a paper by P. F. Windley in the Computer Journal (= Journal 00), 1960, Volume 3, beginning on page 84 with the title "Trees, Forests and Rearranging". The paper contains five references; three are to articles in The Computer Journal and one to an article in The Journal of the Association for Computing Machinery (= Journal 02). The fifth article is not listed; it was in a journal which was not among those included in the system (though it has since been added).

III. The Program

The program is really a collection of programs for loading, updating, sorting editing and retrieval. When new books or journal articles are to be added to the system, the appropriate magnetic tapes are loaded, the new cards read in and the tapes updated. Each book gives rise to two records on the book/author tapes, viz.

  1. that a new book on the subject of X by author Y has been added to the library;
  2. that author Y has written a book on subject X.

Each journal article gives rise to a number of records on the journal tape, viz.

  1. that author Y has written an article in journal X, volume V, page P etc with title T, with references R1, R2, ... Rm;
  2. that paper Ri has been referred to by author Y in an article in journal X etc.

In the world of information retrieval this last process is called "building up a citation index". It is the key to the whole system.

As soon as the Atlas disc is fitted we intend to replace the tape files by disc sectors, although the tapes will remain as archive records.

IV. The Retrieval Process

The user can ask certain questions. If he wishes to know what books we have on, say, Group Theory he can easily do so: if he wishes to know what book we have in the library written by author X, he can also find out very easily. We have been able to do this for many months: it was, of course, very easy to implement once we had written the updating and editing programs.

The most interesting part of the retrieval process is associated with retrieval of journal articles. Let us suppose that the user wants to know if we have any references to articles on the numerical solution of integral equations. Since we do not use a key-word or abstract based retrieval system, how is the user to get going at all? The answer is that we assume that he knows at least one reference to a paper on the subject of integral equations. This is not very much to ask. Let us suppose that the user knows that Elliott has written a paper on this subject; the user then types in

Find author ELLIOTT

The machine responds with a list of all papers by ELLIOTT. Among these we find:

00630102000601 ELLIOTT, DAVID 
00630102000602 A CHEBYSHEV SERIES METHOD FOR THE 
0063010200060/ NUMERICAL SOLUTION OF FREDHOLM INTEGRAL EQUATIONS

If we wish we can now follow this reference up by asking

Give references from 006301020006

This will produce a list of the references given in the paper by Elliott, and we discover that there are twelve. Of these twelve, three obviously refer to integral equations and their references can be further recovered, and so on. A further line of retrieval is to make use of the citation index aspect of the system by making a request of the type

Give references to 006301020006

and a list of all papers which have referred to the paper by Elliott will be produced. Again one can ask

Give papers related to 006301020006

and the program will print out all those papers which have given one or more references to the papers cited by Elliott.

In Kessler's scheme one can qualify the retrieval conditions by asking, for example, that only references later than 1962 are reported, or that only articles from certain journals be given, and so on. We intend to implement a similar system. Another important facility is one enabling the user to ask how many references he will be given if he asks a certain question, and if there are too many or too few he may wish to qualify or change the request before seeing any print-outs.

The great advantage of the citation index is that it allows retrieval to proceed forwards as well as backwards in time once a starting point has been established. It would be a difficult and expensive system to organise and maintain manually, but it is very easy and not expensive for a computer. Thus, the use of a citation index is natural in a computer-based information retrieval scheme and adds enormously to the retrieval potential.

One could clearly extend the set of retrieval commands, and we may do this later in the light of experience; but this will not be until the system is being used as it was always designed to be used, on-line from a console via a satellite computer and disc to Atlas.

V. The Cost

One of the attractions of Kessler's system is the remarkably low cost in terms of both quality and quantity of staff. Many schemes require highly qualified abstracters. Kessler got his system working on the 7090 at M.I.T. in little over a year with a total staff that never exceeded four: himself and two or three students. Our experience has been closely parallel. During the first year of the project only the following people have worked on the project: Miss M. P. Richards (a student from Bristol College of Advanced Technology), August to December 1965; Miss E. Litherland (a student waiting to go to Oxford University), February to August 1966; Mrs. S. M. T. Harold (a junior programmer on the Atlas Laboratory staff), January to July 1966. In addition I gave perhaps one day a week to the project throughout the year. This low cost is all the more remarkable when it is pointed out that all the programming is in machine-code and that Miss Richards and Miss Litherland had never programmed in machine code before.

In addition to the programming effort the data has to be punched up. At the moment the articles are punched from about a dozen journals, all related to computers. We decided upon these after studying which were the most quoted The number can be extended at any time. Punching of the current journals and steadily working away at the backlog makes a background job for the data preparation section of the Atlas Laboratory. Punching of the backlog is going back to 1960 for all journals on the list, and when this is complete selected journals will be punched from 1959 backwards. The effort required has not been great. One punch girl has been on the project for about three months, and in this time has punched up the articles with references from about 150 issues of journals. Included among these are The Computer Journal, Mathematics of Computation, and The Journal of the A.C.M. which have been punched back to 1960.

The present state of the system is that it has nearly reached completion so far as the off-line use of the computer is concerned. The last phase, on-line retrieval, will require the implementation of the simple language outlined in paragraph IV, and this will not take long when the satellite and consoles have been added.

Reference

[1] The M.I.T. Technical Information Project, M. M. Kessler, Physics Today, March 1966, 28-36.

⇑ Top of page
© Chilton Computing and UKRI Science and Technology Facilities Council webmaster@chilton-computing.org.uk
Our thanks to UKRI Science and Technology Facilities Council for hosting this site