Chilton::ACL::Articulated Subject Indexes

The Production of Articulated Subject Indexes by Computer

Janet Armitage

1968

Over the last three years, the Postgraduate School of Librarianship and Information Science at Sheffield University has established a research unit supported by grants from the Office of Scientific and Technical Information. Two of the projects under the direction of M. F. Lynch involve research into the use of computers in scientific information work. More specifically, they are concerned with some aspects of structure in scientific information. One project, described here, involved an analysis of the structure of English language phrases as they appear in a printed alphabetical subject index, as a basis for the automatic generation of indexes.

The other project, described by Miss J. M. Harrison in an accompanying paper, deals with certain problems in the manipulation of chemical structures by computers.

Many computer programs have been written for the production of indexes, but most of them have been in the form of permuted indexes in which titles of articles are automatically arranged in alphabetical order under each significant word in the title [1]. Little attention had been paid to the highly organized manually produced subject indexes such as those to Chemical Abstracts.

Our work in Sheffield was based on an analysis of the Chemical Abstracts indexes in an attempt to determine ways in which the information contained in the indexes could be used for the purpose of information retrieval. [2]

In the manual production of indexes, there is a considerable amount of clerical work carried out by indexers. Having selected the phrase to be indexed, the indexer must then write down the same phrase often in five or six different ways in order to enter it under the various subject headings in the index. We wanted to find some way of automating this clerical procedure.

The Chemical Abstracts Subject index is one of the most highly developed examples of a manually produced alphabetical subject index. A typical set of index entries, shown below, consists of a subject heading in bold face, followed by the remainder of the phrases, the modifications, indented beneath. If the first component of the modification is common to a number of phrases, it is printed once only for the first entry and the remaining phrases are further indented beneath. The function words (prepositions and connectives) are used as articulating points) or points at which the phrase can be broken down into component parts and displayed in a variety of ways in the printed index.

Melting points. (See also Freezing points; Softening points.): in analysis of pharmaceutical powders, 60: 5280g; atomic vibrations in relation to, 60: 14178f; crystal vacancies at, vapor pressure in relation to, 60: 7484h; detn. of, of ash, 60: 1495f; filter paper in purification of compds. for, 60: 6704d; hot stage for, 60: 14133d; pressure and, 60: 3497g; of stereoregular polymers by diffraction and crystallite properties, 60: 12124g; of elements, periodic system and, 60: 13895c; entropy of formation and, of refractory metal carbides, 60: 15218e

Analysis of 1,000 entries from Chemical Abstracts indexes revealed a logic by which indexers intuitively convert descriptive phrases into index entries, and from this we are able to suggest a model for the generation of index entries. The descriptive phrases should consist of noun phrases, capable of acting as subject headings, separated by function words.

Consider a descriptive phrase consisting of five noun phrases separated by four function words:

Advances in research in information retrieval by computer at Sheffield.

If a noun phrase, for example the third, is selected as the subject heading, the first component of the modification is chosen by taking the noun phrase and function word either from the left or from the right of the subject heading.

i.e. either

Information retrieval research in

Information retrieval by computer

Having chosen one noun phrase and function word, the next can be selected either from the left or from the right. Instead of selecting just one noun phrase and function word, it is possible to choose multiple sets such as two noun phrases and two function words. At each stage of building up the modification, there is a choice from left or right of the heading until one end of the phrase is reached, and a choice of single or multiple noun phrases. This results in the following eight possibilities^* for the index entry under the subject heading, 'Information Retrieval'.

Information retrieval research in, advances in, by computer at Sheffield	Information retrieval by computer, advances in research in, at Sheffield
Information retrieval research in, by computer, advances in, at Sheffield	Information retrieval by computer, research in, advances in, at Sheffield
Information retrieval research in, by computer at Sheffield, advances in	Information retrieval by computer at Sheffield, advances in research in
Information retrieval advances in research in, by computer at Sheffield	Information retrieval by computer at Sheffield, research in, advances in

In the printed index, only one entry would appear for each subject heading within a phrase. The entry chosen depends on all the other phrases that are to appear under the subject heading in question. Where there are a number of possibilities for the first component of the modification, the component is chosen which occurs most frequently amongst all the other entries. This gives rise to the maximum possible organization in the display of the index.

For example, when indexing the following phrases under computer, information retrieval would be chosen as the first word in the modification as it is common to both phrases.

Indexing and information retrieval by computer, 001

Research in information retrieval by computer at Sheffield, 002

These would appear in the printed index under computer as

Computer: information retrieval by, indexing and, 001; research in, at Sheffield, 002

A program has been written to produce an index from descriptive phrases, using SLIP, the list processing language written by Weizenbaum [4] and implemented on Atlas by Don Russell.

The phrases are input to the machine and a dictionary of prepositions is used to separate subject headings from prepositions. A list of excluded words prevents useless words from appearing as subject headings within the index. The phrases are organized in the machine in alphabetical order of subject headings, a reference being made from each subject heading to all the titles which contain that subject heading.

The entries under each subject heading in turn are analysed to determine the possible first components in the modifications. Those are selected which occur most times within the entries, so that the final printed index appears in the most highly organized form.

We are now testing the adequacy of this model for the production of indexes. Tests have been carried out on samples of 200 phrases in various subject fields. Future plans involve determining the feasibility of the system for large-scale production of indexes; also, investigation of the use of raw titles as input to the program and the possibility of rewriting automatically those titles which fail to conform to the required syntactic structure.

We gratefully acknowledge grants from the Office of Scientific and Technical Information for making the work possible, and particularly wish to thank the Director and Staff of the Atlas Computer Laboratory for the service provided and for their constant help.

References

1. Luhn, H. P. 'Keywords in context index for technical literature (KWIC index)', Am. Doc., 11, 288-295 (1960).

2. Lynch, M. F. 'Subject Indexes and Automatic Document Retrieval: The Structure of Entries in Chemical Abstracts Subject Indexes', J. Doc. 22 167-85 (1966).

3. Armitage, J. E. and Lynch, M. F. 'Articulation in the Generation of Subject Indexes by Computer'. J. Chern. Doc. 7 170-178 (1967).

4. Weizenbaum, J. 'Symmetric List Processor', Comm. ACM 6 (9) 524-44 (1963).

* Footnote: In general the total number of possible entries for a phrase consisting of n noun phrases separated by n-l function words is given by the following formula, which represents also the alternate numbers in the Fibonacci series (the series in which each term is obtained by summing the two preceding terms):

This is an interesting observation but is of no practical value.