Chilton::ACL::COCOA - A Word-Count and Concordance Generator

COCOA - A Word-Count and Concordance Generator

D B Russell

1965

Introduction

COCOA is a system which allows users to generate word-counts and concordances from literary (or other) texts. It was written originally for Atlas after consultation with various British Universities, and is currently being implemented for System 4-75 at Edinburgh.

The output from a COCOA word-count consists of a table containing every word in the author's vocabulary for that particular text, together with a number indicating how many times that word was used. This table is output three times in different orders: frequency ordering, with the most popular words first; alphabetic ordering, as in a conventional dictionary; and rhyme ordering, which is alphabetic on word endings. In addition a frequency profile table is produced showing how many words were used once each, twice each and so on.

The output from a COCOA concordance contains, for every occurrence of every word (or of a selected group of words), a line giving: a reference, e.g. HAMLET 1: 1: 173; and a limited amount of the context in which the word appears, i.e. a line, a sentence, or as much as possible. The printing of the context is adjusted on the line so that the indexed word appears in a column at the centre of the page.

It is clear that a word-count is a compression process, that is, the amount of output is perhaps a tenth of the amount of input, whereas a full concordance is an expansion process producing output some ten times as large as the input. Thus a user is likely to want to be selective when concording lest he drown himself in output.

Motivation

If the input text is a list of titles of papers from a group of journals, then a concordance is nothing more nor less than a KWIC (Key Word In Context) Index, providing a crude form of Information Retrieval.
In philology it is useful to feed in works written during a particular period to concord examples of the usage of specified words.
A concordance is an aid towards compiling vocabularies for language text-books and indices for technical books.
Word-counts and concordances provide useful tools for quantifying literary style in attempts at establishing authorship.

An outline of facilities

Texts may he punched in any of the media acceptable to Atlas. Works written in non-Roman alphabets must be transliterated. The user may include in the punched text his own comments within square brackets. More important he must include identification records enclosed with angle brackets. (It will be seen that square and angle brackets are special characters which cannot be used as transliterated letters.) The identification records will typically be copied from the headings of the original script, e.g.

<W SHAKESPEARE> <T HAMLET> <A 1> <S 1>

the W standing for Writer not William, T for Title, A for Act and S for Scene and so on. The choice of letters is arbitrary and is left to the individual user except that L always stands for Line Number, which is automatically initialized and incremented by the system. These identification records allow COCOA to identify the different sections of text. By reference to these records, users can program COCOA to select those sections of his archive which are appropriate for a particular study. He can also program COCOA to generate references for occurrences of words appearing in the output from a concordance.

A sensible first run on newly punched data is a word-count, since this produces a relatively small amount of printing, and yet it provides the user with all the information he needs to restrict the printing from any subsequent concordances to proportions he can handle. Moreover if he looks through the list of words which appear only once or twice he will probably find amongst the rarely used words some common words mis-spelt. A subsequent concordance of these mis-spelt words would locate their exact positions in the text ready for correction.

The output from a full concordance of the works of Shakespeare would certainly stand higher and would probably weigh heavier than any prospective COCOA user. COCOA therefore provides certain controls for limiting the output from a concordance to manageable proportions. Either the user may provide a list of words of particular interest asking for a concordance including only those words; or he may provide a list of common uninteresting words and request a concordance of all words except those cited; or else he may provide a range of frequencies which interest him, e.g. concord all words appearing more than 10 times but less than 100 times. Frequency control can provide dramatic savings since the most common 2% of the vocabulary is likely to account for over half the text. Along with the frequency range the user provides an alphabetic range, asking for words beginning with A. B, C or D (say). Alphabetic control also permits a large concordance to be broken up into several short runs.

In generating a concordance, COCOA also allows user control over the order in which all the occurrences of each word are printed. This may be decided either by the reference to produce the same sequence as in the text or by the context to the immediate right or left of the indexed word. Context sorting is a help for those linguists who wish to study how words are used together.

Assessment

It is clear from users' comments and suggestions that COCOA could be improved even if its original aims were not extended; but the fact that in its first six months of operation it has been used for studies of works in at least six languages indicates that it provides a worthwhile tool for linguists.