Chilton::ACL::Collocations

The Computation of Collocations and their Relevance in Lexical Studies

G L M Berry-Roghe

April 1972

Atlas

The terms collocation and collocability were first introduced by J R Firth in his paper Modes of Meaning published in 1951. Firth does not give any explicit definition of collocation but he rather illustrates the notion by way of such examples as: 'One of the meanings of ass is its habitual collocation with an immediately preceding you silly...' Although some of his other contributions to linguistic and stylistic analysis (such as prosodic features) have had a considerable impact, his notion of collocation has not been seriously considered until the last decade. The reasons for this neglect are probably twofold: on the one hand, the rather vague terms in which he described the notion (cfr. Haskell 1970) and, on the other hand, the practical restrictions imposed by the prohibitive scale of a textual study of collocability. The latter drawback has been remedied by the introduction of the digital computer in textual analysis. As to the former, several recent attempts have been made by scholars at defining the notion collocation more precisely within the framework of modern linguistic theory. For relevant theoretical discussions on this topic, the reader is referred to the bibliography (Haas 1966, Halliday 1961, 1966, Lyons 1966, McIntosh 1966, Sinclair 1966, Van Buren 1967).

The present paper reports on a pilot study attempting to make explicit the notion of collocation in statistical and computational terms. Firth himself stressed that collocation was not to be equated with mere co-occurrence of lexical items but he always used the phrase habitual or usual collocation implying a gradation in collocability among the set of words which are found to occur in the environment of a particular item. As a technically more workable definition of collocation we borrowed the one formulated by Halliday (1961) as:

'the syntagmatic association of lexical items, quantifiable, textually, as the probability that there will occur at n removes (a distance of n lexical items) from an item x, the items a, b, c ...'

In other words, the aim is to compile a list of those syntagmatic items (collocates) significantly co-occurring with a given lexical item (node) within a specified linear distance (span). Significant collocation can be defined in statistical terms as the probability of the item x co-occurring with the items a, b, c ... being greater than might be expected from pure chance. Consider the following statistical data as being given:

   Z    total number of words in the text
   A    a given node occurring in the text Fn times
   B    a collocate of A occurring in the text Fc times
   K    number of co-occurrences of B and A
   S    span size, ie., the number of items on either side of the node 
                        considered as its environment.

First must be computed the probability of B co-occurring K times with A, if B were distributed randomly in the next. Next, the difference between the expected number of co-occurrences and the observed number of co-occurrences must be evaluated. The probability of B occurring at any place where A does not occur is expressed by:

   p = Fc / (Z-Fn)

The expected number of co-occurrences is given by:

   E = p.Fn.S

The problem is to decide whether the difference between observed and expected frequencies is statistically significant. This can be done by means of computation of the z-score as a normal approximation to the binomial distribution (Hoel 1962) using the formula:

   z = (K-E) / sqrt(Eq)      (q = 1-p)

This formula has proved highly satisfactory in that it yielded a gradation among collocates which largely corresponded with our semantic intuitions. It is, however, by no means proposed as the ultimate one. Further studies on a larger scale than the present one and possibly embodying a greater variety of texts might well require modifications to it. Thus, for some nodes a Poisson distribution would be more appropriate than a binomial one. Another flaw in the formula is that it does not account for the possibility of a node co-occurring with itself within the specified span. This possibility was temporarily dismissed on account of its relative rarity. But other studies with a stylistic rather than a lexical inclination might find such aspects highly relevant. It is nevertheless hoped that the subsequent illustrations of the application of the formula will prove its relative validity. But first a few words must be said about the nature of the data.

In order to obtain a fairly comprehensive picture of the collocational relations of lexical items a very large corpus would have to be processed. Halliday (1966) quotes the figure of some 20 million running words! It is obviously not feasible even with our largest computers to process a corpus of this size. (Even the Brown University corpus of American English has a limit of 1 million words). The number of running words processed in the present pilot study amounted to 71,595. This figure is of course far off the 20 million mark but it proved sufficient for an initial methodological investigation. The data consisted of a 19th century prose work and two modern plays, namely A Christmas Carol by Charles Dickens, Each his own Wilderness by Doris Lessing and Everything in the Garden by Giles Cooper. The choice of these texts was motivated by their availability in machine readable form rather than by any stylistic considerations. Prior to the analysis a concordance and statistical information on each text (such as a frequency profile, average sentence length ...) were obtained. But the three texts were subsequently conflated as the size of each separately did not warrant results of any statistical significance. Thus, the aim of the study was not in the first place stylistic, but rather mehodological, being concerned with answering such questions as What is the optimal span size?, should grammatical items be ignored?, etc. For computational purposes, the lexical unit was defined as the graphic word ie., a sequence of characters delimited by specified word boundary markers such as space and the punctuation marks, thus adopting the same conventions used in the statistical analysis of American English at Brown University (Nelson and Kucera 1971).

The texts were processed on the Atlas computers at Manchester and Chilton. The initial stages of the program - written in Atlas Autocode - are similar to an ordinary concordance program where the context is limited to the specified span. A particular keyword being selected as node, all items occurring within the span are conflated into an alphabetical list and their number of co-occurrences with the node is counted. This list is then tested against a previously compiled dictionary consisting of an alphabetic word-frequency table for the entire text. The total number of occurrences of each collocate is subsequently recorded, after which the z-scores are computed. As initial node the item house was chosen, partly because the word count had shown it to occur with relatively high frequency (83 instances) and partly, because our intuitions about the syntagmatic bonds of this item are fairly clear, thus providing a valuable touchstone for an automatic analysis. The initial collocation set consisted of all graphic words occurring within a span of three items on either side of the node house, regardless of intervening sentence boundaries. The small size of the textual sample compelled us to discard all collocates co-occurring with house only once. This decision caused some valuable collocations to be lost (such as four-roomed house) but, conversely the inclusion of the item Amazons (occurring in the context: This house is full of Amazons ...) 'was avoided. Table 1 contains a listing, in order of absolute frequency of co-occurrence, of all items collocating at least twice with house, including indications of their total frequency in the data (Fc), their observed (K) and expected frequency of co-occurrence (E) and the respective z-scores. As might have been expected, those items co-occurring most frequently with house are grammatical. Table 2 is a re-ordering of table 1 in terms of relative collocational significance. Due to the limitations of the sample the order established is obviously not fully representative of the collocational behaviour of house in the English Language as a whole. It is, for example, highly likely that in a larger sample sold would occur more frequently in other environments than house and that Commons would stand out more clearly as forming part of the idiom House of Commons.

Where the significance limit should be drawn is in the last resort subject to the judgement of the individual investigator and to the purpose of his study. At the 0.1% level of significance statistical tables put this limit at a 2.576 z-score (Spiegel 1961). This figure was adopted as a workable hypothesis for the present study but is by no means applicable to other types of data and corpus sizes. By drawing this particular limit we run the risk of excluding from consideration unusual but creative collocations (an example of which being Dickens's use of the adjective young with house) alongside with obviously irrelevant ones. Similarly, those items with a negative z-score are excluded, although they might possibly be deemed to be of a particular stylistic interest as they seem to repel the node. However, it is our view that unusual collocation needs to be explained with reference to an explicit definition of usual collocation. The present study is an attempt to investigate how the latter can be best established.

The next question to be considered is how adequate the chosen span size was. Table 3 shows how an increase of the span size from 3 to 6 affects the sets of statistically significant collocates. All newly introduced collocates are highlighted. Some of these, such as enter, rooms, fronts, garden ... seem indeed highly relevant, whereas others such as God, Bernard, stop are to be considered intruders. An investigation at this point of the respective concordances and statistical data on each text showed that the stylistic nature of the corpus is highly relevant in defining the optimal span size. Increasing the span size resulted in effect in introducing desirable collocates from A Christmas Carol but undesirable ones from the plays. The obvious explanation lies in the marked difference between the mean sentence length which amounts to 14.03 in A Christmas Carol but only to 6.7 in the modern plays. Although by no means all significant collocates occur solely within the sentence boundaries, the majority nevertheless do. However, a general policy concerning the span size to be adopted in this pilot study could not be based upon observations of the behaviour on a single node. Other items of medium frequency were similarly tested, including verbs, adverbs and adjectives: As it proved that the majority of significant collocates do appear in the immediate vicinity of their nodes, it seemed appropriate to provisionally adopt a span of 4 for both types of data and for all nodes which were non-grammatical items, except in the case of adjectives where a span of only 2 seemed indicated.

Much more could be said at this point about further stylistic differences between the two types of data as well as about the collocational relevance of grammatical items. A case might be put forward for considering as potential collocates only those items which stand in a grammatical relation to the node. Generally speaking, this would have the effect of including only relevant items. However, unless a very sophisticated grammatical analysis which takes into account anaphoric reference is applied a great many important collocates might be lost. Consider, for example, the following quote from A Christmas Carol. It is hard to envisage at the moment a grammatical model which would be powerful enough to relate to the node face all underlined items which we felt to be lexically related to it:

Marley's face. It was not in impenetrable a shadow as the other objects in the yard were, but had a dismal light about it, like a bad lobster in a dark cellar. It was not angry or ferocious, but looked at Scrooge as Marley used to look with ghostly spectacles turned up on its ghostly forehead. The hair was curiously stirred, as if by breath or hot air; and, although the eyes were wide open, they were perfectly motionless. That, and its livid colour, made it horrible, but its horror seemed to be in spite of the face and beyond control, rather than part of its own expression.

As such a sophisticated computerised syntactic analysis is not yet available, it is hoped that the procedures proposed here will nevertheless yield some valuable results in semantic and stylistic studies of this kind.

Finally, we should like to anticipate the next stage in our study of collocation. The eventual aim of a collocational analysis is not just to establish sets of syntagmatically related items but to extend these to include paradigmatically related items so that eventually a semantic field might be established. A program is being designed which will take as input the set of collocates of a given node and examine their collocational relations with other items, thus forming a network of semantically related items. Starting with the node house, a preliminary attempt was made to simulate manually such an automatic scan, hereby making use of available collocational sets and concordances. The results, which are plotted in table 4 embody a provisional semantic field of habitation terms. Considering that this field emerged from as small a sample as some 72,000 running English words it seems, apart from some obvious gaps, intuitively quite acceptable. The methods outlined in this paper when applied to a corpus of considerable size could lead to the establishment of a thesaurus of English, based on an objective and consistent analysis of how words are actually used rather than on subjective intuitions. Other stylistic applications to the analysis of the vocabulary (or even conceptual world) of an individual author are easy to envisage.

TABLE 1: Collocates of house in order of frequency of co-occurrence. (Span = 3 items on either side of the node.)

Collocate	K	Fc	E	z-score
THE	35	2368	20.6315	3.2278
THIS	22	252	2.1955	13.3937
A	15	1358	11.5661	0.9316
OF	13	1163	10.1327	0.9096
I	12	1674	14.5849	-0.6865
IN	12	843	7.3447	1.7299
IT	9	1193	10.3941	-0.4368
MY	8	271	2.3611	3.6780
IS	7	362	3.1539	2.1721
HAVE	7	403	3.5111	1.8682
TO	7	1482	12.9121	-1.6660
SOLD	6	7	0.0609	24.0500
YOU	5	1711	14.9073	-2.6034
AND	4	1568	13.6614	-2.6488
BUT	4	383	3.3369	0.3641
COMMONS	4	4	0.0385	21.2416
FOR	4	502	4.3737	-0.1794
HIS	4	468	4.0775	-0.0385
INTO	4	92	0.8015	3.5792
NOT	4	413	3.5983	0.3221
ONE	4	244	2.1258	1.2879
'RE	4	147	1.2807	2.5999
WELL	4	254	2.2130	-0.1209
ALL	3	366	3.1888	-0.1060
COULD	3	116	1.4462	1.9887
DECORATE	3	3	0.0261	19.9000
EMPTY	3	7-	0.0609	11.9020
HAS	3	67	0.5837	2.9359
OUT	3	181	1.5769	1.1348
WAS	3	572	4.9836	-0.0890
BE	2	363	3.1626	-0.6557
BEEN	2	134	1.1674	0.7713
BEFORE	2	67	0.5837	1.6451
LIKE	2	188	1.6379	0.2833
EVERY	2	59	0.5140	2.0736
ABOUT	2	168	1.3522	0.5363
BUYING	2	4	0.0348	10.5270
DID	2	146	1.2729	0.6462
DO	2	374	3.2585	-0.6993
FAMILY	2	20	O.1742	4.3744
FULL	2	25	0.2178	3.8209
GET	2	101	0.8799	1.1949
GHOST	2	89	0.7754	1.3916
IF	2	268	2.3349	-0.2197
KNOW	2	247	2.1520	-0.1038
LIVED	2	13	0.1132	5.6067
LOVES	2	10	0.0871	6.4811
MORE	2	90	0.7841	1.3740
MOTHER	2	129	1.1239	0.8558
MUCH	2	93	0.8102	1.3227
MYRA	2	73	0.6360	1.7232
NICE	2	54	0.4704	2.3208
ONLY	2	60	0.5227	2.0441
OTHER	2	74	0.6447	1.6889
OPPOSITE	2	6	0.0522	8.5192
OUTSIDE	2	12	0.1045	5.8626
PAINTING	2	4	0.0348	10.5270
PEOPLE	2	121	1.0542	0.9220
UP	2	201	1.7512	1.8829
REMEMBER	2	26	0.0226	3.9425
SEE	2	127	1.1065	0.8503
SOMETHING	2	67	0.5837	1.9026
THAT	2	758	6.6041	-1.8030
THEY	2	327	2.8490	-0.5175
TONY	2	86	0.7492	1.4459
THERE	2	95	0.9877	1.2896
WITH	2	431	4.0012	-0.9090
YEARS	2	50	0.4252	2.3712
YES	2	345	3.0900	-0.5818

TABLE 2 Collocates of 'house' in decreasing order of significance.

SOLD      24.0500   MORE    1.3740 
COMMONS   21.2416   MUCH    1.3927 
DECORATE  19.9000   WHERE   1.2896 
THIS      13.3937   ONE     1.2879 
EMPTY     11.9090   GET     1.1949 
BUYING    10.5970   OUT     1.1348 
PAINTING  10.5970   OR      0.9316 
OPPOSITE  8.5192    PEOPLE  0.9440 
LOVES     6.4811    OF      0.9096 
OUTSIDE   5.8626    MOTHER  0.8558 
LIVED     5.6067    SEE     0.8503 
FAMILY    4.3744    BEEN    0.7713 
REMEMBER  3.9425    DID     0.6462 
FULL      3.8209    ABOUT   0.5363 
MY        3.6780    BUT     0.3641 
INTO      3.5792    NOT     0.3221 
THE       3.2978    LIKE    0.2833 
HAS       2.9359    HIS    -0.0385 
'RE       2.5333    WAS    -0.0890 
NICE      2.3908    KNOW   -0.1038 
YEARS     2.3712    ALL    -0.1060 
IS        2.1721    WELL   -0.1909 
EVERY     2.0736    YES    -0.5818 
ONLY      2.0441    WITH   -0.9090 
COULD     1.9807    ABOUT  -0.3805 
SOMETHING 1.9026    FOR    -0.1794 
UP        1.8829    IF     -0.2197 
MYRA      1.7239    IT     -0.4368 
OTHER     1.6889    THEY   -0.5175 
IN        1.7299    I      -0.6865 
HAVE      1.8689    BE     -0.6557 
BEFORE    1.6451    DO     -0.6993 
TONY      1.4459    TO     -1.6660 
GHOST     1.3916    THAT   -1.8030 
                    YOU    -2.6034 
                    AND    -2.6488

TABLE 3 Significant collocates of 'House'

         Span = 3                                Span = 4 
collocate   K    Fc    z-score         collocate   K    Fc    z-score 

sold        6     7    24.0100         sold        6     7    20.7566 
commons     4     4    21.2416         commons     4     4    18.3415 
decorate    3     3    19.8000         decorate    3     3    15.8837 
this       22   252    13.3337         this       22   252    10.7863 
empty       3     7    11.8090         empty       3     7    10.2360 
buying      2     4    10.5970         buying      2     4     9.0697 
painting    2     4    10.5970         painting    2     4     9.0697 
opposite    2     6     8.5192         opposite    2     6     7.5951 
loves       2    10     6.4811         loves       2    10     5.5975 
outside     2    12     5.8626         entered     2    10     5.5975 
lived       2    13     5.6067         near        2    11     5.2373 
family      2    20     4.3744         outside     2    12     4.9038 
full        2    25     3.8209         lived       2    13     4.7583 
remember    2    26     3.9425         remember    3    26     4.9102 
my          8   271     3.6780         rooms       2    15     4.3255 
into        3    92     3.5792         flat        2    18     3.8170 
the        35  2368     3.2978         big         2    19     3.7878 
has         2    50     2.9359         Bernard     2    20     3.6876 
                                       family      2    20     3.6676 
                                       my          9   271     3.3055 
                                       full        2    23     3.2845 
                                       into        3    92     2.8326 
                                       the        42  2368     2.3182 
                                       every       3    59     2.7971 
                                       Mrs.        2    29     2.6713



             Span = 5                             Span = 6 
collocate   K    Fc    z-score         collocate   K    Fc    z-score 

 
sold        7     7    21.6383         sold        8     7    22.5581 
commons     4     4    16.3571         commons     4     4    14.8871 
decorate    3     3    14.1356         decorate    3     3    13.8456 
fronts      2     2    11.5635         fronts      2     2    10.5971 
this       22   252     9.6080         cracks      2     2    10.5971 
empty       3     7     9.0914         this       22   252     8.4908 
buying      2     4     8.0577         empty       3     7     8.2410 
painting    2     4     8.0577         buying      2     4     7.3117 
opposite    2     6     6.1350         painting    2     4     7.3117 
loves       2    10     4.8677         opposite    2     6     5.8695 
entered     2    10     4.8677         loves       2    10     4.3741 
near        2    11     4.6050         entered     2    10     4.3741 
outside     2    12     4.6050         black       2    12     3.9168 
black       2    12     4.3742         near        2    11     4.1308 
remember    3    26     4.2689         outside     2    12     4.1308 
lived       2    13     4.1691         remember    2    26     3.7847 
rooms       2    15     3.9122         lived       2    13     3.7118 
garden      2    17     3.5230         rooms       2    15     3.6698 
flat        2    18     3.4019         God         5    64     3.6806 
big         2    19     3.2829         stop        3    27     3.2550 
into        5    92     3.2795         garden      2    17     3.1308 
God         4    64     3.1869         flat        2    18     3.0115 
family      2    20     3.1728         every       4    59     2.9325 
Bernard     2    20     3.1728         big         2    19     2.9009 
my          9   271     2.8011         my         11   271     2.8854 
full        2    25     2.7980         into        5    92     2.8849 
                                       family      2    20     2.7310 
                                       Bernard     2    20     2.7310 
                                       whole       2    23     2.6771

TABLE 4: The Structure of the Semantic Field of "habitation terms"

          house room office home place flat hall chamber hut cellar warehouse grocer's world
live in/at  *     *          *     *     *           *    *                              *

leave       *     *          *     *          *

door        *     *                           *                 *       *

empty       *     *                                                     *

window      *          *                                  *

close                  *           *                                            *

dark              *                                  *          *

garden      *                           *

move                        *           *

let out           *    *

furnished         *    * 

stay in/at                 *      *

wall                   *                             *

pass              *                                                             *

enter       *                                  *

full        *                                                                            *

Full image ⇗
© UKRI Science and Technology Facilities Council

REFERENCES

Cooper, Giles (1963) Everything in the Garden New English Dramatists - 7, pp 141-221. London: Penguin

Dickens, Charles (1922) A Christmas Carol. London: Macmillan

Firth, J.R. Modes of Meaning. Papers in Linguistics 1934-51, (1957) pp. 190-215. Oxford University Press.

Firth, J.R. (1968) A Synopsis of Linguistic Theory 1930-55. Selected Papers of J.R. Firth 1952-59 (ed. Palmer, F.R.) London: Longman.

Haas, W. (1966) Linguistic Relevance. In Memory of J.R. Firth, pp. 116-148 (ed. Bazell, C.E. et al) London: Longman

Halliday, M.A.K. (1961) Categories of the Theory of Grammar. Word, 17, 241-92

Halliday, M.A.K. (1966) Lexis as a Linguistic Level. In Memory of J.R. Firth, pp.148-63 (ed. Bazell, C.E. et al) London: Longman

Haskel, Peggy (1971) Collocations as a Measure of Stylistic Variety. The Computer in Literary and Linguistic Research, pp 159-169. (ed. Wisbey, R.A.) Cambridge University Press

Hoel, P.G. (1962) Introduction to Mathematical Statistics New York: Wiley.

Kucera, H. and Francis, W.N. (1967) Computational Analysis of present-day American English. Brown University Press.

Lessing, Dorris (1959) Each His Own Wilderness. New English Dramatists-1, pp. 11-97. London: Penguin

Lyons, John (1966) Firth's Theory of Meaning. In Memory of J.R. Firth, pp. 288-302 (ed. Bazell, C.E. et al) London: Longman

Mc Intosh, Angus (1966 ) Patterns and Ranges. Papers in General, Descri~tive and Applied Linguistics, pp. 183-199. London: Longman

Sinclair, J. McH, (1966) Beginning the Study of Lexis. In Memory of J.R. Firth, pp. 410-31 (ed. Bazell, C.E. et al) London: Longman

Spiegal, M.H. (1961) Theory and Problems of Statistics London: Schaum

Van Buren, P. (1967) Preliminary Aspects of Mechanisation in Lexis. Cahiers de Lexicology, It, 89-112, 12 71-84.