Jump Over Left Menu
The Computation of Collocations and their Relevance in Lexical Studies
G L M Berry-Roghe
The terms collocation and collocability were first introduced by J R Firth in his paper Modes of Meaning published in 1951. Firth does not give any explicit definition of collocation but he rather illustrates the notion by way of such examples as: 'One of the meanings of ass is its habitual collocation with an immediately preceding you silly...' Although some of his other contributions to linguistic and stylistic analysis (such as prosodic features) have had a considerable impact, his notion of collocation has not been seriously considered until the last decade. The reasons for this neglect are probably twofold: on the one hand, the rather vague terms in which he described the notion (cfr. Haskell 1970) and, on the other hand, the practical restrictions imposed by the prohibitive scale of a textual study of collocability. The latter drawback has been remedied by the introduction of the digital computer in textual analysis. As to the former, several recent attempts have been made by scholars at defining the notion collocation more precisely within the framework of modern linguistic theory. For relevant theoretical discussions on this topic, the reader is referred to the bibliography (Haas 1966, Halliday 1961, 1966, Lyons 1966, McIntosh 1966, Sinclair 1966, Van Buren 1967).
The present paper reports on a pilot study attempting to make explicit the notion of collocation in statistical and computational terms. Firth himself stressed that collocation was not to be equated with mere co-occurrence of lexical items but he always used the phrase habitual or usual collocation implying a gradation in collocability among the set of words which are found to occur in the environment of a particular item. As a technically more workable definition of collocation we borrowed the one formulated by Halliday (1961) as:
'the syntagmatic association of lexical items, quantifiable, textually, as the probability that there will occur at n removes (a distance of n lexical items) from an item x, the items a, b, c ...'
In other words, the aim is to compile a list of those syntagmatic items (collocates) significantly co-occurring with a given lexical item (node) within a specified linear distance (span). Significant collocation can be defined in statistical terms as the probability of the item x co-occurring with the items a, b, c ... being greater than might be expected from pure chance. Consider the following statistical data as being given:
Z total number of words in the text A a given node occurring in the text Fn times B a collocate of A occurring in the text Fc times K number of co-occurrences of B and A S span size, ie., the number of items on either side of the node considered as its environment.
First must be computed the probability of B co-occurring K times with A, if B were distributed randomly in the next. Next, the difference between the expected number of co-occurrences and the observed number of co-occurrences must be evaluated. The probability of B occurring at any place where A does not occur is expressed by:
p = Fc / (Z-Fn)
The expected number of co-occurrences is given by:
E = p.Fn.S
The problem is to decide whether the difference between observed and expected frequencies is statistically significant. This can be done by means of computation of the z-score as a normal approximation to the binomial distribution (Hoel 1962) using the formula:
z = (K-E) / sqrt(Eq) (q = 1-p)
This formula has proved highly satisfactory in that it yielded a gradation among collocates which largely corresponded with our semantic intuitions. It is, however, by no means proposed as the ultimate one. Further studies on a larger scale than the present one and possibly embodying a greater variety of texts might well require modifications to it. Thus, for some nodes a Poisson distribution would be more appropriate than a binomial one. Another flaw in the formula is that it does not account for the possibility of a node co-occurring with itself within the specified span. This possibility was temporarily dismissed on account of its relative rarity. But other studies with a stylistic rather than a lexical inclination might find such aspects highly relevant. It is nevertheless hoped that the subsequent illustrations of the application of the formula will prove its relative validity. But first a few words must be said about the nature of the data.
In order to obtain a fairly comprehensive picture of the collocational relations of lexical items a very large corpus would have to be processed. Halliday (1966) quotes the figure of some 20 million running words! It is obviously not feasible even with our largest computers to process a corpus of this size. (Even the Brown University corpus of American English has a limit of 1 million words). The number of running words processed in the present pilot study amounted to 71,595. This figure is of course far off the 20 million mark but it proved sufficient for an initial methodological investigation. The data consisted of a 19th century prose work and two modern plays, namely A Christmas Carol by Charles Dickens, Each his own Wilderness by Doris Lessing and Everything in the Garden by Giles Cooper. The choice of these texts was motivated by their availability in machine readable form rather than by any stylistic considerations. Prior to the analysis a concordance and statistical information on each text (such as a frequency profile, average sentence length ...) were obtained. But the three texts were subsequently conflated as the size of each separately did not warrant results of any statistical significance. Thus, the aim of the study was not in the first place stylistic, but rather mehodological, being concerned with answering such questions as What is the optimal span size?, should grammatical items be ignored?, etc. For computational purposes, the lexical unit was defined as the graphic word ie., a sequence of characters delimited by specified word boundary markers such as space and the punctuation marks, thus adopting the same conventions used in the statistical analysis of American English at Brown University (Nelson and Kucera 1971).
The texts were processed on the Atlas computers at Manchester and Chilton. The initial stages of the program - written in Atlas Autocode - are similar to an ordinary concordance program where the context is limited to the specified span. A particular keyword being selected as node, all items occurring within the span are conflated into an alphabetical list and their number of co-occurrences with the node is counted. This list is then tested against a previously compiled dictionary consisting of an alphabetic word-frequency table for the entire text. The total number of occurrences of each collocate is subsequently recorded, after which the z-scores are computed. As initial node the item house was chosen, partly because the word count had shown it to occur with relatively high frequency (83 instances) and partly, because our intuitions about the syntagmatic bonds of this item are fairly clear, thus providing a valuable touchstone for an automatic analysis. The initial collocation set consisted of all graphic words occurring within a span of three items on either side of the node house, regardless of intervening sentence boundaries. The small size of the textual sample compelled us to discard all collocates co-occurring with house only once. This decision caused some valuable collocations to be lost (such as four-roomed house) but, conversely the inclusion of the item Amazons (occurring in the context: This house is full of Amazons ...) 'was avoided. Table 1 contains a listing, in order of absolute frequency of co-occurrence, of all items collocating at least twice with house, including indications of their total frequency in the data (Fc), their observed (K) and expected frequency of co-occurrence (E) and the respective z-scores. As might have been expected, those items co-occurring most frequently with house are grammatical. Table 2 is a re-ordering of table 1 in terms of relative collocational significance. Due to the limitations of the sample the order established is obviously not fully representative of the collocational behaviour of house in the English Language as a whole. It is, for example, highly likely that in a larger sample sold would occur more frequently in other environments than house and that Commons would stand out more clearly as forming part of the idiom House of Commons.
Where the significance limit should be drawn is in the last resort subject to the judgement of the individual investigator and to the purpose of his study. At the 0.1% level of significance statistical tables put this limit at a 2.576 z-score (Spiegel 1961). This figure was adopted as a workable hypothesis for the present study but is by no means applicable to other types of data and corpus sizes. By drawing this particular limit we run the risk of excluding from consideration unusual but creative collocations (an example of which being Dickens's use of the adjective young with house) alongside with obviously irrelevant ones. Similarly, those items with a negative z-score are excluded, although they might possibly be deemed to be of a particular stylistic interest as they seem to repel the node. However, it is our view that unusual collocation needs to be explained with reference to an explicit definition of usual collocation. The present study is an attempt to investigate how the latter can be best established.
The next question to be considered is how adequate the chosen span size was. Table 3 shows how an increase of the span size from 3 to 6 affects the sets of statistically significant collocates. All newly introduced collocates are highlighted. Some of these, such as enter, rooms, fronts, garden ... seem indeed highly relevant, whereas others such as God, Bernard, stop are to be considered intruders. An investigation at this point of the respective concordances and statistical data on each text showed that the stylistic nature of the corpus is highly relevant in defining the optimal span size. Increasing the span size resulted in effect in introducing desirable collocates from A Christmas Carol but undesirable ones from the plays. The obvious explanation lies in the marked difference between the mean sentence length which amounts to 14.03 in A Christmas Carol but only to 6.7 in the modern plays. Although by no means all significant collocates occur solely within the sentence boundaries, the majority nevertheless do. However, a general policy concerning the span size to be adopted in this pilot study could not be based upon observations of the behaviour on a single node. Other items of medium frequency were similarly tested, including verbs, adverbs and adjectives: As it proved that the majority of significant collocates do appear in the immediate vicinity of their nodes, it seemed appropriate to provisionally adopt a span of 4 for both types of data and for all nodes which were non-grammatical items, except in the case of adjectives where a span of only 2 seemed indicated.
Much more could be said at this point about further stylistic differences between the two types of data as well as about the collocational relevance of grammatical items. A case might be put forward for considering as potential collocates only those items which stand in a grammatical relation to the node. Generally speaking, this would have the effect of including only relevant items. However, unless a very sophisticated grammatical analysis which takes into account anaphoric reference is applied a great many important collocates might be lost. Consider, for example, the following quote from A Christmas Carol. It is hard to envisage at the moment a grammatical model which would be powerful enough to relate to the node face all underlined items which we felt to be lexically related to it:
Marley's face. It was not in impenetrable a shadow as the other objects in the yard were, but had a dismal light about it, like a bad lobster in a dark cellar. It was not angry or ferocious, but looked at Scrooge as Marley used to look with ghostly spectacles turned up on its ghostly forehead. The hair was curiously stirred, as if by breath or hot air; and, although the eyes were wide open, they were perfectly motionless. That, and its livid colour, made it horrible, but its horror seemed to be in spite of the face and beyond control, rather than part of its own expression.
As such a sophisticated computerised syntactic analysis is not yet available, it is hoped that the procedures proposed here will nevertheless yield some valuable results in semantic and stylistic studies of this kind.
Finally, we should like to anticipate the next stage in our study of collocation. The eventual aim of a collocational analysis is not just to establish sets of syntagmatically related items but to extend these to include paradigmatically related items so that eventually a semantic field might be established. A program is being designed which will take as input the set of collocates of a given node and examine their collocational relations with other items, thus forming a network of semantically related items. Starting with the node house, a preliminary attempt was made to simulate manually such an automatic scan, hereby making use of available collocational sets and concordances. The results, which are plotted in table 4 embody a provisional semantic field of habitation terms. Considering that this field emerged from as small a sample as some 72,000 running English words it seems, apart from some obvious gaps, intuitively quite acceptable. The methods outlined in this paper when applied to a corpus of considerable size could lead to the establishment of a thesaurus of English, based on an objective and consistent analysis of how words are actually used rather than on subjective intuitions. Other stylistic applications to the analysis of the vocabulary (or even conceptual world) of an individual author are easy to envisage.
TABLE 1: Collocates of house in order of frequency of co-occurrence. (Span = 3 items on either side of the node.)
TABLE 2 Collocates of 'house' in decreasing order of significance.
SOLD 24.0500 MORE 1.3740 COMMONS 21.2416 MUCH 1.3927 DECORATE 19.9000 WHERE 1.2896 THIS 13.3937 ONE 1.2879 EMPTY 11.9090 GET 1.1949 BUYING 10.5970 OUT 1.1348 PAINTING 10.5970 OR 0.9316 OPPOSITE 8.5192 PEOPLE 0.9440 LOVES 6.4811 OF 0.9096 OUTSIDE 5.8626 MOTHER 0.8558 LIVED 5.6067 SEE 0.8503 FAMILY 4.3744 BEEN 0.7713 REMEMBER 3.9425 DID 0.6462 FULL 3.8209 ABOUT 0.5363 MY 3.6780 BUT 0.3641 INTO 3.5792 NOT 0.3221 THE 3.2978 LIKE 0.2833 HAS 2.9359 HIS -0.0385 'RE 2.5333 WAS -0.0890 NICE 2.3908 KNOW -0.1038 YEARS 2.3712 ALL -0.1060 IS 2.1721 WELL -0.1909 EVERY 2.0736 YES -0.5818 ONLY 2.0441 WITH -0.9090 COULD 1.9807 ABOUT -0.3805 SOMETHING 1.9026 FOR -0.1794 UP 1.8829 IF -0.2197 MYRA 1.7239 IT -0.4368 OTHER 1.6889 THEY -0.5175 IN 1.7299 I -0.6865 HAVE 1.8689 BE -0.6557 BEFORE 1.6451 DO -0.6993 TONY 1.4459 TO -1.6660 GHOST 1.3916 THAT -1.8030 YOU -2.6034 AND -2.6488
TABLE 3 Significant collocates of 'House'
Span = 3 Span = 4 collocate K Fc z-score collocate K Fc z-score sold 6 7 24.0100 sold 6 7 20.7566 commons 4 4 21.2416 commons 4 4 18.3415 decorate 3 3 19.8000 decorate 3 3 15.8837 this 22 252 13.3337 this 22 252 10.7863 empty 3 7 11.8090 empty 3 7 10.2360 buying 2 4 10.5970 buying 2 4 9.0697 painting 2 4 10.5970 painting 2 4 9.0697 opposite 2 6 8.5192 opposite 2 6 7.5951 loves 2 10 6.4811 loves 2 10 5.5975 outside 2 12 5.8626 entered 2 10 5.5975 lived 2 13 5.6067 near 2 11 5.2373 family 2 20 4.3744 outside 2 12 4.9038 full 2 25 3.8209 lived 2 13 4.7583 remember 2 26 3.9425 remember 3 26 4.9102 my 8 271 3.6780 rooms 2 15 4.3255 into 3 92 3.5792 flat 2 18 3.8170 the 35 2368 3.2978 big 2 19 3.7878 has 2 50 2.9359 Bernard 2 20 3.6876 family 2 20 3.6676 my 9 271 3.3055 full 2 23 3.2845 into 3 92 2.8326 the 42 2368 2.3182 every 3 59 2.7971 Mrs. 2 29 2.6713 Span = 5 Span = 6 collocate K Fc z-score collocate K Fc z-score sold 7 7 21.6383 sold 8 7 22.5581 commons 4 4 16.3571 commons 4 4 14.8871 decorate 3 3 14.1356 decorate 3 3 13.8456 fronts 2 2 11.5635 fronts 2 2 10.5971 this 22 252 9.6080 cracks 2 2 10.5971 empty 3 7 9.0914 this 22 252 8.4908 buying 2 4 8.0577 empty 3 7 8.2410 painting 2 4 8.0577 buying 2 4 7.3117 opposite 2 6 6.1350 painting 2 4 7.3117 loves 2 10 4.8677 opposite 2 6 5.8695 entered 2 10 4.8677 loves 2 10 4.3741 near 2 11 4.6050 entered 2 10 4.3741 outside 2 12 4.6050 black 2 12 3.9168 black 2 12 4.3742 near 2 11 4.1308 remember 3 26 4.2689 outside 2 12 4.1308 lived 2 13 4.1691 remember 2 26 3.7847 rooms 2 15 3.9122 lived 2 13 3.7118 garden 2 17 3.5230 rooms 2 15 3.6698 flat 2 18 3.4019 God 5 64 3.6806 big 2 19 3.2829 stop 3 27 3.2550 into 5 92 3.2795 garden 2 17 3.1308 God 4 64 3.1869 flat 2 18 3.0115 family 2 20 3.1728 every 4 59 2.9325 Bernard 2 20 3.1728 big 2 19 2.9009 my 9 271 2.8011 my 11 271 2.8854 full 2 25 2.7980 into 5 92 2.8849 family 2 20 2.7310 Bernard 2 20 2.7310 whole 2 23 2.6771
TABLE 4: The Structure of the Semantic Field of "habitation terms"
house room office home place flat hall chamber hut cellar warehouse grocer's world live in/at * * * * * * * * leave * * * * * door * * * * * empty * * * window * * * close * * * dark * * * garden * * move * * let out * * furnished * * stay in/at * * wall * * pass * * enter * * full * *
Cooper, Giles (1963) Everything in the Garden New English Dramatists - 7, pp 141-221. London: Penguin
Dickens, Charles (1922) A Christmas Carol. London: Macmillan
Firth, J.R. Modes of Meaning. Papers in Linguistics 1934-51, (1957) pp. 190-215. Oxford University Press.
Firth, J.R. (1968) A Synopsis of Linguistic Theory 1930-55. Selected Papers of J.R. Firth 1952-59 (ed. Palmer, F.R.) London: Longman.
Haas, W. (1966) Linguistic Relevance. In Memory of J.R. Firth, pp. 116-148 (ed. Bazell, C.E. et al) London: Longman
Halliday, M.A.K. (1961) Categories of the Theory of Grammar. Word, 17, 241-92
Halliday, M.A.K. (1966) Lexis as a Linguistic Level. In Memory of J.R. Firth, pp.148-63 (ed. Bazell, C.E. et al) London: Longman
Haskel, Peggy (1971) Collocations as a Measure of Stylistic Variety. The Computer in Literary and Linguistic Research, pp 159-169. (ed. Wisbey, R.A.) Cambridge University Press
Hoel, P.G. (1962) Introduction to Mathematical Statistics New York: Wiley.
Kucera, H. and Francis, W.N. (1967) Computational Analysis of present-day American English. Brown University Press.
Lessing, Dorris (1959) Each His Own Wilderness. New English Dramatists-1, pp. 11-97. London: Penguin
Lyons, John (1966) Firth's Theory of Meaning. In Memory of J.R. Firth, pp. 288-302 (ed. Bazell, C.E. et al) London: Longman
Mc Intosh, Angus (1966 ) Patterns and Ranges. Papers in General, Descri~tive and Applied Linguistics, pp. 183-199. London: Longman
Sinclair, J. McH, (1966) Beginning the Study of Lexis. In Memory of J.R. Firth, pp. 410-31 (ed. Bazell, C.E. et al) London: Longman
Spiegal, M.H. (1961) Theory and Problems of Statistics London: Schaum
Van Buren, P. (1967) Preliminary Aspects of Mechanisation in Lexis. Cahiers de Lexicology, It, 89-112, 12 71-84.