(How) Are Multi-word Sequences Universal? On the nature and degree of universality of multi-word sequences from the perspective of Korean, German and English

Buerki, Andreas 2010. (How) Are Multi-word Sequences Universal? On the nature and degree of universality of multi-word sequences from the perspective of Korean, German and English. Presented at: Seoul International Conference on Linguistics, Korea, 23 - 25 June 2010.

The availability of large language corpora and their electronic processing has challenged the notion that language in use is predominantly a product of the slot and filler principle of grammar and vocabulary. The extent to which language in use consists of multi-word sequences that appear to be processed as units, rather than combined for each instance of use, is still a matter of some uncertainty, but it is clear that the phenomenon is prominent (Altenberg, 1998; Sinclair, 1991) and recent years have seen a flurry of work devoted to it. Most of this work was done on English data and, to a much smaller degree, on a few other European languages (see Butler, 2005 for an overview). It has generally been assumed that MWS similar to the type found in English are universally found in languages and even in studies that looked at MWS in non-European languages, this assumption has not received much overt discussion (though some discussion is found in Kim, 2009). While the assumption is not unreasonable, the nature and degree of the assumed universality across languages, particularly those of differing morphological type, need urgent investigation. The present study aims to contribute to this by comparing MWS found in English, German and Korean. While English is the most isolating language of the three, German is more synthetic and Korean is a synthetic language with polysynthetic elements. These data allow us to test whether the concept of MWS depends on, or is significantly influenced by, morphological typology. In our investigation, we looked at just over 1 million words of newspaper texts from the 80s and 90s for each language, taken from the British National Corpus (BNC Baby, version 2, 2005), the Swiss Text Corpus (Bickel, Gasser, Häcki Buhofer, Hofer & Schön, 2009) and the Sejong Corpus (Kim, 2006) respectively. After the extraction of plain text and preparatory formatting, each of the one million word corpora was split into n-grams of length 2 to 10, using Zhang's NGramTool (2009). Subsequently, our own substring reduction algorithm was used to produce a consolidated list of n-grams for each corpus. These were then filtered to produce MWS and subsequently compared. Among the comparisons drawn were those between the number of MWS types and tokens above different frequency cut-offs. Notable differences between languages were found, with the English data displaying the highest number of extracted MWS by a considerable margin, followed by German, with Korean showing the smallest number of extracted MWS at all frequency cut-offs employed. In a second round of comparisons, the German data were modified by lemmatising verbs, thus neutralising verbal inflectional morphology. This raised the number of MWS extracted by about 10%. The removal of subject, object, topical and plural suffixes in our Korean data resulted in an even more dramatic increase in MWS extracted. While results support the notion of MWS across typologically different languages, they also suggest that MWS perhaps require a somewhat flexible definition when universality is claimed. Finally, the results highlighted methodological challenges in MWS identification. One answer to these challenges could involve a stronger weighing of the morpheme vis à vis the word in MWS definition and identification. References: Altenberg, B. (1998). On the phraseology of spoken English: The evidence of recurrent word-combinations. In A. P. Cowie (Ed.), Phraseology: Theory, analysis and applications. Oxford: Clarendon Press. Bickel, H., Gasser, M., Häcki Buhofer, A., Hofer, L., & Schön, C. h. (2009). Schweizer Text Korpus - theoretische Grundlagen, Korpusdesign und Abfragemöglichkeiten. Linguistik Online, 39(3). BNC Baby, Version 2. (2005). [Data file] Oxford University Computing Services on behalf of the BNC consortium. Butler, C. S. (2005). Formulaic language: An overview with particular reference to the cross-linguistic perspective. Pragmatics & Beyond. New Series, 140, 221-242. Kim, H. (2006). Korean national corpus in the 21st century Sejong Project. In Language corpora: Their compilation and application. Proceedings of the 13th NIJL international symposium. Kim, Y. (2009). Korean lexical bundles in conversation and academic texts. Corpora, 4(2), 135-165. Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.

Item Type: Conference or Workshop Item (Paper)
Date Type: Completion
Status: Unpublished
Schools: English, Communication and Philosophy
Subjects: P Language and Literature > P Philology. Linguistics
