Corpus Linguistics

[Please note: these pages are no longer maintained and may be out of date.]


INTRODUCTION

GLOSSARY

CORPORA

COURSES

BIBLIOGRAPHY

RELATED SITES

SOFTWARE

SEARCH ENGINE

TUTORIAL

COMMENTS




These pages were created as part of the
W3-Corpora Project
at the
University of Essex.
 


Corpus-Related Research

This is a short introduction to some of the research areas where corpora can be and have been used.

* Links to web-pages made to supplement the book "Corpus Linguistics" by Tony McEnery and Andrew Wilson.


  • Computational Linguistics

    "Computational Linguistics is an interdisciplinary field which centers around the use of computers to process or produce human language"C. Ball In some ways, computational linguistics and corpus linguistics can be seen as overlapping disciplines. Computational linguists are dependent on computer-readable linguistic data to use in their research, while corpus linguists often use computational methods when analysing their data. One main difference can be said to be that in corpus linguistics it is the data in the corpus that is the main object of study. In computational linguistics, corpora are not studied as such but used as a resource to solve various problems.

    Computational Linguistics is a broad term. You can read more about it in Catherine Ball's lecture About Computational Linguistics or at this page (University of Saarland).

  • Cultural Studies (*)

    The existence of comparable corpora makes it possible to compare the language use in, for example, different countries. The result of such comparisons can point to differences in culture. It has been suggested, for example, that the lower proportion of expressions of future in the Kolhapur Corpus of Indian English, as compared to LOB and Brown Corpora, can be explained with cultural differences. "Maybe the Indian mind is not given to thinking much in terms of the future..." (Shastri 1988:18 in ICAME Journal 12:15-26).

    So far, the use of corpora in cultural studies is not a particular well developed field. Perhaps the ongoing work of compiling 20 corpora of different varieties of English within the ICE project (International Corpora of English) will help make this a more fruitful research area in the future.

  • Discourse Analysis and Pragmatics (*)

    "Pragmatics is the study of the way language is used in particular situations, and is therefore concerned with the functions of words as opposed to their forms. It deals with the intentions of the speaker, and the way in which the hearer interprets what is said"
    (from The Collins Cobuild English Language Dictionary (1987).

    Corpora have not been used much in discourse analysis or pragmatic studies. One explanation to that is that it has been difficult to find material suitable for this kind of research. As more corpora are being compiled and annotated with the relevant information, more corpus based research is also being performed in this area. Examples of such studies are can be found among the work done by scholars in Bergen (Norway), on their corpus of London teenage language, COLT. See, for example, "They like wanna see how we talk and all that.The use of like as a discourse marker in London teenage speech", and "More trends in teenage talk. A corpus-based investigation of the discourse items cos and innit".

    Read more about using corpora in discourse analysis and pragmatics in, for example,

  • Grammar/Syntax (*)

    Much research on grammar and syntax has been based on the researcher's intuition about the language, on his/her 'competence'. The existence of large corpora has made it easier to study the language as it is produced, to study the 'performance' of many people.

    "Every (formal) grammar is initially written on the basis of intuitive data; by confronting the grammar with unrestricted corpus data it can be tested on its correctness and its completeness". Jan Aarts (1991)

    Corpus data are being used to a larger or smaller extent for the production of grammar books. One example of a book, based completely on corpus evidence is An Empirical Grammar of the English Verb: Modal Verbs by Dieter Mindt (1995). (external link) An other example of how corpora can be used to corpus-based research on grammar and syntax is, for example, Clause patterns in Modern British English: A corpus-based (quantitative) study by N. Oostdijk and P. de Haan (1994). In ICAME Journal 18, (abstract)

  • Historical Linguistics (*)

    The possibility of having representative samples of the language at different points in history in machine-readable form allows historical linguists to conduct their research faster and more efficiently. The Helsinki Corpus is a well-known and much used corpus of texts from different periods.

    The Lampeter Corpus of Early Modern English Tracts contains a collection of pamphlets published between 1640 and 1740. For example of research on the Lampeter Corpus, click here.

  • Language Acquisition

    The ICLE corpus (International Corpus of Learner English) contains data produced by learners of English as a foreign language from different countries. It is being use for a variety of research purposes, some of which were presented at the AILA96 conference (abstracts). Learn more about this research in the book Learner English on Computer by Sylviane Granger.

    The CHILDES database contains transcripts of language spoken by children. This material can be used for research in a number of fields, language acquisition being one. An annotated bibliography of research in child language and language disorders can be found by using this link.

  • Language Teaching (*)

    There are several examples of how corpora can be, and gave been, used in language teaching. See, for example: Classroom Concordancing / Data-driven learning Bibliography "references to the direct use of data from linguistic corpora for language teaching and language learning".

  • Language Variation (*)

    Much work with corpora concerns language variation. Corpora are used to study how laguage varies between different text types, domains, times, regions, speakers, writers, etc. In these kinds of studies, one variant of the language is compared to another. These 'variants' can be different parts of one and the same corpus or similar parts of different corpora. An example of the former would be, for example, the Science Fiction texts in the LOB corpus compared to the Romantic Fiction tests in the same corpus. An example of a study of variation between two corpora would be, for example, an examination of the Science fiction texts in the LOB corpus as compared to the Science fiction texts in the Brown corpus. Language variation can also concern how speakers vary their production depending on the situation, how the language has changed over time, or how the language varies within an area (dialect).

  • Lexicography (*)

    Corpora are increasingly used in lexicography today. The first to start making extensive use of large corpora in dictionary and grammar book production were the Collins Cobuild. Longman has consulted the British National Corpus (BNC) and the Longman Corpus Network for their latest edition of the Longman Dictinary of Contemporary English.

    You can read more about the use of corpora in dictionary making in Computer Corpus Lexicography by Vincent B. Y. Ooi. For examples of corpus based studies in the field of lexicography, see, for example, the Bibliography of papers by Cobuild staff members.

  • Linguistics

    Corpora are important sources of data for a number of areas within the wide scope of Linguistics. Some of these are dealt with in more detail elswhere on this page.

    "[A]nalyses of language use provide an important complementary perspective to traditional linguistic descriptions". Douglas Biber, in IJCL 1:2

  • Psycholinguistics (*)

    Observing the language found in a corpus can contribute to the creation of hypothesis about the way language is processed by the mind.
    The use of corpora can also contribute to research in language pathologies. In order to analyse a particular language impairment it is important to have a very clear picture of the structural and formal differences between the impairment and its correct form.

  • Semantics (*)

    There are various ways in which you can study the meaning of words/utterances. One way is to look at the context in which the word/phrase occurs. Concordances and collocations (*) are often used for this.

    Attempts have been made at annotating corpora with semantic annotation, but the avaialbility of such material is limited. Example of how such information can be given can be found by looking at the Word Net, a semantically annotated lexical database for English.

  • Social Psychology (*)

  • Sociolinguistics (*)

    With the existence of corpora provided with sociolinguistic information about the speakers and/or authors of a text has come the possibility of using corpora in sociolinguistic research.

    The British National Corpus BNC has been extensively annotated for various sociolinguistic parameters, such as speakers' age, sex, and social class, writers' age, sex, domicile, etc. This information is used in a number of studies, for example by Paul Rayson et al in Social Differentiation in the Use of English Vocabulary: Some Analyses of the Conversational Component of the British National Corpus (IJCL 1997:2:1). Another corpus with sociolinguistic annotation is the COLT Corpus of London teenager language. See, for example, Girls' conflict talk: a sociolinguistic investigation of variation in the verbal disputes of adolescent females by A-B Stenstrom and I.K. Hasund.

    Historical corpora are also being used for sociolinguistic research. See, for example, the Sociolinguistics and Language History Project.

  • Speech (*)

    The first computer-readable corpus of spoken discourse was the London-Lund Corpus (LLC). It contains about 500,000 words of spoken British English, transcribed and provided with prosodic annotation. The LLC was compiled to be used for linguistic research primarily. Since then, many corpora of spoken text have been compiled for various other research tasks, especially for use within the fields of speech science and speech technology.

    There are several areas where spoken corpora are used. One is looking at spoken language as a variety of natural language, sometimes in comparison to written language. In such studies, a corpus of orthographically transcribed language, such as the BNC, can be (and has been) used.

    Spoken corpora are also used within the fields of speech technology and speech science. Example of such research tasks are teaching computers to produce and understand speech. The study of acoustic and phonetic phenomena of speech is important in, for example, the expanding commercial sphere of telecommunication.

    Producing transcribed spoken corpora with detailed annotation is a time-consuming and, therefore, costly procedure, which is why few such corpora are freely available.

    Read more about spoken language corpora activities here (external link)

  • Stylistics (*)

    The availability of corpora with large collections of texts from different genres, authors, media, etc. opens up new possibilities in the research area of Stylistics. Texts of different kinds can be compared to each other to find text type specific features. General corpora can serve as a frame of reference, something to compare other texts with.

    One area of Stylistics where corpora have been used is in authorship attribution (see, for example, P. de Haan, 1997 ).

    Another example of how corpora can be used in Stylistics is given in T. Tabata's essay on the use of statistical methods to investigate the changes of style in a corpus of Dicken's writings.


INTRODUCTION TUTORIAL SEARCH ENGINE

W3-Corpora project 1996-98: This page is no longer maintained.