Corpus Linguistics

[Please note: these pages are no longer maintained and may be out of date.]


INTRODUCTION

GLOSSARY

CORPORA

COURSES

BIBLIOGRAPHY

RELATED SITES

SOFTWARE

SEARCH ENGINE

TUTORIAL

COMMENTS




These pages have been created as part of the
W3-Corpora Project
at the
University of Essex.
 


Using corpora

Knowing your corpus
Something about corpus compilation

To combine texts into a corpus is called to compile a corpus. There are various ways of doing this, depending on what kind of corpus you want to create and on what resources (time, money, knowledge) you have at your disposal.

Even if you are not compiling your own corpus, it is important to know something about corpus compilation when you use a corpus. Using a corpus is using a selection of texts to represent the language. How the corpus has been compiled is of utmost importance for the results you get when using it. What texts are included, how these are marked up, the proportions of different text types, the size of the various texts, how the texts have been selected, etc. are all important issues.

Illustration: the language as a newspaper

Let us imagine that you have a newspaper - a collection of texts of different kinds (editorials, reportage on different topics, reviews, cartoons, letters to the editor, sports commentaries, lists of shares, etc) written by different people. You then cut the paper into small pieces with one word on each. You put all the pieces/words into a bowl and pick a sample of ten at random. Obviously there would be several words that you know exist in the newspaper that are not found in your sample. If you were to pick another ten pieces of paper you would not expect the two sets of ten words to be exactly the same. If you picked two sets of 100 words each, you would probably find that some words, especially frequent words like function words, can be found in both samples, if not in exactly the same numbers. You would also find that many words are found in only one of the samples. If you took two very large samples you would find that the frequent words would occur to a similar extent. Words that occur only once in the newspaper would be found in only one of the samples (at most). Words that occur infequently would not necessarily be evenly distributed across the two samples.

Now imagine that you divide the newspaper into sections (or classify its content into categories/text types) before cutting it up, and then put the cuttings in different bowls. By picking your paper slips from the different bowls you can influence the composition of your sample. You can choose to take slips from only one bowl or from several, in equal or different proportions. If there is a difference in the language in the bowls, there will be a difference in the language on the slips and that will affect your sample correspondingly. You can easily see that if you were to take 100 slips of paper from the 'sports' bowl and 100 slips from the 'editorial' bowl, you would probably find a larger number of the word football in the sample taken from the 'sports' bowl than from the 'editorial'.

Corpus compilation

We can use the image above to give a (simplified) description of how a corpus can be created. (We will not go into any practical issues here - this is merely intended to give you an understanding of why it is important to know the corpus you use). If we imagine the language as a whole as the newspaper, we can say that the words on the slips of papers are texts (bits of spoken or written language). You create (compile) a corpus by selecting texts from the language. The composition of the corpus depends on the kind of texts you use, how you have selected them, and in what proportions. If you have divided your paper into sections you can decide to use more texts from one section, to use texts from one section only, to use a set proportion of texts from each section, to use a set number of texts from each section, etc. What kind of bowls you use will also make a difference - will you have bowls for various text types (reviews, editorials, news reportage, etc), or sort the cuttings according to author specification (age of author, sex, education, etc)? Perhaps sorted according to time when they were written, intended reader, colour of print? How do you classify the texts? If you look at the slips before you select/dischard them, the composition of your sample/corpus will reflect the choices you made (for example, you may choose to select texts which contain some particular feature/construction/vocabulary item, irrespectable of what section they come from).

Discalaimer

The image of the language as a newspaper is perhaps giving the impresson that 'the language as a whole' is a well-defined and generally agreed upon notion, something that is concrete and possible to quantify. This is far from the case. We should not be tempted to forget that language is not a confined, closed entity but a very difficult notion to define, quantitatively or qualitatively. Try to decide, for example, how much language you use in a day. Do you then count only the language you produce or all the language you get in contact with? What are the proportions of written and spoken language? Should the spoken language you hear on the radio (actively listening or just overhearing) be counted differently from the spoken language directed to you? Does it make a difference if you talk/write to several people or just to one? What is language spoken to a dictaphone or answering machine? Would a shopping list be counted as language? What about a diagram you make/see as an illustration to a text (spoken or written)? etc.

When compiling a corpus, you do not only have to take into account how you define language - you also have to decide what proportions of different varieties of language you want to include in your corpus. Once that is settled, you have to get the language - acquire the texts. Articles from newspapers and books can be easy enough to get hold of, and transcripts/scripts of certain radio and TV programs as well. How do you get the more personal writings like letters and diarys, though? And records of personal conversations, confessions, information given in confidence, etc? Moreover, as many corpus compilers can testify, much time and effort has to be spent on legal issues such as optaining permission to use the texts and making sure that no copy-right agreement is broken.

Summary

When you think of what we have described above, it is easy to understand why it is important to know something about how a corpus is compiled and what kind of text sample are included. Among the issues that have to be considered, then, by both corpus compilors and corpus users are:
  • the language sampled (what kind of newspaper has been used?)
  • the size of the corpus (how many pieces of paper were taken from the newspaper bowl?)
  • kind of text included (from which bowls was the sample taken?)
  • the proportions of different text types (how many slips of paper from each bowl?)
If the corpus consists of samples from a particular variety of language (from the 'sports' bowl, for example) you will find that it may be very different from another sample taken from another bowl. Moreover, it is important to know about the size of the corpus and the size/number of samples making up the corpus. If you have a big corpus (a large proportion of the newspaper) you may be able to find even rare words. In a small sample you have a bigger chance of missing something (think of all the words you don't get if you take only ten slips from the newspaper bowl, for example). If the corpus consists of a large part of one particular bowl you get a good picture of that particular bowl. It may or may not be different to a sample from another bowl. If you have a corpus of the same size but consisting of several small samples from different bowls, you will have a broader corpus (from more areas). The samples from each bowl are still small, however, so you may not be able to say much about the language in any one bowl.

Among the practical matters that have to be solved by the compiler are:

  • how can the texts be obtained? Where do they exist? (in books, on the WWW, etc)
  • do you need permission to use the texts?
  • do you need to process the material to include it (transcribe, code, convert files, etc)?
  • how can the texts be converted to the format you want them in (made electronically readable by scanning, keying-in, converting files to right format, etc)?
Though the user of the corpus do not have to make decisions about these practical matters, there are other issues that are important for the user to be aware of. Among those are, for example:
  • permission to use the corpus.
    Some corpora are only available to licence holders or for particular purposes (such as non-commercial academic research, teaching, personal use, etc)
  • permission to reproduce text.
    You may be permitted to use the texts as long as you do not quote them or publish them.
  • format of the texts.
    Some texts may be available only in particular formats that cannot be read by a usual word processor, for example.
  • software.
    A number of programs, search engines have been developed for the use on corpora in general or on specific corpora. A basic knowledge of and access to some of these tool may be necessary in order to make use of the corpus.

BACK to START TUTORIAL SEARCH ENGINE

W3-Corpora Project 1998 Contact us.