- Computational Linguistics
"Computational Linguistics is an interdisciplinary field which centers around the use of computers to
process or produce human language"C. Ball
In some ways, computational linguistics and corpus linguistics can be
seen as overlapping disciplines. Computational linguists are dependent
on computer-readable linguistic data to use in their research, while
corpus linguists often use computational methods when analysing their data.
One main difference can be said to be that in corpus linguistics it is the data in the corpus that is the main object of study. In computational linguistics, corpora are not studied as such but used as a resource to solve various problems.
Computational Linguistics is a broad term. You can read more about it in Catherine Ball's lecture About Computational Linguistics or at this page (University of Saarland).
- Cultural Studies
(*)
The existence of
comparable
corpora makes it possible to compare the
language use in, for example, different countries. The result of such comparisons
can point to differences in culture. It has been suggested, for example,
that the lower proportion of expressions of future in the
Kolhapur Corpus of Indian English, as compared to LOB and Brown
Corpora, can be explained with cultural differences. "Maybe the Indian mind is not given to thinking much in terms of the
future..." (Shastri 1988:18 in ICAME Journal
12:15-26).
So far, the use of corpora in cultural studies is not a particular well
developed field. Perhaps the ongoing work of compiling 20 corpora of
different varieties of English within the
ICE project
(International Corpora of English) will help make this a more fruitful research
area in the future.
- Discourse Analysis and Pragmatics
(*)
"Pragmatics is the study of the way language is used in particular
situations, and is therefore concerned with the functions of words as opposed
to their forms. It deals with the intentions of the speaker, and the way in
which the hearer interprets what is said"
(from The Collins Cobuild English Language Dictionary (1987).
Corpora have not been used much in discourse analysis or pragmatic studies.
One explanation to that is that it has been difficult to find material
suitable for this
kind of research. As more corpora are being
compiled and
annotated with
the relevant information, more corpus based research is also being
performed in this area. Examples of such studies are can be found among
the work done by scholars in Bergen (Norway), on their corpus of London
teenage language,
COLT. See, for example,
"They like wanna see how we talk and all that.The use of like
as a discourse marker in London teenage speech", and
"More trends in teenage talk.
A corpus-based investigation of the discourse items cos
and innit".
Read more about using corpora in discourse analysis and pragmatics in, for
example,
- Grammar/Syntax
(*)
Much research on grammar and syntax has been based on the researcher's
intuition about the language, on his/her 'competence'. The existence of
large corpora has made it easier to study the language as it is
produced, to study the 'performance' of many people.
"Every (formal) grammar is initially written on the basis of intuitive data; by
confronting the grammar with unrestricted corpus data it can be tested on its
correctness and its completeness". Jan Aarts
(1991)
Corpus data are being used to a larger or smaller extent for the production
of grammar books. One example of a book, based completely on corpus evidence is
An Empirical Grammar of the English Verb: Modal Verbs by Dieter Mindt (1995).
(external link)
An other example of how corpora can be used to corpus-based research on grammar and syntax is, for example,
Clause patterns in Modern British English: A corpus-based (quantitative) study
by N. Oostdijk and P. de Haan (1994). In
ICAME Journal 18,
(abstract)
- Historical Linguistics
(*)
The possibility of having representative samples of the language at
different points in history in machine-readable form allows historical
linguists to conduct their research faster and more efficiently.
The Helsinki Corpus is
a well-known and much used corpus of texts from different periods.
The
Lampeter Corpus
of Early Modern English Tracts contains a collection of pamphlets
published between 1640 and 1740. For example of research on the Lampeter
Corpus, click
here.
- Language Acquisition
The ICLE corpus (International Corpus of Learner English) contains data produced by learners of English as a foreign
language from different countries. It is being use for a variety of research purposes,
some of which were presented at the AILA96 conference
(abstracts). Learn more about this research in the book
Learner English on Computer by Sylviane Granger.
The CHILDES
database contains transcripts of language spoken by children. This material
can be used for research in a number of fields, language acquisition being one.
An annotated bibliography of research in child language and language disorders
can be found by using
this link.
- Language Teaching
(*)
There are several examples of how corpora can be, and gave been, used in
language teaching. See, for example:
Classroom Concordancing / Data-driven learning Bibliography
"references to the direct use of data from
linguistic corpora for language teaching and language learning".
- Language Variation
(*)
Much work with corpora concerns language variation. Corpora are used to study how laguage varies between different text types, domains, times, regions, speakers, writers, etc. In these kinds of studies, one variant of the language is compared to another. These 'variants' can be different parts of one and the same corpus or similar parts of different corpora. An example of the former would be, for example, the Science Fiction texts in the LOB corpus compared to the Romantic Fiction tests in the same corpus. An example of a study of variation between two corpora would be, for example, an examination of the Science fiction texts in the LOB corpus as compared to the Science fiction texts in the Brown corpus. Language variation can also concern how speakers vary their production depending on the situation, how the language has changed over time, or how the language varies within an area (dialect).
- Lexicography
(*)
Corpora are increasingly used in lexicography today. The first to start making
extensive use of large corpora in dictionary and grammar book production were the Collins Cobuild.
Longman has consulted the British National Corpus (BNC) and the Longman Corpus Network for their latest edition of
the
Longman Dictinary of Contemporary English.
You can read more about the use of corpora in dictionary making in
Computer Corpus Lexicography
by Vincent B. Y. Ooi.
For examples of corpus based studies in the field of lexicography,
see, for example, the
Bibliography of papers by Cobuild staff members.
- Linguistics
Corpora are important sources of data for a number of areas within the
wide scope of Linguistics. Some of these are dealt with in more detail
elswhere on this page.
"[A]nalyses of language use provide an important
complementary perspective to traditional linguistic descriptions".
Douglas Biber, in
IJCL 1:2
- Psycholinguistics
(*)
Observing the language found in a corpus can contribute to the creation of
hypothesis about the way language is processed by the mind.
The use of corpora can also contribute to research in language
pathologies. In order to analyse a particular language impairment it
is important to have a very clear picture of the structural and formal
differences between the impairment and its correct form.
- Semantics
(*)
There are various ways in which you can study the meaning of words/utterances. One way is to look at the context in which the word/phrase occurs. Concordances and collocations (*) are often used for this.
Attempts have been made at annotating corpora with semantic annotation, but the avaialbility of such material is limited. Example of how such information can be given can be found by looking at the
Word Net, a semantically annotated lexical database for English.
- Social Psychology
(*)
- Sociolinguistics
(*)
With the existence of corpora provided with sociolinguistic information about the speakers and/or authors of a text has come the possibility of using corpora in sociolinguistic research.
The British National Corpus BNC has been extensively
annotated for various sociolinguistic parameters, such as speakers' age, sex,
and social class, writers' age, sex, domicile, etc. This information is
used in a number of studies, for example by Paul Rayson et al in
Social Differentiation in the Use of English Vocabulary: Some Analyses of
the Conversational Component of the British National Corpus (IJCL 1997:2:1).
Another corpus with sociolinguistic annotation is the COLT
Corpus of London teenager language. See, for example,
Girls' conflict talk: a sociolinguistic investigation of variation in the verbal
disputes of adolescent females by A-B Stenstrom and I.K. Hasund.
Historical corpora are also being used
for sociolinguistic research. See, for example, the
Sociolinguistics and Language History Project.
- Speech
(*)
The first computer-readable corpus of spoken discourse was the
London-Lund
Corpus (LLC). It contains about 500,000 words of spoken British English, transcribed
and provided with prosodic annotation. The LLC was compiled to be used for
linguistic research primarily. Since then, many corpora of spoken
text have been compiled for various other research tasks, especially
for use within
the fields of speech science and speech technology.
There are several areas where spoken corpora are used. One is looking at
spoken language as a variety of natural language, sometimes in comparison
to written language. In such studies, a corpus of orthographically
transcribed language,
such as the BNC, can be
(and has been) used.
Spoken
corpora are also used within the fields of speech technology and speech
science. Example of such research tasks are teaching computers to
produce and understand speech. The study of acoustic and
phonetic phenomena of speech is important in, for example, the expanding
commercial sphere of telecommunication.
Producing transcribed spoken corpora with detailed
annotation is a time-consuming and, therefore, costly procedure,
which is why few such corpora are freely available.
Read more about spoken language corpora activities
here (external link)
- Stylistics
(*)
The availability of corpora with large collections of texts from different genres, authors, media, etc. opens up new possibilities in the research area of Stylistics. Texts of different kinds can be compared to each other to find text type specific features. General corpora can serve as a frame of reference, something to compare other texts with.
One area of Stylistics where corpora have been used is in authorship attribution
(see, for example, P. de Haan, 1997
).
Another example of how corpora can be used in Stylistics is given in T. Tabata's
essay on the use of statistical methods to investigate the changes of style in a corpus of Dicken's writings.