|
|
List
of Corpora
Not all resources listed here conform to a strict definition of a corpus (see, for ex.,
this).
They have, nevertheless, been included for the convenience of those who need them.
For an interesting approach that uses the whole World Wide Web as a
corpus, see http://www.webcorp.org.uk/.
-
The
Air Traffic Control Corpus
- ACL/DCI Association
for Computational Linguistics Data Collection
Initiative
- ATIS Air Travel Information System (Held by LDC)
-
Bank of English
- BNC
The
British National Corpus
-
The
Brown Corpus 1 million word American English, standard reference
- CALLFRIEND Collection Unscripted telephone conversations in 12 languages and 3 dialects. (Held by LDC)
- CALLHOME Collection Unscripted telephone conversations in six languages (Held by LDC)
- CCAT
Archive Classical, Literary, Historical and Religious Texts.
- The
CHILDES Child Language Data Excahange Systems
"Child language transcript data from scores of
projects in dozens of languages"
-
COLT The Bergen Corpus of London Teenager Language.
-
Contemporary Portuguese Corpus
Written (40 million words) and spoken (1,5 million words) texts from various text types.
-
CRATER Multilingual Aligned Annotated Corpus (English, French, Spanish)
- CSLU Speech corpora collected/distributed by the Center for Spoken Language Understanding (Oregon, USA)
- CSPA Corpus of Spoken,
Professional American-English.
2 milj. words from 1994-98. Compiled by M. Barlow.
- CSR Continuous Speech Recognition Corpora(Held by LDC)
-
English-Norwegian Parallel Corpus
-
English Turkish Aligned Parallel Corpora
- Corpus of Estonian Written Texts
1 million words. Texts from 1983-87, various text-types.
-
European Corpus
Initiative Multilingual Corpus I (ECI/MCI)
- The
Gutenberg Project
-
The Canadian Hansard proceedings in English and French.
(Site where you type a word/phrase in
one language and get the equivalent in the other).
-
The
Helsinki Corpus of English Texts
-
Hypermedia Corpus of Japanese Conversation
-
The
International Corpus of English (ICE)
-
The
International Corpus of Learner English (ICLE)
- Japanese Speech Corpora of Major City Dialects (in Japanese)
-
The
Kolhapur Corpus
-
Lampeter Corpus of Early Modern English Tracts.
-
The
Lancaster Parsed Corpus
-
The
Lancaster/IBM Spoken English Corpus (external link)
- LDC
Materials:
- The
LOB Corpus
-
The
London-Lund Corpus (LLC)
- The
Market Research Corpus
-
MARSEC: The Machine Readable Spoken English Corpus
- The Middle English
Collection 40 titles, most of which are publicly-accessible.
- The
Modern English Collection The Electronic Text Center at The University
of Virginia. 1,144 titles including 3,000 manuscript and book
illustrations, many of which are publicly accessible.
- Old English Corpus
".. contains all surviving OE material ..". Search on-line.
- The Complete Corpus of
Old English.UVa. users only.
- The Oslo Corpus of Bosnian Texts Written. Appr. 1,6 million words from several different genres, mostly published in the 1990s.
- The PEDANT Project, Gothenborg, Sweden.
Searching interface for searching language pairs of Swedish and English,
German, or French. Texts mostly official EU documents and some presse releases
from Volvo and Skanska.
-
Penn
Treebank (LDC)
- Regeringsforklaringen
The yearly declaration of the Swedish government issued in Swedish, English, French,
German, and Spanish.
Searchable on-line.
- The Religious and
Sacred Texts Page.
Collection of various sacred and religious texts
- LDC)
-
ShATR The Sheffield-ATR Multiple-Simultaneous Speaker Database. A Speech Science Corpus
- Corpus of Spoken Bulgarian collected by Krasimira Aleksova.
"transcribed conversations in family contexts". Freely available.
- Corpus of Spoken Bulgarian collected by Cvetanka Nikolova.
Transcribed conversations from 1975-1977. Approx. 50,000 words. Freely available.
-
SPIDRE Corpus - Recorded Telephone Conversations
- The
Susanne Corpus Parsed subset (130,000 words) of the Brown corpus.
- Swedish component of the
Parole
project.
Appr. 10 million words. POS tagged. Seach online.
-
TIMIT (English Speech Corpus)
-
TIPSTER Information Retrieval Text Research Collection(LDC)
- United
Nations Parallel Text Corpus (English, French, Spanish)(LDC)
- Wellington Corpus
of Spoken New Zealand English (WSC)
1 million words; monologue (scripted and unscripted) and dialogue (private and public).
- Wellington Corpus of Written New Zealand English (WWC)
1 million words, structure parallel to Brown corpus.
-
WHO bilingual documents
Aligned texts English-French, English-Spanish
- Corpus of Spoken Bulgarian collected by Krasimira Aleksova.
"transcribed conversations in family contexts". Freely available.
- Corpus of Spoken Bulgarian collected by Cvetanka Nikolova.
Transcribed conversations from 1975-1977. Approx. 50,000 words. Freely available.
-
CRATER Multilingual Aligned Annotated Corpus (English, French, Spanish)
- Oslo Corpus of Bosnian Texts.
- Corpus of Estonian Written Texts
1 million words. Texts from 1983-87, various text-types.
-
The Canadian Hansard proceedings in English and French.
(Site where you type a word/phrase in
one language and get the equivalent in the other).
- Hypermedia Corpus of
Japanese Conversation
- Japanese Speech Corpora of Major City Dialects (in Japanese)
- English-
Norwegian Parallel Corpus
Original texts and translations to and from both languages.
-
Contemporary Portuguese Corpus
Written (40 million words) and spoken (1,5 million words) texts from various text types.
- Swedish component of the
Parole
project.
Appr. 10 million words. POS tagged.
- The PENDANT Project, Gothenborg, Sweden.
Searching interface for searching language pairs of Swedish and English,
German, or French. Texts mostly official EU documents and some presse releases
from Volvo and Skanska.
- Regeringsforklaringen
The yearly declaration of the Swedish government issued in Swedish, English, French,
German, and Spanish.
Searchable on-line.
- English
Turkish Aligned Parallel Corpora
- CALLFRIEND Collection Unscripted telephone conversations in American English, Canadian French, Egyptian Arabic, Farsi, German, Hindi, Japanese, Korean,
Mandarin, Spanish, Tamil, and Vietnamese. (Held by LDC)
- LDC)
-
WHO bilingual documents
Aligned texts English-French, English-Spanish
Available on-line
- free, directly available
- BNC: Simple searches on-line.
(limited number of hits, limited information about the hits)
- Gutenberg Project Texts
(via the W3-Corpora
search engine)
- Demo of CobuildDirect:
On-line searches in a
sub-component of the
Bank of English (limited number of hits, limited information about the hits)
- Colt Bergen Corpus of London Teenager Language
Search in
the pilot version.
-
The Canadian Hansard proceedings in English and French.
(Site where you type a word/phrase in
one language and get the equivalent in the other).
- Old English Corpus
".. contains all surviving OE material ..".
- Regeringsforklaringen
The yearly declaration of the Swedish government issued in Swedish, English, French,
German, and Spanish.
-
The Canadian Hansard proceedings in English and French.
(Type a word/phrase in
one language and get the equivalent in the other).
- free, registered users
- registered users (subject to subscription/licence fee)
Links to online text resources other than corpora. An extensive list can be found on the
Project Gutenberg site (link here).
- ATHENA
Collections of texts free for non-commercial use. Much by Swiss and French authours.
- Electronic Newsstand
Magazines of many different kinds (not all available on-line).
- The
Electronic Text Center
University of Virginia.
- The English Server
Arts and Humanities texts online. Arranged by topics.
-
Internet Corpora Index
"online resources which may serve as corpora for psychologists and other
scientists".
- The Internet Public Library
"the first public library of and for the Internet community"
- KidPub
Collection of stories written by "kids from all over the planet!"
-
On Media Directory
Search by geographic location or media type.
- Oxford Text Archive (OTA)
"The OTA collects high-quality scholarly electronic texts and linguistic
corpora (and any related resources) of long-term interest and use across
the range of humanities disciplines"
- Project Gutenberg
Extensive collection of whole books.
- Project Runeberg
Center for Nordic literature. About 200 titles available online.
|