collocate: to appear together, or words that appear together.
(In the collocations 'apple tree', 'apple pie', and 'Adam's apple',
'apple' collocates with 'tree', 'pie', and 'Adam's'. They are
collocates.)
hit: When your search string is found in the corpus,
it is referred to as a hit or match.
KWIC (Key-Word In Context): a form of concordance where
the
hit is shown with a certain amount of context, often presented
with the hit in the centre of the page.
(example)
lemma: the set of different forms of a word, such as the inflected
forms of a verb. Ex. 'sing', 'sang', 'sung' are one lemma, 'boy', 'boys'
another.
lemmatisation: the process or result of dividing a text into
lemmas.+
TEI:
(Text Encoding Initiative)
International project set up "to develop
guidelines for the preparation and
interchange of electronic texts...". Uses SGML as starting-point.
thin: remove certain hits, either automatically or manually.
thinning: the process or result of removing certain hits,
either by selecting the desired ones, selecting the ones to discard or by
selecting/discarding a set amount of hits.
token: individual word. Compare type.
tokenisation: the process or result of dividing a text or list of
words into tokens.
treebank: term sometimes used for parsed corpora.
type: wordform. "I see a cat and a dog" contains seven tokens
but only six types (the type 'a' occurrs twice).
* Links to web-pages made to supplement the book
"Corpus Linguistics" by Tony McEnery and Andrew Wilson.
Are there words you miss, that you think should be included in the list above?
Please let us know and we
will try to update the glossary. Comments and suggestions are also welcome.
Thank you!