A Quick Overview of the W3Corpora Search Engine

Doug Arnold
University of Essex
doug@essex.ac.uk

.

The W3Corpora WebPage gives you access to:

This document is intended to give you a quick overview of how you can use the search engine .

Suppose, for the sake of argument, that you are interested in looking at the difference in meaning between stong and powerful, and to begin with you are just going to look at strong.

Before you start, open the search engine in a new browser window (so you can see it and this page at the same time).

To begin with, you have to choose a corpus, and give a search string:

  1. Choose a corpus. Clicking the "Corpus" button sends you to a new page; the only option here is the Gutenberg Corpus, so select it, and press "Confirm".

    Now you have to select a sub-corpus. Choose the first one or two. And press "Confirm".

  2. Choose a search string. You have to decide whether to look for whole words, beginnings of words, or parts of words. This allows you to control whether, as well as strong as a single word, you will find examples of things like:

    Since this is just a first investigation, suppose we ignore these, and just look for strong as a whole word: select "Exact Match", and enter the "strong" in the box below. Press "Confirm".

Now press the "Search button". Quite quickly (because you have only selected a couple of documents from the whole corpus), you will see information about frequency. This is not very interesting, but it gives you and idea of what you are getting in to (if the number of matches is very large, you might want to re-think what you are doing).

However, you can also look at subcorpus frequencies, and lexical freqencies (click on these buttons). The subcorpus freqencies allow you to compare the frequency of a word in different documents (it might help you to focus your investigation on particular documents).

The lexical frequency information is not very interesting here, but it would be more interesting if we had done another kind of search (it would tell us how many of our matches were "Armstrong", "stronger", etc).

However, it is much more interesting to click on the "Display" button, which gives you a KWIC (Key Word in Context) display. Now you begin to get an idea of how the word is used. You initially see 10 of the search results. There are buttons to move you forward and backward to see others. If you click on a particular instance of the key word, you will see a much larger context.

Now you should play around with this. For example:

In short, you can use this search engine to get inforamtion about frequency and context of words and phrases within and across a selection of documents.

It isn't the only place you can get this information (see the pages on Corpus Linguistics on the W3Corpora WebPage).

Looking at individual words isn't the only thing you can do (see the Tutorial on the W3Corpora WebPage for suggestions on how you might investigate other things).

[You might reasonably ask what you will learn about strong vs powerful by this sort of investigation. You should bear in mind that a search engine is only a tool, and you can use it for what you are interested in. It will be better for some things than others. The aim here was just to give you a feel for the way the search engine works. However, you might, if you look closely, see that Dickens, in A Child's History of England, uses strong as a sort of intensifier "Elizabeth , conceived a strong idea", "...a strong suspicion", and does not use powerful in this way. Moreover, when he describes someone as strong it seems to convey some kind of intrinsic property: "(the) Earl of Salisbury , who was past sixty , and had never been strong", whereas powerful generally refers to something like political power.

There is a classic paper about this issue: Church and Hanks (1990): 'Using statistics in lexical analysis' Lexical Acquisition: Using On-Line Resources to Build a Lexicon. Ed. Uri Zernik. Hillsdale: Lawrence Erlbaum, 1991. Church and Hanks point out that there appears to be a generalized motivation underlying the choice: intrinsic vs. extrinsic, giving examples like: strong tea/defense/constitution vs. powerful drug/military/neighbor. Their investigation is based on a statistically based examination of collocations, not just looking at examples.]

Doug Arnold
University of Essex
Jan 1999


doug@essex.ac.uk