Sentences are made up of words, traditionally categorised into parts of speech or categories including nouns, verbs, adjectives, adverbs and prepositions (normally abbreviated to N, V, A, ADV, and P). A grammar of a language is a set of rules which says how these parts of speech can be put together to make grammatical, or `well-formed' sentences.
For English, these rules should indicate that (
a) is grammatical,
but that (
b) is not (we indicate this by marking it with a
`*').
Here are some simple rules for English grammar, with examples. A sentence consists of a noun phrase, such as the user followed by a modal or an auxiliary verb, such as should, followed by a verb phrase, such as clean the printer:
A noun phrase can consist of a determiner, or
article, such as the, or a, and a noun, such as
printer (
a). In some circumstances, the determiner can be
omitted (
b).
`Sentence', is often abbreviated to S, `noun phrase' to NP, `verb phrase'
to VP, `auxiliary' to AUX, and `determiner' to DET.
This information is easily visualized by means of a labelled
bracketing of a string of words , as follows,
or as a tree diagram, as in
Figure
.
users
should
clean
the
printer
Figure: A Tree Structure for a Simple Sentence
The auxiliary verb is optional, as can be seen from (
), and the
verb phrase can consist of just a verb (such as stopped):
NP and VP can contain prepositional phrases (PPs), made up of prepositions ( on, in, with, etc.) and NPs:
The reader may recall that traditional grammar distinguishes between phrases and clauses. The phrases in the examples above are parts of the sentence which cannot be used by themselves to form independent sentences. Taking The printer stopped, neither its NP nor its VP can be used as independent sentences:
By contrast, many types of clause can stand as independent
sentences. For example, (
a) is a sentence which
consists of a single clause --- The printer stopped.
As the bracketing indicates, (
b) consists of two clauses
co-ordinated by and. The sentence (
c) also consists of
two clauses, one ( that the printer stops) embedded in the other,
as a sentential complement of the verb.
There is a wide range of criteria that linguists use for deciding
whether something is a phrase, and if it is, what sort of phrase it
is, what category it belongs to. As regards the first issue, the
leading idea is that phrases consist of classes of words which
normally group together. If we consider example (
)
again ( The user should clean the printer), one can see that there
are good reasons for grouping the and user together as a
phrase, rather than grouping user and should. The point is
the and user can be found together in many other contexts,
while user and should cannot.
As regards what category a phrase like the user belongs to, one
can observe that it contains a noun as its `chief' element (one can
omit the determiner more easily than the noun), and the positions it
occurs in are also the positions where one gets proper nouns (e.g.
names such as Sam). This is not to say that questions about constituency and category are
all clear cut. For example, we have supposed that auxiliary verbs are
part of the sentence, but not part of the VP. One could easily find
arguments to show that this is wrong, and that should clean the
printer should be a VP, just like clean the printer, giving
a structure like the following, and Figure
:
users
should
clean
the
printer
Moreover, from a practical point of view, making the right assumptions about constituency can be important, since making wrong ones can lead to having to write grammars that are much more complex than otherwise. For example, suppose that we decided that determiners and nouns did not, in fact, form constituents. Instead of being able to say that a sentence is an NP followed by an auxiliary, followed by a VP, we would have to say that it was a determiner followed by an noun, followed by a VP. This may not seem like much, but notice that we would have to complicate the rules we gave for VP and for PP in the same way. Not only this, but our rule for NP is rather simplified, since we have not allowed for adjectives before the noun, or PPs after the noun. So everywhere we could have written `NP', we would have to write something very much longer. In practice, we would quickly see that our grammar was unnecessarily complex, and simplify it by introducing something like an NP constituent.
Figure: An Alternative Analysis
For convenience linguists often use a special
notation to write out grammar rules. In this notation, a rule consists
of a `left-hand-side' (LHS) and a `right-hand-side' (RHS) connected by an arrow (
):
SNP (AUX) VP VP
V (NP) PP* NP
(DET) (ADJ) N PP* PP
P NP N
user N
users N
printer N
printers V
clean V
cleans AUX
should DET
the DET
a P
with
The first rule says that a Sentence can be rewritten as (or decomposes into, or consists of) an NP followed by an optional AUX, followed by VP (optionality is indicated by brackets). Another rule says that a PP can consist of a P and an NP. Looked at the other way, the first rule can be interpreted as saying that an NP, and AUX and a VP make up a sentence. Items marked with a star (`*') can appear any number of times (including zero) --- so the second rule allows there to be any number of PPs in a VP. The rules with `real words' like user on their RHS serve as a sort of primitive dictionary. Thus the first one says that user is a noun, the fifth one that clean is a verb. Since the NP rule says that an N by itself can make up an NP, we can also infer that printers is an NP, and since (by the VP rule) a V and an NP make up a VP, clean printers is a VP. Thus, a grammar such as this gives information about what the constituents of a sentence are, and what categories they belong to, in the same way as our informal rules at the start of the section.
Returning to the tree representation in Figure
, each
node in the tree (and each bracketed part of the string
representation) corresponds to the LHS of a particular rule, while the
daughters of each node correspond to the RHS of that rule. If
the RHS
has two constituent s, as in NP
DET N, there will be
two branches and two daughters; if there are three constitituents,
there will be three branches and three daughters, and so on.
It is worthwhile to have some terminology for talking about trees.
Looking from the top,
the trees above start from (or `are
rooted in') a sentence node --- the LHS of our sentence rule. Near
the bottom of the trees, we have a series of nodes corresponding to the
LHS's of dictionary rules and, immediately below them at the very
bottom of the trees, actual words from the corresponding RHS's of the
dictionary rules. These are called the `leaves' or terminal nodes of
the tree. It is normal to speak of `mother' nodes and `daughter' nodes
(e.g. the S node is the mother of the NP, AUX, and VP nodes), and of
mothers `dominating' daughters.
In practice most sentences are longer and more complicated than our
example. If we add adjectives and prepositional phrases, and some more
words, more complex tree s can be produced, as shown in
Figure
, where the NP which is the left daughter of
the S node contains an adjective and a noun but no determiner (the NP
rule in our grammar above allows for noun phrases of this form), the
NP in VP contains a determiner and a PP.
A large collection of such rules will constitute a formal grammar for a language --- formal, because it attempts to give a mathematically precise account of what it is for a sentence to be grammatical. As well as being more concise than the informal descriptions at the beginning of the section, the precision of formal grammars is an advantage when it comes to providing computational treatments.
Figure: A More Complex Tree Structure
We should emphasise that the little grammar we have given is not the
only possible grammar for the fragment of English it is supposed
to describe. The question of which grammar is `best' is a matter for
investigation. One question is that of completeness -- does the
grammar describe all sentences of the language? In this respect,
one can see that our example above is woefully inadequate. Another
issue is whether a grammar is correct in the sense of allowing only
sentences that are in fact grammatical: our example grammar falls down
in this respect, since it allows the examples in (
),
among many others.
A grammar may also be incorrect in associating constituents with the
wrong categories. For example, as we noted above, one would probably
prefer a grammar which recognizes that determiners and nouns make up
NPs, and that the NP that occur in S (i.e. subject NPs) and those that
appear in VP (object NPs) are the same (as our grammar does) to a
grammar which treats them as belonging to different categories ---
this would suggest (wrongly) that there are things that can appear as
subjects, but not as objects, and vice versa. This is obviously not
true, except for some pronouns that can appear as subjects but not as
objects: I, he, she, etc.
A worse defect of
this kind is the treatment of words -- the grammar gives far too
little information about them, and completely misses the fact that
clean, and cleans are actually different forms of the
same verb. We will show how this problem can be overcome in
Chapter
.
In a practical context, a further issue is how easy it is to understand the grammar, and to modify it (by extending it, or fixing mistakes), and how easy it is to use it for automatic processing (an issue to which we will return). Of course, all these matters are often related.