next up previous contents index
Next: Further Analysis: Grammatical Up: Representing Linguistic Knowledge Previous: Representing Linguistic Knowledge

Grammars and Constituent Structure

 

Sentences are made up of words, traditionally categorised into parts of speech or categories including nouns, verbs, adjectives, adverbs and prepositions (normally abbreviated to N, V, A, ADV, and P). A grammar of a language is a set of rules which says how these parts of speech can be put together to make grammatical, or `well-formed' sentences. 

For English, these rules should indicate that ( a) is grammatical, but that ( b) is not (we indicate this by marking it with a `*').

Here are some simple rules for English grammar, with examples. A sentence consists of a noun phrase, such as the user followed by a modal or an auxiliary verb, such as should, followed by a verb phrase, such as clean the printer:

A noun phrase can consist of a determiner, or article, such as the, or a, and a noun, such as printer ( a). In some circumstances, the determiner can be omitted ( b).

`Sentence', is often abbreviated to S, `noun phrase' to NP, `verb phrase' to VP, `auxiliary' to AUX, and `determiner' to DET.   This information is easily visualized by means of a labelled bracketing of a string of words , as follows, or as a tree  diagram, as in Figure gif.

  
Figure: A Tree Structure for a Simple Sentence

The auxiliary verb is optional, as can be seen from ( ), and the verb phrase can consist of just a verb (such as stopped):

NP and VP can contain prepositional phrases (PPs), made up of prepositions ( on, in, with, etc.) and NPs:

  The reader may recall that traditional grammar distinguishes between phrases and clauses. The phrases in the examples above are parts of the sentence which cannot be used by themselves to form independent sentences. Taking The printer stopped, neither its NP nor its VP can be used as independent sentences:

By contrast, many types of clause can stand as independent sentences. For example, ( a) is a sentence which consists of a single clause --- The printer stopped. As the bracketing indicates, ( b) consists of two clauses co-ordinated by and. The sentence ( c) also consists of two clauses, one ( that the printer stops) embedded in the other, as a sentential complement of the verb.

There is a wide range of criteria that linguists use for deciding whether something is a phrase, and if it is, what sort of phrase it is, what category it belongs to. As regards the first issue, the leading idea is that phrases consist of classes of words which normally group together. If we consider example (gif) again ( The user should clean the printer), one can see that there are good reasons for grouping the and user together as a phrase, rather than grouping user and should. The point is the and user can be found together in many other contexts, while user and should cannot.

  As regards what category a phrase like the user belongs to, one can observe that it contains a noun as its `chief' element (one can omit the determiner more easily than the noun), and the positions it occurs in are also the positions where one gets proper nouns (e.g. names such as Sam). This is not to say that questions about constituency  and category are all clear cut. For example, we have supposed that auxiliary verbs are part of the sentence, but not part of the VP. One could easily find arguments to show that this is wrong, and that should clean the printer should be a VP, just like clean the printer, giving a structure like the following, and Figure gif: 

Moreover, from a practical point of view, making the right assumptions about constituency  can be important, since making wrong ones can lead to having to write grammars that are much more complex than otherwise. For example, suppose that we decided that determiners and nouns did not, in fact, form constituents. Instead of being able to say that a sentence is an NP followed by an auxiliary, followed by a VP, we would have to say that it was a determiner followed by an noun, followed by a VP. This may not seem like much, but notice that we would have to complicate the rules we gave for VP and for PP in the same way. Not only this, but our rule for NP is rather simplified, since we have not allowed for adjectives before the noun, or PPs after the noun. So everywhere we could have written `NP', we would have to write something very much longer. In practice, we would quickly see that our grammar was unnecessarily complex, and simplify it by introducing something like an NP constituent. 

  
Figure: An Alternative Analysis

For convenience linguists often use a special notation to write out grammar rules. In this notation, a rule consists of a `left-hand-side' (LHS) and a `right-hand-side' (RHS) connected by an arrow ( ): 

S NP (AUX) VP VP V (NP) PP* NP (DET) (ADJ) N PP* PP P NP N user N users N printer N printers V clean V cleans AUX should DET the DET a P with

  The first rule says that a Sentence can be rewritten  as (or decomposes into, or consists of) an NP followed by an optional AUX, followed by VP (optionality is indicated by brackets). Another rule says that a PP can consist of a P and an NP. Looked at the other way, the first rule can be interpreted as saying that an NP, and AUX and a VP make up a sentence. Items marked with a star (`*') can appear any number of times (including zero) --- so the second rule allows there to be any number of PPs in a VP. The rules with `real words' like user on their RHS serve as a sort of primitive dictionary. Thus the first one says that user is a noun, the fifth one that clean is a verb. Since the NP rule says that an N by itself can make up an NP, we can also infer that printers is an NP, and since (by the VP rule) a V and an NP make up a VP, clean printers is a VP. Thus, a grammar such as this gives information about what the constituents  of a sentence are, and what categories they belong to, in the same way as our informal rules at the start of the section.

Returning to the tree representation  in Figure gif, each node in the tree (and each bracketed part of the string representation) corresponds to the LHS of a particular rule, while the daughters of each node correspond to the RHS of that rule. If the RHS has two constituent s, as in NP DET N, there will be two branches and two daughters; if there are three constitituents, there will be three branches and three daughters, and so on.

It is worthwhile to have some terminology for talking about trees.  Looking from the top,gif the trees above start from (or `are rooted in') a sentence node --- the LHS of our sentence rule. Near the bottom of the trees, we have a series of nodes corresponding to the LHS's of dictionary rules and, immediately below them at the very bottom of the trees, actual words from the corresponding RHS's of the dictionary rules. These are called the `leaves' or terminal nodes of the tree. It is normal to speak of `mother' nodes and `daughter' nodes (e.g. the S node is the mother of the NP, AUX, and VP nodes), and of mothers `dominating' daughters.    In practice most sentences are longer and more complicated than our example. If we add adjectives and prepositional phrases, and some more words, more complex tree s can be produced, as shown in Figure gif, where the NP which is the left daughter of the S node contains an adjective and a noun but no determiner (the NP rule in our grammar above allows for noun phrases of this form), the NP in VP contains a determiner and a PP.

A large collection of such rules will constitute a formal grammar for a language --- formal, because it attempts to give a mathematically precise account of what it is for a sentence to be grammatical. As well as being more concise than the informal descriptions at the beginning of the section, the precision of formal grammars is an advantage when it comes to providing computational treatments.

  
Figure: A More Complex Tree Structure

We should emphasise that the little grammar we have given is not the only possible grammar for the fragment of English it is supposed to describe. The question of which grammar is `best' is a matter for investigation. One question is that of completeness -- does the grammar describe all sentences of the language? In this respect, one can see that our example above is woefully inadequate. Another issue is whether a grammar is correct in the sense of allowing only sentences that are in fact grammatical: our example grammar falls down in this respect, since it allows the examples in ( ), among many others.

A grammar may also be incorrect in associating constituents  with the wrong categories. For example, as we noted above, one would probably prefer a grammar which recognizes that determiners and nouns make up NPs, and that the NP that occur in S (i.e. subject NPs) and those that appear in VP (object NPs) are the same (as our grammar does) to a grammar which treats them as belonging to different categories --- this would suggest (wrongly) that there are things that can appear as subjects, but not as objects, and vice versa. This is obviously not true, except for some pronouns that can appear as subjects but not as objects: I, he, she, etc. A worse defect of this kind is the treatment of words -- the grammar gives far too little information about them, and completely misses the fact that clean, and cleans are actually different forms of the same verb. We will show how this problem can be overcome in Chapter gif.

In a practical context, a further issue is how easy it is to understand the grammar, and to modify it (by extending it, or fixing mistakes), and how easy it is to use it for automatic processing (an issue to which we will return). Of course, all these matters are often related.



next up previous contents index
Next: Further Analysis: Grammatical Up: Representing Linguistic Knowledge Previous: Representing Linguistic Knowledge



Arnold D J
Thu Dec 21 10:52:49 GMT 1995