web-sls 97.04 03.11.97

WEB-SLS

The European Student Journal of Language and Speech

 


 

Translation Completeness for Context-Free Grammars

 

Willem-Olaf Huijsen

 

Utrecht Institute for Linguistics OTS, Utrecht University

Trans 10, NL-3512 JK, Utrecht, The Netherlands

Willem-Olaf.Huijsen@let.ruu.nl

 


Note: The HTML markup for this article includes three fonts: Times, Arial (Helvetica) and Symbol. The symbol font is used for the many mathematical expressions in the article and needs to be resident on your machine - it is not downloaded with this article. Please report any difficulties to the Technical Editor, Mark Tatham - Mark.Tatham@essex.ac.uk.


  1. Introduction
  2. Motivation
  3. Compositional Grammars
  4. Compositional Translation
  5. Completeness of Machine Translation
  6. CFG-based Compositional Grammar
  7. Functional Category Correspondence
  8. Relational Category Correspondence
  9. Conclusion
  10. References

1 Introduction

Systems for translation of controlled language require the source text to be expressed within severe syntactic and lexical limits. One of the objectives of such systems is that an author who conforms to the restrictions is rewarded with a reliable and fully automatic translation of the text into one or more target languages. Thus a proof of the completeness of such systems is of great importance. A machine-translation system is complete if and only if all expressions that are correct according to the source language grammar have at least one translation in the target language. The present research is inspired by the method of compositional machine translation developed in the Rosetta project [Rosetta 1994]. It focuses on the provability of completeness for relatively simple grammar formalisms, which are more appropriate for machine translation of controlled languages.

Section 2 elaborates on the motivation of our research.

Sections 3 and 4 describe compositional grammar and compositional translation.

Section 5 then presents the theme of the paper, completeness of compositional translation.

Section 6 shows how context-free grammars can be rephrased as compositional grammars.

Sections 7 and 8 work out completeness conditions for compositional grammars based on context-free grammars. Finally,

Section 9 recapitulates the findings and sketches future research.

top of article


2 Motivation

The research reported in this paper is motivated primarily by the potential benefit of the use of controlled language to the reliability of machine translation. A controlled language is a precisely defined alteration of a natural language, on the one hand constrained in its lexicon, grammar and style, on the other hand possibly extended with domain specific terminology and grammatical constructions. Controlled languages are mostly used for technical texts, procedures, and instructions, which discuss limited and well defined domains, so that the restrictions imposed are more readily accepted. Due to the reduction in ambiguity, redundancy, lexicon size, and grammatical complexity, texts written in a controlled language become easier to analyse and process for humans and computers. Humans may find such texts easier to read and understand, so that they can execute described tasks more efficiently and effectively. Computational processing can also benefit from the use of controlled language. Natural language is notoriously difficult for computers to process. By choosing the controlled version of a natural language right, it can become possible to guarantee successful computational processing of texts written in that controlled language. In the case of a machine-translation system, adherence to a controlled source language could guarantee the successful analysis of any source-language text. In this paper we seek to guarantee that the subsequent translation will succeed as well. That is, we find conditions under which the machine-translation system can be guaranteed to be complete. Thus, authors who conform to a controlled language that includes such a completeness condition could be rewarded with a reliable and fully automatic translation of the text into one or more languages.

We now discuss two practical examples of the combination of controlled language and machine translation. This is the type of system we have in mind for the application of our research. Already in the 1960s, Caterpillar Inc. created one of the first controlled languages, Caterpillar Fundamental English (CFE) [Caterpillar 1974]. Currently, its successor, Caterpillar Technical English (CTE), is used to facilitate the translation in up to 35 languages of the large volume of documentation (over 100,000 new pages each year) [Hayes et al. 1996]. An exemplary success story on the combination of controlled language and machine translation is that of Perkins Engines Ltd., a world-wide leader in the manufacturing of diesel and other engines. The frequent introduction of new products and the modification of existing products calls for the rapid production of documentation in five languages — English, French, German, Spanish, and Italian. To simplify English publication for non-native speakers and to aid translation — be it conventional or computer-assisted — Perkins introduced a controlled language called Perkins Approved Clear English (PACE) in 1980 [Pym 1990, Newton 1992]. Through the combination of PACE with Weidner's MicroCat machine-translation system, Perkins obtained an ensured consistency of the use of terminology and greatly reduced translation time and translation costs, leading to a higher quality of the documentation and a shorter time-to-market, thus gaining a competitive edge.

top of article


3 Compositional Grammars

Compositional translation assumes that the source language (SL) and the target language (TL) are defined as compositional grammars, i.e. grammars that obey the well-known compositionality principle (cf. [Partee et al. 1993, Janssen 1986]). We define a compositional grammar as a finite set of basic expressions BE, a finite set of syntactic rules Rules, a finite set of syntactic categories Cats, and a syntactic type-assignment function Type(.). Basic expressions are, roughly, the smallest meaningful units in a language (more or less the stems of content words). Syntactic rules are operations that recursively build derived expressions from basic expressions. Expressions are either basic or derived expressions. Syntactic categories describe the syntactic properties of expressions. Basic expressions b all have a syntactic category Cat(b). Syntactic rules restrict their arguments to certain categories, and specify the category of the derived expression they yield. The syntactic type-assignment function associates every syntactic rule R with an ordered pair, Type(R), consisting of the so-called argument list, ArgList(R), of the categories of its arguments and its result category, ResCat(R). The arity of a syntactic rule is the number of categories in the rule's argument list. We require that syntactic rules are total: They must be applicable to any combination of arguments that matches their argument list.

According to the compositionality principle the semantic component of a compositional grammar assigns an object from a semantic domain to each basic expression and an operation on objects in that domain to each syntactic rule. We define translation as a relation between expressions, assuming that the translation relation implies equivalence of meaning. We do not include a semantic component (cf. Montague's Universal Grammar, [Thomason 1974]), because its inclusion would obscure our general discussion, and there is no need for a separate semantic component in the context of controlled languages where syntactic elements can be assumed to be unambiguous.

Derivational histories of expressions are represented by syntactic derivation trees. A syntactic derivation tree t is either a tree consisting of a single node b, where b is the name of a basic expression, or a tree of the form R[t1, ..., tn], where R is the name of a syntactic rule, and t1, ..., tn is an ordered list of syntactic derivation trees. However, not all syntactic derivation trees describe expressions: For a given syntactic derivation tree R[t1, ..., tn], assume that the subtrees t1, ..., tn describe expressions e1, ..., en of certain syntactic categories C1, ..., Cn. If these categories do not match the syntactic categories of the argument list of syntactic rule R, then the rule is not applicable to expressions e1, ..., en, so that the syntactic derivation tree R[t1, ..., tn] does not describe an expression. This issue is formalised as the notion of well-formedness. A syntactic derivation tree t is a well-formed syntactic derivation tree of syntactic category C if and only if it consists of a single basic expression b of category C, or if tree t is of the form R[t1, ..., tn], where the arity of R is n; subtrees t1 through tn are well-formed syntactic derivation trees of categories ArgList(R), respectively; and category C is ResCat(R). The syntactic category of a syntactic derivation tree is denoted Cat(R). For convenience, we sometimes annotate syntactic derivation trees with their syntactic category t:Cat(t).

top of article


4 Compositional Translation

Compositional translation between compositional grammars is based on the notion of translation equivalence of their syntactic elements. We assume that for compositional translation between grammars G and G', a translation-equivalence relation `~' is specified between their basic expressions and their syntactic rules: ~ Í BEG x BEG' È RulesG x RulesG', where translation equivalent rules must have the same arity. Given this, we say that syntactic derivation tree t of grammar G is translation equivalent to syntactic derivation tree t' of grammar G', denoted t ~ t', if and only if the trees are isomorphic and all corresponding syntactic elements are translation equivalent. For compositional grammars G and G', the compositional translation of an SL expression is a set of TL expressions, obtained as follows:

Source-Language Expression

 

Target-Language Expression

¯ analysis

 

generation ­

SL Syntactic Derivation Trees

® transfer

TL Syntactic Derivation Trees

Figure 1. The Method of Compositional Translation

Analysis performs morphological and syntactic analysis of an SL expression, yielding the set of all syntactic derivation trees that correspond to the expression; transfer of an SL syntactic derivation tree yields the set of all translation-equivalent TL syntactic derivation trees; and generation of a well-formed syntactic derivation tree produces the corresponding TL expression.

top of article


5 Completeness of machine translation

An important question regarding the reliability of compositional translation is what we call the completeness issue: Can the translation method be guaranteed to produce at least one translation for all SL expressions? This description does not make precise from which stage onward translation must be guaranteed to succeed, however. Depending on this, one may distinguish at least two levels of completeness. Naively, one would like a machine-translation system to produce at least one translation for every well-formed SL expression. We call this expression completeness. However, it is well-known that natural-language expressions are often ambiguous, so that one expression may have several distinct syntactic analyses. For each of these analyses, such an ambiguous expression may have a different translation. In that event, a machine-translation system should be able to provide at least one translation for each of the syntactic analyses of the SL expression. We define syntactic completeness as the case where, for each syntactic derivation tree of each well-formed SL expression, translation yields at least one well-formed TL expression. The central issue of the paper is the question of how to guarantee syntactic completeness: What conditions on the SL and TL grammars are sufficient for guaranteeing that, after successful analysis, transfer and generation produce a well-formed TL expression? Generation evaluates the syntactic derivation trees yielded by transfer by recursive rule application. As stated in Section 3, we assume that all syntactic rules are total for the categories of their arguments. Rule application therefore succeeds if and only if the arguments are of the correct categories. To ensure this we must move `upstream' to transfer. Transfer simply replaces the basic expressions and syntactic rules of the source language by translation-equivalent counterparts of the target language. An obvious necessary and sufficient condition for the success of transfer is that there be at least one translation-equivalent counterpart in the TL grammar for each possible syntactic element in the SL syntactic derivation tree. A compositional grammar pair satisfying this condition is called a homomorphic grammar pair. However, grammar homomorphism is only a necessary -- but not a sufficient -- condition for completeness. It guarantees that for every well-formed SL semantic derivation tree there is a translation-equivalent TL syntactic derivation tree, but it does not ensure that this TL syntactic derivation tree is well-formed. Sections 7 and 8 work out conditions that do guarantee this (for compositional grammars based on context free grammars, which are presented in the next section).

top of article


6 CFG-Based Compositional Grammar

A CFG grammar can be rephrased as a compositional grammar in the following way: (i) The nonterminals become the syntactic categories of the compositional grammar. (ii) The lexical rules, i.e. the rules of the form A ® j , where j consists of terminal symbols only, become basic expressions b of the form j : A, where Cat(b) = A. (iii) The rewrite rules of the CFG are mapped onto compositional rules as follows. The general format of a rewrite rule is C ® j 0 C1 j 1 ... j n-1 Cn j n, where each j i is a string of terminal symbols, and C, C1, ..., Cn are nonterminals. This rule is mapped onto a compositional rule of type <<Ci1, ..., Cin>,C>, of which operation R(x1 ,..., xn) is defined as j 0 x1 j 1 ... j n-1 xn j n. Ci1, ..., Cin is a permutation of C1, ..., Cn, which implies that the order of the arguments of the rule need not correspond to the order of the nonterminals in the right-hand side of the rewrite rule. However, the latter order does correspond to the concatenation order of the operation that the rule performs.

What about the `translation power' of CFG-based compositional grammars? Our translation method demands that SL basic expressions correspond to TL basic expressions, and that SL syntactic rules correspond to TL syntactic rules of the same arity. This restricts the translation power considerably. The latitude in the translation relation amounts to the following. SL basic expressions can, obviously, be translated into different TL basic expressions. In the syntactic rules, the nonterminals need not occur in the same order as in the argument list. This allows translation-equivalent rules to describe word-order differences between languages. Syntactic rules may also introduce lexical material in addition to that of the arguments. Finally, there is an amount of freedom in the correspondence between the SL and TL syntactic categories.

top of article


7 Functional Category Correspondence

This section shows how a restriction of the correspondence between SL and TL syntactic categories can lead to completeness. First we formally define such a restriction.

Definition 1 Functional Category Correspondence

There is a functional category correspondence between compositional grammars G and G' if and only if there is a function f: Cats(G) ® Cats(G') such that:

(R ~ R' Ù TypeG(R) = <<c1,...,cn>, c>) Þ TypeG'(R') = <<f(c1), ..., f(cn)>,f(c)>

The restriction of compositional grammars to functional category correspondence, in conjunction with the grammar homomorphism condition gives us completeness:

Theorem 1. CFG Completeness for Functional Category Correspondence

If a compositional grammar pair <G, G'> is homomorphic from G to G' and there is a functional category correspondence f between their syntactic categories, then compositional translation from G to G' is complete, i.e. for every well-formed SL syntactic derivation tree t of category C there is a well-formed translation-equivalent TL syntactic derivation tree t' of category f(C).

Proof: By induction on the depth m of SL syntactic derivation trees t.

Induction Base m = 0. An SL syntactic derivation tree of depth 0 is an SL basic expression b. Homomorphism guarantees that there is at least one translation-equivalent TL basic expression b'. Because of the functional category correspondence f we know that the category of b' is f(C), where C is the category of b. Basic expressions are trivially well-formed syntactic derivation trees.

Induction Step m+1. Every well-formed SL syntactic derivation tree of depth m+1 is of the form R[t1, ..., tn]:A, where for each 1 £ i £ n, subtree ti is of category Ci, for ArgList(R) = <C1,...,Cn>. Homomorphism guarantees that rule R has at least one translation-equivalent syntactic rule R' with ArgList(R') = <C'1, ..., C'n>, say. Because of the functional category correspondence f we know that <C'1, ..., C'n> = <f(C1), ..., f(Cn)>, and that ResCat(R') = f(ResCat(R)). Since for each 1 £ i £ n, subtree ti is of depth m or less, we may apply the induction hypothesis, which yields well-formed translation-equivalent syntactic derivation trees t'1, ..., t'n of respective categories f(C1), ..., f(Cn). By the definition of translation equivalence we may now derive that R[t1, ..., tn] ~ R'[t'1, ..., t'n], the latter of which is a well-formed TL syntactic derivation tree of category ResCat(R') = f(ResCat(R)).

top of article


8 Relational Category Correspondence

The functional category correspondence condition is rather restrictive: Every SL expression must be translated into expressions of one and the same TL syntactic category. A looser category correspondence is preferable. For example, consider the following grammars for English and French noun phrases, which reflect the fact that French uses agreement on determiners and nouns:

English Syntax

   

French Syntax

R1 : NP ® DET N

~   R'1a : NP' ® DET'm N'm
  ~   R'1b : NP' ® DET'f N'f

b1 : DET ® the

~   b'1a : DET'm ® la
  ~   b'1b : DET'f ® le

b2 : N ® bike

~   b'2 : N'f ® vélo

b3 : N ® car

~   b'3 : N'm ® voiture

Table 2. English/French CFG Grammar Pair.

Here we would like to associate the SL syntactic category DET with the TL syntactic categories DET'm and DET'f, and the SL syntactic category N with the TL syntactic categories N'm and N'f. In order to be able to do so it must be allowed that every SL syntactic category is associated with a number of TL syntactic categories, instead of with just one. We call this relational category correspondence:

Definition 2 Relational Category Correspondence, Category-Correspondence Set ~{C}

There is a relational category correspondence between compositional grammars G and G' if and only if there is a relation Rel Í CatsG x CatsG' such that:

" b Î BEG, b' Î BEG' b ~ b' Þ <CatG(b), CatG'(b')> Î Rel

R Î RulesG, R' Î RulesG'

(R ~ R' Ù TypeG(R) = <<c1,...,cn>,c>) Þ TypeG'(R') = <<c1',...,cn'>,c'>,

where " i (1 £ i £ n) <ci,ci'> Î Rel and <c,c'> Î Rel

For any SL syntactic category C the set of corresponding TL syntactic categories {C' | <C, C'> Î Rel} is called the category-correspondence set of C and is denoted ~{C}.

For this new situation we must adjust the completeness condition. The naive extension of the previous condition is unrealistic: In the English/French example it would require a French syntactic rule for all four argument lists <DET'm, N'm>,<DET'm, N'f>,<DET'f, N'm>, and <DET'f, N'f>. But the requirement, for example, that there be a TL syntactic rule R' that combines a masculine determiner DET'm and a feminine noun N'f is nonsensical. The underlying problem is that agreement dependencies cannot be expressed explicitly in the CFG grammar formalism, while they should be taken into account. We present a way of encoding this information, and then use this to arrive at completeness. First, we distinguish two kinds of category correspondence:

Definition 3 Conjunctive/Disjunctive Category

For the translation between compositional grammars G and G', an SL syntactic category C is a conjunctive category if and only if for every well-formed SL syntactic derivation tree t of category C, for every corresponding category C' in ~{C}, there exists at least one translation-equivalent well-formed TL syntactic derivation tree t' of category C'. Any SL syntactic category that is not a conjunctive category is called a disjunctive category. Of course, SL syntactic categories with only one TL syntactic category in their category-correspondence set are always conjunctive categories.

For example, in the case of the English/French NP rules, category DET corresponds conjunctively to categories DET'm and DET'f (any determiner has both a masculine and a feminine form), whilst category N corresponds disjunctively to categories N'm and N'f (in general, nouns can be claimed to have either masculine or feminine gender, not both). Category NP corresponds to only one category, NP', and is therefore a conjunctive category.

How can we use this to determine under which conditions completeness is guaranteed? The key idea is that some of the well-formed SL syntactic derivation trees of some category C may be guaranteed to translate into at least one well-formed TL syntactic derivation tree for all categories in ~{C}, instead of ‘for at least one'. Category C then corresponds conjunctively to the categories in ~{C}. As opposed to a disjunctive category, a conjunctive category does not require every rule R' to have translation-equivalent variants for all categories in ~{C}. Hence for every SL basic expression b, we simply require that if its category C is a disjunctive category, then for at least one category C' in ~{C}, there must be at least one translation-equivalent TL basic expression b' of category C'. But if category C of SL basic expression b is a conjunctive category, then for every category C' in ~{C}, there must exist at least one translation-equivalent TL basic expression b' of category C'. With respect to the syntactic rules, we impose conditions on their translation-equivalent TL counterparts. When deriving a TL syntactic derivation tree from an SL syntactic derivation tree for subtrees ti that are of a conjunctive category C, we can guarantee a tree ti' for every category in ~{C}. For subtrees ti that are of a disjunctive category C we can guarantee a tree ti for only one category in ~{C}, though we may not know which one. However, the theorem below says that there is completeness if the following condition is satisfied:

The Category-Correspondence Condition — For any rule R with argument list ArgList(R) = <C1, ..., Cn>, define ArgListC(R) as the tuple consisting of the occurrences of conjunctive categories in ArgList(R), and define ArgListD(R) as the tuple consisting of the occurrences of disjunctive categories in ArgList(R). Now, the category-correspondence condition is met if and only if for every SL rule R of type <<C1, ..., Cn>,C>, where ArgListC(R) = <Ci1, ..., Cimc> and ArgListD(R) = <Cj1, ..., Cjmd>, it holds that for every tuple <C'j1, ..., C'jmd> such that C'j1 Î ~{Cj1}, ..., C'jmd Î ~{Cjmd}, there is at least one TL rule R' of type <<B1, ..., Bn>,B>, such that:

Theorem 2 CFG Completeness for Relational Category Correspondence

For any CFG-based compositional grammar pair <G, G'>, compositional translation from G to G' is complete if the grammar pair is homomorphic from G to G' and there is a relational category correspondence between the SL and TL syntactic categories that meets the above category-correspondence condition.

Because of space limitations we do not include the proof. We trust that the above discussion suggests how the proof can be given.

top of article


9 Conclusion

In this paper we presented the issue of completeness for compositional translation, and discussed two conditions under which completeness can be shown to hold. We examined the completeness issue for context-free grammars and established completeness conditions for grammars with a functional category correspondence. As this condition is rather restrictive, we relaxed it to relational category correspondence. A first attempt led to unrealistic conditions on the grammar rules. We countered that by introducing the distinction between conjunctive and disjunctive categories. We adjusted the relational category-correspondence condition accordingly, and obtained a completeness condition for grammars with a relational category correspondence.

Further research is concerned with proving completeness for more sophisticated grammar formalisms and more complicated translation relations. An example of the former is our current study of definite-clause grammars. An example of the latter is our study of so-called polynomial translation. The basic idea here is the generalisation of the unit of translation equivalence from single syntactic elements to combinations of these, algebraically known as polynomials. Polynomial translation provides a high degree of freedom in expressing translation relations. On the other hand, proving completeness may become a fairly complex matter.

top of article


10 References

[Caterpillar 1974] Caterpillar, 1974, Dictionary for Caterpillar Fundamental English. Caterpillar Corporation, East Peoria, Illinois.

[Hayes et al. 1996] Hayes, Phil, Maxwell, Steve, and Schmandt, Linda, 1996, Controlled English Advantages for Translated and Original English Documents. In the Proceedings of the First International Workshop on Controlled Language Applications, Leuven, Belgium, pp. 84-92.

[Janssen 1986] Janssen, T.M.V., Foundations and Applications of Montague Grammar. Mathematisch Centrum (now CWI), Amsterdam, The Netherlands.

[Newton 1992] Newton, John, 1992, The Perkins Experience. In John Newton (ed.), Computers in Translation: A Practical Appraisal, Routledge, London, pp. 46-57.

[Partee et al. 1993] Partee, Barbara H., Alice ter Meulen and Robert E. Wall, 1993. Mathematical Methods in Linguistics (second edition). Kluwer Academic Publishers, Dordrecht, The Netherlands.

[Pym 1990] Pym, P.J., 1990, Pre-Editing and the Use of Simplified Writing for MT: An Engineer's Experience of Operating an MT System. In Pamela Mayorcas (ed.), Translating and the Computer 10: The Translation Environment 10 Years On, Aslib, pp. 80-95.

[Rosetta 1994] Rosetta, M.T., 1994. Compositional Translation. Kluwer Academic Publishers, Dordrecht, The Netherlands.

[Thomason 1974] Thomason, Richard H. (ed.), 1974. Formal Philosophy: Selected Papers of Richard Montague. Yale University Press, New Haven, Conneticut.


top of article


web-sls front page


This article accessed times since 1st May 1998.

web-sls article: #97.04
published: 3rd November 1997
archived: n/a

academic comments to: the author
publishing comments to: Mark Tatham, Technical and Publishing Editor

Your attention is drawn to the web-sls copyright notice on the Journal's front page.