In the previous section, we looked at a method of controlling the input to an MT system, simplifying it by avoiding certain uses of words, and avoiding potentially ambiguous constructions. Since the success of the METEO MT system, which we mentioned briefly in Chapter , an important strand of MT has involved concentrating on what we could loosely call `MT for Special Purpose Languages', or sublanguage MT. Here, rather than imposing controls or simplifications on writers, one tries to exploit the restrictions in terms of vocabulary and constructions that users of the language for specialized purposes normally accept, or simply observe without reflection. The term sublanguage refers to the specialized language used (predominantly for communication between experts) in certain fields of knowledge, for example, the language of weather reports, stockmarket reports, the language of some kinds of medical discussion, the language of aeronautical engineering. Specialized vocabulary is one characteristic of such `languages' (they typically contain words not known to the non-specialist and also words used in different or more precise ways). However sublanguages are also often characterised by special or restricted grammatical patterns. In MT, it is quite common to use the term sublanguage rather loosely to refer not just to such a specialized language, but to its use in a particular type of text (e.g. installation manuals, instruction booklets, diagnostic reports, learned articles), or with a particular communicative purpose (communication between experts, giving instructions to non-experts, etc).
The chief attraction of sublanguage and text type restriction to MT researchers is the promise of improved output, without the need to artificially restrict the input. Restricting the coverage to texts of particular types in certain subject domains will allow one to profit from regularities and restrictions in syntactic form and lexical content. This may be important enough to permit significant simplification of the architecture, and certainly leads to a reduction in the overall coverage required. We reproduce an example from English to French output from METEO :
Of course, the language of meteorological reports is special in happening to combine a rather small vocabulary with a simple, telegraphic style of writing (notice in particular the complete absence of tenses from these extracts --- the few verbs there are in non-finite forms). Nonetheless, a simplification of lexical and possibly syntactic coverage can be expected in less extreme cases. To give an example with respect to lexical coverage, it is reported that 114 of the 125 occurrences of the verb to match in a computer software manual translate into the Japanese icchisuru-suru, which is listed as one of the less frequent of the 15 translations given in a small-size English-Japanese dictionary. In the extract from a corpus of telecommunications text given below, traffic always corresponds to the French trafic and never to circulation (which applies only to road traffic). Moreover the dictionary writer can safely ignore the meaning of both trafic and traffic concerning dealings in illegal merchandise (`drug traffic'). Also, for an increasing number of sublanguages one can rely on the availability of a termbank (an on-line (multilingual) terminological dictionary ) defining and stating equivalences for many of the technical terms that will be encountered. This greatly eases the job of dictionary construction. Such examples can be multiplied almost at will.
As for syntactic coverage, examples of instruction manuals and other
informative documentation typically share a number of common features. There will probably be no idioms , and a restricted set of sentential patterns. Another common feature is the relatively simple temporal dimension of the text, e.g. predominant use of the simple present. There is also the common occurrence of enumeration as a form of conjunction, usually either numbered or inset by dashes, etc. Some of these features can be seen by comparing the examples of English and French given below, which are drawn from a corpus of texts about Telecommunications. All are of great benefit to the developer or user of an MT system. For the developer, they mean that there are fewer problems of ambiguity , and development effort can be concentrated on a smaller range of constructions. For the user, this should mean that better coverage is obtained, and that the system performs better.
It is not, of course, the case that expository texts in different languages always exploit the same devices for a particular communicative purpose. The following extracts from the same corpus show that English and French differ in their use of impersonal constructions, with French favouring such constructions with the impersonal subject pronoun il (`it') far more in this type of text than English does. But even in these cases, it is generally easier to choose the correct translation, simply because the range of possibilities in such texts is smaller. (Literal translations of the phrases we have picked out would be: `It is advisable to take account of...', It is manifestly much more difficult...', and `It is advisable to take...'.)
Text type can strongly influence translation, not just because certain syntactic constructions are favoured (e.g. conjunction by enumeration), but also by giving special meanings to certain forms. An example of how the text type can be useful in determining translational equivalents is the translation of infinitive verb forms from French or German into English. Infinitives normally correspond to English infinitives, but are usually translated as English imperatives in instructional texts. Thus, in a printer manual one would see (
b) as the translation of ( a), rather than the literal translation.
Thus, concentration on a sublanguage not only restricts the vocabulary
and the number of source and target language constructions to be
considered, it can also restrict the number of possible target
translations. Given the potential that sublanguages provide for
improvements in the quality of output of MT systems, and the fact that
most commercial institutions do in fact have their major translation
needs in restricted areas, it is not surprising that many research
prototypes concentrate on restricted input in various ways, and that
the design of tools and resources supporting sublanguage analysis is a
major area of research.