In the previous section, we looked at a method of controlling the
input to an MT system, simplifying it by avoiding certain uses of
words, and avoiding potentially ambiguous constructions. Since the
success of the METEO MT system, which we mentioned briefly in
Chapter
, an important strand of MT has involved
concentrating on what we could loosely call `MT for Special Purpose
Languages', or sublanguage MT. Here, rather than imposing controls or
simplifications on writers, one tries to exploit the restrictions in
terms of vocabulary and constructions that users of the language for
specialized purposes normally accept, or simply observe without
reflection. The term sublanguage refers to the specialized
language used (predominantly for communication between experts) in
certain fields of knowledge, for example, the language of weather
reports, stockmarket reports, the language of some kinds of medical
discussion, the language of aeronautical engineering. Specialized
vocabulary is one characteristic of such `languages' (they typically
contain words not known to the non-specialist and also words used in
different or more precise ways). However sublanguages are also often
characterised by special or restricted grammatical patterns. In MT,
it is quite common to use the term sublanguage rather loosely to refer
not just to such a specialized language, but to its use in a
particular type of text (e.g. installation manuals, instruction
booklets, diagnostic reports, learned articles), or with a
particular communicative purpose (communication between experts,
giving instructions to non-experts, etc).
The chief attraction of sublanguage and text type restriction to MT researchers is the promise of improved output, without the need to artificially restrict the input. Restricting the coverage to texts of particular types in certain subject domains will allow one to profit from regularities and restrictions in syntactic form and lexical content. This may be important enough to permit significant simplification of the architecture, and certainly leads to a reduction in the overall coverage required. We reproduce an example from English to French output from METEO :
Of course, the language of meteorological reports is special in happening to combine a rather small vocabulary with a simple, telegraphic style of writing (notice in particular the complete absence of tenses from these extracts --- the few verbs there are in non-finite forms). Nonetheless, a simplification of lexical and possibly syntactic coverage can be expected in less extreme cases. To give an example with respect to lexical coverage, it is reported that 114 of the 125 occurrences of the verb to match in a computer software manual translate into the Japanese icchisuru-suru, which is listed as one of the less frequent of the 15 translations given in a small-size English-Japanese dictionary. In the extract from a corpus of telecommunications text given below, traffic always corresponds to the French trafic and never to circulation (which applies only to road traffic). Moreover the dictionary writer can safely ignore the meaning of both trafic and traffic concerning dealings in illegal merchandise (`drug traffic'). Also, for an increasing number of sublanguages one can rely on the availability of a termbank (an on-line (multilingual) terminological dictionary ) defining and stating equivalences for many of the technical terms that will be encountered. This greatly eases the job of dictionary construction. Such examples can be multiplied almost at will.
As for syntactic coverage, examples of instruction manuals and other
forms of
informative documentation typically share a number of common
features. There will probably be no idioms , and a restricted set of
sentential patterns. Another common feature is the relatively simple
temporal dimension of the text, e.g. predominant use of the simple
present. There is also the common occurrence of enumeration as a form
of conjunction, usually either numbered or inset by dashes, etc. Some
of these features can be seen by comparing the examples of English and
French given below, which are drawn from a corpus
of texts about Telecommunications. All are of great benefit to the
developer or user of an MT system. For the developer, they mean that
there are fewer problems of ambiguity , and development effort can be
concentrated on a smaller range of constructions. For the user, this
should mean that better coverage is obtained, and that the system
performs better.
It is not, of course, the case that expository texts in different languages always exploit the same devices for a particular communicative purpose. The following extracts from the same corpus show that English and French differ in their use of impersonal constructions, with French favouring such constructions with the impersonal subject pronoun il (`it') far more in this type of text than English does. But even in these cases, it is generally easier to choose the correct translation, simply because the range of possibilities in such texts is smaller. (Literal translations of the phrases we have picked out would be: `It is advisable to take account of...', It is manifestly much more difficult...', and `It is advisable to take...'.)
Text type can strongly influence translation, not just because certain
syntactic constructions are favoured (e.g. conjunction by
enumeration), but also by giving special meanings to certain
forms. An example of how the text type can be useful in determining
translational equivalents is the translation of infinitive verb forms
from French or German into English. Infinitives normally correspond to
English infinitives, but are usually translated as English imperatives
in instructional texts. Thus, in a printer manual one would see (
b) as the translation of (
a), rather than the literal
translation.
Thus, concentration on a sublanguage not only restricts the vocabulary
and the number of source and target language constructions to be
considered, it can also restrict the number of possible target
translations. Given the potential that sublanguages provide for
improvements in the quality of output of MT systems, and the fact that
most commercial institutions do in fact have their major translation
needs in restricted areas, it is not surprising that many research
prototypes concentrate on restricted input in various ways, and that
the design of tools and resources supporting sublanguage analysis is a
major area of research.