The topic of the book is the art or science of Automatic Translation, or Machine Translation (MT) as it is generally known --- the attempt to automate all, or part of the process of translating from one human language to another. The aim of the book is to introduce this topic to the general reader --- anyone interested in human language, translation, or computers. The idea is to give the reader a clear basic understanding of the state of the art, both in terms of what is currently possible, and how it is achieved, and of what developments are on the horizon. This should be especially interesting to anyone who is associated with what are sometimes called ``the language industries''; particularly translators, those training to be translators, and those who commission or use translations extensively. But the topics the book deals with are of general and lasting interest, as we hope the book will demonstrate, and no specialist knowledge is presupposed --- no background in Computer Science, Artificial Intelligence (AI), Linguistics, or Translation Studies.
Though the purpose of this book is introductory, it is not
just introductory. For one thing, we will, in
Chapter
, bring the reader up to date with the
most recent developments. For another, as well as giving an accurate
picture of the state of the art, both practically and theoretically,
we have taken a position on some of what seem to us to be the key
issues in MT today --- the fact is that we have some axes to grind.
From the earliest days, MT has been bedevilled by grandiose claims and exaggerated expectations. MT researchers and developers should stop over-selling. The general public should stop over-expecting. One of the main aims of this book is that the reader comes to appreciate where we are today in terms of actual achievement, reasonable expectation, and unreasonable hype. This is not the kind of thing that one can sum up in a catchy headline (``No Prospect for MT'' or ``MT Removes the Language Barrier''), but it is something one can absorb, and which one can thereafter use to distill the essence of truth that will lie behind reports of products and research.
With all this in mind, we begin (after some introductory remarks in this chapter) with a description of what it might be like to work with a hypothetical state of the art MT system. This should allow the reader to get an overall picture of what is involved, and a realistic notion of what is actually possible. The context we have chosen for this description is that of a large organization where relatively sophisticated tools are used in the preparation of documents, and where translation is integrated into document preparation. This is partly because we think this context shows MT at its most useful. In any case, the reader unfamiliar with this situation should have no trouble understanding what is involved.
The aim of the following chapters is to `lift the lid' on the core component of an MT system to give an idea of what goes on inside --- or rather, since there are several different basic designs for MT system --- to give an idea of what the main approaches are, and to point out their strengths and weaknesses.
Unfortunately, even a basic understanding of what goes on inside an MT
system requires a grasp of some relatively simple ideas and
terminology, mainly from Linguistics and Computational Linguistics,
and this has to be given `up front'. This is the purpose of
Chapter
. In this chapter, we describe some fundamental
ideas about how the most basic sort of knowledge that is required for
translation can be represented in, and used by, a computer.
In Chapter
we look at how the main kinds of MT system
actually translate, by describing the operation of the
`Translation Engine'. We begin by describing the simplest design,
which we call the transformer architecture . Though now somewhat
old hat as regards the research community, this is still the design
used in most commercial MT systems. In the second part of the chapter,
we describe approaches which involve more extensive and sophisticated
kinds of linguistic knowledge. We call these Linguistic
Knowledge (LK) systems . They include the two approaches that have
dominated MT research over most of the past twenty years. The first is
the so-called interlingual approach , where translation proceeds
in two stages, by analyzing input sentences into some abstract and
ideally language independent meaning representation, from which
translations in several different languages can potentially be
produced. The second is the so-called transfer approach , where
translation proceeds in three stages, analyzing input sentences into a
representation which still retains characteristics of
the original, source language text. This is then input to a special
component (called a transfer component) which produces a
representation which has characteristics of the target (output)
language, and from which a target sentence can be produced.
The still somewhat schematic picture that this provides will be
amplified in the two following chapters. In
Chapter
, we focus on what is
probably the single most important component in an MT system, the
dictionary, and describe the sorts of issue that arise in designing,
constructing, or modifying the sort of dictionary one is likely to
find in an MT system.
Chapter
will go into more detail about some of the
problems that arise in designing and building MT systems, and, where
possible, describe how they are, or could be solved. This chapter will
give an idea of why MT is `hard', of the limitations of current
technology. It also begins to introduce some of the open questions for
MT research that are the topic of the final chapter.
Such questions are also introduced in Chapter
. Here we
return to questions of representation and processing, which we began
to look at in Chapter
, but whereas we focused previously on
morphological, syntactic, and relatively superficial semantic issues,
in this chapter we turn to more abstract, `deeper' representations
--- representations of various kinds of representation of meaning.
One of the features of the scenario we imagine in
Chapter
is that texts are mainly created,
stored, and manipulated electronically (for example, by word processors). In
Chapter
we look in more detail at what this
involves (or ideally would involve), and how it can be exploited to
yield further benefits from MT. In particular, we will describe how
standardization of electronic document formats and the general notion of
standardized markup (which separates the content of a document
from details of its realization, so that a writer, for example, specifies that
a word is to be emphasised, but need not specify which typeface must
be used for this) can be exploited when one is dealing with documents
and their translations. This will go beyond what some readers will
immediately need to know. However, we consider its inclusion
important since the integration of MT into the document processing
environment is an important step towards the successful use of MT.
In this chapter we will also look at the benefits and practicalities
of using controlled languages --- specially simplified versions
of, for example, English, and sublanguages --- specialized languages
of sub-domains. Although
these notions are not central to a proper understanding of the
principles of MT, they are widely thought to be critical for the
successful application of MT in practice.
Continuing the orientation towards matters of more practical than
theoretical importance, Chapter
addresses the issue
of the evaluation of MT systems --- of how to tell if an MT system
is `good'. We will go into some detail about this, partly because it
is such an obvious and important question to ask, and partly because
there is no other accessible discussion of the standard methods for
evaluating MT systems that an interested reader can refer to.
By this time, the reader should have a reasonably good idea of what
the `state of the art' of MT is. The aim of the final chapter
(Chapter
) is to try to
give the reader an idea of what the future holds by describing where
MT research is going and what are currently thought to be the most
promising lines of research.
Throughout the book, the reader may encounter terms and concepts with which she is unfamiliar. If necessary the reader can refer to the Glossary at the back of the book, where such terms are defined.