Over the last few years there has been a growing interest in the research community in statistical approaches to Natural Language Processing. With respect to MT, the term `statistical approaches' can be understood in a narrow sense to refer to approaches which try to do away with explicitly formulating linguistic knowledge, or in a broad sense to denote the application of statistically or probablistically based techniques to parts of the MT task (e.g. as a word sense disambiguation component). We will give a flavour of this work by describing a pure statistical-based approach to MT.
The approach can be thought of as trying
to apply to MT techniques which have been
highly successful in Speech Recognition, and though the details
require a reasonable amount of statistical sophistication, the basic
idea can be grasped quite simply. The two key notions involved
are those of the language model and the translation
model. The language model provides us with probabilities for
strings of words (in fact sentences), which we can denote by
(for a source sentence
) and
(for any given
target sentence
). Intuitively,
is the
probability of a string of source words S occurring, and likewise for
. The translation model also provides us with
probabilities ---
is the conditional probability that
a target sentence
will occur in a
target text which translates a text containing the source sentence
. The product of this and the probability of S itself, that is
gives the the probability of source-target
pairs of sentences occurring, written
.
One task, then, is to find out the probability of a source string (or
sentence) occurring (i.e.
). This can be decomposed into the
probability of the first word, multiplied by the conditional
probabilities of the succeeding words, as follows.
, etc...
Intuitively, the conditional probability
is the
probability that s2 will occur, given that s1 has occurred; for
example, the probability that am and are occur in a
text might be approximately the same, but the probability of
am occurring after I is quite high, while that of
are is much lower). To keep things within manageable limits, it is
common practice to take into account only the preceding one or two
words in calculating these conditional probabilities (these are known
respectively as `bigram' and `trigram' models). In order to calculate
these source language probabilities (producing the source language
model by estimating the parameters), a large amount of monolingual
data is required, since of course the validity, usefulness or accuracy
of the model will depend mainly on the size of the corpus.
The second task requiring large amounts of data is specifying the
parameters of the translation model, which requires a large bilingual
aligned corpus. As we observed above, there are rather few such
resources, however, the research group at IBM which has been mainly
responsible for developing this approach had access to three million
sentence pairs from the Canadian (French-English) Hansard --- the
official record of proceedings in the Canadian Parliament (cf. the
extract given above), from which they have developed a (sentence-)
aligned corpus, where each source sentence is paired with its
translation in the target language, as can be seen on
page
.
It is worth noting in passing that the usefulness of corpus resources depends very much on the state in which they are available to the researcher. Corpus clean-up and especially the correction of errors is a time-consuming and expensive business, and some would argue that it detracts from the `purity' of the data. But the extract given here illustrates a potential source of problems if a corpus is not cleaned up in some ways --- the penultimate French sentence contains a false start, followed by ..., while the English text (presumably produced by a human translator) contains just a complete sentence. This sort of divergence could in principle effect the statistics for word-level alignment.
In order to get some idea of how the translation model works, it is useful to introduce some further notions. In a word-aligned sentence-pair, it is indicated which target words correspond to each source word. An example of this (which takes French as the source language) is given in the second extract.
The numbers after the source words indicate the string position of the corresponding target word or words. If there is no target correspondence, then no bracketted numbers appear after the source word (e.g. a in a demandé). If more than one word in the target corresponds, then this is also indicated. The fertility of a source word is the number of words corresponding to it in the target string. For example, the fertility of asked with English as source language is 2, since it aligns with a demandé. A third notion is that of distortion which refers to the fact that source words and their target correspondences do not necessarily appear in the same string position (compare tout acheter and buy everything, for example).
The parameters which must be calculated from the bilingual sentence
aligned corpus are then (i) the fertility probabilities for each
source word (i.e. the likelihood of it translating as one, two, three,
etc, words respectively), (ii) the word-pair or translation
possibilities for each word in each language and (iii) the set of
distortion probabilities for each source and target position. With
this information (which is extracted automatically from the corpus),
the translation model can, for a given S, calculate
(that is, the probability of T, given S). This is the essence of the
approach to statistically-based MT, although the procedure is itself
slightly more complicated in involving search through possible source
language sentences for the one which maximises
, translation being essentially viewed as the problem of
finding the S that is most probable given T --- i.e. one wants to
maximise
. Given that
then one just needs to choose S that maximizes the product of
and
.
It should be clear that in an approach such as this there is no role whatsoever for the explicit encoding of linguistic information, and thus the knowledge acquisition problem is solved. On the other hand, the general applicability of the method might be doubted, since as we observed above, it is heavily dependent on the availability of good quality bilingual or multilingual data in very large proportions, something which is currently lacking for most languages.
Results to date in terms of accuracy have not been overly impressive,
with a 39% rate of correct translation reported on a set of 100 short
test sentences. A defect of this approach is that morphologically
related words are treated as completely separate from each other, so
that, for example, distributional information about sees cannot
contribute to the calculation of parameters for see and
saw, etc. In an attempt to remedy this defect, researchers at IBM
have started to add low level grammatical information piecemeal to
their system, moving in essence towards an analysis-transfer-synthesis
model of statistically-based translation. The information in question
includes morphological information, the neutralisation of case
distinctions (upper and lower case) and minor transformations to input
sentences (such as the movement of adverbs) to create a more canonical
form. The currently reported success rate with 100 test sentences is a
quite respectable 60%. A major criticism of this move is of course
precisely that linguistic information is being added piecemeal,
without a real view of its appropriacy or completeness, and there must
be serious doubts about how far the approach can be extended without
further additions of explicit linguistic knowledge, i.e. a more
systematic notion of grammar. Putting the matter more positively, it
seems clear that there is a useful role for information about
probabilities. However, the poor success rate for the `pure' approach
without any linguistic knowledge (less than 40%) suggests that
the real question is how one can best combine statistical and
rule- based approaches.