Every text that is not delivered as an electronic document on a floppy disc, a magnetic tape, or via a computer network will have to be put into the system manually. Re-typing a text into the computer solely to make it available for MT is unlikely to be cost-effective --- it would often be quicker to have the text translated directly by a human translator. In recent years it has become practicable to use an optical character reader (OCR) to input text available only in printed form. Clearly this is much quicker than re-typing, but checking for and correcting scanning errors can be time-consuming.
However, if as is the case with ETRANS as described in Chapter , the MT system fits into an overall document production system (DPS), then text can be created, translated, re-edited and generally prepared for publication within the same electronic environment. In the first part of this chapter we will explore this notion of an electronic document in some detail.
Electronic text is simply text which is available in a machine
readable form. For example, electronic text is produced by ordinary
office word processors. At its simplest, such a text is just a sequence
of characters, and, for the characters in use in general computing (i.e.
the English alphabet, normal punctuation characters, plus characters
such as the `space' character, the `line-feed' character, etc.) there
is a standard representation provided by the ASCII codes, which
associates each character with a seven or eight bit code (i.e. a
number --- e.g. a is ASCII 97, b is ASCII 98,
A is ASCII 65, the `space' character is ASCII 32).
Unfortunately, this standard is not sufficient for encoding
the letters of foreign alphabets and their accents, even those based
on the Roman alphabet, let alone non-Roman alphabets, and characters
in non-alphabetic scripts, such as Japanese characters (Kanji). One
approach to such alphabets is to extend the ASCII codes beyond those
needed by English. Another is to represent foreign accents and
special characters by sequences of standard ASCII characters. For
example, a German u with umlaut ( ü) might be
One problem is that there is (as yet) no genuine accepted standard beyond basic ASCII, with the further complication that many word processors use non-ASCII representations `internally', as a way of representing text format (e.g. information about typeface, underlining, etc.) This lack of standards means that it is necessary to use special conversion programs if one wants to freely import and export text from different languages and a variety of DPSs (such as word processors). Even when such programs exist, they do not always preserve all the information (e.g. some information about format may be lost).
Part of a general solution to these problems, however, is to distinguish two components of a printed document: the text itself (a sequence of words and characters); and its rendition --- the form in which it appears on the page (or screen). For example, consider a title or heading. There are the words which make up the title --- perhaps a noun phrase such as `The Electronic Document' --- and the particular presentation or rendition of those words on the page. In this book all section and chapter headings are aligned with the left margin and different levels of heading (chapter, section, subsection) are printed in a distinctive typeface and separated by a standard space from the preceding and following paragraphs of text.
If we think about this distinction between text and rendition in
electronic terms, it is easy to see that we have to code both the
characters in the text, and indicate how we intend parts of that text
to appear on screen or in printed form. In the early days of
electronic text handling, this problem was solved in a rather direct
and obvious fashion: the author would type in not only the substance
of the text but also some special codes at appropriate places to tell
the printer to switch into the appropriate type faces and point size.
For example, in typing in a title the author would carefully insert an
appropriate number of carriage returns (non-printing characters which
start a newline) to get a nice spacing before and after. She would
also make sure the title was centred or left-aligned as required, and
finally she would type in special codes (say
before and after the title string to switch the printer into a bold
typeface with 24 `points' to the inch and back to its usual font and
size immediately afterwards.
There are three evident problems with such a procedure:
The printer codes are a sort of tiny little program for a particular
printer. The next development was to replace these rather specific
programs by some means of stating directly ``I want this in 24 point
Roman boldface'' --- perhaps by a `markup' like this:
\roman\24pt\bf '. Each printer or word processor can then be
equipped with a special program (a so-called `driver') which
interprets this high-level code and sends the printer or screen
appropriate specific low-level codes. Providing everyone used exactly
the same high-level codes in all systems, the problem of portability
would be solved.
However, there is another way of tackling the rendition problem. When one thinks about it abstractly, the only thing that the author really needs to put into the text is some markup which says (in effect) `This is a heading', or `This is a footnote' or `This is an item in an item list' and so on. Each piece of text is thus identified as being an instance of some class of text elements. With such markup, the author no longer has to worry about how each such marked document element is going to be printed or shown on screen --- that task can be delegated to the document designer (the modern equivalent of a compositor). The document designer can specify an association between each type of document element and the high-level rendition codes she wants it to have. In other words, she can say that she wants all headings to be printed in 24 point boldface Roman. The document handling system then ensures that headings etc. are displayed and printed as required.
This type of markup, where the author simply identifies particular pieces of text as being instances of particular document elements, is known as descriptive or `intensional' (`intentional') markup. This notion is fundamental to all modern document processing systems and techniques. Not only does this provide flexibility in how text is rendered, provided that the way in which markup is made is consistent from system to system, the result is that electronic documents can be freely passed between systems.
We can now be a little more precise about the notion of an electronic document: it contains electronic or machine-readable text with descriptive markup codes which may be used to determine the rendition and other usages of the document. Before we go on to give an idea of how this can be exploited for MT, it may be worth a brief description of the standard descriptive markup: SGML (Standardised General Markup Language) which is specified by the International Standards Organization. It is our belief that in the next few years no serious commercial MT system will be supplied without some means of handling SGML.
SGML specifies that, ordinarily, text will be marked up in the way shown in the last example above, i.e. with document elements surrounded by their names in angle brackets. An office memo marked up in SGML might look like the example below. In addition to the actual text, various pairs of SGML tags delimiting the memo elements can be seen here. The memo as a whole starts with <Memo> and ends with </Memo> (where / indicates the closing delimiter). In between the Memo tag pair we find the sub-elements of the memo, also marked-up with paired tags ( <To>... </To>, <From> .. </From>, <Body>... <P>..</P>...</Body>).
The relationship between SGML tags, and the way text is actually rendered is given by an association table, such a table might say, e.g. that the body of a memo should be separated from the previous part by a horizontal line. When actually printed, this memo might look as in Figure :
Figure: How a Memo Marked Up in SGML Might Appear When Printed
The tagging principles of SGML are intended to extend to very complex and highly structured documents. Imposing such a structure not only allows very fine, and flexible control of how documents are printed, it can also allow easy access to and manipulation of information in documents, and straightforward consistency checking.
One thing the SGML standard does not do is try to specify a standard inventory of all possible document elements. Users are perfectly free to define their own document types and to specify the elements in those documents. SGML provides a special method of doing this known as a Document Type Definition (DTD) . A DTD is a sort of formal grammar specifying all such relations in a particular type of document. For example, such a grammar might say that all Memos (all our Memos at least) contain a To element followed by a From element followed by a Body element, which itself contains at least one Paragraph followed by zero or more Paragraphs. This means that a Memo has the following sort of DTD (grossly simplified):
Memo To, From, Body Body Paragraph, Paragraph*
Using a DTD has several advantages:
With this general idea of Electronic Documents and markup, we can look at how an MT system can exploit the fact that texts are represented in this way.