Next: SGML Markup and Up: The Electronic Document Previous: The Electronic Document

## Basic Ideas

Every text that is not delivered as an electronic document on a floppy disc, a magnetic tape, or via a computer network will have to be put into the system manually. Re-typing a text into the computer solely to make it available for MT is unlikely to be cost-effective --- it would often be quicker to have the text translated directly by a human translator. In recent years it has become practicable to use an optical character reader (OCR) to input text available only in printed form. Clearly this is much quicker than re-typing, but checking for and correcting scanning errors can be time-consuming.

However, if as is the case with ETRANS as described in Chapter , the MT system fits into an overall document production system (DPS), then text can be created, translated, re-edited and generally prepared for publication within the same electronic environment. In the first part of this chapter we will explore this notion of an electronic document in some detail.

Electronic text is simply text which is available in a machine readable form. For example, electronic text is produced by ordinary office word processors. At its simplest, such a text is just a sequence of characters, and, for the characters in use in general computing (i.e. the English alphabet, normal punctuation characters, plus characters such as the space' character, the line-feed' character, etc.) there is a standard representation provided by the ASCII codes, which associates each character with a seven or eight bit code (i.e. a number --- e.g. a is ASCII 97, b is ASCII 98, A is ASCII 65, the space' character is ASCII 32). Unfortunately, this standard is not sufficient for encoding the letters of foreign alphabets and their accents, even those based on the Roman alphabet, let alone non-Roman alphabets, and characters in non-alphabetic scripts, such as Japanese  characters (Kanji). One approach to such alphabets is to extend the ASCII codes beyond those needed by English. Another is to represent foreign accents and special characters by sequences of standard ASCII characters. For example, a  German u with umlaut ( ü) might be represented thus: \"{u}.

One problem is that there is (as yet) no genuine accepted standard beyond basic ASCII, with the further complication that many word processors use non-ASCII representations internally', as a way of representing text format (e.g. information about typeface, underlining, etc.) This lack of standards means that it is necessary to use special conversion programs if one wants to freely import and export text from different languages and a variety of DPSs (such as word processors). Even when such programs exist, they do not always preserve all the information (e.g. some information about format may be lost).

Part of a general solution to these problems, however, is to distinguish two components of a printed document: the text itself (a sequence of words and characters); and its rendition --- the form in which it appears on the page (or screen). For example, consider a title or heading. There are the words which make up the title --- perhaps a noun phrase such as The Electronic Document' --- and the particular presentation or rendition of those words on the page. In this book all section and chapter headings are aligned with the left margin and different levels of heading (chapter, section, subsection) are printed in a distinctive typeface and separated by a standard space from the preceding and following paragraphs of text.

If we think about this distinction between text and rendition in electronic terms, it is easy to see that we have to code both the characters in the text, and indicate how we intend parts of that text to appear on screen or in printed form. In the early days of electronic text handling, this problem was solved in a rather direct and obvious fashion: the author would type in not only the substance of the text but also some special codes at appropriate places to tell the printer to switch into the appropriate type faces and point size. For example, in typing in a title the author would carefully insert an appropriate number of carriage returns (non-printing characters which start a newline) to get a nice spacing before and after. She would also make sure the title was centred or left-aligned as required, and finally she would type in special codes (say \[223\[-447) before and after the title string to switch the printer into a bold typeface with 24 points' to the inch and back to its usual font and size immediately afterwards.

There are three evident problems with such a procedure:

1. The codes used are likely to be specific to particular printers or word processing set-ups and hence the electronic document will not be directly portable to other systems for revision, integration with other documents or printing.
2. The author is required to spend some of her time dealing with rendition problems --- a task that (prior to the advent of electronic systems) had always been conveniently delegated to the compositor in a printing house.
3. If at some point it is decided that a different rendition of headings is required, someone has to go through the entire document and replace all the codes and characters associated with the rendition of each heading.

The printer codes are a sort of tiny little program for a particular printer. The next development was to replace these rather specific programs by some means of stating directly I want this in 24 point Roman boldface'' --- perhaps by a markup' like this: \roman\24pt\bf '. Each printer or word processor can then be equipped with a special program (a so-called driver') which interprets this high-level code and sends the printer or screen appropriate specific low-level codes. Providing everyone used exactly the same high-level codes in all systems, the problem of portability would be solved.

However, there is another way of tackling the rendition problem. When one thinks about it abstractly, the only thing that the author really needs to put into the text is some markup which says (in effect) This is a heading', or This is a footnote' or This is an item in an item list' and so on. Each piece of text is thus identified as being an instance of some class of text elements. With such markup, the author no longer has to worry about how each such marked document element is going to be printed or shown on screen --- that task can be delegated to the document designer (the modern equivalent of a compositor). The document designer can specify an association between each type of document element and the high-level rendition codes she wants it to have. In other words, she can say that she wants all headings to be printed in 24 point boldface Roman. The document handling system then ensures that headings etc. are displayed and printed as required.

This type of markup, where the author simply identifies particular pieces of text as being instances of particular document elements, is known as descriptive or intensional' (intentional') markup. This notion is fundamental to all modern document processing systems and techniques. Not only does this provide flexibility in how text is rendered, provided that the way in which markup is made is consistent from system to system, the result is that electronic documents can be freely passed between systems.

We can now be a little more precise about the notion of an electronic document: it contains electronic or machine-readable text with descriptive markup codes which may be used to determine the rendition and other usages of the document. Before we go on to give an idea of how this can be exploited for MT, it may be worth a brief description of the standard descriptive markup: SGML  (Standardised General Markup Language) which is specified by the International Standards Organization. It is our belief that in the next few years no serious commercial MT system will be supplied without some means of handling SGML.

SGML  specifies that, ordinarily, text will be marked up in the way shown in the last example above, i.e. with document elements surrounded by their names in angle brackets. An office memo marked up in SGML might look like the example below. In addition to the actual text, various pairs of SGML tags delimiting the memo elements can be seen here. The memo as a whole starts with <Memo> and ends with </Memo> (where / indicates the closing delimiter). In between the Memo tag pair we find the sub-elements of the memo, also marked-up with paired tags ( <To>... </To>, <From> .. </From>, <Body>... <P>..</P>...</Body>).

The relationship between SGML  tags, and the way text is actually rendered is given by an association table, such a table might say, e.g. that the body of a memo should be separated from the previous part by a horizontal line. When actually printed, this memo might look as in Figure :

Figure: How a Memo Marked Up in SGML Might Appear When Printed

The tagging principles of SGML  are intended to extend to very complex and highly structured documents. Imposing such a structure not only allows very fine, and flexible control of how documents are printed, it can also allow easy access to and manipulation of information in documents, and straightforward consistency checking.

One thing the SGML  standard does not do is try to specify a standard inventory of all possible document elements. Users are perfectly free to define their own document types and to specify the elements in those documents. SGML provides a special method of doing this known as a Document Type Definition (DTD) . A DTD is a sort of formal grammar specifying all such relations in a particular type of document. For example, such a grammar might say that all Memos (all our Memos at least) contain a To element followed by a From element followed by a Body element, which itself contains at least one Paragraph followed by zero or more Paragraphs. This means that a Memo has the following sort of DTD (grossly simplified):

Memo To, From, Body Body Paragraph, Paragraph*

Using a DTD  has several advantages:

1. The DTD makes sure that documents are truly portable between different SGML  document systems; the document system reads the accompanying DTD to find out what sort of elements will be in the document and how they will be arranged with respect to each other. Thus, the document processing system knows what to expect when it encounters a document which is an instance of a certain DTD.
2. It ensures that documents of a particular type (e.g. user manuals) are always structurally consistent with each other. It suffices to define a DTD for the class of user manuals and then the SGML document processing system will ensure that all documents produced by that DTD will indeed have the same overall structure. In short, DTDs help to promote a certain rigour which is extremely desirable in technical documentation.
3. The use of DTDs in document preparation allows authors to deal directly with the content of texts whilst having little or no direct contact with the actual markup used. What happens with the usual sort of SGML  system is that there is a window offering the author a choice of document entities appropriate for the document she is preparing or revising. This list of document entities is obtained by reading the DTD for the document. For example, in a memo, there will be a choice of To, From, and Body. The author clicks on the appropriate element and the markup is entered into the text (perhaps invisibly). When actually typing in the Body, the choice is narrowed down to Paragraph. Whilst this is not particularly interesting for simple documents like memos, it is clear that it would be be immensely useful in constructing complex documents, and in document retrieval.

With this general idea of Electronic Documents and markup, we can look at how an MT system can exploit the fact that texts are represented in this way.

Next: SGML Markup and Up: The Electronic Document Previous: The Electronic Document

Arnold D J
Thu Dec 21 10:52:49 GMT 1995