[This local archive copy is from the official and canonical URL, http://www.ascc.net/xml/en/utf-8/ercsretro.html; please refer to the canonical source document if possible.]

Retrospective on ERCS: the Extended Reference Concrete Syntax

or "ASCII: My Part in its Downfall"

Rick Jelliffe
Computing Center
Academia Sinica
Taipei, Taiwan
1999-05-20
[email protected]

In mid 1994 I was given a background project at Allette Systems, Sydney, where I worked as Senior SGML Consultant, to identify any technical reasons why the SGML market in East Asia and South East Asia was stagnant.

This project soon consumed quite a lot of my time, involving trips to Japan and discussion with various contacts; learning a lot more about SGML and characters and many arcane tidbits about CJK (China/Japan/Korea) publishing.

Out of this research evolved some proposals called the Extended Reference Concrete Syntax (ERCS), which received much useful feedback, especially from Japanese. The ERCS proposals were publicized through Robin Cover's website and the vendor's consortium SGML-Open (the previous incarnation of OASIS-Open) placed it on their website as a discussion resource for developers. There was also a link from the W3C internationalization page. This retrospective looks at the issues and solutions raised by ERCS and what has become of them after five years.

ERCS is not now promoted as a syntax with its own identity: it has found adequate expression in XML, whose development started at the very time and meetings that the minor corrections to SGML required for ERCS (Annex J, the Extended Naming Rules) were adopted.

Internationalization as Re-Colonialism?

From the start, I took the approach of saying "If I have found something important in SGML, as a Western user with Western documents, CJK users will also find it useful". In other words, even though each locale may have unique problems, there is a base level of functionality which is shared; the onus of proof can be on those who dispute that something I find useful is not also useful for someone in another locale.

This approach has been very useful to me in my current work at Academia Sinica, on the Chinese XML Now! project. There is a certain level of acquiescence to the problems of a technology that users get which will distort the reporting of problems and localization issues, and therefore the awareness of them by foreign decision-makers. So my approach has been to actually try to process Chinese documents, to see what problems arise, and to observe Chinese users. (Some results can be found in the Chinese XML FAQ and the Chinese Text Processing FAQ.) It is a kind of usability testing, I suppose.

At first I was hesitant to try this "walk-in-their-shoes" approach: as a foreigner I had no language skills; as an SGML expert I would miss beginners' errors; and as an outsider I would be opening myself up to accusations of patriarchal, or rather re-colonial, meddling. (As a matter of personal introspection, I wonder how much my career direction has been influenced by my family history, which is completely riddled with colonial and post-colonial influences.)

I did receive some flack from some quarters; but I was greatly encouraged in particular by Mr Yasuhiro Okui and Dr Yushi Komachi along the way. Certainly I discovered that the view that foreigners are automatically experts in their own language to be very dubious; finding people who had critical awareness of SGML and their own language in 1994 was beyond my ability for many regions; and I also noted a tendency of some people to want to solve local issues with local solutions, which seemed to me a spurious kind of humility. However, being paid allows a certain abandon in the face of criticism, so the lion's share of kudos for this project should belong to Nick Carr at Allette systems.

In any case, the key approach was that I was bringing my foreign expectations to judge the technology, so criticisms that I am an outsider were otiose, though understandable (especially in the light of my many gaffes). (Readers interested in the question of how a Westerner approaches the East will find Edward Said's book "Orientalism" fascinating.)

ERCS Proposals

ERCS proposed several ideas:

native language markup (not to be confused with "natural language"): that "users should be able to use their customary letters and symbols to mark documents up", because I (as an English-speaking Westerner) found it useful to have names in SGML that I can understand, rather than mnemonics from another language;
ISO 10646 Universal Character Set (UCS) as a character catalog: that all the characters of Unicode should be available in any document, to allow transcoding, because I (as a English-speaking Westerner) found it useful to not worry about whether a character was available or not;
Reference ISO 10646 characters by number not name because most Unicode characters do not have a useful name in English; I (as an English-speaking Westerner) had found it tedious to convert numeric character references from hexadecimal numbers often found in tables, to the decimal numeric character references required by SGML, when localizing SDATA entity sets to a particular character set;

Proposals

For native language markup:

SGML's SGML declarations (which declares the syntax used by a document type) required certain extensions to allow large character sets to be used in names: ellipsis to allow character ranges and simplified case-mapping declarations for characters which do not have case;
Unicode character properties should be used to determine which roles (NAMESTART, etc) each character should use: the specific mappings (notably the treatment of white-space) will always be contentious and the Unicode Consortium subsequently has published guidelines for using UCS characters in identifiers, though I am not sure of whether this represented a response to ERCS in particular;
because different characters are not available in every encoding, and because SGML/XML does not allow numeric character references in markup, I proposed that people should restrict themselves to using subset repertoires.

For using UCS as a character catalog, I proposed a very large public entity set in which all UCS characters would be available, using the hexadecimal code as their name. I used the form &U-ABCD; and later the smaller &UABCD; as the entity names.

Use Numbers not Names morphed into the SPREAD entity set and into the push for allowing hexadecimal numeric character references (HNCR).

None of these problems represented challenges to SGML's model: SGML had an explicit user requirement that "there should be no natural language bias", and the Unicode character set had not been created then. (The SGML declarations did not, and still do not, support the encodings Big5 and ShiftJIS, but neither of these were registered ISO encodings in 1986, and in any case are to some extent pathelogical encodings.)

I had made the decision to avoid DTD-related issues, for example, how to represent ruby annotations, and concentrated on the particular issues of the SGML declaration, which I regarded as being logically prior. CJK document structures could only be addressed after a supporting infrastructure was in place.

The 2nd version ERCS recommendations can be found in Robin Cover's site. TheSPREAD-2 entities can be found at the Chinese XML Now! website.

Successes

It has been very exciting to me to follow the results and flow-ons from ERCS. Sometimes these are direct, sometimes very unexpected (the dictionary!), and sometimes they can be attributed to memes, which is more or less the notion that ideas can spring up from many places in a society at the same time, because many people will simultaneously be considering the problems of the day: I have been very aware of this with ERCS, so much so that I find it difficult to think of anyone as an original thinker now (let alone me!). ERCS has definitely been part of a tide.

Here is a rough chronological listing of the highlights:

1994, ERCS referenced in Gavin Nicol's seminal "The Multilingual World Wide Web";
1995, ERCS discussed by W3C HTML working group at length during discussions of HTML 2.0, again championed flatteringly by Gavin Nicol: If we do want to define a single SGML declaration that is locale-independent, and which offers a clean, and efficient, implementation strategy, let's just use ERCS. I have not seen, and very much doubt I will ever see, anything superior. (But there was a superior way to implement ERCS, and it was Gavin who soon found it: see below.)
1st public version released
1995, Gavin Nicol proposed for HTML that the document character set should be UCS regardless of file encoding (based on work by James Clark and others at ISO WG8 for the ISO HyTime standard, which separated the storage encoding from the document character set): this was a much more elegant mechanism for the ERCS' SPREAD entities, but nevertheless it is at heart the same idea that of using UCS as a character catalog despite the document encoding; Gavin made many insightful comments during ERCS's development, especially confirming my view that "everyone should adopt UCS-2 encoding" was not a practical option (without some kind of migration technology);
1995, ERCS mentioned in articles in <TAG> magazine "Scrutable Asia" by Rick Jelliffe and "Postcard from Tokyo" by Gavin Nicol;
1995, the CJK Document Processing Group (CJK DOCP), a groups of East Asian academics and technocrats in liason with ISO and some business consortia, kindly endorsed ERCS, and allowed me to bring out the entity set under their project name SPREAD, the Standardization Project Regarding East Asian Documents;
1995, present ERCS ideas to ISO JTC1/SC18/WG 8;
2nd Version released;
1996, the Extended Naming Rules (ENR) were voted into SGML as ISO 8879 Annex J, and HNCR were accepted in principle for a future XML revision; Japan had already been requesting something like ENR for some time;
1996, James Clark implements ENR in the SP parser;
1997, ERCS kindly referenced in RFC 2070, "Internationalization of the Hypertext Markup Language" by Yergeau, Nicol, Adams, Duerst>
1997, Christian Wittern and the Japanese IRIZ project adopt the same naming aproach as SPREAD for their historical collection of characters, and endorse SPREAD;
1997, a major European publishing company informs me it is using SPREAD;
1997, HTML 4 uses HCR, and uses UCS as the document character set regardless of file encoding;
1997, HNCR were voted into SGML in ISO 8879 Annex K;
1997, various ISO standards allow Unicode numeric character references using the U-ABCD or UABCD form (in the case of the ISO keyboarding method for UCS, this is partly because I acted as the Australian reviewer and requested it);
1997, TimeLux implements Native Language Markup in their EditTime editor;
1997, Balise implement Native Language Markup in their text processing software, notatably the subsetting checking, called by them sanity checking;
1998, XML specification allows native language markup, with name-roles determined by Unicode character category, an SGML declaration declared using ENR, HNCR, and uses UCS as the document character set, regardless of file encoding; it also provides the kind of migration path to allow staged insertion of UCS-based components (XML as a "Trojan horse" or "Typhoid Mary" for Unicode);
1999, XHTML is proposed to allow HTML-in-XML;
1999, several SGML editing companies request the SPREAD entities to send out as their standard distribution, to allow gradual moving from SGML'86 systems to SGML'97 (e.g., XML);
1999, a Japanese dictionary is published giving as part of its entry the UCS number in the form U-ABCD (however, this is just as likely to be following on from Java);
1999, WWW browsers from Microsoft and Netscape handle HTML4 and XML pretty well, including HNCRs (with certain problems in IE5 using XML and CSS);
1999, Academia Sinica Computing Centre starts to develop "lossless transcoders" which replace missing characters with HNCRs.

My strongest memory is reporting on ERCS at three consecutive SGML Asia/Pacific conferences: in 1995 as something needed but unlikely, in 1996 as something possible in future SGML (ENR) systems, and in 1997 as something achieved through XML.

Failures

All in all, the main part of ERCS that has not been widely implemented has been the subset repertoire "sanity-checking". However, I think that most people using XML will adopt it without much thinking, for practicality. This area represents a continued problem for XML internationalization and I respect the opinion that it should be solved by allowing numeric character references in markup as well as data in SGML/XML.

I hold a different point of view, or at least I weigh matters slightly differently. I think it is important to never underestimate the value in an XML document of having the markup readable and editable as plain text: I think this is so important that the cost of having some documents that cannot be transcoded (because native language markup is being used) is slight. To me, not being able to read markup on a system represents a bigger interoperability problem than the interoperability problem caused if some documents cannot be transcoded.

In any case, this minor inconvenience will promote UTF-8 adoption, which is a good thing. And in any case, I do not expect people to create documents using non-Latin markup for consumption outside their territory; a Chinese user would not expect a Westerner to have an input method to be able to specify the id of an element and can be expected to prefer ASCII-repertoire characters. The goal of native language markup is to allow users to use their own letters and symbols if they so choose, not that they must.

Indeed, there can be good reasons to use non-native markup: an Indian typesetter of Thai documents in Singapore told me that Latin script is good for markup because it visually stands out more; several Chinese programmers have commented how easy it is to keyboard ASCII characters rather than Chinese ideograms. Even accepting these, one of the key uses of names in a document is to provide ID attributes; these are often generated automatically from a subelement's contents; it is important to be able to use

The other part of ERCS that has failed to thrive is the idea that customary symbols should be available as part of native language markup. For those from HTML and XML backgrounds, the major reason for SGML's complication is that not only can you remap delimiters to tags, but you can also define your own delimiters to create little-languages. For example, in a mathematical formula, you might put n^2, where the "^" is equivalent to the start-tag for a "superscript" element type. That this feature is useful can be seen in discussions of MathML, where many tool-providers continue to provide their own little-languages because they are more convenient for data entry.

However, there are two good reasons why using "customary symbols" may not be a good idea. First, for accessability reasons: a speech synthesizer may have difficulty reading symbol characters in a useful way. The second is because software systems may be confused by characters that are expected to be seperators not tokens. And then there is a policy consideration: the Unicode identifier properties have been published, and collaberation dictates that we should try to contain ourselves to standards, unless the standards themselves provide mechanism for extensibility.

Finally, XML did not follow ERCS in treating non-ASCII whitespace as well-formed token separators. I think this is an error, because it allows well-formedness errors that are not visually apparant, which goes against the ideal of human-readable markup. However, it does minimize the number of token-separator characters that a parser needs to look for; this makes the XML specification simpler to read and implement, which are laudable goals. So I do not regard it as error with bad consequences, or which needs to be corrected.

1995 Suggested Implementation Strategy for ERCS

ERCS had a list of things that vendors should consider doing in order to implement ERCS. In order to see what progress has been made, the 1995 list is in italics, and followed by the strategy that would be appropriate today.

Vendors should support elipses in the SGML declaration syntax declaration, for NAMING and SHORTREFs. (Proposed for ISO 8879)
Support XML. The SGML declarations for XML uses ranges, as in ISO 8879 Annex J.
Vendors should support NMCHAR and NMSTRT character classes in NAMING declaration. (Proposed for ISO 8879)
Support XML. The SGML declarations for XML use NAMESTART and NAMECHAR, as in ISO 8879 Annex J.
Vendors should support invocation of syntax declaration through public identifiers.
Support XML, always using the XML header and making sure the MIME media-types are sent correctly. The XML header implies an SGML declaration.
Vendors should specify (in system declaration)\ whether their system allows 16-bit markup characters, if there are restrictions on the number of shortrefs or SDATA entities. Any 8-bit dependencies should be noted.
Support XML. All XML systems must support UTF-8 and UCS-2. However vendors should clearly make available what character encodings their systems support.
Vendors should make sure that their products support long LITLEN.
Support XML. The SGML declarations for XML have long LITLEN
Include a copy of "-//SPREAD ERCS//SYNTAX Extended (ISO/IEC 10646-1:1993 repertoire)//EN"
Support XML. The SGML declarations for XML follow closely and supercede the ERCS syntaxes.
Include copies of "-//SPREAD ERCS//SYNTAX Extended (ISO 8859-1 repertoire)//EN" to "-//SPREAD ERCS//SYNTAX Extended (ISO 8859-9 repertoire)//EN" as they are developed with distributions of the SGML application.
There is currently no standard approach to this. Vendors of editing or exporting products should allow some kind of subset repertoire "sanity checking" to warn if a character used in markup is not available in the output encoding. At the least the user should be notified; a system should never generate corrupted markup.
If SGML and XML are altered to allow numeric character references in names, this issue will largely go away. And the increasing use of UTF-8 will also make it less important.
Implement the algorithmic processing of ISO/IEC 10646-1:1993 public entity characters (i.e. those beginning with U), and build them in so they don't need to be read in from a file. (If the proposed CHAR declarations are not accepted into ISO 8879.)
Support XML. Because ISO 10646 is the document character set, numeric character entities subsume the role of the SPREAD entity sets: the effect is algorithmic processing of the characters using their names, but the delimiter is different and the model is much more elegant.
Consider building or bundling tools to help clean up incoming data; in error handling of name misspelling, consider the half-width full-width equivalence and the small and normal katakana equivalence for Japanese; for Indic and Middle Eastern languages consider canonical ordering of characters; for European and accented languages consider the equivalence of combined characters and of base+accent sequences. This will help robustness, without sacrificing validatability.
The World Wide Web Consortium and the Unicode Consortium are currently discussing various normalization strategies, such as suggested. Support these when they arrive. The current plans seem to rely on "early normalization" (i.e., generate good data), however I think that prudent software makers will allow some kinds of normalization at import.

1999 Suggested Implementation Strategy for ERCS

Here is the updated list:

Use XML, always using an XML header;
Provide "lossless transcoding" so that when an output encoding does not have a character available, the user is warned (if in a CDATA section or markup), or a numeric character reference is generated (if in data);
Provide some character subset "sanity-checking" mechanism;
Generate normalized data, and consider building normalization into your import routines also (these can be cheap: they only need effect data that is in unnormalized form);
Make sure scripting languages and programming languages support the non-Latin characters that can be used in XML names and identifiers (see below);

The bigggest thing needed to complete ERCS is that all scripting and programming languages should allow native language identifiers, using the XML and Unicode rules.

It is a common processing technique to use SGML/XML names (attribute names, element type names) and values (NMTOKEN values, NOTATION names, ID values, entity names) as the keys to arrays, or as class names. If a document uses native language markup with non-ASCII characters, most script and programming languages will not be able to process them using the common techniques just mentioned.

So let me recommend to any standards-developers or systems-makers who are reading: in your next versions, support non-Latin native language identifiers. If this is outside the current standards for that language, you could make it a user option.

If you allow non-standard identifiers, you will also need to allow programmers to enter them in the GUI and in program source code. To make this easy, and the help debugging and testing, you will need to provide some form of numeric character references. For the form of these, let me strongly urge for explicit end-delimiters: the form \uABCD where the end delimiter is a space or any character other than [01-9A-F] should be avoided at all costs: it is not clear visually, and it can cause an editor to insert spurious line-breaks when word-wrapping on spaces. A form like SPREAD's &UABCD; or XML's ꯍ is much better.