[This local archive copy is from the official and canonical URL, http://www.ascc.net/xml/en/utf-8/ercsretro.html; please refer to the canonical source document if possible.]


Retrospective on ERCS: the Extended Reference Concrete Syntax

or "ASCII: My Part in its Downfall"

Rick Jelliffe
Computing Center
Academia Sinica
Taipei, Taiwan
1999-05-20
ricko@gate.sinica.edu.au

In mid 1994 I was given a background project at Allette Systems, Sydney, where I worked as Senior SGML Consultant, to identify any technical reasons why the SGML market in East Asia and South East Asia was stagnant.

This project soon consumed quite a lot of my time, involving trips to Japan and discussion with various contacts; learning a lot more about SGML and characters and many arcane tidbits about CJK (China/Japan/Korea) publishing.

Out of this research evolved some proposals called the Extended Reference Concrete Syntax (ERCS), which received much useful feedback, especially from Japanese. The ERCS proposals were publicized through Robin Cover's website and the vendor's consortium SGML-Open (the previous incarnation of OASIS-Open) placed it on their website as a discussion resource for developers. There was also a link from the W3C internationalization page. This retrospective looks at the issues and solutions raised by ERCS and what has become of them after five years.

ERCS is not now promoted as a syntax with its own identity: it has found adequate expression in XML, whose development started at the very time and meetings that the minor corrections to SGML required for ERCS (Annex J, the Extended Naming Rules) were adopted.

Internationalization as Re-Colonialism?

From the start, I took the approach of saying "If I have found something important in SGML, as a Western user with Western documents, CJK users will also find it useful". In other words, even though each locale may have unique problems, there is a base level of functionality which is shared; the onus of proof can be on those who dispute that something I find useful is not also useful for someone in another locale.

This approach has been very useful to me in my current work at Academia Sinica, on the Chinese XML Now! project. There is a certain level of acquiescence to the problems of a technology that users get which will distort the reporting of problems and localization issues, and therefore the awareness of them by foreign decision-makers. So my approach has been to actually try to process Chinese documents, to see what problems arise, and to observe Chinese users. (Some results can be found in the Chinese XML FAQ and the Chinese Text Processing FAQ.) It is a kind of usability testing, I suppose.

At first I was hesitant to try this "walk-in-their-shoes" approach: as a foreigner I had no language skills; as an SGML expert I would miss beginners' errors; and as an outsider I would be opening myself up to accusations of patriarchal, or rather re-colonial, meddling. (As a matter of personal introspection, I wonder how much my career direction has been influenced by my family history, which is completely riddled with colonial and post-colonial influences.)

I did receive some flack from some quarters; but I was greatly encouraged in particular by Mr Yasuhiro Okui and Dr Yushi Komachi along the way. Certainly I discovered that the view that foreigners are automatically experts in their own language to be very dubious; finding people who had critical awareness of SGML and their own language in 1994 was beyond my ability for many regions; and I also noted a tendency of some people to want to solve local issues with local solutions, which seemed to me a spurious kind of humility. However, being paid allows a certain abandon in the face of criticism, so the lion's share of kudos for this project should belong to Nick Carr at Allette systems.

In any case, the key approach was that I was bringing my foreign expectations to judge the technology, so criticisms that I am an outsider were otiose, though understandable (especially in the light of my many gaffes). (Readers interested in the question of how a Westerner approaches the East will find Edward Said's book "Orientalism" fascinating.)

ERCS Proposals

ERCS proposed several ideas:

Proposals

For native language markup:

For using UCS as a character catalog, I proposed a very large public entity set in which all UCS characters would be available, using the hexadecimal code as their name. I used the form &U-ABCD; and later the smaller &UABCD; as the entity names.

Use Numbers not Names morphed into the SPREAD entity set and into the push for allowing hexadecimal numeric character references (HNCR).

None of these problems represented challenges to SGML's model: SGML had an explicit user requirement that "there should be no natural language bias", and the Unicode character set had not been created then. (The SGML declarations did not, and still do not, support the encodings Big5 and ShiftJIS, but neither of these were registered ISO encodings in 1986, and in any case are to some extent pathelogical encodings.)

I had made the decision to avoid DTD-related issues, for example, how to represent ruby annotations, and concentrated on the particular issues of the SGML declaration, which I regarded as being logically prior. CJK document structures could only be addressed after a supporting infrastructure was in place.

The 2nd version ERCS recommendations can be found in Robin Cover's site. TheSPREAD-2 entities can be found at the Chinese XML Now! website.

Successes

It has been very exciting to me to follow the results and flow-ons from ERCS. Sometimes these are direct, sometimes very unexpected (the dictionary!), and sometimes they can be attributed to memes, which is more or less the notion that ideas can spring up from many places in a society at the same time, because many people will simultaneously be considering the problems of the day: I have been very aware of this with ERCS, so much so that I find it difficult to think of anyone as an original thinker now (let alone me!). ERCS has definitely been part of a tide.

Here is a rough chronological listing of the highlights:

My strongest memory is reporting on ERCS at three consecutive SGML Asia/Pacific conferences: in 1995 as something needed but unlikely, in 1996 as something possible in future SGML (ENR) systems, and in 1997 as something achieved through XML.

Failures

All in all, the main part of ERCS that has not been widely implemented has been the subset repertoire "sanity-checking". However, I think that most people using XML will adopt it without much thinking, for practicality. This area represents a continued problem for XML internationalization and I respect the opinion that it should be solved by allowing numeric character references in markup as well as data in SGML/XML.

I hold a different point of view, or at least I weigh matters slightly differently. I think it is important to never underestimate the value in an XML document of having the markup readable and editable as plain text: I think this is so important that the cost of having some documents that cannot be transcoded (because native language markup is being used) is slight. To me, not being able to read markup on a system represents a bigger interoperability problem than the interoperability problem caused if some documents cannot be transcoded.

In any case, this minor inconvenience will promote UTF-8 adoption, which is a good thing. And in any case, I do not expect people to create documents using non-Latin markup for consumption outside their territory; a Chinese user would not expect a Westerner to have an input method to be able to specify the id of an element and can be expected to prefer ASCII-repertoire characters. The goal of native language markup is to allow users to use their own letters and symbols if they so choose, not that they must.

Indeed, there can be good reasons to use non-native markup: an Indian typesetter of Thai documents in Singapore told me that Latin script is good for markup because it visually stands out more; several Chinese programmers have commented how easy it is to keyboard ASCII characters rather than Chinese ideograms. Even accepting these, one of the key uses of names in a document is to provide ID attributes; these are often generated automatically from a subelement's contents; it is important to be able to use

The other part of ERCS that has failed to thrive is the idea that customary symbols should be available as part of native language markup. For those from HTML and XML backgrounds, the major reason for SGML's complication is that not only can you remap delimiters to tags, but you can also define your own delimiters to create little-languages. For example, in a mathematical formula, you might put n^2, where the "^" is equivalent to the start-tag for a "superscript" element type. That this feature is useful can be seen in discussions of MathML, where many tool-providers continue to provide their own little-languages because they are more convenient for data entry.

However, there are two good reasons why using "customary symbols" may not be a good idea. First, for accessability reasons: a speech synthesizer may have difficulty reading symbol characters in a useful way. The second is because software systems may be confused by characters that are expected to be seperators not tokens. And then there is a policy consideration: the Unicode identifier properties have been published, and collaberation dictates that we should try to contain ourselves to standards, unless the standards themselves provide mechanism for extensibility.

Finally, XML did not follow ERCS in treating non-ASCII whitespace as well-formed token separators. I think this is an error, because it allows well-formedness errors that are not visually apparant, which goes against the ideal of human-readable markup. However, it does minimize the number of token-separator characters that a parser needs to look for; this makes the XML specification simpler to read and implement, which are laudable goals. So I do not regard it as error with bad consequences, or which needs to be corrected.

1995 Suggested Implementation Strategy for ERCS

ERCS had a list of things that vendors should consider doing in order to implement ERCS. In order to see what progress has been made, the 1995 list is in italics, and followed by the strategy that would be appropriate today.

1999 Suggested Implementation Strategy for ERCS

Here is the updated list:

The bigggest thing needed to complete ERCS is that all scripting and programming languages should allow native language identifiers, using the XML and Unicode rules.

It is a common processing technique to use SGML/XML names (attribute names, element type names) and values (NMTOKEN values, NOTATION names, ID values, entity names) as the keys to arrays, or as class names. If a document uses native language markup with non-ASCII characters, most script and programming languages will not be able to process them using the common techniques just mentioned.

So let me recommend to any standards-developers or systems-makers who are reading: in your next versions, support non-Latin native language identifiers. If this is outside the current standards for that language, you could make it a user option.

If you allow non-standard identifiers, you will also need to allow programmers to enter them in the GUI and in program source code. To make this easy, and the help debugging and testing, you will need to provide some form of numeric character references. For the form of these, let me strongly urge for explicit end-delimiters: the form \uABCD where the end delimiter is a space or any character other than [01-9A-F] should be avoided at all costs: it is not clear visually, and it can cause an editor to insert spurious line-breaks when word-wrapping on spaces. A form like SPREAD's &UABCD; or XML's ꯍ is much better.


Copyright (C) 1999 Rick Jelliffe. Please feel free to publish this in any way you like, but try to update it to the most recent version, and keep my name on it.