SGML: ISO entity files - Rick Jelliffe Explains

'ISO entity files' - Explained by Rick Jelliffe


Date: Sun, 20 Jul 1997 14:53:34 +1000
From: Rick Jelliffe - ricko@allette.com.au
Subject: (ISOdia, ISOtech, etc explained)
To: xml-dev@ic.ac.uk

      ---------------------------------------------------------------

Someone on this list has asked what ISOdia, ISOtech etc are.

The SGML standard (ISO 8879) included several sets of entity definitions for many special characters:

* ISOlat1 gives the characters in extended Latin alphabet #1, which is also the upper part of ISO 8859-1
* ISOlat2 gives a whole lot of extra latin characters
* ISOgrk1 and ISOgrk2 give simple modern Greek characters
* ISOcyr2 and ISOcyr2 give modern Russian and non-Russian Cyrillic characters
* ISOdia gives spacing versions of diacritical marks (and is therefore not very useful, I think)
* ISOpub, ISOtech give symbols used in publishing and science
* ISOnum, ISOgrk3 and ISOgrk4 give symbols used in mathematics
* ISObox gives the box characters (yuck)

These entity sets allow you to use special characters in your document, regardless of what the document character set you are using. Well known examples are "<", "&", or "—".

They are "SDATA" entity sets, which means that it is the job of the recipient to map them to something locally useful. XML uses ISO 10646 (=Unicode) as its document character set, so I made up versions of some of the public entity sets resolved for use with ISO 10646. That was the sets I posted.

The ISO standard character entity sets are almost universally used in SGML documents, and giving the XML versions of them makes translation from SGML to XML easier.

W3C has put out its own versions (HTMLsymbol, HTMLlat2, and HTMLmisc) which contain have a selection of the most ubiquitous special characters from the ISO sets: basically, the characters in the so-called ANSI code page used on Windows and the Adobe Symbol font, again resolved for ISO 10646.

Other public entity sets of interest are:

* Martin Bryan has put together ISOchem, (in ISO 9573 Techniques for using SGML) for chemical symbols;
* TEI (Harry Gaylord, etc) has put together entity sets for Arabic and Hebrew;
* I have put together a set (SPREAD) for representing all Unicode characters as entities: in the case of XML, this is redundant, but it does allow transport between XML and SGML fairly trivially;
* American Mathematics Society has contributed, and ISO has standardised, several sets of characters for mathermatical use (ISOamsr etc.);
* Anders Berglund has revised some of the ISO 8879 public entity sets, and the TEI sets, and added some others for other languages e.g. Thai as part of ISO 9573 part 15 "Entity sets for non-Latin Languages). These sets have apparantly been voted YES by ISO national bodies, but have sat in limbo for several years, apparantly due to some obstruction from somewhere, which (if true) is a very bad thing. (I hope members of ISO bodies will get their national bodies to investigate and push for distribution of the public entity sets.)

I hope this answers the question from the list member who asked.

In closing, ISO 10646 does not contain all the symbols or letters in the world. Especially in technical fields (e.g. ISOchem for example). So there is still the need for XML to provide a standard way to request special glyphs over the web. There will always be this need, since the number of glyphs and characters is unbounded. I hope that XML WG will look at this issue. Gavin Nicol has started a maillist on this subject.

Rick Jelliffe