Comments on the GEDCOM Future Directions document

[This local archive copy mirrored from the canonical site: http://home.iclweb.com/icl2/mhkay/gedcomments9808.html; links may not have complete integrity, so use the canonical document at this URL if possible.]

Comments on the GEDCOM Future Directions document

Michael H. Kay, 17 May 1998

Introduction

A proposal for a revision to the GEDCOM standard has been published at ftp://gedcom.org/pub/genealogy/gedcom. I shall refer to this as the Future Directions document.

Most of the document is concerned with revisions to the information model, with which I have no quarrel. My comments are confined to the small part of the document that deals with the data format (specifically Chapter 1, and appendices C, D, and E).

The new standard drops compatibility with GEDCOM 5.5, for very good reasons. This creates a one-off opportunity to move forwards to the use of modern industry-standard data encodings. This paper argues the case for such an approach.

XML - a summary

XML has been developed by the World-Wide-Web Consortium as a standard for encoding documents and datasets for use on the web. Version 1.0 of the standard was ratified in Feb 1998 and it is widely supported by IT vendors such as Microsoft and Netscape. A large number of products supporting XML are already available for free download on the Web.

XML is a syntax for tagging documents. Like HTML, it complies with the internationally-accepted SGML standard, but unlike HTML, it allows application-defined tags. Thus one can define a set of tags for mathematical equations, a set of tags for musical notation -- and a set of tags for genealogical data. These sets of tags are known as Document Type Definitions or DTDs, and they can either be defined by individual users or enterprises, or they can be standardized within a user community.

XML is ideally suited as a tagging syntax for the GEDCOM information model. It can do everything that the proposed GEDCOM data format does, and a great deal more. Moreover, it is as easy to convert existing GEDCOM files to a new XML-based encoding as it is to convert them to the new data format proposed in Future Directions.

I have written a prototype DTD called GedML as a proof of concept. A description, and sample software, can be downloaded from http://home.iclweb.com/icl2/mhkay/gedml.html.

Benefits of an XML-based encoding

The immediate benefits of using XML for encoding the GEDCOM information model are as follows:

Better validation. XML documents are validated against the rules in the DTD. If a DTD is published for GedML, parsers will check that the documents conform to the GedML syntax rules. This provides for a far better level of validation, and hence quality of interchange, than with the current GEDCOM encoding where validation is at the discretion of each implementor.
Widespread availability of XML parsers. There are already at least five XML parsers available on the web for free download, and they all offer a standard interface (SAX 1.0), so any application works with any parser. Using widely-available and standardized parser technology makes life far easier to write applications that process the data, and it makes it far more likely that applications will interpret the standard in the same way. So we will have more GEDCOM applications and better GEDCOM applications as a result. To illustrate this, with GedML I was able to write an application to produce a sorted list of surnames appearing in a dataset in about 120 lines of Java code; the same program written to access raw GEDCOM 5.5 data is about 750 lines of code.
Hiding character set issues. XML deals with the problem of character coding, hiding it entirely from applications. It offers full support for Unicode, with all the complexity of doing so handled by the parser. Character sets are discussed further below.
Improved extensibility. The GEDCOM encoding is already quite extensible, but certain changes are difficult to make. XML increases the extensibility still further. For example, it becomes possible to define optional tags that can be applied to the components of a personal or place name, without duplication of the data itself: software that does not understand these tags can simply ignore them and use the untagged value.

As well as these immediate benefits, there are longer-term benefits which will accrue as XML extends in scope -- many of these will start to be seen in the next 12 months. These include:

A rich hyperlinking model, allowing data in one GEDCOM file to refer to data in another either by hard-coded pointers (record number 189 in file xyz.ged) or by softer pointers (the individual called "James Smith" born in 1865) which are more resilient to change in the referenced dataset.
Incorporation of XML parsing technology as standard within web browsers, allowing GEDCOM files to be stored directly on the web and displayed by the user's favorite display application. This is a considerable improvement on the current situation, where the dataset owner typically converts the data to an HTML display form for access on the web, and is accessible to users only in that rendition.
It is likely that internet search engines will become XML-aware. They will probably recognize XML files and categorize them according to the DTD used, so if GedML files were widely available on the internet, it would be possible to ask a general-purpose search engine such as infoseek for all GedML files containing "England" as a surname. This is a considerable advance on current search possibilities.

Implementation concerns

My experience over a wide range of IT-standards activities suggests a golden rule: the success of a proposed standard depends largely on the cost of implementing it.

There are two kinds of GEDCOM implementor, the professional and the amateur. Professional implementors (the vendors of genealogy products) tend to employ professional programmers and they can invest a reasonable amount of time and effort in ensuring that they conform to all details of the standard. For example, they can afford to test that their products interwork with all their major competitors.

Amateur implementors may be professional programmers working in their spare time, or keen genealogists who are self-taught programmers. Generally, their aim is to get something working quickly rather than to test it to exhaustion. Increasingly, however, the tools and utilities produced by amateur implementors are appearing on the web and some of them are becoming highly popular.

It should therefore be a key aim of the GEDCOM standard to ensure:

That amateur implementors can write programs to access GEDCOM data quickly and easily
That when they do so, they implement the standard correctly

An encoding based on XML furthers both these aims, because amateurs will be able to use standard off-the-shelf XML parsers rather than writing their own GEDCOM parsing layer within the application.

Character set issues

I have not been able to read Appendix D of Future Directions, it appears blank in my copy. I am relying here on the section Character set on page 61..

Genealogical data covers the globe. The only character set that meets the requirement is Unicode, also known as ISO 10646.

Future Directions still allows use of the ANSEL character set. I think the genealogical community should definitely take this opportunity to get rid of ANSEL once and for all. ANSEL was never standardized outside the US, and even within the USA it achieved only very limited success. The only other significant user is the MARC standard for library cataloguing, and MARC is itself undergoing revision (in all probability to use XML and Unicode). ANSEL is not widely understood, and its specification is available only at inordinate cost. As a result, much genealogical software has implemented it incorrectly. Searching for names containing accented characters on a genealogical database such as GENSERV is therefore a nightmare.

XML is firmly based on Unicode, though it also permits encoding using other popular character sets such as ISO 8859/1 (Latin Alphabet Number 1, known in the Microsoft world as "ANSI") if required.

Unicode is supported by the latest programming tools of choice, for example Java and current versions of Visual Basic. The normal method of encoding Unicode data (UTF-8) uses only one byte for characters that appear in the standard ASCII set, so for genealogies in which most of the names are English, there will be no overhead in the space occupancy on disc.

Incidentally, the document Future Directions on page 61 makes a distinction between ASCII and 7-bit ASCII, and also refers to ANSI. This is very confusing. The term ASCII is almost invariably used by experts to mean the US national variant of ISO 646, as defined in ANSI X3-1968. Other 7-bit standards (such as the UK or French national variants of ISO 646) are not ASCII (the A in ASCII means American). Moreover they are almost totally obsolete, although I suppose we need to cater for the fact that some genealogists have very old computers! ASCII is a 7-bit standard, there is no such thing as "8-bit ASCII". A standard of that name was once developed but it was never really used; instead on DOS, IBM proprietary 8-bit character sets (called code pages) were used. Apple also defined a proprietary superset of ASCII on the Macintosh. These code pages are supersets of ASCII, but they are not "8-bit ASCII".

The use of the term "ANSI" to mean ISO 8859-1, or some Microsoft-defined variant of it, is also unfortunate, as is the term "code pages" to refer to variants of it. Clearly the document needs input here from someone with more detailed knowledge of character coding standards.

The key point is that all this is history! Today we have an opportunity to define Unicode as the single interchange format for character data. Unicode is viable today, it is widely supported by today's software, it is "the new ASCII" for a global community. We should grasp this opportunity to get rid of the confusion and parochialism of the past.

Binary data

XML includes facilities to support binary data, for example image or sound files. These can either be represented inline (using Base64 encoding as a character stream) or out-of-line as an external entity.

If the encoding of the data is properly declared in the DTD it is possible for the parser to take away some of the legwork from the application. For example, it can hide the difference between internal and external objects, and it can in principle invoke automatic format conversion e.g. from one image format to another.

There are a number of design choices in this area and support for binary data in the existing XML products is probably not very advanced. In the longer term, however, there are distinct benefits in handling encoding of binary data at a low level of the system. Certainly the current GEDCOM 5.5 makes life very difficult for applications by exposing details such as continuation records at the level of the information model, while the Future Directions proposal appears to remove the convenient option of embedding binary data inline.

HTML tags within GEDCOM text

The Future Directions document proposes that it should be possible to use a limited number of standard HTML tags within GEDCOM text, the chosen selection being , , , , <BLOCKQUOTE>, and <CENTER>. In principle this seems a good idea, but I don't think it goes far enough.

Firstly (and most importantly) some of the examples show cases where logical markup should be used rather than textual styling. The example puts , tags around the name "John Henry Stephens" appearing in a textual extract from a book. This is a case where logical markup (say a <PN> tag to mark this as a personal name) would be far better. The <PN> could be interpreted as implying italic rendition for display applications, but it would also act as a signal for more intelligent applications, e.g. applications to produce an index of names appearing within text.

So I think the priority is to define logical tags for use in text, the most obvious ones being tags for personal names, place names, and dates.

By all means use a selection of HTML-like tags as well where it is only the appearance of the text we are concerned with. This can be easily achieved by borrowing a familiar subset of HTML tags in the GedML DTD. They can appear exactly like their HTML counterparts in the GedML document, with the caveat that all tags must be properly paired and nested, and unpaired tags (such as ) must be identified as such (the equivalent XML syntax is " ").

Conversion and compatibility

I have demonstrated with my GedML prototype that conversion is not a problem. Conversion from GEDCOM 5.5 to GedML is done in about 500 lines of code, conversion in the reverse direction in slightly less. There is no loss of information in doing the conversion, because the data model is the same. The proposal to use XML is all about how the information is encoded, it does not affect the content of the information in any way.

The only problems I have had in converting existing GEDCOM datasets to GedML have been in handling proprietary extensions or deviations from the GEDCOM 5.5 standard: these are detected and result in validation failures when the GedML dataset is parsed. Thus the conversion exercise also provides the opportunity to clean up existing dataset and achieve a higher level of conformance to the standard.

Is there a downside?

Here I try to anticipate some possible objections, and answer them.

1	GEDCOM files encoded in XML will take more space. A: Yes, they will: about 40% more. This is mainly due to the use of end tags (Unicode hardly affects the issue). But the overhead disappears entirely when files are compressed. Disc space is getting cheaper all the time, and bandwidth is also getting cheaper, so the overhead is affordable.
2	Existing software will have to be rewritten. A: Not a serious problem. It will need changing anyway to cope with the new proposals. Coping with a new encoding is far less trouble than coping with data model changes. If necessary, a vendor can bolt-on a simple converter which converts the data back to the classic GEDCOM encoding before processing it. A great deal of the work in accepting GEDCOM data is in dealing with proprietary extensions, and this will become much less of a concern with an XML-based syntax.
3	XML is not stable or mature. A: XML itself is definitely stable, though some of the ancillary standards (such as the style language) are still under development. XML has been endorsed by all the major players in the industry, many of whom have already released software. There are no rival standards competing for the same territory. Therefore, there is every reason to expect that XML will take off rapidly within the next 12 months.
4	XML is more complicated than we need. A: XML is not a complex standard; the specification is 39 pages long and is freely available on the web. It does contain some features which at first sight are not needed for GEDCOM, but these do not intrude and do not make applications harder to write, because they are largely taken care of within the parser. On closer examination, some of the features can probably be used to add value: for example, conditional sections in the DTD could be used to define modules of the GEDCOM data model (for example a module defining details of LDS submissions) that are of interest to some applications and not others.

Conclusion

In conclusion, I believe strongly that the next version of the GEDCOM standard should use an XML standard encoding. It is the right standard at the right time.

If you want to comment on my proposal, or criticize it, please contact me.

If you want to add support to the proposal, please write to gedcom@gedcom.org

About the author

I am a professional software engineer working for the systems and services company ICL in the UK, and an amateur genealogist.

I encountered XML in my professional work and have written XML-based applications for clients, but this proposal has no connection with my employers.

I have written genealogical software for my own use in managing a one-name study (of the surname IRONSIDE) but I have no connection with any commercial genealogical software vendor.