GedML: Genealogical Data in XML

[This local archive copy mirrored from the canonical site: http://home.iclweb.com/icl2/mhkay/gedml.html; links may not have complete integrity, so use the canonical document at this URL if possible. See other references]

These pages describe GedML, a way of encoding genealogical data sets in XML. It combines the well-established GEDCOM data model with the new XML standard for encoding complex information. The result is a representation that can easily be converted to and from GEDCOM, but can be manipulated much more easily using standard tools.

Updated 8 June 1998, with improved software: see changes

On other pages:

Rationale: Why do it?
Principles: Design considerations
Proposal: Proposed specification of GedML
Example: A sample GedML dataset
DTD: The Document Type Definition (for XML buffs only)
Software: Software currently available

Feedback

I want your feedback, both on the principles and on the current (very early) software.

Email: mhkay@iclweb.com

Changes

8 Jun 1998 Since I announced GedML in April 1998 there has been a fair trickle of feedback, which I have responded to privately. I've been working on the software and have some exciting developments in progress. At this stage I'm not releasing any pretty user interfaces, but the basic GEDCOM-to-GedML and and GedML-to-GEDCOM converters are now supplemented by:

A class library that builds a set of Java objects representing the objects in the GEDCOM data model (Inidviduals, Families, Events, and so on). This is only useful to you if you want to do some Java programming.
A utility which builds the object model, as above, and then generates a set of comma-separated-value (CSV) files which can be used to load the data into a relational database or spreadsheet. As well as being an example of how the object model can be used, I hope this is a useful utility in its own right.

The converters themselves are much improved:

The GEDCOM parsing is now much more rigorous, with far better error handling when it fails.
I now detect a wide range of vendor extensions in the GEDCOM file, which previously resulted in a GedML file that failed validation against the GedML DTD. The vendor extensions don't cause the conversion to fail, they result in explicit <VENDOR> tags which are permitted by the DTD; on conversion back to GEDCOM, the original tags are restored.
There are improvements in the ANSEL to UNICODE conversion (and vice versa) though it still isn't perfect. It's OK provided you only use characters found in Latin-1 and Latin-2 (Western and Eastern European Repertoires, respectively). If you generate a GedML file containing UNICODE non-spacing accents, the conversion back to GEDCOM will give the wrong answer. (Thanks to John Cowan for his help on this.)
Multi-line text is now handled much better, including trying to do sensible line breaks and wordwrap in the GedML-to-GEDCOM direction. In fact the "from GedML" direction has had a lot of work to ensure that a valid GEDCOM file results no matter what strange things you do when editing the GedML file, though in some cases this is at the expense of losing data.

As far as the GedML spec is concerned, most of the changes were already anticipated:

I've defined what I hope is a reasonably clean way of handling vendor extensions with a <VENDOR> tag. It turns out every GEDCOM file I looked at had a vendor extension in it somewhere.
Line breaks and continuations are now done in a cleaner way. I use the empty tag <BR/> for a line break (as in HTML), equivalent to the GEDCOM CONT tag, and a similar <CC/> tag for the GEDCOM CONC (concatenation). I could have just thrown this away, but keeping it ensures that the original file can be regenerated. I have yet to discover how (or whether) CONC is used in practice.
I now use the tag <S> to mark the surname in a name.

25 May 1998

Since I first wrote these pages, a paper has been published on Future Directions for GEDCOM. This paper suggests a considerable advance in the GEDCOM information model, but it does very little to improve GEDCOM at the encoding level. I have produced a comment on the proposals.

In some ways it is a disappointment that the paper makes no progress on the encoding level at all, concentrating entirely on improving the GEDCOM data model itself. In fact the encoding proposals are very messy indeed, displaying a poor understanding of character coding standards and other such issues. But I prefer to see this as an opportunity: since no work has been done in this area, the field is wide open.

Michael H. Kay
15 June 1998