GedML: Genealogical Data in XML
[This local archive copy mirrored from the canonical site: http://home.iclweb.com/icl2/mhkay/gedml.html; links may not have complete integrity, so use the canonical document at this URL if possible. See other references]
These pages describe GedML, a way of encoding genealogical data
sets in XML. It combines the well-established GEDCOM data model
with the new XML standard for encoding complex information.
The result is a representation that can easily be converted to
and from GEDCOM, but can be manipulated much more easily using
standard tools.
Updated 8 June 1998, with improved software:
see changes
On other pages:
Feedback
I want your feedback, both on the principles and on the current
(very early) software.
Email: mhkay@iclweb.com
8 Jun 1998
Since I announced GedML in April 1998 there has been a fair trickle of feedback,
which I have responded to privately. I've been working on the software and have some
exciting developments in progress. At this stage I'm not releasing any pretty
user interfaces, but the basic GEDCOM-to-GedML and and GedML-to-GEDCOM converters
are now supplemented by:
- A class library that builds a set of Java objects representing the objects
in the GEDCOM data model (Inidviduals, Families, Events, and so on). This is only
useful to you if you want to do some Java programming.
- A utility which builds the object model, as above, and then generates a set
of comma-separated-value (CSV) files which can be used to load the data into
a relational database or spreadsheet.
As well as being an example of how the object model can
be used, I hope this is a useful utility in its own right.
The converters themselves are much improved:
- The GEDCOM parsing is now much more rigorous, with far better error handling when
it fails.
- I now detect a wide range of vendor extensions in the GEDCOM file, which previously
resulted in a GedML file that failed validation against the GedML DTD. The vendor extensions
don't cause the conversion to fail, they result in explicit <VENDOR> tags which are
permitted by the DTD; on conversion back to GEDCOM, the original tags are restored.
- There are improvements in the ANSEL to UNICODE conversion (and vice versa) though it
still isn't perfect. It's OK provided you only use characters found
in Latin-1 and Latin-2 (Western and Eastern European Repertoires, respectively). If you
generate a GedML file containing UNICODE non-spacing accents, the conversion back to GEDCOM
will give the wrong answer. (Thanks to John Cowan for his help on this.)
- Multi-line text is now handled much better, including trying to do sensible
line breaks and wordwrap in the GedML-to-GEDCOM direction. In fact the "from GedML"
direction has had a lot of work to ensure that a valid GEDCOM file results no matter what strange
things you do when editing the GedML file, though in some cases this is at the expense
of losing data.
As far as the GedML spec is concerned, most of the changes were already anticipated:
- I've defined what I hope is a reasonably clean way of handling vendor
extensions with a <VENDOR> tag. It turns out every GEDCOM file I looked
at had a vendor extension in it somewhere.
- Line breaks and continuations are now done in a cleaner way. I use the
empty tag <BR/> for a line break (as in HTML), equivalent to the GEDCOM CONT tag,
and a similar <CC/> tag for the GEDCOM CONC (concatenation). I could have just
thrown this away, but keeping it ensures that the original file can be regenerated.
I have yet to discover how (or whether) CONC is used in practice.
- I now use the tag <S> to mark the surname in a name.
25 May 1998
Since I first wrote these pages, a paper has been published on
Future Directions for GEDCOM.
This paper suggests a considerable advance in the GEDCOM information model, but
it does very little to improve GEDCOM at the encoding level. I have produced
a comment on the proposals.
In some
ways it is a disappointment that the paper makes no progress on the encoding level at all,
concentrating entirely on improving the GEDCOM data model itself. In fact the encoding
proposals are very messy indeed, displaying a poor understanding of character coding
standards and other such issues. But I prefer to see this as an opportunity: since
no work has been done in this area, the field is wide open.
Michael H. Kay
15 June 1998