GedML: Genealogical Data in XML
[Snapshot of document at http://users.iclway.co.uk/mhkay/gedml/ 2002-12-27; please see the official/canonical URL and content if possible.]
These pages describe GedML, a way of encoding genealogical data
sets in XML. It combines the well-established GEDCOM data model
with the XML standard for encoding complex information.
The result is a representation that can easily be converted to
and from GEDCOM, but can be manipulated much more easily using
standard tools.
Updated 12 September 1999: This is a major overhaul, effectively
GedML Mark 2. The principal changes are:
- The rules for converting from the traditional GEDCOM encoding to the XML encoding
now take into account only the basic GEDCOM grammar, described in Chapter 1 of the
GEDCOM standard; it does not affect the lineage-linked GEDCOM file, described in
Chapter 2 of the standard, which is by far the most commonly used application of the
GEDCOM grammar.
- The GedML software is completely rewritten. It is now designed as a pipeline. Each
step in the pipeline (known as a "filter" can perform its own processing independently
of the other steps in the pipeline. The first step is a parser which processes the
input file directly: there are two versions of this parser, one of which takes XML input,
the other takes traditional GEDCOM input. The other stages of the pipeline are
independent of whether the original input was XML or GEDCOM. This means that GEDCOM files,
in their traditional format, can be input to many applications such as XSL stylesheets
that expect to get their input from an XML parser.
I have withdrawn the software temporarily from the site because the build was wrong.
I hope to reissue it after teh next version of SAXON is out, in the next week or so.
(8 Oct 1999)
Contents
Feedback
I want your feedback, both on the principles and on the current
software.
Email: mhkay@iclway.co.uk
12 May 1999
Software updated to work with SAXON 4.2
16 February 1999
I've updated the software to work with SAXON 4.0, and made a few updates to the
accompanying text.
Now is a good opportunity to take stock. There's been a fair amount of feedback, and XML
is now much better understood.
Many people have given encouragement for the ideas behind GedML, and there is widespread
acceptance in principle that XML is a good way forward for genealogy, but there have been
two frequently-expressed reservations:
- What is really wrong with GEDCOM is not the encoding, it's the data model. To achieve better
data interchange, we need a better data model; the encoding could be improved, but it's not the
real problem
- GEDCOM is too well entrenched: no product vendor is interested in being the first to support
some alternative standard.
I accept both these points. Producing a better data model is a hard problem, which is why I
haven't attempted it. Sticking with the GEDCOM data model helps with point (2): the software provided
here can freely convert between the traditional GEDCOM and GedML encodings, so a product only needs to
support either one of these natively.
The position I totally reject is that the idea of an interchange format is intrinsically flawed.
Bob Velke has been pushing an alternative approach called GenBridge, which is proprietary technology
to convert between a number of database formats used by popular American software packages. This can
produce good results for a finite number of products, but it is a closed approach: it doesn't allow
the shareware author to write new useful tools that can import their data from anywhere.
8 Jun 1998
Since I announced GedML in April 1998 there has been a fair trickle of feedback,
which I have responded to privately. I've been working on the software and have some
exciting developments in progress [which I still haven't got round to publishing
- 16 Feb 1999]. For the present, though, I'm just releasing an
improved spec and an improved version of the converters. Still nothing that does
anything very useful, but that will come.
As far as the spec is concerned, most of the changes were already anticipated:
- I've defined what I hope is a reasonably clean way of handling vendor
extensions with a <VENDOR> tag. It turns out every GEDCOM file I looked
at had a vendor extension in it somewhere.
- Line breaks and continuations are now done in a cleaner way. I use the
empty tag <BR/> for a line break (as in HTML), equivalent to the GEDCOM CONT tag,
and a similar <CC/> tag for the GEDCOM CONC (concatenation). I could have just
thrown this away, but keeping it ensures that the original file can be regenerated.
I have yet to discover how (or whether) CONC is used in practice.
- I now use the tag <S> to mark the surname in a name.
25 May 1998
Since I first wrote these pages, a paper has been published on
Future Directions for GEDCOM.
This paper suggests a considerable advance in the GEDCOM information model, but
it does very little to improve GEDCOM at the encoding level. I have produced
a comment on the proposals.
In some
ways it is a disappointment that the paper makes no progress on the encoding level at all,
concentrating entirely on improving the GEDCOM data model itself. In fact the encoding
proposals are very messy indeed, displaying a poor understanding of character coding
standards and other such issues. But I prefer to see this as an opportunity: since
no work has been done in this area, the field is wide open. My response to the Future
Directions paper is here.
The software
As well as converters to and from GedML, there is now a utility to generate CSV files
for loading data into a relational database or spreadsheet.
I've improved the converters considerably:
- The GEDCOM parsing is now much more rigorous, with far better error handling when
it fails.
- I now detect a wide range of vendor extensions in the GEDCOM file, which previously
resulted in a GedML file that failed validation against the GedML DTD. The vendor extensions
don't cause the conversion to fail, they result in explicit <VENDOR> tags which are
permitted by the DTD; on conversion back to GEDCOM, the original tags are restored.
- There are improvements in the ANSEL to UNICODE conversion (and vice versa) though it
still isn't perfect. It's OK provided you only use characters found
in Latin-1 and Latin-2 (Western and Eastern European Repertoires, respectively). If you
generate a GedML file containing UNICODE non-spacing accents, the conversion back to GEDCOM
will give the wrong answer. (Thanks to John Cowan for his help on this.)
- Multi-line text is now handled much better, including trying to do sensible
line breaks and wordwrap in the GedML-to-GEDCOM direction. In fact the "from GedML"
direction has had a lot of work to ensure that a valid GEDCOM file results no matter what strange
things you do when editing the GedML file, though in some cases this is at the expense
of losing data.
Michael H. Kay
16 February 1999