GedML: Genealogical Data in XML

[Snapshot of document at http://users.iclway.co.uk/mhkay/gedml/ 2002-12-27; please see the official/canonical URL and content if possible.]

These pages describe GedML, a way of encoding genealogical data sets in XML. It combines the well-established GEDCOM data model with the XML standard for encoding complex information. The result is a representation that can easily be converted to and from GEDCOM, but can be manipulated much more easily using standard tools.

Updated 12 September 1999: This is a major overhaul, effectively GedML Mark 2. The principal changes are:

The rules for converting from the traditional GEDCOM encoding to the XML encoding now take into account only the basic GEDCOM grammar, described in Chapter 1 of the GEDCOM standard; it does not affect the lineage-linked GEDCOM file, described in Chapter 2 of the standard, which is by far the most commonly used application of the GEDCOM grammar.
The GedML software is completely rewritten. It is now designed as a pipeline. Each step in the pipeline (known as a "filter" can perform its own processing independently of the other steps in the pipeline. The first step is a parser which processes the input file directly: there are two versions of this parser, one of which takes XML input, the other takes traditional GEDCOM input. The other stages of the pipeline are independent of whether the original input was XML or GEDCOM. This means that GEDCOM files, in their traditional format, can be input to many applications such as XSL stylesheets that expect to get their input from an XML parser.

I have withdrawn the software temporarily from the site because the build was wrong. I hope to reissue it after teh next version of SAXON is out, in the next week or so. (8 Oct 1999)

Rationale: Why do it?
Principles: Design considerations
Proposal: Proposed specification of GedML
Example: A sample GedML dataset
DTD: The Document Type Definition (for XML buffs only)
Software: Software currently available

Feedback

I want your feedback, both on the principles and on the current software.

Email: mhkay@iclway.co.uk

Changes

12 May 1999

Software updated to work with SAXON 4.2

16 February 1999

I've updated the software to work with SAXON 4.0, and made a few updates to the accompanying text.

Now is a good opportunity to take stock. There's been a fair amount of feedback, and XML is now much better understood.

Many people have given encouragement for the ideas behind GedML, and there is widespread acceptance in principle that XML is a good way forward for genealogy, but there have been two frequently-expressed reservations:

What is really wrong with GEDCOM is not the encoding, it's the data model. To achieve better data interchange, we need a better data model; the encoding could be improved, but it's not the real problem
GEDCOM is too well entrenched: no product vendor is interested in being the first to support some alternative standard.

I accept both these points. Producing a better data model is a hard problem, which is why I haven't attempted it. Sticking with the GEDCOM data model helps with point (2): the software provided here can freely convert between the traditional GEDCOM and GedML encodings, so a product only needs to support either one of these natively.

The position I totally reject is that the idea of an interchange format is intrinsically flawed. Bob Velke has been pushing an alternative approach called GenBridge, which is proprietary technology to convert between a number of database formats used by popular American software packages. This can produce good results for a finite number of products, but it is a closed approach: it doesn't allow the shareware author to write new useful tools that can import their data from anywhere.

8 Jun 1998 Since I announced GedML in April 1998 there has been a fair trickle of feedback, which I have responded to privately. I've been working on the software and have some exciting developments in progress [which I still haven't got round to publishing - 16 Feb 1999]. For the present, though, I'm just releasing an improved spec and an improved version of the converters. Still nothing that does anything very useful, but that will come.

As far as the spec is concerned, most of the changes were already anticipated:

I've defined what I hope is a reasonably clean way of handling vendor extensions with a <VENDOR> tag. It turns out every GEDCOM file I looked at had a vendor extension in it somewhere.
Line breaks and continuations are now done in a cleaner way. I use the empty tag <BR/> for a line break (as in HTML), equivalent to the GEDCOM CONT tag, and a similar <CC/> tag for the GEDCOM CONC (concatenation). I could have just thrown this away, but keeping it ensures that the original file can be regenerated. I have yet to discover how (or whether) CONC is used in practice.
I now use the tag <S> to mark the surname in a name.

25 May 1998 Since I first wrote these pages, a paper has been published on Future Directions for GEDCOM. This paper suggests a considerable advance in the GEDCOM information model, but it does very little to improve GEDCOM at the encoding level. I have produced a comment on the proposals.

In some ways it is a disappointment that the paper makes no progress on the encoding level at all, concentrating entirely on improving the GEDCOM data model itself. In fact the encoding proposals are very messy indeed, displaying a poor understanding of character coding standards and other such issues. But I prefer to see this as an opportunity: since no work has been done in this area, the field is wide open. My response to the Future Directions paper is here. The software As well as converters to and from GedML, there is now a utility to generate CSV files for loading data into a relational database or spreadsheet.

I've improved the converters considerably:

The GEDCOM parsing is now much more rigorous, with far better error handling when it fails.
I now detect a wide range of vendor extensions in the GEDCOM file, which previously resulted in a GedML file that failed validation against the GedML DTD. The vendor extensions don't cause the conversion to fail, they result in explicit <VENDOR> tags which are permitted by the DTD; on conversion back to GEDCOM, the original tags are restored.
There are improvements in the ANSEL to UNICODE conversion (and vice versa) though it still isn't perfect. It's OK provided you only use characters found in Latin-1 and Latin-2 (Western and Eastern European Repertoires, respectively). If you generate a GedML file containing UNICODE non-spacing accents, the conversion back to GEDCOM will give the wrong answer. (Thanks to John Cowan for his help on this.)
Multi-line text is now handled much better, including trying to do sensible line breaks and wordwrap in the GedML-to-GEDCOM direction. In fact the "from GedML" direction has had a lot of work to ensure that a valid GEDCOM file results no matter what strange things you do when editing the GedML file, though in some cases this is at the expense of losing data.

Michael H. Kay
16 February 1999

GedML: Genealogical Data in XML

Contents

Feedback

Changes