Lexicon Data Model

The following article is from Eastman's Online Genealogy Newsletter and is copyright 1998 by Richard W. Eastman and Ancestry, Inc. It is re-published here with the permission of the author. This issue of the newsletter containing this article was originally published to subscribers on August 15, 1998, and is available at the Ancestry website at http://www.ancestry.com/columns/eastman/eastAug17-98.htm#concepts.

The first and last paragraphs are Dick Eastman's words. The remainder are from John Wylie. Both are GENTECH Board members and John is a member of the Lexicon Working Group.

Concepts from the GENTECH Lexicon Data Model

GENTECH is one of the leading organizations involved with setting the standards for the use of technology for genealogy purposes. A major work in progress is their formation of the GENTECH Lexicon Data Model. John Wylie has written the following report on the project:

Over the next few months, and specifically at the FGS conference later this month in Cincinnati, the GENTECH Lexicon Data Model will be publicly released. Beau Sharbrough mentioned this in his guest article here last month, but there is value in knowing more about the thing.

First, a brief history. Over three years ago, in response to a call by genealogical software developers, GENTECH asked its Technology Committee to address the need for an effective genealogical lexicon they could use in developing software. As the project grew, it became apparent that a data model should be developed first. Initially the Lexicon Data Model effort was begun as a joint venture of GENTECH and the Federation of Genealogical Societies. However many other groups have also endorsed the effort; the New England Historic Genealogical Society, the National Genealogical Society, the Association of Professional Genealogists and the Board for Certification of Genealogists. Eight knowledgeable volunteers, genealogists, developers and a facilitator were selected and immediately got busy developing this important tool. Six of those eight volunteers completed the project this month.

The Data Model isn't simple but it does have some basic elements that relate to all genealogists, whether they are involved with software development or not. It is a schema that uses 'entities' to describe the genealogical process, including how all data is defined, recorded and related to all other data. I'll address a few of these entities in this article. Keep in mind that the Lexicon Data Model is a logical model and not a physical model. It is not a set of instructions for software developers. It is a description of the genealogical process written in terms that developers can understand.

The heart of the Data Model is the ASSERTION.[Note: Throughout this article I will use all caps to indicate an entity title. Also, in data modeling jargon, the plural of an entity title is always that title followed by a lower case 's'. Apologies to English teachers are offered in advance.] The ASSERTION records the act of analyzing evidence and coming to a conclusion about that evidence. While ASSERTIONs initially address evidence, they can also address prior ASSERTIONs. We often talk in genealogy about the need to cite sources, and use evidence. The ASSERTION is the basic tool for recording that analysis. Except when addressing previous ASSERTIONs, it will usually address the lowest level of SOURCE. This is sometimes called a snippet. It may be that part of a source that addresses a particular person (what we call a PERSONA). For example, the will of a fictitious John Smith, executed on 1 May 1850, may say "...and to my daughter Polly Adams, I give $100." From this I could assert that: John had a daughter called Polly. She was alive on 1 May 1850. And that she married an Adams. From another source, my knowledge of genealogy, I could also assert that this Polly was the same person as his daughter Mary (Polly being a common nickname for Mary.) Note that these ASSERTIONs will have different levels of surety, and when combined with other ASSERTIONs that address Polly/Mary SMITH/ADAMS, will document an ancestor.

ASSERTION is linked to many other entities: SOURCE, EVENT, GROUP, PERSONA, PLACE, CHARACTERISTIC, SURETY, RESEARCHER and another ASSERTION. One can see that software that implements the Data Model would be very powerful. With all this detail recorded (or at least recordable) and linked at the lowest level, we can audit (or backtrack) on all of the hundreds of decisions we make when we enter data into our software. We can also find the decisions of those from whom we import data. Using just the links I've listed above, there are nine relationships other data may have to an ASSERTION. With the important recursive power of ASSERTIONs (that is ASSERTIONs about ASSERTIONs) this becomes the fundamental tool for documenting the genealogical process and in strengthening genealogy software: the ultimate goal of the entire project, and the reason GENTECH took on this task in the first place.

Another aspect of the Data Model is how it handles sources. A source can exist as an hierarchy of SOURCE entities, each a subset of a more general SOURCE. At the lowest level (but not limited to that level) this would then be reported in a REPRESENTATION, that snippet from the source that leads to an ASSERTION. In software applications, our Model would expect the ability to store those REPRESENTATIONs. For example, a REPRESENTATION could be a scanned image of the original document, but is not limited to this medium. Thus, recording sources in software must include the ability to manage REPRESENTATIONs and to link these to ASSERTIONs.

Note that we do _not_ combine SOURCEs to collect evidence in support of a thesis We combine ASSERTIONs (often ASSERTIONs made about sources.) This thinking is essential to the model and to good genealogical methodology. Think of the software you use and see it you can fit this complex, but correct, thinking into the way it records, stores and retrieves conclusions.

One of the most important (and most difficult to manage) requirements of the Data Model was our insistence that all work can be audited. An audit (or attribution) trail must exist for every recorded action that results in a conclusion. The Data Model will insist that data be stored at its lowest workable level. For example, if I assert (in the ASSERTION entity) that John is the father of Stephen, that may be based on, say eight pieces of evidence, each with it's own ASSERTION (#1, #2, #3...#8). This collected ASSERTION is thus #9. Later I find evidence that two of the earlier ASSERTIONs (#4 and #7) are wrong (perhaps a disproved compiled family history). The software must allow me to "de-construct" the previously collected ASSERTION (#9) of the eight pieces of evidence and then reconstruct based on what I subsequently learned. We call this working at the atomic level, and it is essential to understanding the model.

It is equally important that genealogists have the ability to retain ASSERTIONs that have subsequently proven false or have changed. We don't want software to destroy data, even if it is proven (or is suspected of being) false. We want to identify the error to insure we, or others, don't repeat it.

If this has whetted your appetite for more, stay tuned. The formal (currently over 90 pages) Request For Comments (RFC) document will be released at the GENTECH Luncheon in Cincinnati on August 21st, by Robert Charles Anderson, Chairman of the project. We plan to have it available in digital form for downloading at the GENTECH website shortly thereafter. We want you to look it over and to tell us what you think.

I, personally, expect that the Data Model will improve genealogy as a discipline. We can only get better as we learn more about what we do and how we do it.

John Wylie

This will be a major topic of discussion this week in Cincinnati. However, there will be even more discussion later after everyone has read and digested the report. I expect that the GENTCH conference next January 22 and 23 in Salt Lake City will be very interesting. For details on both the Lexicon Data Model and the 1999 GENTECH conference, keep an eye on http://www.gentech.org/.