A text retrieval protocol for a distributed library

Overview

As classicists like the CHS workshop participants build collections of TEI-conformant texts, we need a service allowing us to retrieve pieces of text by a canonical reference. I'll call this service a TextServer.

The GetCapabilities request for a TextServer will identify:

what texts the server includes (discussed in "1. Defining a text")
what forms of canonical reference are syntactically valid for each text (discussed in "2. Defining a canonical reference")

A request would, at a minimum, identify a text and normally include a canonical reference to a passage; the server response would be an XML-encoded version of the corresponding passage. (Discussed in "3. Requests.")

This is a very simple idea, but many applications could be built from a text retrieval service. Examples:

A text pager or browser is an obvious example (but not trivial since to work effectively it requires information about valid ranges of canonical references in addition to information on format of references: more on that below)
A citation expander could automatically turn cited references into footnotes with the relevant passage.
A difference engine could retrieve passages from two different editions of a text and highlight their differences. This is a good example because the editions could even come from two different servers each of which is unaware of the other's existence. (See more on "Defining a text" below)

1. Defining a text

The main job of the GetCapabilities description of a TextServer is to return an inventory of the texts it serves. I'll call this complex element a SrcInventory. The SrcInventory is organized hierarchically so that clients can request texts to retrieve in ways corresponding to how we think of "texts."

The root-level organization of the SrcInventory groups texts as "work" elements under an element currently (misleadingly) named "author." These elements are organizational conveniences, not historical judgments. It might be convenient for many purposes to group a work with a title value "Seventh Letter" under an author with name value "Plato" whatever your views on the historical authorship of the work. Author elements have a an attribute that uniquely identifies them on the server; work elements have an attribute that uniquely identifies them within an "author" grouping. Author/work combinations therefore uniquely identify a given work on the server. Author elements are required to have at least one name element with a lang attribute, and may have several. Work elements are required to have at least one title element with a lang attribute, and may have several. Application developers thus should never need to expose the unique identifiers to end users, but may if they choose.

The element names at this stage are by way of example only and are open for suggested improvements : "author" and "work" would suggest literary texts but a SrcInventory should equally gracefully handle epigraphic or papyrological texts. For inscriptions, the "work" element would correspond to an individual epigraphic text; under the "author" grouping, a TextServer might gather together a corpus from a single site, monument or other convenient grouping.

The author/work identifiers are the minimum requirements for a client request to identify a text: in response to a request that only asks for author/work without further qualification, the server may default to any edition or translation found under that author/work element.

The work element can contain up to two further levels of information about a text. At the first level, every text available on the server must contain at least one of the two elements "edition" or "translatededition". These are repeatable: a server may have any number of editions and/or translations of a given work. Optional request elements allow clients to specify particular translations or editions.

The edition and translatededition represent specific but replicated versions of a text. One level deeper in the SrcInventory hierarchy, the "edition" and "translatededition" elements may contain zero or more "exemplar" elements that identify a specific physical copy of a text. Giving clients the ability to identify specific exemplars allows (e.g.) for comparison of two damaged physical copies of the same edition. A direct digital transcription of a manuscript or inscription (as opposed to a digital version of a printed transcription) would not have distinct editions and exemplars: in such a case, the digital edition is simultaneously a direct representation of the exemplar.

May 11: I have roughed out a DTD that defines the SrcInventory element as described here, and am annotating it now. I will add a link to the DTD and annotations here soon.

2. Defining a canonical reference

At a minimum, the GetCapabilities response must identify the syntax of valid requests for a given text. I assume that canonical references are hierarchical, and that right-most elements can always be dropped. A prose work organized as book/chapter/section could have legitimate references of the form 1.1.1 to refer to a specific section, 1.1 to refer to a whole chapter, 1 to refer to a whole book, or null to refer to the whole work.

At present, the draft SrcInventory DTD I am using wraps this information in a single element; the text value of the element follows the TLG convention of naming the constituent parts separated by a slash, so that the element for a typical classical prose work might contain the string "book/chapter/section"; for the Iliad, the string would be "book/line"; for some poems, or many inscriptions, it might be simply "line" while other inscriptions might have the citation form "column/line". In other words, the maximum number of components allowed in citation of a given work is always 1 more than the number of slashes; labels for the components are separated by slashes.

Yes, this is a complete kludge, due only to the historical accident that I was interested in working immediately with data from the TLG canon. We should rework this into a coherent XML representation.

3. Requests

Requests allow clients to retrieve parts of TEI-conformant texts, or gather information about those texts. I would suggest that we follow the OGIS model of allowing either key-value parameters with HTTP GET or XML encoding with HTTP POST method.

Required

Every TextServer must support a request named GetText.

GetText

The GetText request must include an "author" and "work" parameter. In addition it may include either a single "ref" parameter, or a range element that contains a start and end ref. The author and work parameters must give unique identifiers as identified in the GetCapabilities request's SrcInventory element: anything else is an error.

Remember that the TextServer is not an end-user application: it is a service people will use to build applications. End users never need to see these unique identifiers, and one typical piece of functionality applications will supply are convenient ways of translating names to these unique identifiers. Note that it is a trivial transformation to expose name and title strings from author and work elements to end users and keep track of the unique identifying attributes: the SrcInventory guarantees that you will have at least one name and title at hand to use.

The draft form I have been using further expects the text value of "ref" element to consist of dot-separated components. A request for the first line of the Iliad would include a ref element with the value "1.1" in this scheme.

Is this a poor choice for the server protocol? Should we instead force the application to break that down into something like
<ref> <refpart type="book">1</refpart> <refpart type="line">1</refpart> </ref> where values of the type attribute "book" and "line" are derived from the SrcInventory?

The GetText reply includes the author/work elements of the SrcInventory for the text delivered, together with the corresponding elements of the source TEI document's body.

Optional

Implementation of the following requests would be optional: the server's GetCapabilities document would indicate whether or not GetWorks is implemented.

GetWorks

We should define a method named GetWorks that would return all the work data for a given author. This functionality is redundant since the same information could be derived entirely from the GetCapabilities request, but it would be convenient to application developers, and would prevent the possibly significant overhead of transferring an entire SrcInventory for a large text server.

The GetWorks request would require one parameter identifying the author by unique ID.

The GetWorks reply should return the entire author structure of the SrcInventory.

GetHeader

The optional GetHeader request requires an author and work parameter.

The GetHeader reply returns the author and work data from the SrcInventory, together with the teiHeader from the corresponding source document.

Should this request be required?

Leveraging existing inventories of literary texts

I want to add some notes here on how we could take advantage of existing inventories of texts such as the TLG Canon to coordinate the offerings of various text servers (so that your Apollonius poet of the Argonautica is not confused with my Apollonius author of the great mathematical work on Conics).

This server is hosted in the St. Isidore of Seville Research Lab at the College of the Holy Cross, Worcester MA. It is not an official server of the college: see www.holycross.edu.

E pur si muove