[This local archive copy is from the official and canonical URL, http://www.sgmltech.com/papers/jlp1198.htm; please refer to the canonical source document if possible.]
Jorge Leal Portela
Many organizations, such as museums, police forces, and insurance companies, are faced with the problem of identification of stolen and recovered objects of art and have difficulties in sharing relevant information. GRASP (Global Retrieval, Access and information System for Property items) is a project which addresses the problem of sharing information by demonstrating how descriptions of objects can be captured, stored in a heterogeneous database, and widely distributed across a network environment.
This paper addresses the issue of how SGML (Standard Generalized Markup Language) was successfully used for numerous aspects of the project, ranging from data storage and specification of the exchange structure, to distributed database synchronization control, in combination with a programmable processing tool.
Jorge Leal Portela is a project manager at ACSE sa/nv, a member of the SGML Technologies Group. He is a software engineer and systems architect specializing in object-oriented distributed applications and document management applications. All these systems use SGML either as a document storage and exchange medium or as a formal message specification tool for communications between distributed application processes. A graduate of engineering specializing in nuclear physics, he studied at the Free University of Brussels and may be contacted at jlp@sgmltech.com.
Many organizations, such as museums, police forces, and insurance companies, are faced with the problem of identification of stolen and recovered objects of art and have difficulties in sharing relevant information. GRASP is a project which addresses the problem of sharing information by demonstrating how information about objects can be captured, stored in a heterogeneous database, and widely distributed across a network environment.
GRASP AD1008 is a Telematics for Administrations project funded by the European Union. It started in January 1996 and is currently (summer 1998) in the final validation phase. The list of partners includes prestigious organizations in the field of crime, cultural heritage, and museums, such as the Metropolitan Police Services (Scotland Yard) in London, the Spanish Ministry of Culture, and the Dutch Central Information Research Department CRI (Centrale Recherchie Informationdienst).
Resulting from the requirements analysis phase (see [GRASPD51] for full functional specifications), the GRASP partners decided to focus on several functionalities. In particular, the system should:
The originality of the approach of the GRASP partners was to combine relevant existing technologies in order to meet the project requirements.
Some of the technologies include:
This paper will focus on SGML. The following sections describe how SGML, in combination with the other technologies, helped the project requirements to be met.
The GRASP network is organized in terms of nodes interconnected through a wide area network (WAN). Typically, each organization will have one GRASP node. This node may, in turn, serve one or multiple local workstations equipped with the GRASP front-end user interface.
The system consists of various modules:
An object-oriented approach is used for the GRASP design; there are a number of nodes composed of software modules that are distributed across a heterogeneous network of workstations and high-end servers. All interoperations among these components are handled by a CORBA-compliant Object Request Broker.
Actual content of the messages being exchanged between these components is specified and manipulated in SGML. As will be seen in a later section, SGML provides an elegant way of manipulating object descriptions.
This component-based architecture enables the user to reconfigure the system by replacing individual components according to his needs.
Each art object is described by a set of fifty-five descriptors, each of which describes one particular distinguishing feature of the object. Some descriptors are factual (colour, size, shape, signature text and position, and so on), while others are more subject to interpretation (style, region, period in time when the object was manufactured, subject, and so on). Because GRASP has to provide access to a heterogeneous database a common vocabulary is required. This standardized vocabulary is the Ontology.
The GRASP ontology is based on the AAT (Arts and Architecture Thesaurus) developed at the Getty Research Institute. It contains 28,400 main terms and categories and about 100,000 terms including alternative spelling, synonyms, etc. The AAT is structured according to 33 hierarchies for different classes of concepts (material, processes and techniques, styles and periods, for example). Each hierarchy contains a root concept and a tree of sub-concepts.
To support multilingual use the descriptor values are transformed into integer values before storage in the Adaptor and transformed back into actual words on any another node depending on the local language settings. Currently supported languages include English, Dutch, and Spanish (partially).
Consider the descriptor category. It is composed of three values which define the object category according to a classification. The first level of the classification groups objects in three types: instruments, fine art objects, and decorative arts. If a decorative art object is being described there will be, at the second level, the choice between ceramics, textiles, costumes, and glassware. And finally, if it is a textile that is being described, then at the third level the choice will be among tapistry, laceware, silkware, etc.
The user assigns values to descriptors with the HCI module; during this description process the Ontology system guides the user by computing on-the-fly smaller and smaller sets of values on the basis of the description introduced so far.
This section will briefly describe a typical usage scenario.
A police officer located in London describes a Ming vase which was reported stolen by its owner.
The object is described in the English version of the HCI module with the assistance of the Ontology module.
The description is forwarded to the Adaptor module which handles the storage and the distribution. The object is marked as Lost/Stolen.
A few days later a Dutch police officer finds an object of suspect provenance in an abandoned van in the outskirts of Amsterdam - our Ming vase.
Back in the office he introduces a description of the object into the GRASP system using the Dutch version of the HCI module.
The object is forwarded to the Adaptor module and marked as Found/Recovered.
The system reports a list of possible matching candidates for the newly-introduced object, among which is the Ming vase previously reported stolen. The objects are illustrated by a thumbnail sketch.
The Dutch officer identifies the vase.
The Dutch officer contacts his colleagues in London and checks whether the two vases are indeed one and the same object.
The Ming vase is returned to its rightful owner.
The descriptors are organized in a hierarchical structure: the description tree. Additionally, description trees must conform to a set of rules that define the structure and occurrence of each descriptor.
The description structure is, from a conceptual viewpoint, very close to the hierarchical structure of a document. SGML thus appeared the obvious candidate for structuring, validating, versioning, and handling this kind of data.
A start was made by specifying the structure of the description tree (document) in a DTD (Document Type Definition). To do so the descriptors were mapped to elements, their structure and occurrence to element models, and the descriptor values to numerical attributes. The Ontology thesaurus takes care of the translation of the descriptor's terms (actually words) into numerical values according to local language settings.
Owing to the similarity between documents and object descriptions, SGML proved to be far more compact and elegant as a data definition language compared to a classical structured data definition language such as CORBA's Interface Definition Language (IDL (Interface Definition Language)). The resulting architecture uses CORBA as a low-end inter-component messaging system, while SGML takes care of defining the actual content of the messages that are exchanged.
SGML was also chosen for the storage of the descriptions themselves. The alternative technology, a relational database, is very efficient for handling many to many relationships but very clumsy for manipulating linear and hierarchical structures like the GRASP descriptions. The system still uses a relational database but only as a back-end storage system; the descriptions are stored as SGML instances.
The object description DTD of the GRASP project and an example instance are given in the annexes.
Ranking is used to find matching candidates for a certain reference object. A successful ranking will therefore present the objects that are in the database in such a way that the objects closest (admitting a distance between two descriptions) to the reference object will appear first.
The ranking strategy used for GRASP consists of three phases.
A pre-selection phase excludes most descriptions according to a very coarse classification based on a few elements of the description.
For each of the remaining candidates the system computes the distance to the reference object.
The list of descriptions is sorted and presented to the user.
The main difficulties in finding the distance between two descriptions are:
To summarize, and without going into too much detail, the following assumptions were made:
As an example to illustrate the second assumption, suppose two values for the colour descriptor are being compared. The system will reduce the detail level of colour to a common level; comparing light brown with dark brown will therefore give a positive result because both colours are derived from brown; on the other hand comparing light brown with dark green will give a negative result because the first colour is derived from brown and the second from green. This simple technique was applied to all descriptors.
Although the test base is still rather limited, preliminary tests gave good results in terms of selectivity and resistance to perturbation (ranking of incomplete or slightly perturbed object descriptions).
It is possible that the distance calculation algorithm could be extended to other classes of documents.
The distance calculation is processed by an SGML parser. A reference object and a list of object descriptions are given as input; on output the application produces a list of relative distances. Owing to the stream-based way of processing the data, a large number of descriptions can be processed efficiently.
As stated before, SGML is used for handling object descriptions that are stored in the Adaptor module. But to ensure distribution, once the descriptions are stored in the Adaptor of one node, they have to be replicated in other nodes so as to ensure data availability to all nodes. This is performed by a transaction record and replay system.
All operations on a node - insertion, modification, and deletion of object descriptions - are logged (recorded) in the Adaptor. These transactions can be requested by other nodes that will repeat (replay) them in order to synchronize their database contents.
SGML was the natural choice as an exchange format for these transactions. Additionally, the generation and the processing of these transactions are carried out by means of an SGML parsing system coupled with an application language. Owing to the stream-based way of processing the information, large transaction files can be processed efficiently.
GRASP is not considered to be a typical document management application. However, SGML proved well suited because of the similarities in structure between GRASP object descriptions and traditional documents. The benefits of using this technology were clear, particularly in terms of reduced system complexity and reduced development time.
<!--****************************************************************
Copyright (c) GRASP Consortium/ACSE, 1996-1998
*****************************************************************-->
<!--DTD of the SGML description message that allows to describe any art object
and containing the case properties of the item..-->
<!--****************************************************************
History:
*****************************************************************-->
<!--****************************************************************
Ranking
*****************************************************************-->
<!ENTITY %RANK-MODE-ON "IGNORE"> <!-- Do not modify! -->
<!ENTITY %RANK-MODE-OFF "INCLUDE">
<!--****************************************************************
Pseudo types
*****************************************************************-->
<!-- ISO-Latin with entities for Greek alphabet -->
<!ENTITY % date "CDATA"> <!-- +/-yyyymmdd -->
<!ENTITY % ontology "CDATA"> <!-- integer -->
<!ENTITY % closed-list "CDATA"> <!-- integer -->
<!ENTITY % living-list "CDATA"> <!-- integer -->
<!ENTITY % millimeter "CDATA"> <!-- integer -->
<!ENTITY % text "#PCDATA">
<!--***********************************************************
Textual elements definition
*****************************************************************-->
<!ENTITY % langCodes "(FR|SP|EN|NL|IT)">
<!ENTITY % textDscr
"(%short_description; |
%full_description; |
%title; |
%serial_number; |
%engravings; |
%inscriptions; |
%distinctive_features;)">
<!ELEMENT %textDscr; - - (%text;)>
<!ATTLIST %textDscr; %language; %langCodes; #REQUIRED>
<!--****************************************************************
onthology descriptors definition
*****************************************************************-->
<!ENTITY % dsc-onthology
"%technique;|
%period;|
%style;|
%main_material-1;|
%main_material-2;|
%additional_material;|
%region;|
%place;|
%form;|
%visual_texture;|
%intended_location;|
%main_colour;|
%background_colour;|
%other_colour;|
%object_type;|
%functional_context;|
%subject_matter_type;|
%pattern;">
<!ELEMENT (%dsc-onthology;) - - EMPTY>
<!ATTLIST (%dsc-onthology;)
%value; %ontology; #REQUIRED
%baseterm; %ontology; #REQUIRED>
<!--****************************************************************
Property Item definition
*****************************************************************-->
<![%RANK-MODE-OFF;[
<!ELEMENT %ptyitem; - -
((%short_description;)+,
(%full_description;)*,
(%whole;)+,
(%part;)*,
(%url;)?,
(%imgs;)?)>
]]>
<![%RANK-MODE-ON;[
<!ELEMENT %ptyitem; - -
((%short_description;)+,
(%full_description;)*,
(%whole;,candids?)+,
(%part;,candids?)*,
(%url;)?,
(%imgs;)?)>
<!ELEMENT candids - - ((%ptyitem;)*)>
]]>
<!ATTLIST %ptyitem;
%status; (ls|fr|dm) ls
%authority_id; CDATA #REQUIRED
%private_id; CDATA #REQUIRED
%officer; CDATA #REQUIRED
%case_id; CDATA #IMPLIED
%set_id; CDATA #IMPLIED
%date_happened; %date; #REQUIRED
%date_reported; %date; #REQUIRED
%date_inserted; %date; #REQUIRED
%date_modified; %date; #IMPLIED
%grasp_node; CDATA #IMPLIED
%grasp_id; CDATA #IMPLIED
%db_origin; CDATA #IMPLIED
%item_type (whole|set|part) whole>
<!-- case-id, set-id are optional -->
<!--****************************************************************
whole part definition
*****************************************************************-->
<!ELEMENT (%whole;|%part;) - -
(%category;,
%production;,
%physical;,
(%object_type;)?,
(%functional_context;)?,
(%subject_matter_type;)?,
(%subject_matter_content-l;)?,
(%component-l;)?,
(%pattern-l;)?,
(%distinctive_features;)?)>
<!ATTLIST (%whole;|%part;)
%descr_id; CDATA #REQUIRED
%quantity; CDATA "1">
<!ELEMENT %category; - - (%top_level;,%main_group;,%group;)>
<!ENTITY % dsc-category "%top_level;|%main_group;|%group;">
<!ELEMENT (%dsc-category;) - - EMPTY>
<!ATTLIST (%dsc-category;)
%value; %ontology; #REQUIRED>
<!ELEMENT %subject_matter_content-l; - - (%subject_matter_content;)*> <!-- SWI -->
<!ELEMENT %component-l; - - (%component;)*> <!-- SWI -->
<!ELEMENT %pattern-l; - - (%pattern;)*>
<!ELEMENT %subject_matter_content; - - EMPTY>
<!ATTLIST %subject_matter_content;
%subject_matter_content_contents; %ontology; #REQUIRED -- SWI --
%subject_matter_quantity; CDATA "1" -- SWI --
%subject_matter_property; %ontology; #IMPLIED> <!-- SWI -->
<!ELEMENT %component; - - EMPTY>
<!ATTLIST %component; -- SWI --
%component_type; %ontology; #REQUIRED -- SWI --
%component_quantity; CDATA "1" -- SWI --
%component_property; %ontology; #IMPLIED> <!-- SWI -->
<!--****************************************************************
production definition
*****************************************************************-->
<!ELEMENT %production; - -
((%maker-l;)?,
(%technique-l;)?,
(%period;)?,
(%time_period;)?,
(%year_of_make;)?,
(%style;)?,
%main_material-1;,
(%main_material-2;)?,
(%additional_material-l;)?,
(%place;)?,
(%region;)?)>
<!ELEMENT %maker-l; - - (%maker;)*>
<!ELEMENT %maker; - - EMPTY>
<!ATTLIST %maker;
%value; %living-list; #REQUIRED>
<!ELEMENT %technique-l; - - (%technique;)*>
<!ELEMENT %year_of_make; - - EMPTY>
<!ATTLIST %year_of_make;
d %date; #REQUIRED>
<!ELEMENT %time_period; - - EMPTY>
<!ATTLIST %time_period;
%from; %date; #REQUIRED
%to; %date; #REQUIRED>
<!ELEMENT %additional_material-l; - - (%additional_material;)*>
<!--****************************************************************
physical definition
*****************************************************************-->
<!ELEMENT %physical; - -
((%form;)?,
(%visual_texture-l;)?,
%measurements;,
(%intended_location;)?,
(%main_colour-l;)?,
(%background_colour;)?,
%colour_cardinality;,
(%other_colour-l;)?,
(%markings;)?)>
<!ELEMENT %visual_texture-l; - - (%visual_texture;)*>
<!ELEMENT %main_colour-l; - - (%main_colour;)*>
<!ELEMENT %other_colour-l; - - (%other_colour;)*>
<!ELEMENT %colour_cardinality; - - EMPTY>
<!ATTLIST %colour_cardinality;
%value; %closed-list; #REQUIRED> <!-- 0|1|2|9=many -->
<!ELEMENT %measurements; - - EMPTY>
<!ATTLIST %measurements;
%weight; CDATA #REQUIRED
%height; %millimeter; #REQUIRED
%length; %millimeter; #REQUIRED
%width; %millimeter; #REQUIRED>
<!-- weight in grams
heigth: when object is in its natural position (could be 0 for lying objects ...)
length: longest other dimension
width: shortest other dimension (could be 0 if wall-suspended ...)
-->
<!--****************************************************************
markings definition
*****************************************************************-->
<!ELEMENT %markings; - -
((%seals_marks-l;)?,
(%title;)*,
(%serial_number;)*,
(%signature;)?,
(%engravings;)?,
(%inscriptions;)?)> <!ELEMENT %seals_marks-l; - - (%seals_marks;)*> <!ELEMENT %seals_marks; - - EMPTY> <!ATTLIST %seals_marks; %value; %living-list; #REQUIRED> <!-- SWI --> <!-- %baseterm; %ontology; #REQUIRED-- -- SWI!!! --> <!ELEMENT %signature; - - EMPTY> <!ATTLIST %signature; %position; %closed-list; #REQUIRED %year; CDATA #IMPLIED %signature_name; CDATA #IMPLIED> <!-- position 0 = no signature 1 = bottom right 2 = bottom left 3 = yes, undefined position 4 = yes, on the back 5 = top right 6 = top left --> <!--**************************************************************** others definitions *****************************************************************--> <!-- url where item is further described (not replicated with ptyitem on other nodes as an url may change locally out of GRASP control) --> <!ELEMENT %url; - - EMPTY> <!ATTLIST %url; name CDATA #REQUIRED> <!-- Images and thumbnails ''''''''''''''''''''' Images and thumbnails are referred by IDs and are not directly included in the SGML flows. IDs could be file names in a certain context, TBD by ASTRA --> <!ELEMENT %imgs; - - EMPTY> <!ATTLIST %imgs; folderid CDATA #REQUIRED>
<PIT AID="Test Org." DHAP="19980202" DINS="19980223" DMOD="19980610" DREP="19980202" GID="46" GND="0" ITT="WHOLE" OFC="t" PID="se" STS="FR"><SUM LAN="EN">a brown walnut queen anne kneehole desk </SUM><WHL DID="0" QTY="1"><CAT><TOP V="60000139"/><MGR V="50037335"/><GRP V="50037680"/></CAT><PRD><PER B="50021047" V="50021047"/><TPR F="1700" T="1720"/><MM1 B="50132451" V="50012476"/><RGN B="50020656" V="60000187"/></PRD><PHY><MSR HGH="0" LGT="0" WDT="0" WGH="0"/><MCOLL><MCOL B="50127490" V="50127490"/></MCOLL><COLCAR V="1"/></PHY><OBT B="50136379" V="50136379"/></WHL></PIT>
[GRASPD51]: The GRASP consortium: Project Deliverable 5.1 Functional Specifications; Public project deliverable available on GRASP's Web site (http://www.arttic.com/GRASP/).
Please e-mail your comments to Jorge Leal Portela at jlp@sgmltech.com.
This paper was first published in the Conference Proceedings of Markup Technologies '98 US, November 1998, pp 145-52.© The SGML Technologies Group 1998