From Stone Age to Electronic Age for Aircraft Technical documentation | Table of contents | Indexes | Aspects, Effectivities, and Variants | |||
Beyond DTDs: constraining data content |
Jose Carlos Ramalho |
University of Minho Computer Science Dept. University of Minho Departamento de Informática Largo do Paço 4709 Braga CODEX Portugal Email: jcr@di.uminho.pt Web: http://www.di.uminho.pt/~jcr/ |
Biographical notice: |
Jose Carlos Ramalho |
Henriques, Pedro |
J.C. Ramalho is a teacher at the Computer Science Department of the University of Minho. He has a Masters on "Compiler Construction" and he is currently working on his PhD thesis on the subject "Document Semantics and Processing". He has been managing several SGML projects and has started to do some consulting. |
Pedro Henriques |
Computer Science Dept. University of Minho Departamento de Informática Largo do Paço 4709 Braga CODEX Portugal Email: prh@di.uminho.pt Web: http://www.di.uminho.pt/~prh/ |
Biographical notice: |
Pedro Henriques |
ABSTRACT: |
Concluding this paper we will present a simple solution that implements the discussed constraint language and puts it to work with existing SGML applications (our case studies). |
Introduction |
This problem gets even worse if instead of one author we have an authoring team, if instead of authoring we are transcribing, if we are doing things on a hurry. |
At this stage you could say: "Relax, we have SGML!". |
Yes, we do have SGML. And SGML can solve up to 50% or 60% of the problem. |
Having this scenario in mind, it is easy to conclude that any automatic validation task will help to improve the quality of electronic publishing we have. |
SGML plays an important role, it gives automatic structural validation, and in certain cases can lead an inexpert author through typing a certain kind of document. |
However, in this validation universe there is a tremendous lack: semantic validation. Of all the validation tasks is the last that should be performed and the most complex and difficult to implement. |
In the next sections, we will address this problem, aiming at specifying the path towards some light. |
Case studies: What can go wrong with existing SGML applications |
In this section we present two case studies that will illustrate our proposal. |
These two case studies emerge from a project context where we are incharged of collecting information from various sources and making this information available through the Internet. |
In both of them, there are problems that could be solved if some simple automatic semantic validation was available. |
Parish Registers |
Archaeology |
Another information source we have is a group of archaeologists. |
SGML and Semantics |
Can we just add constraints to SGML in order to process semantic validations? What is missing? Do we need more than just constraints? |
Though it seems that a simple constraint language can do the job. However, we can distinguish two completely different steps when going towards a semantic validation model: |
This two steps have different aims and correspond to different levels of difficulty in their implementation. |
The definition step, involves the definition of a language or the adoption of an existent one. |
For the processing step, we need to create an engine with processing capabilities concerning the statements written in the above language. |
Somewhere in the middle of these steps we will face the need of having typed information with all the inherent complexity. |
At this moment, you could ask: "Why do we need this extra complexity? Can't we live without types?". |
Look at the following example taken from the second case study : |
SGML Document ... <latitude>41.32</latitude> ... Constraint latitude > 39 and latitude < 43 |
In this example we want to ensure that every latitude value is within a certain range. We are performing a domain range validation. |
For the time being, we envisage two solutions to deal with the problem of data normalization and type inference: |
Obviously, the second is the one to follow, because it is simple, and because some DTD developers have already been worried with data normalization, and they have implemented some solutions. |
Probably there are others, but the best I saw til now is the TEI DTD . |
Example: |
... it happened in <date value="1853.10.05"> the fifth of October of the year 1853 </date> ... |
Here, the normalized form being adopted for date elements is the ANSI format. |
The examples above would look like the following: |
Example (latitude): |
... <latitude type="float">41.32</latitude> ... |
We can assume that when the value attribute is not instantiated the element content is already written in a normalized form. |
Example (dates): |
... it happened at <date type="date" value="1853.10.05">the fifth October of the year 1853</date> ... |
At this point we could ask the following question: Do we need to type every element? Or just the atomic ones (PCDATA)? We will answer this question in the next section. |
Designing a Model for Constraining |
The question raised at the end of the previous section is a yes or no question. But either one of the answers carries an heavy weight. |
It will be very easy to adapt that processing model to work with the changes we proposed so far: thevalue attribute to deal with data normalization, and thetype attribute to help the type inference. |
So, for the moment, the model we propose to deal with constraints is represented in the following figures. |
Proposed SGML authoring and processing model |
CAMILA Validation Process |
Figure illustrates the new validation process. Both, the designer and the user must provide information to settle down this process. |
Are We Reinventing the Wheel? |
In this particular case, we are trying to establish a path towards the implementation of semantic analysis applied to SGML documents. |
Looking around inside the Computer Science area we find a good similar problem with some solutions: Programming Languages. |
We can easily map what we are trying to do with SGML documents to what has been done with programming languages. Furthermore, we can map SGML documents to Programming Languages. |
Look at the following table that is showing a comparison between these two worlds: |
Programming Languages | SGML Documents | |||||
program | document instance | |||||
language | DTD | |||||
terminal symbols | SGML declaration | |||||
grammar rules | DTD |
We can look at a SGML DTD as if it is a formal grammar. From there is easy to conclude the matching that is showed in the above table. |
As the DTD is the heart of an SGML document grammars are the heart of programming languages. |
In the beginning, programming languages had to be processed in order to obtain the runnable code corresponding to the program. The processing of a program, compilation, comprised the following steps: |
In 1968, Donald Knuth introduced a new approach:Attribute Grammars ( ). |
We feel that this approach that we are trying to follow is very close to the Attribute Grammars approach. The problems we are facing have been already faced by them in the past. |
Perhaps the solutions found for the Attribute Grammars can help us solve our problems. But, for now, regarding this parallelism we feel we are in the right track. |
Future Work |
We hope that we will help to improve document quality in the future. |
Acknowledgments |
From Stone Age to Electronic Age for Aircraft Technical documentation | Table of contents | Indexes | Aspects, Effectivities, and Variants |