[This local archive copy is from the official and canonical URL, http://www.juridicum.su.se/iri/corpus/expersys.html; please refer to the canonical source document if possible.]


The Comparative Part of the Corpus Legis Project - Using SGML for Intelligent Information Retrieval of Legal Documents

by Georg Haider 1,2, Cecilia Magnusson Sjöberg 1, Gerald Quirchmayr 2, Verena Sebald 1,2

1 The Swedish Law & Informatics Research Institute
Faculty of Law
Stockholm University
S-106 91 Stockholm

2 Institut für Angewandte Informatik und Informationssysteme
Universität Wien
Liebiggasse 4
A-1010 Wien

Abstract This paper describes the comparative part of the Corpus Legis Project, carried out at Stockholm University with participation of memebers of the Institute for Applied Computer Science at the University of Vienna. The goal of the part of the project described here is to experiment with new ways for representing legal structures in electronic documents in order to facilitate information retrieval, especially new standards such as SGML and HyTime.

Keywords: legal information retreival, SGML, structured documents


1.1 The Corpus Legis Project

Corpus Legis is a project carried out by the Law and Informatics Research Institute at the Faculty of Law, together with the Department of Computational Linguistics, both at Stockholm University. The project was joined by two diploma students from the Institute of Applied Computer Science and Information Systems at the University of Vienna for six month who concentrated their work on the technical, that is, the computer related part of the project.

The overall aim of the project was to create a permanent, computerised legal text resource at Stockholm University for legal-linguistic studies. This legal text resource is supposed to consist not of Swedish legal texts only. For example, laws valid in other countries (written in different languages) are to be included as well. Users of this system (such as lawyers, teachers, or students) should be able to read through the legal resources in an interactive way, that means efficient information retrieval must be provided by the system.

It was decided to use SGML as the standard for the representation of information, since it is a very useful tool to achieve the desired goals.

The Corpus Legis project is split up into the following three different parts:

l the part dealing with the "chain of law" focuses on parliamentary legal information, such as government bills, acts, etc.

l the "comparative part" deals with a selection of acts from different European countries.

l the "historical part" treats historical documents within the filed of company law.

1.2 The Comparative Part of the Project

As mentioned above, the comparative part of the project deals with acts from different (European) countries and their translations into different languages (which can exist in different versions, such as official legal translations, or commercial versions). Its purpose is to provide the possibility for efficient working with these texts. It should be possible to search legal texts, browse through different documents, and compare them with each other in a useful and effective way.

As Sweden has recently joined the European Union, the internationalisation of the Swedish legal system has become more important (for the legal aspects of the project cf. [Flaherty 89], [Tiberg et al. 94]). Therefore, a comparison between different international legal texts is turning into a necessity.

The following legal documents were used as a basis for our work (both in the original language and in the official English translation):

l the Data Protection Directive of the European Union (95/C 93/01),

l the Swedish Data Protection Act (1973:289),

l the German Data Protection Law (BGBl. I, 2378, 2909),

l the Austrian Data Protection Law (BGBl 1978/565).


2.1 Goals

One main goal of the comparative part of the Corpus Legis project is to provide an efficient information retrieval system. The system is supposed to support users (lawyers, students, teachers, etc.) in their reading of the resources in an interactive way, so the following functionality must be provided by the system:

l Searching for words: The system supports the user in finding specific information by specifying a search expression. This search can be done in more than one document.

l Searching for legal elements: It is possible to search the documents for specific legal information. This is realised by allowing a search for SGML markup-elements. The user can, for instance, search for legal definitions by searching for a <LegalDef> element.

l Browsing by means of hyperlinks: Information related to the text that is displayed at the moment can be shown by activating a hyperlink. For example, whenever a legal term is mentioned that is explained somewhere else in the document, a hyperlink can point to this definition.

l Comparing different documents: It is possible to compare parts of a legal text with related information contained in other legal documents by means of hyperlinks. For instance, a user can compare the legal definition of the term "data file" in the Swedish data protection act with the definition in the German or Austrian law.

In order to reach the goals mentioned above, we had to fulfil the following tasks:

l developing one single DTD (Document Type Definition) suitable for all documents we worked with,

l transforming documents into machine readable formats,

l adding the markup to the documents,

l creating all files that are necessary to allow browsing of the documents (using SoftQuad's PanoramaPro browser).

2.2 The DTD

As the main goal was to compare different legal texts, we had to build one single DTD suitable for all of these documents (for SGML cf. [Goldfarb 90], [Herwijnen 94], [Burnard 91]).

2.2.1 HTML versus SGML

HTML (Hypertext Markup Language) is a special SGML application, which defines elements for the markup of hypermedia documents. Those documents are mostly displayed on the Internet by WorldWideWeb applications. We were thinking about using HTML instead of SGML for our purposes. But as HTML allows just a certain set of SGML elements and one of our main goals was to markup legal information, it was not possible to use HTML.

2.2.2 SGML Elements

Each document consists of different kinds of elements. We have decided to summarise them into the following classes of elements:

l Structure elements represent the structure of the documents (for instance, Preamble, Main Provisions, or Final Provisions).

l Legal elements contain the legal information of documents (for example, Legal Definitions). Markup of these elements must be added in order to show the legally relevant information of a document.

l Format elements represent the layout of a document, as, for example, counted lists, or bold characters. Parts of text that require the use of format elements can be found in each legal text: some words, such as definitions, are often displayed in italic or bold characters. That is why the DTD needs to contain format elements, that allow markup of such a special layout.

For the use of SGML, all parts of a document have to be related to one of these elements. So all elements occurring in all different documents have to be found. Some of them, like parts, chapters, or articles, exist in all of these documents, while others occur just in a single document (for example, a preamble is contained just in the EU Directive). An element must be included in the DTD, even if it is found just in one document. For elements occurring in several documents, a common element name must be found.

2.2.3 DTD Structure

A further basic question is, to what extent one should provide some kind of rules for applying the DTD to a document. A strict structure, which exactly defines the set of element that can be used within other elements, means that the correct structure of the document is always guaranteed. Allowing a looser way of applying a DTD makes it much easier to change the DTD structure.

2.3 The Linking Structure

Since just hyperlinks allow the efficient comparison of different documents, providing a linking structure was one important goal of the project,.

Two kinds of links were to be implemented: internal links pointing to a target which is situated in the same document as the anchor, and external links pointing to information located in a second, different document.

The real problems that occur when working with hyperlinks relate to maintaining the entire linking structure.

l The problem starts with the question, who sets the links? The person doing this must, on the one hand, have a legal background, and, on the other hand, know at least a little bit about working with computers. When working with legal texts in several languages, this person must additionally have a good knowledge of the used languages.

l Inserting links in large documents can be an extensive job.

l Creating "interpretative" links is even more difficult. These links occur when a certain part of the document is implicitly related to another part of text. It is very subjective to decide whether there is such a relation or not; it depends on the view of the SGML author. The person performing this task must be very familiar with the legal text.

l It is not always a trivial task to create names for the ID attributes. All names must be unique within one document, and, furthermore, they should represent the content of the target in order to receive a clear linking structure.

l A good compromise between too little and too extensive linking must be found. Especially in a large document, too many links may lead to a very complex structure, which is extremely difficult to maintain.

l All of these tasks get even more difficult, when references between several legal texts are to be made, that is, when external links must be used.

l It can be very difficult to keep survey over all links. If more persons are working together, it can get difficult to arrange them with each other.

After all the links are created, it is necessary to maintain the linking structure.When a new document is released and to be inserted into the structure, new links must be created. The same problems as mentioned above occur here. When an old document is to be removed from the structure, all existing links must be updated. It must be guaranteed that all links still point to a valid target.

Both creating and maintaining the linking structure are no trivial tasks, which, at the moment, can not be done automatically. As far as we know, they still have to be carried out manually by one or more persons. An very interesting approach toward automatic linking for HTML documents is described in [Mowbray et al. 96].


3.1 The Comparative DTD

All considerations described above in the former chapter, lead to the Document Type Definition for the comparative part of the Corpus Legis project: the Comparative DTD (cf. [Goldfarb 90], [Herwijnen 94]).

3.1.1 Structure Elements

We found five basic structure elements. Some of them are obligatory, while others do not occur in all legal texts. Within one document all of these elements may occur in any order.

Document Information: Document information is the text at the beginning of the law which contains general information about the legal text. For instance, it shows the title of the legal text, date of release, etc. It may also contain some text elements.

Preamble: A Preamble is introductory text, which may, for instance, contain a table of contents. A preamble was just found in the EU Directive.

Main provisions: Main provisions contain the main legal information itself. All legal texts are of course divided into several parts and chapters, etc. We had to compare all those divisions found in all documents and name them. We decided to use a very strict structure of these elements.

Final Provisions: Final provisions can be described as a kind of additional legal information. They are not part of the main provisions, although they contain legal text. Final provisions are usually located at the end of the law. They may contain temporary provisions, information about entry into force of this legal text, or an appendix.

Additional Information: Additional information is not part of the legal text, it is rather information about the text, added by the author of the SGML file, that is, the person that adds the markup to the document. It contains information about the country where the law is valid, the type of the legal text (law, statute, directive, etc.), the organisation responsible for the content, etc. Since this is internal information, it will usually not be displayed by the browser. It can be called upon when needed, though.

3.1.2 Legal Elements

Examples for legal elements are legal definitions, juridical remedies, or charges. Finding these legal elements had to be done by a legal expert, as it is not possible for non lawyers to perform this task.

3.1.3 Format Elements

Character format elements are used to specify the format of text characters. To begin with, we added the elements Bold and Ital, for bold and italic characters.

Paragraph layout elements specify the layout of a whole paragraph. For example, the counted list element can be used to number successive paragraphs, which is needed for most legal texts. Further elements are used to mark a simple paragraph and a carriage return.

3.1.4 Other Elements

Other elements that had to be included in the DTD are elements that can be used to realise hyperlinks (see next section), elements for editorial comments, or Footnotes.

3.1.5 Hyperlinks

There are several possible ways how both kinds of links can be implemented. As we were using SoftQuad's PanoramaPro browser, we chose approaches that are accepted by Panorama.

For internal links we used the "ID / IDRef"Approach. Each target is uniquely named by adding an attribute "ID". Now one can let a hyperlink point to such a target, by adding an "IDRef" attribute containing the ID value of the target.

The HyTime standard was chosen to realise external links that actually allow the comparison of different legal texts. In particular, we used HyTime's nameloc function. How such links are implemented is not described here, since an exact description would be too extensive to be shown in this report.

It should be mentioned that HyTime uses the same ID attribute as the ID / IDRef approach. That means, the two methods are "compatible" with each other, which is a big advantage for using both of them together.

3.2 Used Software

Near & Far: is used to create the Document Type Definition within a graphical environment. The resulting DTD file can be exported and stored on the harddisk.

The next step is to create ASCII text files of all legal documents, which we did by scanning them from paper copies.

After that, markup can be added to the text files. This job is done by using AuthorEditor. In order to import DTD files, they have to be converted into a special format that can be read by AuthorEditor. This conversion is done by RulesBuilder, which creates a rules files out of a DTD. AuthorEditor supports the insertion of markup elements in a convenient way. A Rules Checking function allows the user to check the document for errors in the markup structure. After the markup is finished, the resulting SGML file can be exported to the harddisk.

In our case, PanoramaPro is used to display the finished SGML documents. Style Sheets must be created to determine how each of the SGML elements is displayed on the screen. The Panorama user can specify the font, paragraph layout, colour, etc. Furthermore, it is possible to display a second window that shows the structure of the document. This can be realised by generating Navigator files.

3.3 Sample Markup

The following documents were marked with the Comparative DTD: the EU Data Protection Directive (in English), the Swedish Data Protection Act (Swedish original and English translation), the German Data Protection Law (German original and English translation), and the Austrian Data Protection Law (German original).

To demonstrate HTML features and possibilities, we also used HTML to markup all of these documents. Of course, HTML can only be used to markup certain structure elements, which are given by the HTML DTD, legal elements can not be marked in the way it was desired in this project.

3.4 The Panorama Browser

As mentioned above, we have used PanoramaPro as our browsing software. A style sheet file contains information about how to display all of the elements contained in the DTD. A navigator file can be created to generate a browsing tree, which is shown by Panorama in a separate window, together with the document text itself.

Both files were created to display all elements in an attractive way and to allow efficient navigation through the documents.

Unfortunately, Panorama supports only limited searching functions: full text search allows a search for words within the displayed text, context search enables the user to search for SGML elements. Both search methods can be combined with each other, as for example: "data file in <LegalDef>" searches for the words "data file" in all "legal definition"-elements.


The results of our work within this project are the basis for further development. Possible next steps are the enlargement of the The Corpus by adding more legal documents and their translations. Panorama supports the distribution of SGML documents through WorldWideWeb, an advantage of this feature, since it is desired to publish documents via Internet too. The problem of maintaining linking structures must be taken into consideration. Browsing and especially searching facilities must be improved. It must be possible to search for specific legal elements in several documents. A useful application could be the release of a CD-ROM containing all legal texts with SGML markup, a DTD, and searching facilities.


[Burnard 91] Burnanrd Lou, Sperberg-McQueen Michael; An Introduction to TEI Tagging, Oxford Univesity Computing Service; 1991.

[Flaherty 89] Flaherty David H; Protecting Privacy in Surveillance Societies, The University of North Carolina Press; 1989.

[Goldfarb 90] Goldfarb Charles F.; The SGML Handbook, Oxford University Press; 1990.

[Herwijnen 94] Herwijnen Eric van; Practical SGML, Kluwer Academic Publishers; 2nd edition, 1994.

[Mowbray et al. 96] Mowbray Andrew, King Geoffrey, Greenleaf Graham; AUSTLII -Technology and politics of law on the net, 4th International Conference on Substantive Technology in the Law School, Centre de Recherche en Droit Public, Faculté de Droit, Université de Montreal, 1996.

[Tiberg et al. 94] Tiberg Hugo, Sterzel Fredrik, Cronhult Pär; Swedish Law, a survey The authors and Juristförlaget JF AB; Stockholm, 1994.