© Martin Bryan, The SGML Centre, 29 Oldbury Orchard, Churchdown, Glos. GL3 2PU, UK
This article explains why HTML, SGML and HyTime form a natural progression in terms of information management, and how the three techniques can be used to compliment one another.
First let me explain what HTML, SGML and HyTime stand for, for those who have not come across these acronyms before:
The World Wide Web (WWW) is a set of documents and related data (indexes, search tools, forms, etc.) that are coded in a standardized format, HTML, to allow open interchange between computers.
In most cases WWW documents are transmitted from computer to computer via an international communications routing system referred to as the Internet, but HTML-coded documents can also be interchanged on discs or tape where on-line communication is not possible.
HTML-coded documents can be read on any HTML document browser. A wide range of browsers are available, often at little or no cost. The two most popular are:
All current browsers support Version 2 of the HTML standard (HTML2). Figure 1 shows the elements that can be identified in an HTML2 document. The next release of products will support Version 3 of the standard (HTML3). Figure 2 indicates some of the elements that could be added to the list of valid document components when HTML3 has finally been agreed.
<HTML> HTML document <HEAD> Document head <TITLE> Title of document <LINK...> Link from this document <ISINDEX> Document is a searchable index <BASE...> Base context document <NEXTID...> Next ID to use for link name <META...> Generic metainformation <BODY> Document body <H1> Heading, level 1 <H2> Heading, level 2 <H3> Heading, level 3 <H4> Heading, level 4 <H5> Heading, level 5 <H6> Heading, level 6 <P> Paragraph <TT> Typewriter text <B> Bold text <I> Italic text <EM> Emphasized phrase <STRONG> Strong emphasis <CODE> Source code phrase <SAMP> Sample text or characters <KBD> Keyboard phrase, e.g. user input <VAR> Variable phrase or substituable <CITE> Name or title of cited work <BR> Line break <HR> Horizontal rule <A...> Anchor; source/destination of link <IMG...> Image; icon, glyph or illustration <PRE...> Preformatted text <XMP> Example section <LISTING> Computer listing <PLAINTEXT> Plain text passage <DL> Definition list, or glossary <DT> Term in definition list <DD> Definition of term <UL> Unordered list <OL> Ordered, or numbered list <DIR> Directory list <MENU> Menu list <LI> List item <BLOCKQUOTE> Quoted passage <ADDRESS> Address, signature, or byline <FORM...> Fill-out or data-entry form <INPUT...> Form input datum <SELECT...> Selection of option(s) <OPTION...> A selection option <TEXTAREA...> An area for text input Figure 1: Elements provided in Version 2 of HTML DTD
<BANNER> Banner headline to scrollable file <DIV> Division within document <TABLE> Table <TR> Table row <TH> Table heading <TD> Table division <FIG> Figure <OVERLAY> Image overlay <FIGTEXT> Text explaining figure <CREDIT> Credit (source) for figure <CAPTION> Caption to figure or table <SPOT> Anchor point without contents <SUB> Subscript <SUP> Superscript <FN> Footnote <MATH> Mathematical formula* <TAB> Tab stop definition <LH> List header <STYLE> Style sheet information for file * with appropriate sub-elements not listed here Figure 2: Elements to be added to Version 3 of HTML DTD
The key to the success of the WWW is the use of Universal Reference Locators (URLs) that allow documents on other computers to be easily referenced from within a WWW document. URLs can contain a number of components, all of which can be defaulted. The main components are:
The main point about URLs is that, except for the last component, they do not rely on any information stored in the file being pointed to. As long as you know who owns the file, where a copy of it is stored, and the name it is stored under you can make a reference to it from any document you create.
One problem that authors typically come across when referencing documents on the WWW is that they need to refer their readers to a particular part of a long document that contains no named segments that can be referred to using the last component in the URL. In such cases all you can do is to reference the file and tell people what to search for, or which segment of the file to look in. Unfortunately traditional forms of bibliographic reference, such as see pages nn-nn, do not work on systems, such as the WWW, that provide a conitinuous stream of text rather than a set of discrete pages. As we will see when we discuss the use of HyTime, you need to adopt a more generalized approach if you want to be able to reference any point in an electronic document.
Another problem area with URLs occurs when the owner of the referenced file decides to delete or move the file. There is no standard mechanism built into existing systems for tracking file movements. There is also no mechanism for letting owners of referenced documents know who has referenced their documents from where. This means that it is not currently possible to provide, in most cases, notification of a need to update an identifier.
While it is possible to leave out the machine identifier, directory path name and even the file name from a URL, all of which can be presumed to be that of the referencing document, it is not possible at present to build URLs by providing separate pieces for the four components. As we will see when we discuss the role of HyTime later, this is a vital component for the efficient management of locators.
It is important to realise that WWW URLs can only identify one file at a time. If you want to identify multiple data sources you must define multiple link points. URLs can only identify where the data being referenced starts: they provide no indication of where the section of text being referenced ends. This makes it impossible to provide a URL-based equivalent for the phrase see pages nn-nn.
At present WWW browsers tend to give users very little control over how documents should be presented. This is because there is no agreed mechanism for attaching style specifications to the elements defined by HTML2, so style information cannot be interchanged between systems. This is one of the areas being actively debated by the IETF working party responsible for developing the HTML3 standard. An additional hazard, in this respect, is the generally loose control placed on the nesting of markup tags within HTML2. This makes for a large number of possible nestings of styles. HTML3 will deprecate the use of loosely nested elements and recommend that elements are much more rigourously structured in future documents. This should then make it possible to provide a simple style language for the interchange of presentation information for HTML3 documents.
Note: The possibility of using the SGML/Open's DSSSL-Lite subset of ISO's Document Style Semantics and Specification Language (DSSSL) is currently being mooted.
Documenting products and services is a collaborative effort. In most cases multiple authors create sections that have to be integrated to form one or more output documents. Before a collaborative effort can commence an overall structure for the end result(s) needs to be defined, and suitable providers need to be identified for each component of the document.
While, traditionally, workflow management techniques have been used to combine multiply sourced data into a single structure, users are now beginning to realise that many different delivery forms may be required for any information set. In such cases the same data may need to be reused in different output structures. Rather than create a single file from the work of the collaborators, the various sub-components of the information set need to be treated as separate units that can be accessed in any relevant order. SGML's ability to create compound structures from a set of entities that represent logical packages of information, rather than physical files, makes it ideally suited to the management of collaborative work.
A set of interlinked files that form a deliverable data set can be thought of as a work web. The files can be combined into different sets of logical entities. These logical sets can be treated as individually addressable units as well as part of a larger construct. Larger structures can consist of either a simple sequence of logical units or a structured tree of information. Structured trees allow the way in which users find information to be guided by the application, rather than just relying on a sequential search of files. This is the method of accessing data which has tradtionally been provided in books in the form of a contents list, but it is only recently that the ideas of structuring data sets has become common in the computing industry.
There are three aspects to the creation of work webs:
ISO's Standard Generalized Markup Language can be used to formally define the structure of output documents. Once the model has been formally defined in an SGML Document Type Definition (DTD) it can be used to control and segment the data capture and management process. Each of the collaborators has identified start and end points within the data-set being created, so that the scope of his/her part of the project is clearly defined. By defining a set of elements that are relevant within each part of the structure it is possible to ensure that a common set of structuring rules is applied across a project, even when different parts are being tackled in different organizations, who may be using differing tools for data generation.
ISO's Document Style Semantics and Syntax Language (DSSSL) can be used to identify the house style to be assigned to each of the data elements identified by the SGML DTD. Unlike traditional style sheet approaches, DSSSL allows the formatting rules assigned to a particular element to change depending on the context in which the element occurs. For example, a paragraph in an appendix can be set in a smaller point size without having to assign it a different element name from a paragraph in the body of the text.
For collaborative projects where multiple outputs are required DSSSL is particularly important as it allows multiple output structures to be generated from a set of source files. This is particularly important in situations where a number of companies are collaborating in a project from which they wish to create document sets that conform to their own house style. DSSSL can be used to turn files stored in a shared database into output documents that conform to each company's house style and data distribution structure.
Workflow management systems can be used to ensure that appropriate staff create, review and approve data elements in a timely manner. Each segment of an SGML document can be assigned attributes that be used to ensure that appropriate control mechanisms have been, or will be, applied. Controls can be added to ensure that the document cannot be forwarded to the next stage of the process until these control attributes have been set to relevant values.
Managing a work web involves the integration of these three components. The work web takes the individual contributions and uses them to build both publishable documents and information trails. It can also be used as an audit method, checking that no required component is missing prior to publication. SGML's ability to ensure that items referenced by IDREF attributes occur in the same document set provides a simple control mechanism for ensuring that documents cannot be output until all components have been validated against the overall model for the output document.
With a title like this you may be dreading a diatribe on the philosophical aspects of knowledge! Fear not - I am not about to put myself up as a whipping boy for future philosophical debates. What interests me more is the mechanics of wisdom and its related subject, knowledge.
In fact it is knowledge that I want to use as my starting point for this section. The OED defines knowledge in a number of ways, including:
The first of these definitions is the philosopher's one. People like Ted Nelson hope that something akin to a super-WWW will allow us all to have access to the sum of what is known. Unfortunately that will not happen in our lifetimes, if ever.
The second definition is more practical. It states that knowledge is that set of information that we can gain access to. However, it is difficult to define this set of information. For must of us the range of information we have encountered during our lifetime is much larger than the range of information that is currently within our range. Yet it is not this currently restricted information set that is used to judge our knowledge, or wisdom.
By far the best definition of knowledge is the third one: that set of information that we are, through various aspects of our life, familiar with. Perhaps surprisingly this definition of knowledge need not be restricted to the information we actually know. For example, you may not know someones telephone number, but you may know how to use a telephone directory, or an enquiry service, to get the number you need. Therefore the knowledge you have exceeds the range of information you have encoutered to date.
Where does wisdom fit into this? The OED definition of wisdom is possession of experience and knowledge together with the power of applying them critically or practically. Note the stress on the combination of experience and knowledge with the ability to apply them. If we are to introduce wisdom into the mechanical process of information creation and dissemination we must provide a mechanism whereby experience can be captured and then applied in a practical manner.
No, I am not about to go off into the realms of knowledge bases and artificial intelligence, though these techniques are both adjuncts to the sort of processes that are needed for intelligent information management. What I want to look at in this paper is how we can enhance the documents we read by bringing our experience to bear.
Many of the most commonly accessed documents on the WWW are actually records of their creator's experience rather than actual sources of information. When you start using the WWW the first thing you probably do is go to the home page of the network provider. This will normally include pointers on where to go for more information on a number of subjects. In some cases you will be pointed to a search engine where an expert's experience of searching the Web for information has been captured in the form of a program that can undertake the search on your behalf, normally by employing a form to prompt you to state what you are searching for.
How can you record your experience so that it can be reused on the WWW, by yourself or others, at a later date without having to repeat all the searching that it took you to obtain? To record your findings you need to be able to:
Information sources can be identified in a number of ways, some of which have been used for decades, others of which are only just becoming possible. Traditionally such references have been made in bibliographies, or within the text of documents. Typical forms of reference include:
For non-printed material, referencing is normally restricted to a catalogue reference of some sort, such as an International Standard Music Number (ISMN) or an International Standard Recording Code (ISRC), or to a reference to a particular segment of a recording, such as the first movement, or the opening shots, of a titled object.
One characteristic of the above types of reference is that they are all indirect - they rely on user intervention to find the source information. In these days of instance response to questions what users want is instance access to information, and expect the computer to deliver the information they need to the desktop. Obviously this can only be done where the referenced information is available on-line.
What is the electronic equivalent of a citation? While it is possible to set up search engines that will take a standard form of citation and identify whether a copy of that document is available in a known set of repositories, such engines would not provide efficient access to data as the search would need to be performed each time the data is requested. A more efficient approach is to do the search once and record the result so that it can be reused. This is, in essence, the approach currently taken on the WWW, where the result of the query is stored in the form of a URL to the relevant document. There are, however, problems with this approach, as was pointed out earlier. In particular there is no way of knowing where data may have been moved to since the original search.
How can we overcome the limitations of permanently recorded URLs? One way is to record details of the searches originally carried out to find the data alongside the result so that the original search can be repeated when necessary. A slightly more efficient method would be to set up a mechanism for identifying where files have been moved to. How can this be acheived?
The first requirement is that the component parts of the identifier be separated. For example, if the machine identifier used by the information's publisher is separated from the file location then a change to the machine identifier can be made once for all related files, without having to trace the multitude of documents that have made references to that source. By providing directory services, such as those provided by ITU's X.500 standard, the names of a service provider can be divorced from the identification of the machine containing the required files.
Similarly with directory path names. Not only do these change as data is moved from machine to machine but they also change from time to time on one machine to accomodate the changes needed by organizations to either their data access structures or their overall management structures. If pathname can be replaced by unique identifiers within the referencing document these identifiers can then be resolved into the appropriate directory path string at the time the reference is traced.
Having ensured that we can track changes to the machine and directory identifiers, we need to provide a similar mechanism to ensure that we can redirect filenames when they have been changed from those originally referenced. Again a form of aliasing can be used. By giving each file a unique identifier, and then associating this with a file name that locates it in the selected directory, you can start to manage data sets much more efficiently.
Note: Those of you who have used the new Shortcut facility in Windows 95 will have seen a very simple example of how to decouple a file locator from its physical source.
The final problem is to identify where in the file the information being referred to sits. Here we have a major problems. In an unpaginated file, such as those coded in HTML, you cannot use page references to find information. If the file contains unique identifiers you can refer to these. If the file is suitably structured you can count things like chapters and sections to find the data you need. But if the file contains no identifying features how can you find the relevant part of the file? The simplest answer is to just count the number of bits and bytes that need to be skipped to get to the referenced data. Whilst this works in some cases it does not work in all. For example, what happens if you want to make a reference to a particular part of a figure? This will not be stored as a contiguous set of bits and bytes. You will need to select some bits from one scan line, then some bits from an adjacent scan line, and so on. If the image is moving your problem becomes even more complex!
The location model provided in ISO's Hypermedia/Time-based Structuring Language (HyTime) has been specifically designed to allow the various segments of a location's address to be decoupled. HyTime introduces the concept of a location ladder which allows you to identify a component part of a document using a number of small increments.
The first stage of the file identification process is to clearly identify the storage unit containing the required document. This can be done using Formal System Identifiers (FSIs) that conform to the rules specified in a recently submitted Annex to the HyTime standard. As well as providing a mechanism for specifying which storage managers have been used for the files, an attribute can be used to identify the base directory to be used for resolving partial file references.
SGML's standard external entity mechanism can be used to provide unique identifiers to files that are independent of file names. Each external entity referenced from a document is given a locally unique name within the referencing document. This unique term can either be directly linked to the system-specific file identifier, or it can be associated with a string of text that forms a globally unique public identifier for the file. Where relevant, the public identifier can be a formal public identifier, conforming to rules laid down in the SGML standard, or to the rules specified for public text object identifiers in ISO 9070. For a reference to a book the object identifier would typically be a formal reference to its International Standard Book Number (ISBN). For a journal article reference would be made through the journal's International Standard Serial Number (ISSN).
Once the relevant file has been identified the referenced text can then be identified using one of a number of HyTime techniques, depending on the type of data being identified. Options include:
In addition, locations can be specified with respect to previously identified locations. For example, a location ladder can be created that looks for the article whose unique identifier is Bryan87, then looks for the second section element within that article, the first diagram within that section and then identifies a particular segment of that diagram.
One of the advantages of splitting location identifiers into small sections in the way done by HyTime is that it becomes easier to maintain links both within amd between documents. Within documents you can identify components that were not provided with unique identifiers when the document was created. In addition you can identify overlapping data sets, so that the same data can serve multiple roles. When referencing between documents you can use the unique public identifiers as a highly portable shared form of object identification.
HyTime location addresses typically have a start and end point. In structured documents, such as those provided by SGML, the start and end of the selected object are automatically defined. The same is true for data identified from the position of their bits within the file. For other types of references users can identify the location of the start and end of a data span.
Another useful feature of HyTime is its ability to provide n-n links. This means that one reference can identify data in a number of documents, and that these points can, in turn, be referenced from more than one place without having to define a new link set each time.
But the real power behind HyTime object identifiers is that they do not need to be stored in either the source document or the referencing document. HyTime locators can be placed in a separate database, from which relevant sets can be selected to create reference documents that are associated with a particular application of the information set they interconnect. By allowing the pointers to be managed outside of the documents using them HyTime makes it possible to set up managed work webs that can operate on a worldwide platform.
Not only does HyTime allow locations to be defined outside of both the referencing and referenced documents, it also allows its data management files to contain other information. For example, they could contain a list of indexing terms that could be used to associate links between two documents, or two points in the same document, with a particular term, or with a process that depends on the selected term.
Because HyTime does not require you to include location descriptions within either the referencing or the referenced files, recording of observations need no longer be cluttered up with phrases, numbers or icons pointing to other sources of data. With a HyTime system, for example, it becomes possible to identify the source of a particular observation as a note at the foot of the screen, in much the same way that helpful comments are now displayed by Windows programs when you place your cursor on a particular button. A button or key sequence can be provided to activate the curently displayed link so that the referenced data can be viewed alongside your observations. As we have seen with the development of operating systems, active responses of this type can make systems much more user friendly.
While links between individual pieces of data can be created using standard point-and-click procedures, a number of additional features can be provided by a HyTime system. Because it can identify the start and end of the referenced area a HyTime system can ask you to highlight that part of the referenced data being referred to. Because the same location can be linked to multiple places, a HyTime system can ask you if you want to link all occurrences of a particular phrase or structure to the referenced data. In addition a HyTime system can associate processes with links, so that a particular action takes place everytime that link is selected or displayed.
Even more exciting is the prospect of automating observations. If, for instance, you have already defined a link that connects a particular phrase to an observation this phrase can then be used to automatically link your observations to new documents. In this way your observations themselves become active, seeking out new occurrences of information that could benefit from your experience.
To help this process HyTime allows locations to be identified using queries rather than absolute addresses. By regularly running these queries against expanding or changing data sets it is possible to build up n-n link sets related to a particular topic. Because HyTime links can be navigated in both directions it is possible to use these queries as a starting point for finding both information and observations related to a particular topic.
Live observations can be self-broadcasting. If you extend the original search to include data sources around the world the links that your queries create with your observations can then be made available to others. By providing others with access to your set of links you can also use their expertise to extend yours.
With HyTime it becomes possible to subset existing links to create a new set of broadcastable data. Because the links can be maintained independently of the associated text it becomes possible to differentiate between the selection processes applied to links and to files. For example, it would become possible to create a set of observations that are specific to items stored on file servers located in the UK, or provided by a specific organization.
The main reason for the success of the WWW has not, as most commentators seem to suggest, simply been because of the vast amounts of data it provides access to. What has really caused the explosion is the ease with which you have been able, over the last few years, to find links that represent other peoples experience of the relationship between data. No longer do we need to rely on our own experience. Now we have access to a worldwide web of experience that can help us to extend the bounds of our knowledge. HyTime can help overcome some of the limitiations of the current direct referencing methods used on the WWW by allowing us to manage our experience records more efficiently.
At present, HyTime-based systems, such as SoftQuad's Panorama browser for SGML-coded files transmitted over the Internet, have limited capabilities for link management, and only offer a subset of the wide range of addressing options provided by HyTime. Very few systems are able to utilise HyTime's ability to address audiovisual and sound data in addition to text, or its facilities for scheduling the display of audiovisual and textual material. As complex applications requiring the integration of text and audiovisual material develop the need to manage the inter-relationships between the various types of data will begin to be appreciated more fully. The management, over a long period of time, of complex data sets of the types now beginning to appear is the realm of HyTime. Over the next decade it is probable that HyTime will follow the trend seen for SGML, where initial resistance to the controls imposed by SGML has given way to acceptance that data management is a vital part of long-term information dissemination policies.
One final thought. Perhaps it is time the OED revised its definition of knowledge to keep track of what the Internet has taught us. Dare I suggest that in the next edition it should read the sum of recorded human experience?
Martin Bryan is an Information Management Consultant based in Churchdown, Gloucestershire. As Chairman of the BSI panel responsible for monitoring the development of SGML, HyTime and related standards he provides a vital link between UK industry and the international standards community. As a member of team working on the European Commission's Open Information Interchange (OII) initiative he is responsible for ensuring that information on currently available, and forthcoming, standards for information interchange are available thoughout Europe.