[Mirrored from: http://www.ukoln.ac.uk/ariadne/issue6/sgml/intro.html]
David Houghton
International Institute for Electronic Library Research,
Division of Learning Development,
De Montfort University,
The Gateway, Leicester LE1 9BH
Email: djh@dmu.ac.uk
This article discusses a method by which documents marked up using Standard Generalised Markup Language (SGML) can be used to generate a database for use in conjunction with the World Wide Web. The tools discussed in this article and those that were used in experiments are all public domain or shareware packages. This demonstrates that the power and flexibilty of SGML can be utilised by the Internet community at little or no cost. The motivation for this work stems from the lack of standardisation on display techniques for SGML presentation.
Ever since the World Wide Web came into being in the early 1990's, the SGML community have become excited about the possibility of realising the potential of such a markup method on a global scale. The concept of an SGML WWW browser became a real possibility. SGML was suddenly recognised as being the parent of the concept of Hypertext Markup Language (HTML) and as such it could be used to develop the next generation of Web browsers.
Sadly, the initial enthusiasm of the SGML community has been dampened by the software industry's failure to pick up on the concept of SGML use, despite the efforts of such people as C. M. Sperberg-McQueen and Robert F. Goldstein [1]. The reasons for this failure are disputable but include the key concepts of SGML presentation techniques. The lack of a suitable standard in this area has lead to SGML product manufacturers developing their own methods of presentation. As far as standardisation of HTML is concerned, the dominance of the Netscape WWW browser has complicated this issue.
Given this state of affairs, how can SGML users harness the power of the Internet and the flexibility of their data ? The solutions presented in this report relate to the use of public domain and shareware products to provide a mechanism of using SGML-based documents in a WWW environment. There are no doubt many commercial alternatives, but as we shall discuss shortly the method of presentation of SGML presents the major problem.
Those companies and institutes who use SGML on a regular basis will no doubt be aware of the issue of presentation. Unlike other markup methods such as LaTeX and Word, SGML documents themselves have little or no presentation markup information. This is because they are written using the concept of logical markup rather than presentation. SGML emphasises the structure of the document rather than how it appears. This makes it possible to construct documents that are independent of the system for representing the document.
For those readers who are not familiar with this principle it is suggested that references [3] and [5] should be consulted. SGML's power lies in the fact that logical documents can be manipulated and used in a wide range of applications such as databases, without the overheads that relate to presentation aspects.
So how are SGML documents presented ? There is essentially no easy answer to this as the method of presentation will depend on the software product used to 'view' the documents. There are at present a whole range of products from an ever-increasing number of vendors that attempt to provide a easy to use and flexible presentation method. The list of public and commercial products provided in [6] illustrate the huge range of packages available. The reader will soon discover that the common element of all these packages is the lack of standardisation on presentation method.
In order to overcome this lack of standardisation, a great deal of effort has gone into producing ISO/IEC DIS 1017.92 Document Style Semantics Specification Language (DSSSL).Unfortunately at the time of writing, no software manufacturer has implemented this standard.
The pragmatic approach taken in this report is to accept the lack of presentation standard and to accept that products such as Netscape provide sufficient flexibilty to provide an 'acceptable' viewing platform. Of course, Netscape uses HTML as its markup language and includes vendor specific features. This fact too will require some digestion.
The motivation for this study has arisen from work carried out as part of the Electronic SGML Applications (ELSA) [14] project at the IIELR, De Montfort University. The ELSA project is concerned with the investigation of the use of SGML as a method of delivering scientific journal articles for use in an electronic library environment. The SGML articles were provided by Elsevier Science.
The work carried out for the ELSA project was intended to demonstrate that an on-line journal article service could be set up easily and efficiently. A prototype system using HTML versions of the SGML articles was set up as described in the following sections. The primary goal of the prototype system was to assess the user interface and to gain valuable information on the users' reaction to such a system.
The problem of converting documents marked up in SGML into some form of HTML has been solved by numerous methods. In essence, the DSSSL standard mentioned above, when implemented, may remove the necessity of even this step as DSSSL includes a transformation process (SGML Tree Transformation Process STTP). This study uses the Copenhagen SGML Tool (CoST), a publically available product developed by several people and available from [8]. The methods of conversion we have used in this study are based on the use of UNIX as the operating platform. This is because the majority of public domain products used in the study are only available for UNIX, and because the processing power associated with computers running UNIX is required for dealing with large numbers of documents.
CoST (Copenhagen SGML Tool) is a structured controlled SGML application programming tool that uses TCL (Tool Command Language) as the programming language and the SGMLS SGML parser written by James Clark. Details of this converter can been found in the accompanying documentation of [8].
CoST enables the user to 'map' SGML elements and entities to a corresponding target markup format. The target format need not be SGML but can be any format that can be written in an ASCII format. The example in Appendix 1 shows how CoST can be used to translate SGML into LaTeX.
Using CoST it has been possible to map a given DTD into a corresponding HTML equivalent. It should be pointed out at this point that the limitations of HTML for presentation markup require compromises to be made and lead inevitably to a reduction in 'richness'. The transformation process cannot be a 1 to 1 mapping for the majority of DTDs and so acceptable compromises must be sought. Appendix 2 shows a sample of the CoST conversion mapping.
Having made the SGML to HTML conversion, there are a number of other factors that now need to be addressed. The presentatation method of some SGML viewers require that such features as external figures or pictures be treated in a particular way. Some viewers require figures to be treated as external entities that need defining in the DTD while others use the <LINK> feature of SGML. It should be born in mind that the original SGML document may have features in them that HTML browsers cannot handle. It is therefore recommended that HTML should be studied in detail before any SGML transformation process be undertaken.
Additional HTML browser features such as the use of BASE REF and BGCOLOR will need to be added to the target HTML by using UNIX scripts.
Image formats that are supported in Netscape include GIF and JPEG. If the source images are not in this format then it will be necessary to perform conversions. If the amount of data is large, a UNIX script will be required and tools such as convert from ImageMagick will need to be employed. Appendix 3 shows a sample of such a UNIX conversion script.
Often it will be necessary to change the size of images and/or make the images transparent. Again a UNIX script is ideal for this and an example of image size conversion is illustrated in Appendix 3.
Once information has been transformed to HTML, the majority of users will require a new database system to be set up. SGML data may well be structured in the form of a database that uses the SGML fields. Although HTML will be used for presenting documents, there is no reason why the original SGML database cannot be used as long as it is accessible from the WWW. Proprietary database systems may present a problem of access and so a new database may require setting up. A typical SGML database is described in Appendix 4 and involves several thousand journal articles. The articles are arrange in directories that relate to the specific journals that they are associated with. Each journal is associated with a subject area. No attempt has been made to cross reference material in this database. Journals that appear in more than one subject area appear to be duplicated via the use of UNIX symbolic links.
The WWW and Internet communities have adopted a number of database technologies; amongst them FreeWAIS-SF [9] stands out as being the most powerful and flexible. Other database systems exist that may be of more use to the reader. These include Glimpse [10], ICE [11] and Harvest [12]. This study used the freeWAIS-SF package on a DEC Alpha OSF platform which allowed the database to be indexed and queried via a client-server architecture. The actual techniques adopted are discussed in [9] and an example indexing and query set are shown in Appendix 5.
The user interface to the system described in this study is a set of HTML front end pages that enable users to browse and assemble queries. The input data from the query page is converted by a CGI script to interface with the database search mechanism in use.
In order to provide a HTML front end page for the HTML documents, the kidofwais.pl Perl script written by Mike Grady [13] was modified so as to provide support for HTML forms. Users enter information into this search form, and the parameters for the search are extracted and sent to the server in the form of a WAIS query command.
The ability to browse the database is provided by a subject area selection page and a journal list page. The latter allows the user to select information via a journal cover thumbnail image or an articles image. The thumbnail is linked to the journal information held on the journal home server while the the articles image is linked to a list of known articles in the current database for that specific journal.
The front end pages for browsing and searching are shown in Appendix 6.
A test bed of 5000 SGML files has been set up on a Digital Alpha Workstation and indexed using FreeWAIS. Access to this database is via a HTML form that allows users to browse and search. Although the database is potentially available to the WWW user community, access is restricted by the use of the htaccess mechanism [15] to the De Montfort University domain user base. The freeWAIS search engine is sufficiently fast for local use although response times across the Internet are , of course, unpredictable as network congestion needs to be considered. However, this is a matter which is outside the scope of this project.
Complex search queries are still not possible with FreeWAIS, but boolean algebra is supported and provides sufficient functionality to make the system useful.
In order to evaluate the system a questionaire has been designed and will be used on-line in trials on the university campus-wide network. The system will provide a method of delivery of online electronic journals texts originally marked up in SGML using the publically available Elsevier 2.0.3 DTD.
The study has shown that by using public domain software it is possible to provide a powerful and useful database system that allows full text search and retrieval. Whilst the data has been transformed from SGML into its HTML equivalent form and hence lost an element of its 'richness', it has been converted into a form that is more accessible by a larger community. It is worth noting that by attracting a larger audience the cost of production remains constant while circulation has effectively and significantly increased.
The author wishes to thank Elsevier Science for its support and permission to reproduce for this study extracts of its set of scientific journals provided to De Montfort University via its collaboration in Project ELSA.
The author is also eternally grateful to the support and understanding of his fellow employees Owen Williams and Anil Sharma for their help in setting up the experiments mentioned in this report.
The author would also like to thank the people who have dedicated long hours to the production of the publicly available software packages that have been used in this project.
element ART { start{puts stdout "\\ documentstyle\{article\}\\ begin\{document\}"} end {puts stdout "\\ end\{document\}"} } element TITLE { start {puts stdout "\\ begin\{center\}\{\\ LARGE \\ bf" } end {puts stdout "\}\\ end\{center\}"} } element PAR { start {puts stdout "\\ vspace\{0.25in\}" } } element REF { start {puts stdout "\{\\ it" } end {puts stdout "\}\\ \\ " } } element ADDRESS { start { puts stdout "\{\\ it" } end { puts stdout "\}" } } element SECTION { start { puts stdout "\{\\ section*\{" } end { puts stdout "\}" } }
element ATL { start {puts "<H1>" } end {puts "</H1><P>"} } element IT { start { puts stdout "<A>" } end { puts stdout "</A>" } } element P { start { puts stdout "<P>" } } element BB { end { puts stdout "<BR>" } } element BIBL { start { puts stdout "<P><H2>Bibliography</H2><BR><ADDRESS>" } } element AU { start { puts stdout "<H2>" } end { puts stdout "</H2>" } } element SNM { start { puts stdout " " } } element COR { start { puts stdout "<BR><ADDRESS>" } end { puts stdout "</ADDRESS><BR>" } } element AFF { start { puts stdout "<BR><H3>Affiliation</H3><ADDRESS>" } end { puts stdout "</ADDRESS><BR>" } } element RV { start { puts stdout " "} } element ABS { start { puts stdout "<BR><H3>Abstract</H3> <ADDRESS> " } end { puts stdout "</ADDRESS>"} } element BDY { } element BF { start { puts stdout "<B>" } end { puts stdout "</B>" } } element ST { start { puts stdout "<BR><H3>" } end { puts stdout "</H3><BR>" } } element KWDG { start { puts stdout "<BR><H4>Keywords : </H4>" } } element KWD { start { puts stdout " " } } element SUP { start { puts stdout "\^" } } element TBL { start { if {[attrValue ID]=="table_1"} { puts stdout "<IMG SRC=table_1.gif>" } } } element FIG { start { if {[isImplicit ID]} { puts stdout "No Fig ID" } else { if {[attrValue ID]=="1"} { puts stdout "<IMG SRC=fig1.gif>" } if {[attrValue ID]=="2"} { puts stdout "<IMG SRC=fig2.gif>" } if {[attrValue ID]=="scheme_1"} { puts stdout "<IMG SRC=scheme_1.gif>" } } } end { puts stdout "<P>" } } SDATA { case $data in { {˜} {set out \~} {<} {set out \{lt\}} {ß} {set out \{szlig\}} {°} {set out \{deg\}} {−} {set out -} {"} {set out \"} {¯} {set out \^} {π} {set out \{pi\} } {∫} {set out \{int\}} {ρ} {set out \{rho\}} {λ} {set out \{lambda\}} {μ} {set out \{mu\}} {β} {set out \{beta\}} {α} {set out \{alpha\}} {γ} {set out \{gamma\}} {∝} {set out \{prop\}} {′} {set out \{prime\}} {×} {set out *} {±} {set out \{plusmn\}} {–} {set out -} {•} {set out \{bull\}} {○} {set out \{cir\}} {&} {set out \& } {□} {set out \{squ\}} default {set out ? } } puts stdout $out nonewline }
#!/bin/ksh set -x TOP=/connect/mail/djh/elsevier/data PATH=$PATH:/connect/mail/djh JOURNALS="jpms ns cs corel infman lrp aca mlblue" for journal in $JOURNALS do cd $TOP/$journal for article in * do cd $TOP/$journal/$article for tif in *.tif do file=${tif%*.tif} if [ ! -f ${file}.gif ] then convert -geometry 600 -colors 2 tif:${file}.tif \ tif:${file}.new.tif convert -interlace LINE tif:${file}.new.tif \ gif:${file}.gif giftrans -t 1 ${file}.gif > ${file}.tmp.gif rm -f ${file}.new.tif mv ${file}.tmp.gif ${file}.gif fi done done done
The above diagram shows the hierarchy of directories for the experimental SGML/HTML database.
The SA numbers reflect a Subject Area division defined by the data authors.
The fourth column represents the Journal ID codes, for example, cs could be Computer Science.
Each article is uniquely defined by an article ID number which is used to name the corresponding HTML file.
cat listoffiles | waisindex -t URL /usr/local/elsa/public_html / http://www.elsa.dmu.ac.uk/~elsa -d / /elsa3/elsa-waisindex/GASS -t fields / -stdin
where listoffiles is something of the format
/usr/local/elsa/public_html/GASS/aca/00016451/00016451.html /usr/local/elsa/public_html/GASS/aca/00016452/00016452.html /usr/local/elsa/public_html/GASS/aca/00016453/00016453.html
Where
/usr/local/elsa/public_html is the bit to chop off http://www.elsa.dmu.ac.uk/~elsa -d is the bit to add
freeWAIS-SF gate is VERY customizable and can be set up to do all sorts of interesting things. The Search HTML source code for GASS is provided in Appendix 6.
When compiling SFgate, you must give it an application direrctory something like elsa/public_html SFgate. This is where the things like header and footers are to be kept.
If you want headers and footers, then you add the tag <INPUT TYPE="hidden" NAME="application" VALUE="FILEPREFIX" > FILEPREFIX GASS would tell SFgate to insert the header and footer GASS_header and GASS_footer from the application dir (compiled in to be /usr/local/elsa/public_html/SFgate).
The maxhits is set to 40.
The database is used to specify the database to use. local/GASS means use the local GASS database. the locate data base file is compiled in to be /elsa3/elsa-waisindex/
SFgate can also be use to search databases over the WWW. The hidden is used so it doen't show up as a button. A menu could be used to select different databases to be searched.
With SFgate, you can either have a text field that is dedicated solely to doing one type of search (eg, and author search) or you can tell it that it that the text field is of a specified list. We needed the second, more complex, way ....
<INPUT SIZE=35 NAME="fieldsel\_0\_content"> <SELECT NAME="fieldsel\_0\_name"> <OPTION VALUE="ke" SELECTED>Keyword <OPTION VALUE="ti">Title <OPTION VALUE="au">Author <OPTION VALUE="ab">Abstract <OPTION VALUE="bi">Bibliographic <OPTION VALUE="text">Default </SELECT>
As you can see, there are 2 parts to it, the text field <INPUT SIZE=35 NAME="fieldsel_0_content"> and the selecter <SELECT NAME="fieldsel_0_name">
To bind the two together, both have to have the identical prefix, here being fieldsel_0_. (also note that fieldsel_0_ is the value given to the group_1 at from earlier)
content is the word that you will be looking for, name is the name of the database field that will be looked in. ke, ti and soforth are the database fields, as specified in the GASS.fmt file, which is used during indexing.
<SELECT NAME="group\_2\_tie"> <OPTION VALUE="or" SELECTED>OR <OPTION VALUE="and">AND
Bascially, this links the first field of the index, to the second, linking its self to the second field search with group_2_tie -> group2 -> fieldsel_1 (name of the second field)
<HEAD> <TITLE>Search the Elsa GASS Collection</TITLE> </HEAD> <BODY BGCOLOR="#FFE6C9"> <BASE REF=""> <H2><IMG SRC="/~elsa/images/tree.gif" ALIGN=MIDDLE> Search the GASS Database</H2> <P> <FORM METHOD="POST" ACTION="/cgi-bin/GASStest.pl"> <B>Search Term 1:</B> <INPUT NAME="term1" SIZE="40"> <SELECT NAME="op1"> <OPTION VALUE="AND">AND <OPTION VALUE="OR">OR </SELECT> <BR> <B>Search Term 2:</B> <INPUT NAME="term2" SIZE="40"> <SELECT NAME="op2"> <OPTION VALUE="AND">AND <OPTION VALUE="OR">OR </SELECT> <BR> <B>Search Term 3: </B> <INPUT NAME="term3" SIZE="40"> <P> <B>Maximum Number Of Hits: </B> <INPUT NAME="MaximumHits" SIZE="3" VALUE=60> <P> <INPUT TYPE="SUBMIT" VALUE="Submit Query"> <INPUT TYPE="RESET" VALUE="Reset Form"> <P><HR> </FORM> The <I>Elsa GASS Database</I> can be searched using the interface shown above. In order to search the database, search terms can be entered into any of the <I>Search Term Boxes</I> provided. A search term consists of any word the user wishes to search for.<P> By using the <I>Boolean Operators</I>, <B>AND</B> and <B>OR</B>, the user can restrict or broaden their query so as to recieve as much useful information as they require. <P> As well as providing support for Boolean Searches, the Elsa GASS Search interface also provides support for <B>Right-hand truncation </B>. This feature allows a user to enter a search term, such as '<I>astro*</I>' and the search will return documents containing the words '<I>astrophysics</I>' and '<I>astronomy</I>'. The Right-hand truncation operator is the asterisk (<B>*</B>). <P> <A HREF="/~elsa/GASS/Search/help.html"><B>Example searches</B></A> can be found, by following this link.<P> <HR> This page is maintained by <A HREF=http://www.elsa.dmu.ac.uk/~djh> D.Houghton</A>, and it was last modified on Apr 12, 1996.<p> <HR> <A HREF ="/~elsa/GASS/Search"> <IMG SRC="/~elsa/Search.gif" ALT="[Search]"></A> <A HREF ="/~elsa/GASS/TheJournals"> <IMG SRC="/~elsa/elsa-brws.gif" ALT="[Browse]"></A> <A HREF ="/~elsa/GASS"> <IMG SRC="/~elsa/elsa-back.gif" ALT="[Home]"></A> <A HREF ="/~djh/ELSA/research/feedback.html"> <IMG SRC="/~elsa/elsa-fdbk.gif" ALT="[Feedback]"></A> <A HREF = "/~elsa/GASScopyright.html"> <IMG SRC="/~elsa/copyright.gif" ALT="[Copyright]"></A> </BODY> <HTML><HEAD> <TITLE>Browse the ELSA Database</TITLE> </HEAD> <BODY> <H2><IMG SRC=/~elsa/images/tree.gif" ALIGN=MIDDLE"> Browse the ELSA Database</H2> <P> <HR SIZE=5> <H3>Areas</H3> <UL> <LI> <A HREF=/~elsa/TheJournals/SA0/">Multidiscipline</A> <LI> <A HREF=/~elsa/TheJournals/SA1/">Life and Medical Sciences</A> <LI> <A HREF=/~elsa/TheJournals/SA2/">Physical & Environmental Sciences</A> <LI> <A HREF=/~elsa/TheJournals/SA3/">Materials Science</A&g t; <LI> <A HREF=/~elsa/TheJournals/SA4/">Engineering</A> <LI> <A HREF=/~elsa/TheJournals/SA5/">Social / Behavioral Sciences and Humanities</A> <LI> <A HREF=/~elsa/TheJournals/all.html">All Journals</A> </UL> <HR> <A HREF =/~elsa/Search"> <IMG SRC=/~elsa/elsa-srch.gif" ALT="[Search]"></A> <A HREF =/~elsa/TheJournals"> <IMG SRC=/~elsa/elsa-brws.gif" ALT="[Browse]"></A> <A HREF =/~elsa"> <IMG SRC=/~elsa/elsa-back.gif"ALT="[Home]"></A> <A HREF ="http://www.cms.dmu.ac.uk/~djh/ELSA/research/feedback.html"> <IMG SRC=/~elsa/elsa-fdbk.gif"ALT="[Feedback]"></A> <A HREF ="http://www.cms.dmu.ac.uk/~djh/ELSA/research/copyright.html"> <IMG SRC=/~elsa/elsa-copy.gif"ALT="[Copyright]"></A> </BODY> </HTML>
This document was generated using the LaTeX2HTML translator Version .95.3 (Nov 17 1995) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Material on this page is copyright Ariadne/original authors. This page last updated on November 20th 1996