[Mirrored from: http://www5conf.inria.fr/fich_html/slides/dday/sgml/all.htm]
An SGML-Based Web Server
Jon Bosak, SunSoft
What is SGML?
- Standard Generalized Markup Language: the international standard
for structured document interchange, ISO 8879 (1986)
- Developed out of a search for a universal typesetting language
begun in the late 1960s
- Descriptive, not procedural - the procedures are left to
formatting/presentation systems
- Not a single markup language, but a metalanguage for the
specification of an unlimited number of markup languages, each
optimized for a particular category of documents
- The SGML description of a markup language is called a Document
Type Definition (DTD)
Major industry DTDs (markup languages)
ATA 2100 | aircraft industry
|
CALS | military, aerospace
|
CMC | pharmaceuticals
|
PCIS | semiconductors
|
DocBook | computer software
|
IBMIDDoc | IBM software
|
SAE J2008 | automobile manufacturing
|
TMC T2008 | truck manufacturing
|
TIM | telecommunications
|
EDGAR | Securities and Exchange Commission
|
ISO 12083 | journal, book, and magazine publishing
|
ICADD | publishing for the print-disabled
|
TEI | academic and scholarly publishing
|
UTF | news media
|
HTML | World Wide Web
|
HTML is just one of many standardized special-purpose SGML markup languages.
HTML vs. most other SGML languages
Basic document model from DocBook:
<!ELEMENT Book - - ((Title, TitleAbbrev?)?, BookInfo?, ToC?, LoT*, Preface*,
(((%chapter.gp;)+, Reference*) | Part+ | Reference+ |
Article+), (%appendix.gp;)*, Glossary?, Bibliography?,
(%index.gp;)*, LoT*, ToC? ) +(%ubiq.gp;) >
[...]
<!ELEMENT Chapter - - (DocInfo?, Title, TitleAbbrev?, (%sect1.gp;), (Index |
Glossary | Bibliography)*) +(%ubiq.gp;) >
[...previously defined:]
<!ENTITY % sect1.gp "((%component.gp;)+, (Sect1* | RefEntry*)) | Sect1+ |
RefEntry+" >
[...]
<!ELEMENT Sect1 - - (Title, TitleAbbrev?, (%nav.gp;)*, (((%component.gp;)+,
(RefEntry* | Sect2*)) | RefEntry+ | Sect2+), (%nav.gp;)*)
+(%ubiq.gp;) >
Basic document model from HTML 2.0:
<!ENTITY % html.content "HEAD, BODY">
<!ELEMENT HTML O O (%html.content)>
<!ENTITY % body.content "(%heading | %text | %block | HR | ADDRESS)*">
<!ELEMENT BODY O O %body.content>
HTML documents differ from documents marked up in most other
standard SGML languages in that they lack a controlled hierarchical
structure.
Implications of the HTML content model
- Difficult or impossible to validate document data structures (so
that documents can be safely dropped into a database, for example) or
to impose editorial control in projects with multiple authors
- Difficult or impossible to automatically generate navigational
aids (dynamic tables of contents, etc.) directly from the document
itself
- Navigation must generally be implemented by adding handcrafted
hypertext links
- HTML browsers have no concept of entity management (no modular
reuse)
- Context searching becomes difficult or impossible
HTML is too limited to serve as an adequate data format for
large-scale commercial publishing.
HTML tools vs. generic SGML tools
- HTML tools are hardwired to a particular tag set, often containing
proprietary extensions. Generic SGML tools allow designers to add tags
at will; there is no such thing as a proprietary extension in generic
SGML.
- Current HTML tools are hardwired to a particular typographic
format. Generic SGML tools allow formatting and other presentational
characteristics to be controlled by one or more output specifications.
- Future HTML tools will support stylesheets, but not other aspects
of output control typically supported by generic SGML tools.
(For example, in technical publishing, it is often necessary to
generate different versions of an SGML document by selectively showing
different portions of the source. The same capability is needed in
magazine publishing to generate market-specific editions. It is also
often necessary to show elements in an order different from their
order in the source.)
HTML as a server format
Advantages of HTML on the server
. . . but it does not scale
Example:
- Book with a five-level hierarchy
- Five subdivisions at each level
3125 manually created hypertext links are required to make a table
of contents for this one book
And what about revisions?
HTML as a client format
An HTML-based Web server is limited to a flat, unorganized (or
tediously handcrafted) document space. A generic SGML Web server, on
the other hand, delivers the power of a hierarchical object-oriented
document database.
Advantages of generic SGML on the server
- Based on international standards immune to current and future
Internet politics, competing vendor strategies, and ad hoc HTML
extensions
- Fully extensible markup language is completely under publisher
control and precisely suited to the documents
- Documents can be formally validated and editorial guidelines
enforced by structured authoring tools
- Links and navigational aids can be generated directly from the
structure of documents
- Context searching increases the speed of user access 10-20 times
over flat document databases
- Documents can easily be reused for different purposes
- All versions of a document (printed and online) can be generated
from the same source
- User-selectable stylesheets allow dynamically configurable views
of a documents (not just different typographical treatments)
- System administration of large document repositories is vastly
simplified
- Lays the foundation for future deployment of object-oriented
authoring/publishing systems
Some SGML-based Web servers
http://occam.sjf.novell.com:8080/docs/toc.pubs_server.html
http://www.sgi.com/Technology/TechPubs/lib/display.cgi?4097
http://cobweb.sybase.com:8000/
And see:
http://www.w3.org/pub/Conferences/WWW4/Papers/112
Case study: Novell
Note: The speaker left Novell to work for SunSoft in January,
1996. All descriptions of Novell's document server are valid as of
that date but should not be taken as necessarily descriptive of
Novell's current direction. However, everything said in this
presentation may be taken in a general way as applying to SunSoft's
current direction.
Novell's problem (1991-1994)
They started with this...
And needed to get to all of these...
Wrong answer
This is an m x n solution.
Right answer
This is an m + n solution.
The Novell Publications Server (January 1996)
The Novell Publications Server (future)
The next step: generic SGML on the Web
Example:
http://www.ncsa.uiuc.edu/SDG/Software/WinMosaic/Viewers/panorama.htm
then
http://www.sq.com
The case for generic SGML on the Web
- SGML-to-HTML servers are complex and CPU-intensive; only large
corporations can afford them. By shifting more of the processing load
to the client, generic SGML browsers can deliver many of the
advantages of structured documents at a much lower cost.
- Structured SGML provides the basis for presentational controls far
beyond what can be accomplished with any current form of HTML.
- Generic SGML allows for the transmission of much richer data to
client-side applications (especially Java applets); "SGML gives Java
something to do."
Generic SGML is rich enough to support distributed document
processing (not just distributed document rendering); HTML is not.
- Generic SGML is required whenever structured data from a database
system must be processed at the client before transmission to some
other database system.
The interchange format must be capable of capturing all of the
information in the source and conveying it to the target. HTML cannot
do this.
Examples of distributed document processing requiring generic
SGML
- TOCs (tables of contents) and library catalogs downloaded to the
browser for increased performance
- Key requirement: hierarchical data structure
- Patient medical histories from a hospital pasted into the entrance
forms for a home care agency
- Key requirements: data validation, controlled authoring,
industry-specific standardized markup
- Semiconductor data from multiple manufacturers used to drive a
circuit-modeling application
- Key requirements: industry-specific standardized markup, rich
semantic tagging
- Distributed interactive airline scheduling with client-side
itinerary optimization
- Key requirements: controlled authoring, rich semantic tagging
- Corporate travel authorization/expense program implemented as an
"intranet" application
- Key requirements: controlled authoring, easily extensible markup,
tools that automatically adapt to changed markup