[This local archive copy mirrored from the canonical site: http://www.w3.org/MarkUp/future/papers/roconnor.html; links may not have complete integrity, so use the canonical document at this URL if possible.]

HTML and Architectural Forms

Russell O'Connor <roconnor@uwaterloo.ca>

Most authors use an SGML document type definition to define a document's class. The problem is that DTDs can only define document syntax rules. DTDs cannot define the semantics of a document. In order to properly define a document class, authors should use an SGML architecture. Authors can declare that their documents conform to an SGML architecture by adding either an architecture notation, or an architecture processing instruction to their document.

The [2]HTML 4.0 specifications force the [3]HTML 4.0 DTD to define document semantics by stating that authors are not allowed to modify the document type definition. This restricts authors from adding logical elements that do not exist in HTML to their document. Authors cannot add their own entities that they may want to use in their document. If HTML becomes an SGML architecture, these restrictions can be removed, and authors will be free to use whatever DTD suits their documents.

Creating an HTML architecture will facilitate the progress to full SGML on the web. Once authors create a mapping between the HTML architectural forms and their document elements, user agents will be able to read their SGML documents.

Converting HTML to an SGML architecture is easy. Since a set of SGML architectural forms is almost identical a DTD, the only thing that needs to be changed is the way that HTML documents are defined. Currently HTML documents must begin with some variant of the following line:


Using an SGML architecture, documents would instead be required to begin with something like the following processing instruction:

  name="HTML 5.0"
  public-id="-//W3C//NOTATION HTML 5.0 ARCHITECTURE//EN"
  dtd-public-id="-//W3C//DTD HTML 5.0//EN"

Since XML is a subset of SGML, it could also be used to make HTML documents. The following is an example of an XML HTML document.

<?XML VERSION="1.0"?>
  name="HTML 5.0"
  public-id="-//W3C//NOTATION HTML 5.0 ARCHITECTURE//EN"
  dtd-public-id="-//W3C//DTD HTML 5.0//EN"
    <TITLE>Short Example</TITLE>
    <P>This is a short example of an HTML document.</P>

The transition to supporting HTML architectures can be made easier by allowing user agent support to be optional. User agents would only be required to support those documents which explicitly use the HTML DTD. This will probably be the most common use of the architecture anyways. Most authors will validate their documents against the HTML DTD since it provides enough structure. But the option of using another DTD will be available to the author. An example of the common use of the HTML architecture would be the following:

  name="HTML 5.0"
  public-id="-//W3C//NOTATION HTML 5.0 ARCHITECTURE//EN"
  dtd-public-id="-//W3C//DTD HTML 5.0//EN"
<TITLE>Short Example</TITLE>
<P>This is a another short example of an HTML document.

So we see that creating an HTML architecture opens up a world of flexibility to those authors that want to take advantage of it. It maintains the structure that some user agents require. It eases the transition of SGML and XML onto the web. And it is extremely easy to implement. This is where the future of HTML lies.

Works Cited

   C. M. Sperberg-McQueen, Robert F. Goldstein ``HTML to the Max A
        Manifesto for Adding SGML Intelligence to the World-Wide Web.''
        1994-09-18. 1998-03-15

   ``HTML 4.0 Specification.'' Ed. Dave Raggett et al. 1997-12-18. [5]W3C.
        1998-01-23 <URL:[6]http://www.w3.org/TR/REC-html40/>

   Kimber, W. Eliot ``A Tutorial Introduction to SGML Architectures.'' 1997.
        [7]ISOGEN International Corp. 1998-03-16

   Kimber, W. Eliot ``Re: Is XML < SGML? For how long?...'' Online posting.  
        1998-02-17. [9]comp.text.sgml

   Kimber, W. Eliot ``Re: Is XML < SGML? For how long?...'' Online posting.  
        1998-02-18. [11]comp.text.sgml 

   Newcomb, Steven R. ``SGML Architectures Implications and Opportunities
        for Industry.'' [13]<TAG> 1995-08. SGML Associates, Inc. 1998-03-15


   1. mailto:roconnor@uwaterloo.ca
   2. http://www.w3.org/TR/REC-html40/
   3. http://www.w3.org/TR/REC-html40/sgml/dtd.html
   4. http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Autools/sperberg-mcqueen/sperberg.html
   5. http://www.w3.org/
   6. http://www.w3.org/TR/REC-html40/
   7. http://www.isogen.com/
   8. http://www.isogen.com/papers/archintro.html
   9. news:comp.text.sgml
  10. news:34E9CBC9.401B6BB0@isogen.com
  11. news:comp.text.sgml
  12. news:34EB13C0.B7FD0F02@isogen.com
  13. http://tag.sgml.com/
  14. http://tag.sgml.com/08080101.htm
  15. file://localhost/u3/roconnor/public_html/
  16. mailto:roconnor@uwaterloo.ca

Additional Comments

I'd like to add that my motivation for wanting an HTML Architecture comes from a desire to allow authors to create their own entities for client-side includes. For example, I want the following to be legal HTML:

"http://www.w3.org/TR/REC-html40/strict.dtd" [
<!ENTITY header SYSTEM "http://www.example.com/header.inc">
<!ENTITY footer SYSTEM "http://www.example.com/footer.inc">
<HTML lang=en-CA>



<!-- main body here -->


I'm sure this has been discussed before.

The advantages of client-side includes over server-side includes include better caching, less load on the server, and less network traffic.

An HTML architecture is a more general solution to a more general problem, but I'd at least like to see this ability for client-side includes in the next version of HTML.

Transclusion using enitites and objects are actually different in signifigant ways. An html file inserted with an OBJECT element is a stand alone document. It has it's own URL resolution heirarchy. It has it's own style sheet. It has it's own grove. An html segment inserted with an entity becomes part of the root document. The root document's style sheet applies to it. It's insertion affects the grove structure of the root documents.

These subtle diffrences can have major consequences when the document is affected by CSS and DOM. I don't think one method is better than the other. I think each has it's place depending on the situation.