The following report was obtained from the Exeter SGML Project FTP server as Report No. 9, in UNIX "tar" and "compress" (.Z) format. It is unchanged here except for the conversion of SGML markup characters into entity references, in support of HTML.
THE SGML PROJECT SGML/R25 CONFERENCE REPORT SGML '93 BOSTON, MA, USA 6TH-9TH DECEMBER 1993 Issued by Michael G Popham 22nd December 1993 BACKGROUND The conference opened with a welcome from Norm Scharpf of the GCA. This year, around 450 attendees were anticipated, about 50% of whom were attending their first SGML conference. Norm announced that the GCA will be taking over the maintenance of the AAP/ISO 12083 DTDs from EPSIG. In his initial remarks, the Conference Chair (Yuri Rubinsky) noted that this was the largest SGML event in history. The size of the conference meant that it was split into two tracks running concurrently, one for SGML novices, the other for experts. (I attended the technical track for experts). SESSIONS ATTENDED 1. "The SGML Year in Review" - Yuri Rubinsky, B Tommie Usdin, and Debbie Lapeyre 2. Keynote Address: "TEI Vision, Answers and Questions: SGML for the Rest of Us" - Lou Burnard (Text Encoding Initiative) 3. Poster Session 4. Reports from the Front 5. Multi-company SGML Application Standard Development Process" - Bob Yencha (National Semiconductor), Patricia O'Sullivan (Intel Corporation), Jeff Barton (Texas Instruments), Tom Jeffery (Hitachi Micro Systems, Inc.), Alfred Elkerbout (Philips Semiconductors) 6. "Archetypal Early Adopters? Documentation of the Computer Industry" 6.1 " Information Development Strategy and Tools - IBM and SGML" - Eliot Kimber and Wayne Wohler (IBM) 6.2 Eve Maler (Digital Equipment Corporation) 6.3 "Implementation of a Corporate Publishing System" - Beth Micksh (Intergraph) 6.4 Jon Bosak (Novell) 7. International SGML Users' Group Meeting 8. "Real World Publishing with SGML" - Terry Allen (Digital Media Group, O'Reilly & Associates, Inc.) 9. "HyTime Concepts and Tools" - Dr. Charles Goldfarb (IBM) 10. "HyTime: Today's Toolset" - Dr. Charles Goldfarb (IBM) and Erik Naggum (Naggum Software) 11. "Charles Goldfarb, Please Come Home: SGML, the Law, and Public Interest" - Ina Schiff(Attorney at Law) and Allen H Renear (Brown University) 12. "Online Information Distribution" - Dave Hollander (Hewlett Packard) 13. "Xhelp: What? Why? How? Who?" - Kent Summers (Electronic Book Technologies) 14. "Digital Publications Standards Development (DPSD): A Modular Approach" - Horace Layton (Computer Sciences Corporation) 15. "A Practical Introduction to SGML Document Transformation" - David Sklar (Electronic Book Technologies) 16. "SGML Transformers: Five Ways" - Chair: Pam Gennusa (Database Publishing Systems Limited) 17. "The Scribner Writers Series on CD-ROM: From a Great Pile of Paper to SGML and Hypertext on a Platter" - Harry I. Summerfield (Zandar Corporation), Anna Sabasteanski (Macmillan New Media) 18. "The Attachment of Processing Information to SGML Data in Large Systems" - Lloyd Harding (Mead Data Central) 19. ISO 12083 Announcment" - Beth Micksch (Intergraph Corp.) 20. Reports from the SGML Open Technical Committees 21. "A Technical Look at Authoring in SGML" - Paul Grosso (ArborText) 22. "Implementing an Interactive Electronic Technical Manual" - Geoffrey von Limbach, Bill Kirk(InfoAccess Inc.) 23. "The Conversion of Legacy Technical Documents into Interactive Electronic Technical Manuals: A NAVAIR Phase II SBIR Status Report" - Timothy Billington, Robert F. Fye (Aquidneck Management Associates, Ltd.) 24. New Product Announcements and Product Table Top Demonstrations 25. Poster Session 26. "Implementing SGML Structures in the Real World" - Tim Bray (Open Text Corp.) 27. "User Requirements for SGML Data Management" - Paula Angerstein (Texcel) 28. "A Document Query Language for SGML Databases" - Ping-Li Pang, Bernd Nordhausen, Lim Jyh Jang, Desai Narasimhalu (Institute of Systems Science, National University of Singapore) 29. Closing Keynote - Michael Sperberg-McQueen (Text Encoding Initiative) 1. "The SGML Year in Review" - Yuri Rubinsky, B Tommie Usdin, and Debbie Lapeyre The full text of "The Year in Review" will be published in <TAG> and also posted to comp.text.sgml. The review was loosely split into a number of categories, the first of which focused on Standards Activity. The interim report of the review of ISO 8879 will be published in <TAG>, however it is clear that changes to the SGML Standard will be required. ISO 10744, the HyTime Standard is being adopted by IBM, whilst TechnoTeacher are producing relevant tools; also, at least 3 HyTime-related books are currently in preparation. A revised 3- level version of DSSSL will be out early next year, and the SPDL Standard is now ready to go to press. Information on ISO 12083 (the fast-tracked version of the revised AAP DTD) will be given at this conference. User group activity - the Swedish group has been very active in the last twelve months as has the Japanese SGML forum, which attracted 400 people to an open meeting on SGML. Erik Naggum was welcomed as the new Chair of SGML SIGhyper. Major Public Initiatives - SGML Open was founded this year, more information was given later in the week. The NAA and IPTC (both major news industry bodies), have been working on an SGML-based universal text-format, the "News Industry Markup Language" [?], for interchanging news service data. Co-ordinators of The Text Encoding Initiative (TEI) met with people developing the World Wide Web (WWW) to discuss the production of HTML+ (a revised version of the Hypertext Markup Language, the markup scheme recognized by WWW browsers). The TEI have now completed all their major goals and some supplementary work, which is now publicly available (via ftp). The DAVENPORT group made an amicable split into DAVENPORT and CApH; more details were given later in the conference. In the US, 18 other states have followed Texas in requiring text books to be produced following using SGML and following the ICADD guidelines (the International Committee for Accessible Document Design) - several companies have said they will be providing tools to handle the ICADD tagset. By 1995, all companies in the US will be able to provide their financial information according to EDGAR. British Airways is developing an ATA DTD-based system, and Lufthansa already have an SGML-based system in place (details of which were given at SGML `93 Europe). Publications - SGML has received an increasing amount of coverage in the mainstream computer press. Prentice Hall will be publishing a series of books to do with open information interchange, under the guidance of Charles Goldfarb. Kluwer will publish "Making Hypermedia Work" a handbook on HyTime by Steve de Rose and David Barnard. A new version of Eric van Herwijnen's "Practical SGML" will be available early in 1994. Van Nostrand Reinhold will be publishing a manager s' guide to SGML. Exoterica released their "Compleat SGML CD-ROM" in 1993, and will be releasing a conformance suite CD-ROM next year. Elliot Kimber has written a HyQ tutorial (HyQ is the query language described in the HyTime standard), which is available via ftp. Major corporations and government initiatives - The American Memory Project (run by the Library of Congress), has chosen to use SGML to create a text base of materials. IBM has developed an internal SGML-based system, called IBMIDDOC, which they will use to create, manage and produce/deliver all their product documentation. The OCLC have been selected to develop an SGML-based publishing system for use by the ACM. The British National Corpus, a 100 million word tagged corpus, is to be made available next year. Springer Verlag is currently producing 50 journals a year using an SGML-based system, and next year this figure will rise to 100 [150?]. Various patent/trademark bodies, including the US Patent and Trademark Office, the European Patent Office and the Japanese Patent Office, are adopting SGML- based approaches. In France, SGML is being used by a number of key players in various industries (maritime, aerospace, power industry, publishing), whilst SGML uptake in Australia is also increasing. UCLA is adopting SGML for all its campus-wide information publishing, both on-line and on paper, as is the Royal Institute of Technology in Sweden. Miscellaneous - Adobe has an agreement with Avalanche to develop SGML filters to move information to/from Acrobat. Lotus Development Corporation is looking at incorporating SGML awareness into a future version of Ami Pro. Microsoft announced the development of an SGML-based add-on for Word (to be called "SGML Author"). The American Chemical Society is updating its SGML DTDs, whilst the IEEE is developing a suite of DTDs for publishing its standards. The Oxford Text Archive now has about 100 works tagged with TEI-conformant SGML, which are available for ftp over the Internet. The first SGML summer school was organized in France, and attracted 35 attendees. Joan Smith, the "Godmother of SGML" and founder of the International SGML Users' Group, has retired this year; Yuri remarked that her presence will be greatly missed and thanked her for all her efforts over the years. And in the "believe it or not" category, came the news that IBM will be producing an SGML-tagged CD-ROM of interviews published in Playboy magazine between 1962 and 1992. 2. Keynote Address: "TEI Vision, Answers and Questions: SGML for the Rest of Us" - Lou Burnard (Text Encoding Initiative) This presentation looked at the wider relevance of the Text Encoding Initiative (TEI) - its origins, goals and achievements. It included an overview (tasting?) of the TEI's DTD pizza model and the TEI class system, and a look at some TEI techniques and applications. The TEI comes primarily from the efforts of the Humanities research community. Sponsors include the Association for Computational Linguistics (ACL), the Association for Literary and Linguistic Computing (ALLC), and the Association for Computing and the Humanities (ACH). Funding bodies include the US National Endowment for the Humanities, the Mellon Foundation, DG XIII of the European Commission, and the Social Science and Humanities Research Council of Canada. The TEI addresses itself to a number of the fundamental problems facing Humanities researchers, although their findings are widely applicable throughout academia and beyond. It looks at the re- usability of information, particularly with regard to issues of platform-, application-, and language-independence. It accepts the need to support varied information sources (such as text, image, audio, transcription, editorial, linking, analysis etc.). The developers of the TEI's guidelines have also given careful consideration to the interchange of information - issues such as what and how to encode, generality vs. specificity, providing an extensible architecture and so on. The basics of the TEI's approach were published in the first draft of "Guidelines for the Encoding and Interchange of Machine- Readable Texts" (also known as TEI P1). Specialist workgroups were set up as part of the process to develop the guidelines. They focused on identifying significant particularities, ensuring that the guidelines were independent of notation or realization, avoiding controversy, over-delicacy, or inadequacy, and seeking generalizable solutions. The second draft, TEI P2, covers such areas as: segmentation, feature-structures, certainty, manuscripts, historical sources, graphs, graphics, formulae and tables, dictionaries, terminology, corpora, spoken texts, and so on. The consequences of the TEI's approach mean that they have focused on content, not presentation. The guidelines are descriptive, not prescriptive in nature - and any redundancy has been cut out. The aim of the TEI's work (in addition to producing the guidelines) was to create a modular, extensible DTD. To date, the TEI have produced a number of outputs. It has created a coherent set of extensible recommendations (contained in the guidelines). It has made available a large number of SGML tagsets, which can be downloaded from several public ftp servers around the world. In addition, the TEI has developed a powerful general purpose SGML mechanism (using global attributes, etc.). TEI P2 serves as a reference manual for all of these aspects. The current version of TEI P2 is now available as a series of fasicules (available via ftp) which outline the basic architecture and core tagsets. For each element, TEI P2 provides descriptive documentation, examples of usage, and reference in formation. It also contains some DTD fragments (i.e. tagsets) which can be used to create TEI-conformant applications. Lou then raised the more general question of "How many DTDs does the world really need?" - to which there are several possible answers. One massive and/or general (vague) DTD might meet the lowest common denominator of peoples' needs, but the TEI could only develop it if they adopted a "we know what's best for you"-philosophy. At the other extreme, one could argue that the world does not need any generalized DTDs, because no-one could write a DTD (or set of DTDs) which could truly address all the specific concerns of individual users. An intermediate alternative would be to develop "as many [DTDs] as it takes" to support a particular level of user requirements. However, the TEI believes that it is possible to adopt an entirely different approach to the three (extremes) mentioned above, which it calls "The Chicago Pizza Model". The Pizza Model allows users to make selections from a prescribed variety of options, in order to meet their particular needs as closely as possible. The overall model assumes that a pizza is made up of a choice of bases, mandatory tomato sauce and cheese, plus a choice of any combination from a given list of toppings. In terms of the "TEI Menu", the base consists of a choice of one tagset to describe prose, verse, drama, transcribed speech, letters and memoranda, dictionaries, or terminology. The TEI's header and core tag sets are mandatory (i.e. they are the equivalent of the cheese and tomato sauce which come with every pizza), after which the user can select one or more tagsets to describe linking, analysis, feature structures etc. The mandatory parts of the TEI's model, cover a number of vital aspects. The TEI header consists of an AACR2-compatible bibliographic description of an electronic document and its sources. The header also contains standardized descriptions of the encoding systems applied, codebooks and other metadata, and the document's revision status. The core tag set provides markup for highlighted phrases (e.g. emphasis, technical terms, foreign language matter, titles, quotations, mention, glosses etc.), '"data" (e.g. names, numbers, dates, times, addresses), editorial intervention (e.g. corrections, regularizations, additions, omissions, etc.) as well as lists of all kinds, notes, links and pointers, bibliographic references, page and line breaks, verse and drama. Lou gave some examples of the use of core tags, and how to customize the TEI DTD to rename elements, undefine elements and select tagsets. The TEI have also adopted a class system. Element classes consist of semantically related elements which may all appear at the same place in model. Attribute classes are made up of semantically related elements which share a common set of attributes. The classes are implemented using parameter entities, which simplifies documentation and facilitates controlled modification of the DTD by the user. Lou gave an example of how new elements could be added to a class using the TEI's technique. Unfortunately, due to time constraints, Lou did not present his slides on the TEI's use of global attributes, nor how the problems of alignment, using identifiers, pointers and links are dealt with. Several major scholarly publishers (e.g. Chadwyck Healey, Oxford University Press, Cambridge University Press) have begun to adopt the TEI's guidelines and implement their techniques. The same is true of a number of large academic projects - such as the Women Writers project (at Brown University), CURIA, the Wittgenstein Archive, the Oxford Text Archive, etc. - and some Language and Research Engineering (LRE) projects (e.g. EAGLES, and the British National Corpus). The TEI's work is also being taken up by librarians and archivists (such as in the Library of Congress' American Memory Project, at the Center for Electronic Texts in the Humanities (CETH), and so on). Developing a TEI conformant application still requires the essential processes of systems analysis/document design. Once this has been done, the designers can choose the relevant TEI tagsets using the "pizza model" approach. Any restrictions, modifications or extensions must be carefully identified and documented (to help ensure the usefulness of the tagged data to later scholars). The TEI DTD (i.e. its various constituent tagsets) is now available for beta test, and can be downloaded from a number of sites around the world (e.g. ftp from sgml1.ex.ac.uk [144.173.6.61] in directories tei/p2/drafts and tei/p2/dtds; or send email to listserv@uicvm.uic.edu containing the single line: sub tei- l <Your Real Name>) Once the testing of the TEI P2 DTD is complete, any revisions will be incorporated into a final version of the "Guidelines..." to be published as TEI P3. The next phase of work will involve the development of application-specific tutorials (including electronic versions), development of appropriate software tools (e.g. TEI- aware application packages), and the creation of new tagsets to extend the TEI's guidelines to support new kinds of application. 3. Poster Session Poster sessions formed a much larger part of this year's conference schedule than on previous occasions. The idea behind the session is that they allow speakers to give a short, informal presentation on any topic, after which they are available for questions and discussion. Each speaker can support his/her presentation with one or more specially created posters which may consist of anything from summary diagrams or a list of points, to the full-text of a presentation. It would be impossible to provide full details of all the poster presentations. However, given below are the title and extracts from the poster abstracts for each of the presentations mentioned in the programme. (N.B. Some poster sessions were put on impromptu, and these are not show below). The posters were loosely grouped into categories. SGML Transformations: "From SGML to Acrobat Using Shrinkwrapped Tools" - the transformational process the BCS uses to transform its documents into pdf files freely distributed over the internet. (Sam Hunting, BCS Magazine) "SGML Transform GUI" - describes a language-based, syntax- independent GUI for SGML structure and style semantic transformation, that supports both declarative and procedural processing models. (Michael Levanthal, Oracle Corporation) "A Tale of Two Translations" - a comparison of the development of translation programs using Exoterica's OmniMark, and Avalanche's SGML Hammer. (Peter MacHarrie, ATLIS Consulting Group) "Data Conversion Mappings" - mapping old data formats to new, using 1 to 1, 1 to many, 0 to 1 and 0 to many conversions, and how these mappings can be automated. (David Silverman, Data Conversion Laboratory) "DTD to DTD Conversion for Producing Braille Automatically from SGML" - following the techniques created by the International Committee on Accessible Document Design (ICADD) to produce braille, large print and voice synthesized books from SGML source files. (David Slocombe, SoftQuad) "Let's Play UnTag!" - untagging an SGML document to get a proprietary format. (Harry Summerfield, Zandar Inc.) "Introducing Rainbow" - a DTD archetype for representing a wide variety of proprietary word processor data formats to facilitate proprietary-to-SGML interchange and transformation. (Kent Summers, Electronic Book Technologies) "Converting Tables to SGML" - converting legacy table data from typesetting files or formatted visual representation into SGML. (Brian Travis, SGML Associates) Business Case for SGML: "Fear and Loathing in SGML: Life After CALS" - an overview of a recent study of SGML products and markets. (Antoinette Azevedo, InterConsult) "Designing Open Systems with SGML" - the business role and benefits of using SGML, and how to design an SGML based system. (Larry Bohn, Interleaf) "The Commercialization of SGML" - a review of the strengths and benefits of SGML and its current perception in the commercial business world. (Allen Brown, XSoft) "SGML: Setting up the Business Case" - an approach to making the business case for SGML. (Eric Severson, Avalanche Development Corporation) "Document Management Lingo: Why Executives Buy SGML" - a framework for selling SGML in relation to document management. (Ludo Van Vooren, Interleaf) How To...... "To 'INCLUDE' or 'EXCLUDE' That is the Question" - the use of INCLUSION and EXCLUSION in DTDs. (Bob Barlow, Barlow Associates) "Communicating Table Structures Using Word-processor Ruler Lines" - a method for writers to indicate the structures of tables using simple, mnemonic "ruler line encoding". (Gary Benson, Fluke Corporation) "Pre-Fab Documents: Modularization in DTD design" - groups of related document types are traditionally described in large, monolithic DTDs. If the related document types contain similar structures, they can be described as a series of related DTD modules. (Michael Hahn, ATLIS Consulting Group) "SGML + RFB = Access to Documents" - how Recording for the Blind, Inc. (RFB) provides E-Text materials to print-disabled persons. "Handling Format/Style Information" - using FOSIs to describe formatting/style information, and how to develop FOSIs using the Output Specification DTD provided in MIL-M-28001B. (Denise Kusinski and Pushpa Merchant, World Computer Systems Inc.) "Remodeling Ambiguous Content Models Through ORs" - using factorization to avoid ambiguity in model groups resulting from improper use of occurrence indicators and OR connectors. (John Oster, McAfee & McAdam, Ltd.) "Reuse and conditional processing in IBM IBMDOC" - [not listed in programme] (Wayne Wohler and Eliot Kimber, IBM) "An easy way to write DTDs" - [not listed in programme] ([speaker unknown]) Case Studies: "SGML and Natural Language Processing of Wire Service Articles" - the Mitre Corporation's use of Natural Language Processing and SGML-tagging to add value to news-wire articles and other kinds of document. (John D Burger, MITRE) "SGML Databases" - (Mike Doyle, CTMG Officesmiths) "Active Information Sharing System (AISS) SGML Database API" - the applications interface to the AISS SGML database to produce a total solution integrated SGML environment. (Hee Joh Lian, Information Technology Institute (ITI) Singapore) "AISS Document Formatting API" - a processing model of a native SGML fomatter, the issues involved and architectural forms to define them. (Yasuhiro Okui, NIHON UNITEC CO. LTD.) "AIS Document Management API" - controlling workflow using SGML and HyTime. (Roger Connelly, Fujitsu; Steven R Newcomb, TechnoTeacher) "Paperless Classroom" - building an interface to large amounts of information through the use of SGML-based hypermedia technology. (Barbara A Morris, Navy Personnel Research and Development Center) "Integrating SGML into On-line Component Information Delivery" - a comparison of using manual and SGML-based processes for database loading. (Javier Romeu, Info Enterprises Inc. - A Motorola Company) "SGML Support for Software Reuse" - using SGML to markup software for reuse (John Shockro, CEA Inc.) "What is an IETM?" - what constitutes an "Interactive Electronic Technical Manual" (IETM)? The different classes of IETM and how to build one. (Geoffrey von Limbach, InfoAccess Corporation) "Use of SGML to Model Semiconductor Documents" - tagging information on electronic components, which can be used to produce printed documents and treated as machine-readable data. (Various speakers, Pinnacles Group) "Producing a CALS-Compliant Technical Manual" - ensuring that the DTDs, Tagged Instances, and FOSIs support the users' needs for creating CALS compliant Air Force Technical Manual Standards and Specifications. (Susan Yucker and Matthew Voisard, RJO Enterprises Inc.) HyTime: "SGML/HyTime and the Object Paradigm" - a comparison of the object-oriented and SGML/HyTime ways of representing information. (Steven R Newcomb, TechnoTeacher Inc.) "An object-oriented API to SGML/HyTime documents" - the design of some of the HyMinder C++ library (developed by TechnoTeacher), and a comparison of SGML/HyTime constructs and HyMinder object classes. (Steven R Newcomb, TechnoTeacher Inc.) "SlideShow: An Application of HyTime Application" - the system design and architectural forms used to create a sample HyTime application called SlideShow. (Lloyd Rutledge, University of Massachusetts). "HyQ: HyTime's ISO-standard SGML query language" - a discussion of the main features and advantages of HyQ. (Steven R Newcomb, TechnoTeacher/Fujitsu/ISI) Technical Gems: "A Document Manipulation System Based on Natural Semantics" - Natural Semantics, its relationship to SGML, and the results of some document manipulation experiments. (Dennis S Arnon, Xerox PARC; Isabelle Attali, INRIA Sophia Antipolis; Poul Franchi-Zannettacii, University of Nice Sophia Antipolis). "Digital Signatures Using SGML" - a schema for digital signatures of electronic documents using SGML. (Bernd Nordhausen, Chee Yeow Meng, Roland Yeo, and Daneel Pang Swee Chee, National Information Infrastructure Division). "CADE - Computer Aided Document Engineering" - a framework of six methodologies for the Document Development Life Cycle. (G Ken Holman, Microstar Software Ltd.) "Using SGML to Address the 'Real' Problems in Electronic Publishing" - using Requirements Driven Development (RDD) for the generation of information capture and production environments, so as to ensure the balance of the supply and demand for data. (Barry Schaeffer, Information Strategies Inc.) "Recursion in Complex Tables" - a recursive-row table model, that can model the structure of most tables with multi-row subheads. (Dave Peterson) 4. Reports from the Front In this session, several speakers briefly outlined major SGML- related industry activities. Beth Micksh summarized the history and purpose behind parts of CALS MIL-M-28001, dealing with the use of SGML. The latest version MIL-M-28001B was made available last summer, and it supports the electronic review of documents. Early in 1994, the publication of a "MIL SGML Handbook" is expected; it will cover some of the fundamental and general aspects of SGML, as well as providing some valuable CALS-specific information. Terry Allen described the work of the Davenport Group. He outlined their general purpose and main members, full details of which are given in the publicly available material circulated by the Group. the DOCBOOK DTD has been the Davenport Group's major accomplishment to date - and v2.1 should be ready for release in the week commencing December 13th 1993; announcements will be posted to the comp.text.sgml newsgroup. Steve Newcomb talked about CApH, a breakaway group from Davenport, which focuses on Conventions for the Application of HyTime. They intend to provide guidance on how to use HyTime for anyone who wishes to adopt ISO 10744 for the interchange of multimedia and/or hypertext information. CApH will provide general policies and guidelines on how to tackle typical problems (e.g. how to generate a master index for a set of documents which all have their own separate indexes), but will not deal with how to enforce such policies The next speaker discussed the joint efforts of the International Press and Telecommunications Council (IPTC) and the Newspapers Association of America (NAA) to devise a base set of tags for marking up newswire text. This work will effect not only the providers of this kind of information (e.g. companies such as Reuters), but also the newspapers and broadcast services which make use of it, and the database/archiving specialists (such as Mead Data Central who will want to store it. Eddie Nelson spoke about the ATA DTDs, which he stressed were designed for interchange only. The ATA DTDs will influence the documentation activities of all the manufacturers, component suppliers, operators etc. in the commercial aviation industry. To date, the ATA has released 8-9 DTDs, mostly dealing with the major types of technical manual. Dianne Kennedy discussed the Society of Automotive Engineers (SAE) J2008 standard, which has been prompted by the emissions regulations in the Clear Air Act. By model year 1998, all new automobiles sold in the US must be supported by documentation conforming to the J2008 standard. J2008 is actually a suite of standards covering such aspects as the use of text, graphics and the interchange of electronic information - it is not just a DTD. Also, J2008 includes a data modelling (database) approach, which is separate from the actual documentation considerations; information required by the former is essentially relational in nature, whilst that for the latter is hierarchical. Careful thought has been required to ensure that the documentation DTD will sufficiently support mapping into (i.e. populating) the data model. The next meeting of those involved in developing J2008 will take place in January 1994. 5. Multi-company SGML Application Standard Development Process" - Bob Yencha (National Semiconductor), Patricia O'Sullivan (Intel Corporation), Jeff Barton (Texas Instruments), Tom Jeffery (Hitachi Micro Systems, Inc.), Alfred Elkerbout (Philips Semiconductors) This presentation described the work of the Pinnacles Group, a consortium of the five major semiconductor manufacturers, to develop a common form of electronic product datasheet for interchange between themselves and their customers. Product datasheets are relatively few (typically <10,000 per company) and small (typically < 100 pages), but they are very complex in terms of their structure and content. The decision to adopt a common, SGML-based electronic solution, was based on the need to simultaneously resolve business problems (i.e. collect and deliver information efficiently), and respond to market pressures (i.e. customers wanted the information quickly and in electronic form). Developing the DTD jointly by all the members of the Pinnacles group ensured harmonization, distributed the development costs, and encouraged the development of tools for both information providers and users. The speakers repeatedly stressed the importance of having access to the knowledge of content experts during the document analysis phase, and the benefits of having observers to ensure continuity between the various analysis sessions that were held. A cumulative document analysis process was strongly recommended. The draft DTD is due out at the end of 1993. Following a period for review and revision, it is expected to become an industry standard by April 1994. Each individual company still needs to consider how it will customize the DTD for its own use, and how the standard will be implemented within the company. The speakers felt that if members of any other industries were considering forming a similar group, companies should join early, plan carefully, and ensure that the anticipated benefits are continually "sold" to participants and stakeholders throughout the entire development process. 6. "Archetypal Early Adopters? Documentation of the Computer Industry" 6.1 " Information Development Strategy and Tools - IBM and SGML" - Eliot Kimber and Wayne Wohler (IBM) Wayne Wohler began by describing the "BookMaster Legacy". BookMaster is a GML application used by IBM Information Developers to create IBM product documentation. GML is very like SGML, although it lacks a concept comparable to the DTDs of ISO 8879. BookMaster is a fairly extensive authoring language, which has met IBM's information interchange and reuse requirements for several years. However, now that IBM supports more platforms and delivery vehicles (and wishes to interchange information with other enterprises), and to answer the growing demand from users, IBM has decided to migrate its Information Development (ID) operations to SGML. Eliot Kimber described how this migration is being carried out. The procedure first involved the design of a processing architecture (InfoMaster) on which to base the application language and semantics (IBMIDDOC). Tools had to be found for authors, editors and users, existing data had to be migrated to the new environment, with documentation and educational materials being developed along the way. InfoMaster is an architecture for technical documentation that defines the base set of element classes for technical documents; it defines how DTDs should specify the semantics of such documents and how programs should use the information. IBM drew on the HyTime concept of Architectural Forms to standardize the application semantics, and to facilitate the interchange of information between different DTDs. IBMIDDOC is based on industry standards, and was designed without bias towards any particular processing model. It makes uses of explicit containment to describe all containing relationships and uses elements as the basis of all processing semantics. All relationships between elements are treated as (HyTime-conforming) hyperlinks. IBMIDDOC does not use inclusions or exclusions, short references, or #CURRENT attributes. Documents conforming to IBMIDDOC are organized into conventional high level structures (e.g. prolog, front-, body- and back-matter), each of which can contain recursive divisions. Below the division level, elements are classified either as information unit elements (e.g. paragraphs, lists, figures etc.), or as data pool elements. (e.g. phrases and other flowed material). IBMIDDOC supports multimedia objects, and hyperlinks that can either be cross references or explicit hyperlink phrases. Any element can be a hyperlink source or target anchor, and IBMIDDOC also supports HyTime's nameloc, dataloc, and name query features. 6.2 Eve Maler (Digital Equipment Corporation) (I missed this session, as I was slogging round Boston trying to get my ailing laptop repaired. 6.3 "Implementation of a Corporate Publishing System" - Beth Micksh (Intergraph) (I also missed this session, as I was still slogging round Boston trying to get my ailing laptop repaired - this brief write up is based upon the copy of Beth's overheads included in the conference proceedings). Intergraph has 15 documentation departments with 120 full-time writers and a number of third party documentation suppliers. They publish and maintain over 400, 000 pages per year at an annual cost of $20million. The objectives behind developing and implementing a corporate publishing system were to standardize and facilitate the document creation and maintenance process, and to create a corporate documentation production system. The new system would be required to provide a standard documentation source format capable of supporting four different document types. The system would need to be robust enough to handle the production of large software and hardware manuals, (in all the required data formats), and also facilitate the reuse of source data in a multi-platform environment. In addition, it should also make possible the provision of on-line information, allow the translation of existing system data, and support multilingual documents. SGML was the obvious solution to many of these problems, and was adopted accordingly. The anticipated benefits to both the corporation and to users were comparable to those that have been outlined previously in other presentations (i.e. cost savings and improved productivity, consistent document structures, greater information interchange and re-use etc. etc.). The new system was implemented in a two phase process - the first phase designed to prove the principle concept(s); the second to produce the production ready system. Development was done jointly by three divisions (systems integration, electronic publishing and corporate publishing services) using a modular approach. One DTD provides the structure necessary to support all the required variants of Intergraph user documentation. Filters have been developed to allow the conversion of legacy data from existing tagged ASCII and FrameMaker formats. The success of the introduction of the new system has depended upon the cooperative efforts of all concerned- developers, input from users, and support from management. 6.4 Jon Bosak (Novell) (I also missed this session), 7. International SGML Users' Group Meeting This was the mid-year meeting of the International SGML Users' Group (ISUG), the AGM having been held at SGML Europe`93 in Rotterdam earlier this year. After a welcome from Pam Gennusa (President of the ISUG), representatives from many of the Users' Group National Chapters addressed the meeting. Most of the Chapters claimed a membership of around 45-70 individual members and a handful of corporate members. Several people took the opportunity to announce the recent formation of new National/local Chapters (e.g. in Denmark, Sweden, US Tri-State, US Northern California), whilst others expressed an interest in setting up such groups (e.g. in US Alabama/South East [Beth Micksh] and Ottawa [Ken Holman]). Pam announced that Richard Light, an independent consultant based in the UK, had taken over from Francis Cave as the Treasurer of ISUG following a vote by the ISUG Committee. Some Chapters had been extremely active in the preceding twelve months, staging numerous well-attended events (often vendor- oriented). The first event staged by the newly formed Swedish Chapter attracted around 200 attendees to a special one-off SGML conference, although they do not yet have a clear idea of the size of their ordinary membership. Several Chapter representatives/members reported feelings of apathy within their groups. Some Chapters had held only one or two events in the past year and were having difficulties attracting active members or developing programmes which would re- enthuse existing members to support Chapter activities. Members from those Chapters which had organized several successful events outlined the strategies and policies that they had used, and a brief discuss ensued for the benefit of existing and newly forming Chapters. Pam spoke briefly about the software which has been released through the ISUG (i.e. the ARCSGML Parser Materials, and the IADS software). She wished to re-emphasize that any software so released is not in any way endorsed, approved or checked by the ISUG. The ISUG does not have the resources to undertake software investigation or evaluation, but is willing to consider facilitating the distribution of any software which might be of interest to members of the SGML user community. Pam also outlined the relationship that has been established between ISUG and the SGML Open industry consortium; the ISUG has a non- voting place on the committee of SGML Open, and has put forward a proposal that ISUG members may be willing to participate in a case study exercise for SGML Open. Brian Travis reminded all present that the monthly newsletter "<TAG>" is available at a special subscription rate to ISUG members. Anyone interested should contact him for more details (Phone: +1 303-680-0875 Fax: +1 303-680-4906). Daily editions of the <TAG> newsletter were also being circulated throughout the duration of the conference. The next meeting of the ISUG will be the AGM, to be held at SGML Europe `94 (Montreux, Switzerland, May 15-19th 1994). 8. "Real World Publishing with SGML" - Terry Allen (Digital Media Group, O'Reilly & Associates, Inc.) O'Reilly's online "Whole Internet Catalog" first appeared in printed form, produced using a troff-based approach. The structure of the information was very loose, with the title being the only common structural element shared between all the entries. Terry described how he managed to develop an online version of the "Whole Internet Catalog" by using a combination of sed and awk scripts to translate the source files into versions tagged with HTML markup. HTML, the Hypertext Markup Language, is a DTD developed for providing information on the World Wide Web (WWW) - a global network of information servers linked together over the Internet. HTML was first designed as a tag set, and although it can be used as at DTD there is no requirement that it should be. The HTML aware browsers which are used to access information on WWW tend to treat HTML as if it is a procedural (rather than a descriptive) markup language. It is left to the creator of the information to decide whether or not to validate markup against the HTML DTD, as current implementations of the browsers to not parse any document they process, and are very (perhaps too) forgiving of any markup discrepancies. HTML is fundamentally lacking in terms of imposing any strong structural conventions (the occurrence and ordering of many elements are often optional), and its designers appear to have made some rather surprising decisions - such as the use of an empty <P> element to indicate the breaks between paragraphs. Terry described how he has decided to filter all his files into a more constraining but still simple DTD. The source files for the "Whole Internet Catalog" are now more like records in a database than narrative text files, but they can be easily converted (using OmniMark and an awk script) into HTML. This approach has proved successful up till now, because Terry has been working alone to maintain and process the data. New authors for the Catalog will need to be trained and provided with tools to support authoring with Terry's DTD. Limited testing has shown that the most successful approach to adopt with new authors is likely to involve the use of SoftQuad's Author/Editor and a template file of tags to generate the source information. Terry said that his experiences had shown that the authoring/editing process was less likely to be error prone if HTML attributes were actually represented as ordinary elements in his DTD - and if good use was made of display facilities (e.g. use of fonts and colour) to facilitate at-a-glance structural checks by humans. He has also adopted a macro-based approach to map the source SGML files into gtroff for printing on paper - although developing more robust filters may be required in the future. Terry felt that the lessons he had learned were that the production and handling of SGML source is (currently) likely to be done in an heterogeneous environment (i.e. authoring using one set of tools, linking and processing using another, and so on). He is still looking for good, cheap tools which will support the use of arbitrary DTDs, but an ideal future tool might be a sophisticated browser (such as Xmosaic) which also provided an authoring mode which supported user-supplied DTDs. Terry reported that he has also been closely following the development of HTML+, a revised version of HTML, which might provide a more robust/constraining DTD. 9. "HyTime Concepts and Tools" - Dr. Charles Goldfarb (IBM) Dr. Goldfarb began by providing a very brief overview of SGML, and the advantages to be gained from its use. He showed a simple example of some typical SGML markup, and then discussed the impact of SGML since its release as an ISO standard in 1986. SGML has become the dominant tool for producing structured documents, and has been widely adopted by industry, government and education. However, the real impact of SGML is that information owners, not programs, now control the format of their data; information creators/providers are no longer at the mercy of the proprietary solutions developed by vendors. Widespread adoption of SGML, will in turn encourage the creation of Integrated Open Hypermedia (IOH) information and systems. IOH information is integrated in as much as all information is linkable, whether or not it was specially prepared with linking mind. It is "Open" because the addressing of the linked location is not bound to a physical address until the link is "traversed" and the "anchor" accessed. Whilst "Hypermedia" represents the union of hypertext (information which can be accessed in a random order) and multimedia (information communicated by more than one means eg. text + graphics + audio + animation etc.) HyTime, the Hypermedia/Time-based Structuring Language (ISO/IEC 10744), is an application of SGML that has been developed to enable IOH. HyTime standardizes the most useful and general constructs and concepts to do with linking hypermedia/time-based information. It facilitates hyperlinking to any information object (whether or not it is SGML), has a rich model for representing time, and supports an isomorphic representation of time and space. The success of SGML and HyTime is being driven by the fact that users are now demanding products that use real Standards, rather than those which merely offer proprietary solutions. HyTime standardizes the use of SGML for hypertext and multimedia. It provides sets of standardized SGML attributes (called "architectural forms"), which convey the information used by the useful/general hypermedia constructs mentioned in the paragraph above. Architectural forms can be recognized by suitable processing software, and decisions or actions taken on the basis of the values of the attributes. HyTime extends SGML hierarchical structures by facilitating the lexical modelling of data content and attribute values, and also by supporting inheritance of attribute values. HyTime extends SGML hyperlinking (ie. IDREF) capabilities, and adds co-ordinate structures (called Finite Co-ordinate Spaces) to handle the alignment and synchronization of information objects in time and space. Using HyTime means that we no longer have to deal with a single SGML document, but can make seamless use of whole libraries of documents or pieces of documents. Dr Goldfarb then talked about the development and release of two Public Domain HyTime tools: ObjectSGML and POEM. ObjectSGML is an Object-Oriented SGML parser which supports incremental parsing, entity structures as they were originally envisaged during the development of ISO 8879, and the processing LINK feature. It also offers native HyTime support - validating architectural forms, handling location addressing, and processing HyQ queries and properties. The source code will be made publicly available for free. POEM, the Portable Object-oriented Entity Manager, provides a platform-independent interface to real physical storage. It supports ISO/IEC 9070 public identifiers for universal object identification, ISO/IEC 10744 Standard Bento (SBENTO) and ISO 9069 the SGML Document Interchange Format (SDIF). It has no parser dependencies, so it can be used with any of the existing SGML parsers. As with ObjectSGML, the source code will be made publicly available for free. ObjectSGML and POEM are the results of Project YAO, conducted by the International consortium for free SGML software development kits (SDK). Participants included Yuan-ze Institute of Technology (Taiwan, ROC), IBM (in both the US and France), Naggum Software (Norway) and TechnoTeacher Inc (in the US). Between them, the development team had extensive experience of developing SGML-aware tools and systems. Implementations of SDK will use the standard C++ class library, be entirely platform-independent, and be based on proven products. The architecture of ObjectSGML is build around a low-level "parser event" API, a variable-persistence cache, a high-level "information object" API, and uses POEM for entity management. Most existing SGML parsers use the low-level "parser event" approach (ie. passing all start/end tags, attributes, data etc. when found, whilst only retaining the current structural context). The use of a high-level "information object" API means ObjectSGML will provide access to both the element and entity structure of a document; addressing can be done using the HyTime location model ("proploc" and HyQ). The variable-persistence cache will be maintained by the application in a proprietary format; it will allow rapid access to information found by the parser event API, and can be optimized for an application. Using such a cache avoids re-parsing, since it will hold such values as "next event", "element", "entity", "parsing context" etc. Alpha test versions of the software were shipped to test sites last Thursday. The results of testing will determine what revisions may be required. However, the current intention is that both ObjectSGML and POEM should be publicly available no later than the end of the first quarter of 1994. 10. "HyTime: Today's Toolset" - Dr. Charles Goldfarb (IBM) and Erik Naggum (Naggum Software) SGML was developed for the benefit of information owners. It requires that information representation should be independent of applications and systems, which means that more than one representation is always necessary: an abstract (logical) representation, one or more perceivable presentations, and an internal storage representation. The real storage of information is obviously platform dependent at some stage, but SGML liberates information from such dependencies through the use of entities and an entity manager. SGML entities can take a variety of forms. External identifiers, such as DOCTYPE, ENTITY, LINKTYPE, NOTATION and SGMLall declare (and some also reference) entities. A system identifier is really a "locator" in as much as it specifies the physical location where an entity can be found; it involves the use of a Storage Object Specification (SOS), and a storage manager. There is no requirement that an entity should occupy all the content of a storage object, so the manager must be able to extract substrings (as well as handle things like record boundary insertion and omitted system identifiers). The use of public identifiers means that registered and formally identified entities can be recognized by conforming SGML systems. The practical implementation of the formal registration procedures outlined in ISO 9070 has yet to be finally sorted out by the selected registration authority (the GCA). Until this is done, it is perfectly possible to formally identify public entities using ISBNs - and this approach has already been adopted by companies such as IBM. It should also be remembered that the entities associated with public identifiers can also exist in different versions; for example some SGML software requires that DTDs are precompiled before the system can use them. Erik Naggum spoke briefly about Formal System Identifiers (FSIs). An FSI lists a Storage Object Specification, giving details of the storage system type, storage object identifications, record boundary indicators, and substring specifications. The record boundary indicator was felt to be necessary, because many people now move files between Dos, Unix, and Mac systems, and each of these has a different way of indicating record boundaries. Erik showed an example of how files on two different types of systems could be concatenated and passed to an Entity Manager as if they were a single unit. Interchange facilities involve a separation of the virtual entity management from the real physical storage entity management. SDIF (ISO 9069) details how SGML objects can be combined into a single stream for the purposes of interchange. SBENTO is described in the HyTime standard (ISO/IEC 10744) and whereas conventional BENTO uses a directory-based approach to control the packing of objects into a single stream, SBENTO uses SGML entities to make the process simpler. It is also possible to interchange SGML objects packaged using conventional archiving tools (e.g. PKZIP). Good entity management is very important to the success of any SGML-based system. Entity structure is a "virtual storage system" which isolates the element structure from system dependencies, and allows storage-related properties and processes. Entity structure is (literally) the foundation of SGML; it supports both SGML-aware and non-SGML aware storage and programs, and also allows the successful interworking of both types. The soon-to-be released POEM (Portable Object-oriented Entity Manager) announced in Dr Goldfarb's previous presentation, implements all the principles of good entity management. A copy of the POEM specification (version 1.0, alpha level 1.0 ) was distributed to attendees. 11. "Charles Goldfarb, Please Come Home: SGML, the Law, and Public Interest" - Ina Schiff (Attorney at Law) and Allen H Renear (Brown University) (Ina Schiff was unable to give her part of this presentation, which was given on her behalf by another speaker.) The conventional wisdom is that SGML is best-suited to long- lived, structured documentation such as technical manuals. However, the purpose of this presentation was to suggest that it could be effectively applied to the handling of structured legal documentation of the sort regularly produced by attorneys on behalf of their clients. Attorneys are used to researching and supplying the information content that comprises the key parts of most legal documents. However, they generally leave the structuring of the information and the inclusion of legally required ("boilerplate") text to paralegal and secretarial staff. Adopting an SGML-based approach to the creation of such structured documents would mean that attorneys would no longer have to rely on other members of staff to correctly structure their texts and include required elements etc. A well-featured SGML-based system could easily provide a good authoring and editing environment in which to create and revise these kinds of documents. It is possible to imagine a future in which electronic multimedia structured documents are acceptable as submissions in court. If such documents were also archived in the main legal text databases, it would greatly facilitate the generation, delivery, interchange and reuse of legal information. This would represent a better service to clients, and hopefully lessen the tremendous amount of paperwork which is currently required as part o f the legal process. Allen Renear strongly argued for the collaborative development of any DTDs for the sorts of documents mentioned above. He suggested that the legal community could usefully benefit from modelling the approach adopted by the academic community when developing the Text Encoding Initiative's (TEI) Guidelines. Allen gave an entertaining account of how he had been called as an expert witness to defend Ina Schiff's use of SGML to produce her own structured documents, after an opponent who had lost a case to Ina was contesting the size of her fees (having had costs awarded against them). The case against Ina suggested that by entering the data content herself, she had actually been performing a secretarial task, and so the work should have been charged at an appropriate rate; they also alleged that Ina had made substantially less-than-average use of paralegal and secretarial assistance when preparing her case. Allen argued that her use of an SGML-based, structured document authoring environment had allowed her to get on with the job of producing information content, and to do it more efficiently; Ina won the case. Allen said he would now be looking forward to the time when attorneys could be taken to court for not using SGML when preparing documents for a case 12. "Online Information Distribution" - Dave Hollander (Hewlett Packard) There are several current barriers to the delivery of online information. Authors only have a limited selection of authoring tools, they are not used to reusing information, and the information they receive/deliver can be inconsistent. Publishers have no standard tools to process the information they receive from authors. Whilst customers have specific hardware and software requirements, which may be unique to them. This means that throughout the process of developing online information, different convertors are required for every environment - and this creates particular difficulties for large companies like HP, who now produce 3-4 gigabytes of information each month. The kinds of information that might be distributed online vary considerably. A typical list might include such things as: context- sensitive and task-oriented application help, online reference manuals, multimedia (graphics, audio, video), hypertext, information that is conditional on the current environment, history or other factors, and so on. HP have come up with a short term solution, which is not ideal in SGML terms, but fits their purpose. The Semantic Delivery Language (SDL) is a delivery format defined by a DTD; it provides an intermediate language/format in which to deliver information, and it also facilitates tool development and information reuse. SDL's development was entirely driven by practicalities, rather than a wish to experiment with SGML-based techniques. Achieving good performance was the number one issue, so documents were broken down to allow for multiple entry points. The designers of SDL also had to give up some SGML features (eg markup minimization), use pre-calculated counters for chapters etc., use a DTD which would work with simple/cheap parsers, and use a small number of elements. Certain parts of the documents (e.g. the table of contents, indexes etc.) are pre- computed before display to improve performance. SDL had to be flexible. The system had to support font and page specifications which are separated out from the document content (but which allow flexible displays). A normalized set of semantics is included, although the designers also allowed fourteen different types of system dependencies [?]. There are a variety of link types, but SDL is not (yet) HyTime compliant. A version id is placed on all containing elements to support version control/viewing if the tools being used are powerful enough. SDL is intended to provide a structure which can help a reader get to the right information at the right time. Source semantic identifier attributes allow individual and/or groups of elements to be easily identified - but this approach is not ideal, and HP may adopt a HyTime based approach in future. SDL's designers put alot of thought into the development of filters; from the first, SDL was planned with filtering in mind. The inclusion of display level semantics facilitates filtering from SDL, and from other procedural formats that are still being used. SDL is a complete information model, in that it includes formatting (DSSSL-type) information as well as structuring information. SDL hierarchical modelling also makes it easier for non-SGML aware programmers to develop tools quickly, and gets them interested in the concepts of SGML. However, the real value of SDL lies in the fact that HP's customers can now get consistent information distributed online. 13. "Xhelp: What? Why? How? Who?" - Kent Summers (Electronic Book Technologies) Xhelp a standard online help solution for the Unix/X-Windows environment. Xhelp was not produced by EBT in isolation, but is the results of collaboration between numerous X/Unix developers. The current situation is that each vendor often has their own solution to providing online help - which makes the whole situation very complex, and makes life difficult for the end users. Current solutions to this problem either provide less effective help, or are more expensive. The people involved in producing online help (writers, designers, programmers, managers etc.) are not always able to collaborate, and any new solution has to bear this in mind. Online help must be consistent, and should exist as a service which is separate from the client applications. Applications and online help often have different release cycles, so online help should remain uncoupled from applications in order to support incremental updating of content. Moreover, any solution to the problems of providing online help must be truly cross-platform for both applications and documents. The Xhelp solution makes use of the existing standard communication layer in X. Information representation and display favours using SGML, but the display system also supports PostScript, ASCII etc. to allow for the easy inclusion of legacy data. Kent then spoke about the Xhelp architecture (which separates the communication logic, data representation, formatting and display tool layers), and the Xhelp protocol. He described the various parameters that comprise the Xhelp client message content, and outlined the advantages to be gained from using the Xhelp architecture in relation to the Xhelp protocol and Xhelp.dtd. Advantages to be gained from Xhelp procedural approach included such factors as: programmers no longer have to do anything to support on-line help; time of writer/programmer collaboration is reduced; the fallback algorithm works to provide some help, even if the user's query cannot be answered directly; supports context-sensitive and task-oriented help; offers "fill-in- the-blank" templates for linking to help information. The Xhelp.dtd means that authors gain reusable skills, since they become familiar with a single content model and set of authoring tools. Other advantages of Xhelp included: no installation, and help can be distributed independently of applications; good performance; encourages a good environment for business/competition. Kent talked through the online help production process from the programmers' and writers' points of view, comparing the traditional approach with an approach based on the use of Xhelp. The Xhelp approach requires far fewer steps (4 as opposed to 12) and the programmers' input is reduced to zero. Xhelp provides numerous benefits to Unix systems vendors, independent software vendors, and to end users. It provides a cost-effective, cheaper solution, with positive benefits for all involved. (The Xhelp developers forum and DTD is maintained on the O'Reilly & Associates ftp site) 14. "Digital Publications Standards Development (DPSD): A Modular Approach" - Horace Layton (Computer Sciences Corporation) DPSD is a three phase programme to streamline and modernize the acquisition, preparation, production and distribution of information for the US Army. The developers hope to have the finished standard available next summer. The MIL-STD-361A standard, is the flagship product of the DPSD programme. It is task-oriented, and consolidates six former technical manual standards into one. The DTD eliminates chapters, sections, and most paragraph numbering requirements, and focuses on structuring information content rather than style or formatting issues. Horace talked through the evolution of MIL- STD-361A, and the products produced by the end of DPSDphase II (in June 1993) - which including several DTDs and a number of FOSIs. He then looked at the programme for DPSD phase III. The concept behind MIL-STD-361A is that the majority of maintenance manuals, regardless of the level of maintenance, contain similar functional requirements. Knowledge of technical manuals and analysis of technical manual data allows one to group functional requirements into modules of information. Creation of DTDs for these modules allows use of the same modules wherever the same functional requirements are imposed. There is only one DTD requirement for each technical content volume. The approach can be used for all levels of maintenance (operator, unit, direct support/general support, and depot). A single DTD is used to assemble all the required modules into a complete technical manual (TM). The DTDs in MIL-STD-361A are content driven, and comply with both ISO 8879 and MIL-M-28001. The DTDs are quite small in size (8-10 pages), and currently only seven DTDs cover all the Army's TM requirements contained in MIL-STD-361A. The DTDs are intended to be easy for authors to understand and use. Horace showed a diagram of the MIL-STD-361A technical manual structure, and another giving an overview of the MIL-STD-361A concept. He then briefly described the relationship between the MIL-STD-361A and MIL-D-87269 (IETM - Interactive Electronic Technical Manual) standards. Horace closed with a description of the field test/proof of concept objectives, and the test sites and schedule. Testing is due to finish by the end of April 1994, with the final document ready for submission by May/June. The DPSD programme means that the US Army is well on its way to achieving a full digital delivery capability. 15. "A Practical Introduction to SGML Document Transformation" - David Sklar (Electronic Book Technologies) This presentation looked at the requirements for, and features of an "SGML transformation engine" which David likened to "the Swiss Army pocket knife of the SGML community" (i.e. an extremely useful, multi-featured tool - although this description somewhat loses the humour of David's remarks). David began by proposing his personal nomenclature for distinguishing between "conversions" and "transformations". Conversions are "Up and Lexical" in that they involve the (upward) conversion of non-SGML unstructured source information into SGML, on the basis of a content-identified (i.e. lexically-based) process; it is difficult to achieve 100% accuracy during conversion. Transformations are "SGML and Down" in that they involve the (downward) translation of SGML, content- identified, validated data into a non-SGML form; it is possible to achieve 100% accuracy during these types of transformation. This presentation focused on SGML transformations (rather than conversions). A dreaded "chasm" exists between the optimal in-house DTD, and the processor needs/distribution DTD. An optimal in-house DTD is based on a content-driven design, is formatting-free, and supports omission of generatable data. A "processor needs" DTD is useful for people who want to output hardcopy or to publish online - but this is still a relatively young industry, and the processors have limitations (e.g. a processor like DynaText cannot do such things as the auto-numbering of heterogeneous element types; it is inefficient at calculating list-adornment types (e.g. bullets) based on context-sensitive algorithms; it should not be used to auto-generate text as this will be unknown to the search engine). A "distribution DTD" might be something like an industry-standard DTD (e.g. ATA, J2008 etc.) or a DTD required by a major customer. This situation usually results in the in-house DTD being compromised to bridge the gap between the optimum and processor needs/distribution DTDs. Compromising the optimal DTD is the wrong solution to adopt. It leads to a number of additional costs (e.g. disruption of the current authoring/conversion process, the DTD's portability/lifespan is shortened etc.). The solution is distasteful in so far as it represents a change in long-term strategy to compensate for temporary deficiencies in current SGML processors. Ultimately, the aim should be to bring the SGML consumer to the data, and not to force the data to travel to the customer. The correct way to bridge the chasm, is to build an "Xform Bridge" (automated SGML transformation) - to transform data which conforms to the optimal in-house DTD into a form that complies with an interim, "process-ready" DTD. This way, the author/review environment remains unaffected by the transformation process is not affected, and the possibility that the transformation can be 100% automated means that it can be potentially 100% accurate. Moreover, the interim instance need not conform to an actual DTD, as this is not required by some processors (e.g. DynaText), and validation may be deemed optional if the transformation process is itself validated. David then compared the main characteristics of an "Author- Ready" DTD with those of a "Process-Ready" DTD. An Author- Ready DTD will contain no information that can or should be auto-generated; authors are allowed to focus on content which they alone can produce. David called this approach the "noble but not usually practical" from of SGML (emphasizing that it is not riddled with "compromise" attributes designed to satisfy the processor). A Process-Ready DTD will contain auto-generated information that is needed by the processor - and David called this the "real-world" form of SGML (in the sense that this is how SGML systems are typically implemented). Transformations can also be used for a variety of other purposes, such as importing data from external sources (e.g. generating tables from information held in a database). Transformations can also be used to automate "mindless" or "formatting-based" authoring tasks, such as calculating a list-adornment type. They can be used to perform certain types of document analysis, for example producing a report on some data (e.g. statistics, frequency counts etc.). A transformation engine can also assist the semantic validation of data, for example it could check the validity of the value of a "date-month-year" attribute (which an ordinary SGML validating parser will not check). Transformations could also help to extract/duplicate data (e.g. generating an index or table of contents), and to hide data on the basis of an intended audience. Futhermore, a transformation engine could apply style information during the transformation process - which would facilitate the fast on-line display of the resulting SGML documents. Most of the existing transformation engines fall into what David called the category of "Parser with a Print statement" utilities (i.e. the current context is maintained using a simple element stack etc.) This approach is limited in a as much as it offers no lookahead capability, and only limited "lookbehind" (typically only to ancestors and left siblings). The output side often has no SGML intelligence, and so it is quite easy to output invalid documents. Another category of transformation is the "tree analyzer with a print statement" utility. There are very few tools of this type (e.g. EBT System Integrators Toolkit), and although they allow unlimited access to the document tree (with arbitrary look- ahead/behind, and re-ordering of input elements), there is still no SGML-awareness on the output side. There is a third category of transformations, which David called "tree transformation" utilities (e.g. the Polypus library extensions to Balise from AIS). To date, this would appear to be the only product with SGML-awareness on both the input and output sides of the transformation process. David then proposed a number of tips that might prove useful when comparing transformation engines. Check the software externals, such as the range of platforms supported, speed, RAM requirements etc. Also check the software internals i.e. whether or not the scripting language is comfortable/easy to learn/offers good diagnostics/is extensible, look at the script-debugging aids, the error recovery during parsing of an input document, access to the input tree, pattern matching, and access to external data. David had hoped to talk about the role of GLTP (the General Language Transformation Processor) as defined in the DSSSL standard, but unfortunately he did not have time. (N.B. GLTP is likely to be renamed as STTL, the SGML Tree Transformation Language, in the next draft of DSSSL). David also wanted to talk about "GLTPeeWee" the demonstration prototype GLTP transformation engine that he hopes to release into the Public Domain, but this had to be left until a special evening session. 16. "SGML Transformers: Five Ways" - Chair: Pam Gennusa (Database Publishing Systems Limited) This session was designed to give an overview of how various tools would solve some typical transformation problems (e.g. transforming document instances from one DTD to another, or between an SGML document instance and something else). A number of tools were represented, and the problems were specified in advance by the speakers themselves. The Tools: Balise and Polypus (AIS/Berger Levrault) - Christophe LeCluse SGML Hammer (Avalanche) - Ludo van Vooren The Copenhagen SGML Tool [CoST] (Public Domain, University of Copenhagen) - Klaus Harbo GLTPeeWee (Public Domain, David Sklar) - David Sklar [unfortunately, David was not able to present his GLTPeeWee solutions to the problems, because his notes had disappeared]. OmniMark (Exoterica Corporation) - Andrew Kowai TagWrite (Zandar Corporation) - Harry Summerfield. The Problems: (N.B. most problems were accompanied by a sample case of the type of DTD outlined in the problem specification, and some also included sample post-transformation output.) 1) INSTANCE NORMALIZATION: Starting from an arbitrary source SGML instance with potential markup minimization, generate a non-minimized result instance in which all SGML information according to ESIS have been made explicit, and which can still be parsed using the same DTD. The program should be written to be DTD independent. This kind of processing is extremely frequent. The result instance can be easily post- processed by non-SGML systems (e.g. typesetting systems loaders) or by minimal SGML systems. [Set by AIS] 2) DICTIONARY INSTANCE SPLIT/MERGE: Given a dictionary made of entries, split this instance into as many files as there are entries, while generating an SGML skeleton which keeps track of the entry file names. In a second step, perform the inverse operation: use this skeleton to gather all entries (stored one per file) and re-create the original instance. This exercise is a simplified version of a very common operation in database loading and extraction situations. It stresses the input/output capabilities of the application language. [Set by AIS] 3) STRUCTURAL TRANSFORMATION: Starting from a source instance described by a hierarchical type DTD, generate a "flat" instance described by a "flat type" DTD. (This kind of problem is very typical of problems encountered when generating input for a non-SGML word processor or DTP tool; "flattening" to some kind of "transit" DTD is often needed). In a second step, do the opposite: from the generated "flat" instance, re-generate the original hierarchical instance. (This kind of problem is very common in retroconversion). Note: the two sample DTDs are designed to illustrate recursive definitions, and are not meant to be really useful in the real world. [Set by AIS] 4) ID TYPE CHECKING / ID COMPUTATION: An SGML parser checks uniqueness of ID attributes in the instance and checks that IDREF attributes are resolved, but this does not guarantee by itself correctness of the cross-references. In the cases when cross-references are "typed" (i.e. a cross-reference to a figure is not the same as a cross-reference to a table), checking the type of elements associated to the ID/IDREF attribute pairs provides an additional checking level. This problem examines how to handle such a task. As an auxiliary task, it is asked to fill implied ID attributes. 5) FLOATING ELEMENTS CONSISTENCY CHECKING: Floating (empty) elements are often used in DTDs to markup revisions or internationalization. These empty elements occur in pairs with a "starting" and an "ending" element. When such elements are declared floating, an SGML parser cannot check much about the way they are used. This problem examines how to handle that task with an application language. 6) CALS TO AAP TABLE CONVERSION: The general problem consists in transforming SGML structured tables from the CALS table DTD to the AAP table DTD. The transformation program should handle: the general structure of the table, spanning, and borders. The program should be DTD independent. That is, given a DTD including CALS table description, the program can take any instance of this DTD and output the given instance where all CALS tables have been replaced by AAP tables. [Set by AIS] 7) DUPLICATION OF DESTINATION TEXT AT SOURCE OF XREF: Consider a DTD that contains SECTREF elements that represent cross-references to SECTION elements [DTD example omitted]. The transformation should replace each input-DTD SECTREF with an enhanced output-DTD SECTREF [example omitted]. When a SECTREF is encountered during the transformation, the engine should automatically generate this text as the content of the output:" SECTREF: see 'XXXXXX'", where XXXXXX is a copy of the contents of the TITLE of the SECTION that is referenced by the SECTREF. Note that TITLE is not atomic; a TITLE instance can contain an arbitrary number of EMPH subelements. Also note that it is possible for a SECTREF to appear before the SECTION to which it refers. The transformation should perform error checking to ensure that exactly one SECTION is referenced, report problems appropriately in a 'non-fatal' way (i.e. should continue processing and produce a usable document instance). [Set by AIS] 8) EXTRACTION/SORTING: Consider a DTD that allows FIGURE elements to be placed anywhere in the body of the document. Some figure elements have caption attributes, but not all of them do: [DTD example omitted]. Create a transformation that appends an element called FIGREVIEW to the end of the ENDMATTER's child list: <!ELEMENT FIGREVIEW - - (FIGURE+)>. FIGREVIEW simply contains copies of all the FIGURE elements found in the body of the document, sorted lexicographically based on the figure' s caption. (Assume all text is ISO 8859-1). Uncaptioned figures should not appear at all in the FIGREVIEW section. Note that each FIGURE element is actually the root of a subtree of arbitrary size (I intentionally don't show the content model for HOTSPOT); make sure the entire subtree of each captioned FIGURE is duplicated in the FIGREVIEW section. (I am interested in knowing if that is possible, in your technology, without full knowledge of the content model of HOTSPOT). [Set by EBT] 9) ROW COUNTING: While converting an SGML marked-up table to some other form (e.g. a typesetting language), count the number of rows and columns in the table and put the total in a specified position at the start of the table. [Set by Exoterica] 10) LIST MARK FORMATTING: Given a <list> element, that can have other <list> elements within its content, which has an optional attribute whose presence determines the manner in which items in each list are marked or numbered, and whose absence indicates that the mark or numbering form is to be deduced from the ancestory of the <list> (e.g. a <list> with a decimal-numbered list is to use lower-case letters), correctly mark or number each list item. Note that alignment of the text following the marks/numbers can be dealt with as another problem. [Set by Exoterica] 11) DATE: Output the current date in a form determined by an attribute value in the SGML document. [Set by Exoterica] 12) LINE BREAKING: In the process of converting an SGML document to some other form (or even SGML), produce output that has a given number of text characters on each line, not counting any commands or tags in that other form in the total, and adjusting the length of each line so that breaks only occur at "allowed" points (such as at spaces and existing line breaks). Require that there be no trailing spaces at the points of the breaks (i.e. none of the preceding lines have trailing spaces). 13) RESOLVE EXTERNAL CROSS REFERENCES: Given two or more SGML documents each of which contain references to objects and locations in themselves and other documents in the set, replace each reference by an identification of the referenced object or location, the identification having been defined by the processing of the target, whether it be in the document doing the referencing or some other. [Set by Exoterica] 14) ICADD TRANSFORMATION: This transformation exercise gives all of the participants the opportunity to implement (and if necessary make suggestions for) the mapping techniques which have been designed to allow any DTD to carry with it the information needed to be turned into the input (with an instance) to an automated Braille translation process. This is, admittedly, quite a complicated exercise by the end, and people may wish to build processors only for the earlier attributes. [Yuri Rubinsky/ICADD] The Solutions: 1) INSTANCE NORMALIZATION: Christophe said that Balise/Polypus could solve this problem in about three lines of code (but this was not shown). Ludo devised a simple SGML Hammer module to handle any starts (and attributes), data content, and element ends; a couple of other simple, very short modules were also required to provide a complete solution. Klaus said that CoST could also cope with these requirements, and summarized his solution. Andrew said that OmniMark could solve the main bulk of the problem, but could not provide a solution that would be truly DTD independent (however this would only require a simple fix, and it will be corrected in future versions of OmniMark). Harry said that TagWrite could solve the general problem using two or three simple rules and its notion of supertokens, however the solution would probably not be completely DTD independent. David Sklar suggested that the problems for GLTP will come when trying to handle empty elements. 2) DICTIONARY INSTANCE SPLIT/MERGE: The panel agreed that the key to any solution is handling the fact that the application will need to address several external files. Christophe said that this problem would be quite simple for Balise. SGML Hammer cannot do multiple file output, so Ludo suggested that this case might cause difficulties for the product. CoST could handle the problem fairly easily, although some thought would need to be given as to how to make the filenames unique, and making the filenames out of PCDATA content might also be a bit tricky. OmniMark would find the problem simple to solve, whilst TagWrite could provide only a partial solution (the remainder could be solved via the use of simple WordPerfect/Word macros). 3) STRUCTURAL TRANSFORMATION & 10) LIST MARK FORMATTING: These problems were combined, because they shared fundamental similarities. Balise/Polypus can solve the list mark problem, however the nesting and un-nesting might be a difficult area. A Polypus library function could be used to do the structural transformation. OmniMark could solve the list mark formatting issue by using pattern matching and text rebuilding , and solving the problem of structural transformations can also be done (in each case the code used in the modules was shown). SGML Hammer (and the Louise scripting language) could cope with the list mark formatting problem (by using arrays, and acting on the basis of context). TagWrite can solve the list problem (although Harry admitted that it entailed using hideous SGML); the structural transformation problem is quite familiar and solvable with TagWrite. 4) ID TYPE CHECKING / ID COMPUTATION: Balise can solve the type checking of ID/IDREF elements - by going through the document building a data structure of IDs and IDREFs and then checking that everything matches up and resolves correctly (if not, it reports and error). The Louise/SGML Hammer solution would be similar to that adopted by Balise. Using an array-based approach, it would first read the whole file into the array (to avoid having to perform any lookahead), then pass through the array to check all ID/IDREFs resolve. At the end of the process, Louise/SGML Hammer will output a report identifying any problems (e.g. any footnotes that have been used but not referenced). CoST offers two ways to solve this problem. One is the Balise/Louise approach. The other uses CoST's "tree mode", making use of an object-oriented techniques to build a parse tree which it can then use to check ID/IDREFs. This would also be a two pass process, in which the first pass builds the parse tree and the second does the checking. OmniMark would also solve the problem by first constructing associative arrays, which are then used to check ID/IDREFs. Any errors would be output to an error log. TagWrite does not but instead uses counters to keep track of things within the document. The counter-based approach can be used to place IDs in a document that has no IDs, but if some IDs are present/missing, then TagWrite would not be able to cope (and it would be necessary to develop some supporting solutions). 5) FLOATING ELEMENTS CONSISTENCY CHECKING: [I do not appear to have any notes on the answers to this problem, other than what appeared in subsequent handouts. I believe that most of the panel considered that the solution to this problem would be essentially similar in technique to that described for problem number 4). 6) CALS TO AAP TABLE CONVERSION: Christophe said that AIS did not really expect anyone to be able to provide a simple solution to this common problem, as the real issue is one of tree building. Balise/Polypus solves the problem by using the Polypus library. Ludo said that although the problem appears complicated, it is not actually very difficult because all the data stays in the same order (therefore, SGML Hammer could provide a linear rather than a tree-based solution). Klaus said that he did not even attempt to provide a solution using CoST, because he was not familiar with the DTDs and it would have taken him too long to understand them. Andrew offered a skeletal OmniMark solution, which relied on the use of OmniMark's macro facilities rather than a wholly SGML-based approach. 7) DUPLICATION OF DESTINATION TEXT AT SOURCE OF XREF & 13) RESOLVE EXTERNAL CROSS REFERENCES: In setting his problem, David Sklar (of EBT) wanted to see if the Xrefs could be resolved in a single pass. Exoterica also wanted to see if it could be done in a single (or more) pass. (Where "pass" means taking in the data stream only once) OmniMark can solve the second problem using two passes (by building an intermediate file)and then working from that to resolve the external cross references. The first problem can be solved using a single pass. The solutions to both problems offered by Balise/Polypus makes use of several functions defined in the Polypus library. Essentially the problems are solved by the building and manipulation of a parse tree. SGML Hammer, like OmniMark, also has to do multiple passes to solve the second problem. The first problem could be solved in a single pass if it has been written using entity references - where the entity references will be resolved during the parsing process[?] - otherwise this would also require multiple passes. TagWrite - the duplication issue is easy to solve using TagWrite supertokens. The second problem is non-trivial, and TagWrite could not resolve it. Using CoST, the first problem can be solved in a single "pass", because it can build and subsequently interrogate a parse tree. Klaus suggested that the second problem is really one of co- ordination, which could be resolved in various ways with CoST. 8) EXTRACTION/SORTING: David Sklar said that the GLTP solution to this problem is very elegant and simple, but was unable to demonstrate this for the reasons mentioned above. Balise/Polypus can solve this problem using the Polypus library (ie. using trees). Christophe showed some code, including the function called after the document tree has been built which is required to perform some of the global actions. This problem is not very difficult to solve with SGML Hammer . It collects the subtrees during the parse phase, and handles them accordingly at the end. Ludo suggested that interesting solutions could also be created by using different types of DTD for authoring and processing the document. CoST handles the problem by processing the figures in CoST's tree mode (ensuring that they will not be deleted), then extracting them from the tree and moving them to the appropriate place. OmniMark also makes this problem simple to solve. It puts the figures in an associative array at the end of the document, which can then be subsequently sorted. TagWrite could not do this because it does not store associative arrays (nor build parse trees). Harry suggested that this was really a question of how to use SGML not how to create/transform SGML, and therefore TagWrite was never designed to handle this kind of problem 9) ROW COUNTING: All the tools seemed able to cope with this problem (although a TagWrite solution was not offered as Harry was speaking elsewhere). OmniMark solves the problem quite easily in one pass of the document (the code of the solution was shown). Balise can also solve the problem in a single pass (without recourse to using the Polypus library), using an approach similar in spirit to that offered for the ID/IDREF checking problem. SGML Hammer would simply store the relevant information in an array, then output it as appropriate. CoST would use a two pass solution whilst in tree mode. 11) DATE: OmniMark is able to solve in a single pass, taking advantage of OmniMark's built-in function to get the date. Balise would use a system call function, the results or which would be processed accordingly. SGML Hammer follows the same approach as Balise, but also uses a string processing function. For CoST the problem lies in parsing the attribute value; obtaining the date information and handling the formatting can be done internally. 14) ICADD TRANSFORMATION: Yuri suggested that this is quite a complicated problem, but this is primarily due to the difficulties that arise from handling the multiplicity of possible inputs in the specification attributes. A solution is possible using Balise, but Christophe acknowledged that handling all the possible things that could occur in the attributes would be quite hard. Ludo suggested that this problem is a good demonstration of the use of architectural forms. Since processing is attached to specific attributes defined in the architectural forms, SGML Hammer would be well-suited to handling this kind of thing. Klaus said that CoST, like SGML Hammer, was very proficient at handling architectural forms and so a solution would be possible. Andrew stated that a real-world solution (using OmniMark) to the problems of handling ICADD transformations was currently being developed at California University. At the end of the session, all the panel agreed that they would post their solutions in electronic form to the newsgroup comp.text.sgml. 17. "The Scribner Writers Series on CD-ROM: From a Great Pile of Paper to SGML and Hypertext on a Platter" - Harry I. Summerfield (Zandar Corporation), Anna Sabasteanski (Macmillan New Media) Anna Sabasteanski works for the electronic publishing division of Macmillan New Media, which currently publishes about 50 electronic titles, mostly to do with medicine. The Scribner Writers Series represents a mix of writers in English - American, British etc. - who are also classed into numerous genres (such as Children's authors etc.). The decision of which authors to put on the CD-ROM was based on a survey of the texts most used in schools (about 550 authors are represented on the CD-ROM). Anna talked generally about the development process. Many of the texts are only available in hot-metal/printed text, rather than in electronic form - therefore the initial data conversion costs were quite high. The developers had to decide how to differentiate and provide added benefits to encourage use of the electronic form of a text over the existing paper version. Copies of the paper books were physically broken up, so that the pages could be easily scanned. A specialist company was used to handle the conversion from scanned pages to SGML tagged and validated files. The company guaranteed an accuracy rate of 99.5% (which was equivalent to about two errors per page). Markup errors were fairly easy to find and correct (using SGML-aware validation tools), although correcting these required human editorial intervention. The markup conformed to the AAP Book DTD, with a few corrections/amendments (e.g. to allow for extra entities necessary for ancient texts). Attributes were also added to indicate genre, language, nationality, race, sex etc. One of the main aims of the project was to use SGML markup to tag documents for inclusion in what is effectively a database (i.e. the header information of each text is used to facilitate organizing and searching). Microsoft's Multimedia Viewer was used to present the information, with the SGML tagged files converted into RTF (the format recognized by Multimedia Viewer) by Zandar Corporation. A number of problems were encountered when developing the CD-ROM, for example handling certain special characters (which could not be represented in RTF), and deciding how to handle represent links to other part or whole texts. A particular headache arose when converting the original bibliographic sections - the designers of the CD-ROM version wanted all bibliographies to follow the same conventions, but senior editors at Macmillan also imposed the requirement that the bibliographies could be re- presented in the style used in the original source text. A final quality control procedure was necessary to check the end product, from the point of view of both software and content. The testers found several bugs in the beta version of Multimedia Viewer software which took some time to get corrected. Harry Summerfield then described how his company had approached the project. As described above, Zandar were contracted to carry out the transformation from scanned, then hand-edited SGML files, into RTF. However, they first carried out an important feasibility study to ensure that they were capable of doing the job and meeting Macmillan's deadlines. When they began to design the transformation process, Zandar did not just want to see the DTD being used, but also examples of live documents containing real tags. Zandar was aware that DTDs change, and that the tagging actually used in files may or may not always match up with the current version of the DTD. The conversion had to handle 50Mb of data in 510 files. The actual conversion process was done in five passes, because this approach was cheaper to develop. The first pass had to find any (SGML) invalid characters, and convert them to SGML entities. The second pass was to make the first letter of the text of every file a dropped cap (in RTF) - this was made possible by having used special SGML markup. The third pass was to do the conversion of all special characters into RTF. The last two passes [?] had to strip out SGML tagging in the main texts and bibliographies, and format them appropriately for RTF (converting the different kinds of bibliographic entries to a uniform structure was tricky, and intentionally omitted end-tags in the entries made things even harder). The entire conversion process (excluding building hyperlinks) took six hours of cpu time on a 486 PC. As part of the project, Zandar developed a separate tool called HyperTagWrite to handle the creation of the hyperlinking markup, which could be converted in a subsequent pass into the format used by Multimedia Viewer. New writers will be added into future versions of the CD-ROM. Changes will be made in the SGML database, from which the RTF (or whatever future target formats might be required) can be generated. Using an SGML-based approach should greatly facilitate the production of future editions. In the subsequent questions and answers session, a number of points were raised. Proper names in the original texts were identified and processed on the basis of the punctuation in the data content. The conversion process was relatively cheap in terms of man months (i.e. Macmillan put one person on the project full- time for only three months). The quality control checking took seven people six months, and every hyperlink was hand-tested. When proofing uncovered errors, all corrections were made to the SGML source files. The retrieval engine indexes every word, but they also built several specialist indexes on the basis of markup (e.g. index of authors, titles, genre etc.). 18. "The Attachment of Processing Information to SGML Data in Large Systems" - Lloyd Harding (Mead Data Central) Mead Data Central collects information from over 2000 sources, but has no control over the received format; they currently have about 200 million documents available on-line. Conversion/handling on this scale has to be automated as much as possible, bearing in mind that the target is to produce an on-line information system and that they are not concerned about delivery on paper or other media. Lloyd compared the handling of (electronic) information, to the process of industrial manufacturing - especially in its infancy. Standardized solutions have resolved many of the problems that faced early manufacturing industry, but can the same be done for information fabrication systems? He proposed a new paradigm in which the author marks up the information content, an "Information Fabrication System" adds value useful for the target system (publishing, linguistic analysis etc.), and the target systems use the information. The new middle process extents the traditional paradigm. Traditional SGML techniques may provide solutions within this new Information Fabrication paradigm. Markup standardization, that is agreed common DTD development along the lines of the work of the AAP, TEI and existing efforts at Mead Data Central, might help to provide markup relevant to specific target applications (publishing etc.) but will only ever be a partial solution. The use of architectural forms may also provide some benefits, but adding them to existing DTDs requires skill. Link Process Definitions are another possibility, but they also require skill, and they are not supported by many existing tools. FOSIs are fairly straightforward to use but may not be generalizable enough to use as part of an Information Fabrication process. DSSSL's Association Specification and Output DTD, combined with GLTP, appear to offer the greatest promise of a solution, but implementing them will require programmer expertise, and DSSSL is still "shimmerware". As none of these traditional approaches really provide a complete and/or ideal solution, Lloyd proposed his own Information Fabrication System Architecture. For the Information Fabrication Paradigm, the most viable concept is a generalization of the FOSI to accommodate any type of fabrication process. The Architecture required to do this consists of two components: the Application Interface DTD (AID), and the Processing Output Specification Instance (POSI). AID provides the syntax for specifying the attachment and association information. POSI specifies the attachment and association information for a specific application and raw material. Lloyd then talked through two examples of the steps required to develop and use an application using this Architecture. The Architecture-based approach solves the attachment and association problems, alleviates some of the expense issues, and reduces the skill set requirements involved. The goal of developing an Information Fabrication System is to free the author from target system constraints, thereby permitting him/her to focus on content (e.g. authors should not have to worry too much about authoring links). This requires Information Fabrication Systems that can accept any author's creation and cost effectively prepare that creation for a target system. AIDs and POSIs can provide the basic underpinnings for such Information Fabrication Systems. During questions and answer, Lloyd said that his was not necessarily the ideal/only/whole solution, but he would like to see people talking about "Information Fabrication Assembly Lines" - where as much as possible of the process of generating marked up information for target systems could be automated. His approach has not yet been adopted at Mead Data Central, but it will be. 19. ISO 12083 Announcement" - Beth Micksch (Intergraph Corp.) This presentation was intended to provide a brief history and update on ISO 12083 "The Electronic Manuscript Preparation and Markup" Standard. Formerly an ANSI standard ( Z39.59, but generally referred to as the "AAP"), ISO 12083 is now being fast- tracked through the ISO standards process. The first ballot on the Draft International Standard(DIS) was in November 1992, and the voting went as follows: 14 positive, 5 negative, and one abstention. Eric van Herwijnen was asked to be the editor and to set up a small technical committee. Eric was required to resolve all the comments received on the DIS into the Standard, as fast -tracking means that a second vote would not be needed before the Standard is approved. The Standard is intended to facilitate the creation and interchange of books, articles and serials in electronic form. It is meant to provide a basic toolkit which users can pick up and modify according to their needs. The Standard is meant for use by authors, publishers, libraries, library users, and database vendors. Use of the Standard is indicated by its public identifier (e.g. ISO 12083:1993//DTD Book//EN - for the Book DTD). Elements or entity references may be removed or modified as needed. Users can declare their own elements in external parameter entities, and the parameter entities defined in IS0 12083 can be overridden to modify order and occurrence or to specify user defined elements/attributes; alias elements are not permitted. The Standard allows SHORTTAG and OMITTAG, although the revised usage examples will be fully normalized. The application must conform to ISO 8879:1986. ISO 12083 contains four DTDs: Book, Article, Serial, and Mathematics. It has a very large Annex (A) which comments on the DTDs and covers such things as design philosophy, structure descriptions, special characters, electronic review, mathematics, tables, braille/large print/computer voice facilities, and HyTime facilities. Annex B contains descriptions of the elements, and indicates how all the elements relate to one another. Annex C contains examples, some of which are normalized versions of the examples which first appeared in the ANSI standard. Numerous changes have been made to ANSI Z39.59. Element names have been changed and additions made, to make them less cryptic than in the original; there are new elements for things like poetry. The Mathematics DTD is based on the work of the AAP update committee (which has met at a number o f SGML conferences, and corresponded over the internet). ISO 12083 currently offers minimal HyTime capability, but this should be enough to get people started. The Standard also supports the use of ICADD's Structured Document Access (SDA) attributes, to facilitate mapping to braille, large-print or voice-synthesizing systems. The use of SHORTREFs is deprecated but still possible. An alphabet attribute has been added to title, p (paragraph) and q (quotes) - to allow the use of special characters in these elements. Electronic review is also supported, and this was achieved by incorporating the CALS Electronic review declaration subset. The names and descriptions of the elements and attributes are now more explicit and meaningful to make the DTDs more "user-friendly"; the number of illustrative examples has also increased. The Standard will be published very shortly. It will be available from ANSI and NISO (and reserving a copy before January 15th can save $10). NISO will email out electronic copies of the DTDs to anyone that wants them. To get a copy contact: National Information Standards Organization P.O. Box 1056 Bethesda, MD 20827 Phone: (301) 975-2814 Fax: (301) 869-8071 Email: niso@ehn.nist.gov [This information probably only applies to people in the United States. Elsewhere, people should first try contacting their own National standards body]. Beth closed by remarking that the second edition of Eric van Herwijnen's book "Practical SGML" has been produced using the ISO 12083 Standard (including the HyTime capabilities), and things seem to have worked pretty well. The indications from other tests which are currently underway have been equally positive 20. Reports from the SGML Open Technical Committees Paul Grosso (Chair of SGML Open's Technical Committee), reported that they not yet met - although the first meeting would be held on the Friday immediately after this conference. On this one occasion, anyone who wished to attend would be allowed to do so; future attendance at such meetings would be restricted to people connected with SGML Open. Paul suggested that the general role of the Technical Committees will be to look at interoperability issues and make sure that SGML solutions work (i.e. that SGML applications can successfully interact). The main Technical Committee will form specifically- tasked/short-lived sub-committees as necessary. The Technical Committee will need to get input from all the SGML Open member companies, and no-one should expect to be able to "piggy back" on the efforts of a few dedicated companies or experts. Particular problems which may be considered by the Technical Committee include things like entity management, how to package together and exchange SGML files (cf. SDIF, SBENTO etc.), handling tables and math (where many issues go beyond the area covered by ISO 8879), HyTime issues, and so on. 21. "A Technical Look at Authoring in SGML" - Paul Grosso (ArborText) There are a number of ways of authoring in SGML. Approaches include using a standard "ASCII" editor to author both the text and the markup, using a conversion program to add SGML markup to existing content, using an SGML-aware (non "ASCII") editor that provides the proper markup during the initial authoring process, and recreating an SGML document from a repository of existing SGML or SGML-type documents. When discussing authoring in SGML, it is useful to distinguish the roles of the parser and the editor. A parser turns an ASCII character stream into tokenized SGML constructs (it also expands any markup "minimization"). However, the parser also leaves many things to the application for checking. A non-ASCII SGML editor is such an application, and only it can associate meaning to the SGML constructs returned by the parser. Such an editor is not just an interface to the parser - it is an application optimized to author structured documents that represent valid SGML document instances. A non-ASCII SGML editor should provide an interface that transcends the syntactic details; it should represent the document using internal data structures that are isomorphic to the basic constructs of SGML. There are three levels of "understanding " an SGML document. The lowest is the recognition of SGML syntax (e.g. recognizing the individual characters in the string </para> as a an end tag for a particular element called "para"). The middle level entails understanding and providing an interface to SGML semantics, for example what it means to be an element, an attribute, an entity, or a marked section, and what it means to have a content model, a declared value, a default etc. The top-most level of understanding performs the attachment of application-specific semantics, for example in a composition system it determines how to format a paragraph element. All non-ASCII SGML editors must convert the ASCII SGML input into an internal representation, but an SGML editor that inherently understands SGML semantics can provide much greater benefit to the end user than an editor - even a structure one - that "imports" and "exports" SGML by converting to an alternate view that does not maintain a real-time comprehension of and compliance to SGML semantics. When it comes to measuring an SGML editor's performance, the parsing component of an SGML system is defined in Annex G of ISO 8879. Conformance of an editor application is often measured by examining what it can import and export; it should be able to read/output a wide range of valid SGML. However, real-time context checking is also important in an SGML-aware editing system. The system should guide the author whilst creating a valid SGML document, as continual validation and checking will make life easier for the user. However , it must be remembered that during the authoring process it is possible to have "incomplete" as opposed to "invalid" documents - the incomplete document contains nothing that is actually wrong, it just does not yet contain everything that is required. The SGML Conformance Standard uses the concept of ESIS (Element Structure Information Set) to define conformance for a pasrser. However, the definition of ESIS is not inclusive enough to describe all that an SGML editor must do (and this has led to the notion of "ESIS+"). ESIS+ suggests that things of importance to an SGML editor could include comments, the existence of ignored marked sections, and the existence and name of internal general text entities. An SGML editor's view of SGML is really dependent on its view of ESIS+. Therefore, an SGML editor could/should be evaluated on the basis of what it recognizes as the scope of ESIS and ESIS+ The interfaces to complex constructs can cause problems for SGML editors; for example handling such things as marked sections should be done properly, even when they are specified using a parameter entity reference. The editor should allow for marked sections with unbalanced markup (i.e. which include the start tag of an element but not its corresponding end tag); it should also allow the synchronous changing of the values of parameter entities so that the final result is valid even though an intermediate state may be invalid. Subdocs are another complex structure to be considered. A subdoc is basically an external SGML entity with its own DTD. The authoring interface to a subdoc can be similar to that for a regular external SGML entity but there are issues are additional issues to be considered, such as the different ID name/entity name space, a need for a potentially different presentation style for the subdoc, to say nothing of what it might meant to actually compose a document containing a subdoc. A third type of complex structure involves the use of the CONCUR feature. Using concur allows a document instance to contain completely independent and orthogonal structural hierarchies in the same ASCII SGML file. At any given time, the document must be parsed according to the currently active DTD. When a different DT is made active, it is equivalent to reading in a different document from a different document type. Although the character data of the different views of the document remains the same in both cases, some thought needs to be given as to which tag sets should be displayed to the user - only the active DTD, or one or more of the others that apply. 22. "Implementing an Interactive Electronic Technical Manual" - Geoffrey von Limbach (InfoAccess Inc.) There are two specifications which relate to IETMs. The first is GCSFUI (General Content, Style, Format, and User Interaction Requirement - MIL-M-87268), which specifies the on screen layout and behaviour of an IETM (e.g how an IETM should handle warnings - their duration, iconization etc.) The second is IETMDB (MIL-M-87269), but although this mentions "Data Base" in the title, it is primarily a set of architectural forms or templates for IETMDB compliant DTDs. IETMDB also specifies a linking mechanism for sharing content between document instances, based on the use of HyTime ILINKs. Several classes of ETMs and IETMs have been identified in a recent paper by Eric Jorgeson ("Classes of Automated TM Systems" Eric L Jorgeson, Carderock Division, Navel Surface Warfare Center, 11 August 1993). In summary, these are as follows: Class 1: stored page images (+ display mechanism and access via an index) Class 2: as 1, but adds hypertext links Class 3: real IETMs (display conforms to GCSFUI, file is tagged in accordance with IETMDB) Class 4: as 3, but authored with a relational or Object- oriented underlying database Class 5: as 4, but the display is integrated with other tools (such as an expert systems to assist diagnostics). Geoffrey then described the implementation of a prototype IETM at the David Taylor Research Center (DTRC). The DTRC prototype was implemented using Guide Professional Publisher (GPP). GPP includes a scripting language which enabled the flow control required by IETMDB as well as a flexible user interface which can be adapted to meet the requirements of GCSFUI. The DTRC provided a DTD and document instance derived from the architectural forms specified in IETMDB. DTRC also specified the screen layout beyond the requirements of GCSFUI. Geoffrey then showed an example of some sample warning source text and markup. It became clear during the DTRC project that IETM production requires flexible software. The user interface must be adaptable to user requirements. The DTD involved does not remain constant, and is likely to be frequently revised. There is also a need for tools which can handle things like HyTime's ILINKs. Geoffrey recommended that anyone working on a similar project should try to use inherently flexible tools (such as an adaptable user interface, and a good scripting language). Other tools which offer an API which can be called from a development language such C or C++ are also worth considering, although they can lead to higher development costs and greater overheads as requirements change. 23. "The Conversion of Legacy Technical Documents into Interactive Electronic Technical Manuals: A NAVAIR Phase II SBIR Status Report" - Timothy Billington, Robert F. Fye (Aquidneck Management Associates, Ltd.) [This presentation followed on closely from the previous one. It involved a large number of slides (all of which are included in the proceedings), so I shall only attempt to summarize the main points] The Navy's ETM strategy depended upon a comparison of the Class 2 and Class 4 ETMs/IETMs outlined above. Class 2 ETMs are typically linear/sequential in nature. An SGML instance and document DTD are fed into an indexing tool, and the resulting files are fed into an ETM SGML and graphics browser (with style declarations being applied to control formatted display). In Class 4 IETMs, the logic sequence is much more complex - since the underlying SGML tagged information has to be displayed on the basis of interactive inputs from the user (e.g. branching on the basis of user responses to dialogue boxes). Aquidneck Management Associates were awarded a second phase Small Business Innovative Research (SBIR) contract to develop processes and procedures for the transition of legacy, paper-based NAVAIR Work Packages into Class 4 IETMs in accordance with Tri-Service specifications. Timothy described the migration process. This involves scanning massive amounts of legacy data, and particular attention needs to be given to potential problem areas - such as how to markup tables to support interactive decision making. Migration has to be a phased process, with increasingly sophisticated markup being added at each phase. Timothy talked briefly about data enhancement and authoring, as well as data storage and maintenance. He stressed that the subsequent data quality assurance process is very important to users, and should not be neglected or played down. He identified some of the features of an IETM presentation system (e.g. tracking user interactions, setting and clearing logic states, navigating to the next logical node etc.), and showed some diagrams illustrating the operation of a frame oriented IETM. Looking to the future of IETMs, Timothy said that there is a need to find a cost effective authomated means of converting paper- based legacy data to Class 4 SGML-based IETMs; this is a pre- requisite for the widespread implementation of IETM technology. He noted with regret that this is something of a circular argument, since the IETMs will not appear without the cost-effective conversion technologies, but they will not be developed until there is sufficient demand for IETMs. 24. New Product Announcements and Product Table Top Demonstrations [These announcements were made in quick succession, so I apologize in advance for anything I missed or mis-heard.] Xyvision Publishing Systems announced that they have built an SGML publishing solution around their document management system (and also round FrameMaker and Ventura). Incontext announced the release of their Windows 3.1 based SGML editor (which uses Excel to handle tables). Electronic Book Technologies (EBT) announced that DynaText now has a new multibyte core display language, which means that it can display Asian languages (eg. Kanji). The Rainbow DTD being shown at the Poster Sessions will be made publicly available to facilitate translations from word processors to SGML SoftQuad announced a new version of Application Builder which allows Author/Editor to be used as an interface to document management systems. They would also be showing the latest version of Author/Editor (v.3.0). Recording for the Blind announced the creation of their Etags product, to assist the production of electronic texts that can be made accessible to the print-disabled. OpenText Corporation announced the creation of new client/server extensions to support easy combination of hardware. Their product has also been extended to support multibyte char sets (eg. Kanji) Datalogics announced that their WriterStation P/M product has now been ported to Windows 3.1. Schaeffer [?] consultants announced that they would be showing some of the integrations they have done, (based on OpenText), to facilitate data management. Grif SA announced that they would be showing GATE [tools to integrate Grif into your system?], Grif SGML Editor for Windows, and CAT (Corporate Authoring Tool) for authoring in BookMaster (GML to SGML). Oracle announced the release of OracleBook, an online multimedia document delivery system. Version 2 will be SGML- aware, and will be demonstrated at this conference. Texcel announced the release of Information Manager, a package for building document management/collaborative authoring SGML systems. Tachyon Data Services announce that they would be showing the customizable Tagger software that they have developed to convert files to SGML. Synex Information [?] announced the release of SGML Darc, an SGML-based document management and archiving system for Windows. ArborText announced the recent release of version 5 of the ADEPT Series of software (now including a FOSI editor). They were also demonstrating Powerpaste (a set of conversion tools), and the Windows version of the ADEPT SGML Editor, which is due to be released in the second quarter of next year. Interleaf announced the recent release of the Interleaf 5 <SGML> Editor, and the Interleaf 5 <SGML> Toolkit to develop translations. Exoterica announced that they would be demonstrating a new release of OmniMark for use on Macintosh machines. Passage systems announced the release of PassagePro, a document management and production system. Currently available on SGI machines, and soon on Suns, they hope to have it running under Windows by next year. Frame Technology announced that they would be demonstrating FrameBuilder, and announcing/demonstrating their SGML Toolkit (which facilitates mapping between SGML and Frame's WYSIWYG environment) Zandar Corporation announced that they would be demonstrating the latest version of their TagWrite conversion tools (currently at v.3.1). The following companies demonstrated/exhibited their products [this list is taken from the proceedings, so some late entries may have been omitted]: ArborText, Inc.; AIS/Berger-Levrault; CTMG/Officesmiths; Data Conversion Laboratory; Datalogics Inc.; Electronic Book Technologies; Exoterica Corporation; InContext Corporation; InfoAccess; Information Strategies Inc.; Information Dimensions Inc.; Intergraph Corporation; ISSC Publishing Consulting; Microstar Software Ltd; Ntergaid; Open Text Corporation; Recording for the Blind; Saztec International Inc.; SoftQuad Inc.; STEP Sturtz Electronic Publishing GmbH; Synex Information; Tachyon Inc.; Texcel; WordPerfect Corporation; Xyvision Inc.; Zandar Corporatio 25. Poster Session As above 26. "Implementing SGML Structures in the Real World" - Tim Bray (Open Text Corp.) Tim began by remarking that the number of people attending the conference indicates that there are a vast amount of SGML-tagged information being created, but where is it all being put? Some of the possible technologies and systems have been shown at the vendor demonstrations at this conference. At SGML'92, Neil Shapiro said "I can model SGML with a slight extension to SFQL and store SGML in a relational database". In response, Charles Goldfarb said "SGML should be stored in a flat file in a native form". This presentation looked at some of the different possible approaches. Computer systems at the operating system level, have an extremely linear view of the world (i.e. they see everything as a row of bytes). The application program, which actually uses files, has a different view of the world; it sees SGML etc. as a sequence of characters, although it still goes through the file in a linear fashion. However, you might want to jump right into the middle of an SGML file (for example to pick up a particular entity), accessing information in the same way that humans do. The design of a system that needs to access SGML-tagged information in this very direct way, must incorporate a number of goals. The design must be open, that is it must provide full SGML text to any application (with or without entity management? - Tim commented that it is good to see that relevant PD software tools will soon appear, such as the POEM software announced earlier in this conference). The design must make information highly accessible, retrievable on the basis of structure, content, and linkage. It should also be "alive", allowing information to be updated quickly and safely. The design should support "Document Management" (i.e. versioning and workflow), and it should be able to do all of the above quickly. What does "Open" mean in this context? Flat filesystem access (real or emulated) is the lowest common denominator shared by all open systems, but being able to pass SGML files between systems requires additional sophistication. You really need to develop an "element server" (effectively an element API) to have truly open SGML There are four possible strategies which can be adopted when developing a solution. The first involves the use of standalone flat files (where the SGML tagged information sits in separate files); this approach offers complete SGML fidelity, maximal openness, and relative ease of update - however, retrieval can be hampered by poor tools and performance, and it is difficult to perform updates safely and securely.. From the standpoint of data management, this approach appears neutral (neither especially good or bad) because although there are no tools, there are also no barriers. The second strategy involves the use of indexed flat files; this is the approach adopted by the Opentext product (i.e. it builds structure and content indexes which can be then used to access/retrieve information, support updating etc.; the indexes relational database. The arguments in favour of this approach are that it allows complete SGML fidelity, excellent openness, excellent retrieval. However, update implementation is complex, as it is difficult to insert to stick bytes into the middle of a large indexed flat file without creating problems. The third strategy requires the use of a relational database (and Tim pointed out that 90% of the world's existing business information is already stored in relational databases - so there is a great deal of expertise with this kind of database). In a relational database, SGML elements or entities are mapped into relations, and extra information is included to model the SGML element hierarchy. Some extensions to SQL may be provided to support this approach. Tim gave some examples of how relational databases to handle SGML have been implemented in some of the products demonstrated at this conference. The first used a single table of three columns (Context representation, Properties, and SGML Text); record boundaries are inserted at the starts of a subset of "distinguished" elements. The hierarchy is stored and rebuilt via Context Rep [?], whilst metadata (versioning and presentation) is stored in properties. This means that the SGML can be reconstructed on demand, and the product can perform structure and/or content queries on the distinguished elements. In his second example, the relational database was built from a simple decomposition process - breaking the documents down into distinguished elements (e.g. paragraphs); each table record is one such element, with one field of text, and the rest of attributes. In his final example, the SGML text is stored in BLOBs (Binary Large Objects) divided purely for implementation reasons. In this case, elements are stored as table entries with the fields: BLOB id, parent ID, sibling ID, first child ID, and attributes]. There are advantages to using the relational database approach. SQL is a world-standard tool; the theory and practice for safe and efficient updating of relational databases is well-understood. It also offers the possibility of excellent integration with existing document management techniques. However, using relational databases offers poor SGML fidelity, the openness of the information is compromised, and retrieval performance can also be poor. With this in mind, Tim proposed his fourth strategy for developing an open system for handling SGML information - using a "native SGML" (object-oriented type) database. In this case the Database Definition Language (DDL) is SGML, and the Data Manipulation Language (DML) is also SGML-oriented; there is no hidden or relational machinery. As an example of a product that has implemented this approach, Tim talked about SGML/Store; the input is via an SGML parser, and the database essentially stores the resulting ESIS. The software has API primitives for tree navigation and content queries, and treats both elements and attributes as nodes. It uses a transaction-oriented, versioned-object model, and supports multiple instances and DTDs per database. The advantages to using a native SGML database are that it allows for complete SGML fidelity, it gives an opportunity to implement commercial database security and integrity features, and it also makes it possible to optimize for performance. The disadvantages are that it involves a proprietary implementation, and the use of proprietary API and/or DML. In conclusion, Tim recommended that if it is possible to get away with storing SGML documents in flat files, then this is the preferred solution as it is simple and thus safe. He felt that there is still a major requirement for the development of relevant standards (e.g. an SQL for handling SGML documents), and hoped that this need would be met sooner rather than later. 27. "User Requirements for SGML Data Management" - Paula Angerstein (Texcel) [Paula suggested that an appropriate alternative title to this presentation could well have been the question, "Why do we want to have SGML-based approaches to data management in the first place?".] The current business trend shows a growing awareness of documents. Documents are at last being recognized as a corporate asset, although they are often not managed as well as other types of corporate information (such as financial data). There also appears to be a gap in the corporate information infrastructure (for example, many large companies cannot share documents effectively and efficiently internally) - so strategies are needed to track and share information. There are a number of common document-related problems in business. For example, finding the right version of the right document, keeping related documents compatible with each other, and synchronizing documents with the latest data. Other typical problems include getting a document customized for a particular job, reusing appropriate existing document parts, and coordinating the multiple access and update of documents. An effective document repository management system should provide a number of benefits: it should support automated quality control, maintain document security, account for and trace any amendments, facilitate document reuse, and assist worker coordination. The successful solution to providing such a system must offer facilities for the automation of the main business processes, provide for the collaborative reuse of information, and be easy to integrate with existing data management practices (although this is often more of a cultural than a technical problem). The key to a successful information management strategy is a centralized information repository. It represents a "logical vault" for all the documents and information relevant to a workgroup. As a managed collection, documents can be browsed, queried, and used by all members of a workgroup to collaborate on projects. Versions, configurations, and status of information can be centrally kept up-to-date, providing automated quality control, accountability, and traceability. Information can be shared and reused, guaranteeing integrity of data and boosting productivity. SGML-based repository management goes beyond document image and/or traditional document management. It makes it possible to use the rich set of information in a document - namely the markup as well as the content. Document components can be shared, reused, and subject to configuration management. Moreover, it means that automated processes can be driven by the document contents (i.e. by the data), and so do not have to be left to other tools. This approach is different from traditional product configuration management, in so far as documents are structured but have "unpredictable" data (i.e. they contain elements which have no fixed size, order or content type). The typically hierarchical structure of documents does not model naturally as relational tables, and the level of granularity required to track changes etc. probably does not correspond to a document as it is presented to the end user. Similarly, most people probably would not wish to adopt a forms-based document editing environment. Although SGML is often thought of as an interchange format only, it also provides a small but powerful set of semantics for document data modelling. Element and attribute definitions provide a "schema" and attributes for objects in a repository. Entity definitions provide a way to share units of information, and IDs/IDREFs (together with appropriate external identifier resolution or HyTime mechanisms) provide repository-wide linking. The benefits of using SGML modelling in a data repository stem from the fact that SGML is optimized for documents. Document markup and structure contribute to the process of retrieval. It also means that you need only one model and language for information creation, management, and interchange. SGML repository management enables new business processes. It becomes possible to have on-demand generation of new information products, through dynamic document assembly and "data analysis". Element-based access control makes it easier to share document components amongst the members of a collaborative workgroup. It also becomes possible to track the life-cycle of document components through element-based configuration control. Whilst the use of structure-based queries allows dynamic online viewing and navigation around document components within the repository. SGML repository management also facilitates existing business processes such as storage, filing, browsing, query, retrieval, access control, auditing, routing, version, job tracking and reporting, archiving and backup. For many of these functions, an SGML repository enables operation on individual elements in addition to documents. Documents are being recognized as an increasingly important part of the information infrastructure of a company. With the introduction of SGML-based approaches, we should witness a gradual movement from the current notion of "document management" to that of "information management". 28. "A Document Query Language for SGML Databases" - Ping-Li, Pang, Bernd Nordhausen, Lim Jyh Jang, Desai Narasimhalu (Institute of Systems Science, National University of Singapore) [This presentation was delivered by Ms Ping-Li Pang.] The Institute of Systems Science is one of the main research departments at the National University of Singapore. Recently, they have been looking at managing documents, especially using SGML-based approaches, and this has led to the development of DQL, a document query language for SGML databases. The main requirements for a language to query a database of SGML structured documents are that it must support queries on the basis of complex structural information, facilitate information retrieval, and assist document development. Ping-Li talked through some examples of query expression in DQL, showing the typical form of a query (select....from.....where....), and the use of path expressions (for elements, database attributes, and SGML attributes). She then discussed the DQL method for expressing the following types of query: a DTD structural query, a document instance structural query, a document instance content query, a query on versioning information, and a query on links. [Readers who would like to see examples of these queries should probably contact the DQL team at the Institute of Systems Science]. DQL was an attempt to implement an SQL-like language for use with SGML. It has the expressive power to query on structural and/or content information at any granularity. DQL is being implemented at the Institute of Systems Science and an initial prototype that has all the features of DQL will be ready in March, 1995. During questions, Ping-Li stated that the "database attributes" mentioned in her examples are defined when the database model is developed (i.e. the database attribute "title" maps from the content of the SGML element "title" in the relevant DTD[?]). The DQL development team have not looked at HyQ for querying SGML documents. 29. Closing Keynote - Michael Sperberg-McQueen (Text Encoding Initiative) Following the success of the presentation he gave last year, Michael Sperberg-McQueen had again been invited to deliver the closing keynote speech. The full text of his address will be posted to the comp.text.sgml newsgroup. Michael was really posing the same question that he asked last year, "What will happen to SGML and the market for SGML information in the future?" He emphasised that he would be stating his personal opinions only, and they should not be taken as representative of the TEI or any other institution. Michael noted that some progress had been made on several of the issues that he raised in his closing address at SGML`92. The growing expertise in SGML has meant that improved styles of DTD design are being adopted. DTDs are being developed to meet the users' information handling needs, not the requirements of a system or application (often evidenced in earlier DTDs by the presence of processing-motivated "tweaks"). The HyTime engines which are now approaching should be able to close some of the "gaps" of using SGML, since they will make it possible to use architectural forms to convey some sense of system and/or application awareness without compromising the SGML source. SGML promises data portability, but it may also lead to application portability. The HyTime and DSSSL standards should facilitate this process, and developers will need to know about them. The biggest change since SGML`92 is the amount by which the volume knob on the so-called "Quiet revolution" has been turned up. SGML has already begun its entry into mainstream information processing. It is already a standard that is being adopted world-wide. Whilst the use of HTML (the Hypertext Markup Language) on the World Wide Web is perhaps not an ideal demonstration of what SGML-based approaches can achieve, they are doing a very good job of selling SGML to people who were previously ignorant or resistant. Michael said that he would like to see SGML-awareness embedded into as many mainstream applications as possible as a matter of course. When SGML-awareness is embedded almost incidentally in an application, then SGML will be able to realize its full potential, and it could be argued that this is already beginning to happen. The future is perhaps not only closer than we think, it may already by here now; the products demonstrated at this conference, and the public domain tools such as ObjectSGML, POEM and CoST are perhaps examples of this. Michael predicted that it might not be too long before we see SGML being used in the area of literate programming. It is clearly an ideal case for an SGML application. SGML to SGML transformations have been a key topic at this conference. This is important, because in future only legacy data that is being converted to SGML, and outputs from SGML systems (e.g. for printing a document), will exist in non-SGML form. All other information interchange will be done using SGML and thus DTD to DTD transformations will be a fundamental issue. The points raised by Dave Sklar in his presentation on GLTP (now to be re-named STTL) will become highly pertinent, as will the requirement to use GLTP and DSSSL in general. Things are going to become more complicated, therefore we will all need new/better tools for things like DTD design and development, SGML systems development, DTD transformation design, and the actual transformations themselves. There is clearly still much more work to do. Mainstream vendors will need to understand more about SGML, and if they do not change their products in the long run they will lose out as customers come to expect/demand embedded SGML-support. It is certain that technology will keep on changing, and superficially attractive (non-SGML-based) solutions to managing information will always ultimately fail. For when this situation comes about, SGML will not just be in the mainstream, it will be the mainstream! For further details of any of the speakers or presentations, please contact the conference organizers are: Graphics Communications Association 100 Daingerfield Road 4th Fl Alexandria VA 22314-2888 United States of America Phone: +1 (703) 519-8160 Fax: +1 (703) 548-2867 ------------------------------------------------------------------------------- You are free to distribute this material in any form, provided that you acknowledge the source and provide details of how to contact The SGML Project. None of the remarks in this report should necessarily be taken as an accurate reflection of the speaker's opinions, or in any way representative of their employer's policies. Before citing from this report, please confirm that the original speaker has no objections and has given permission. ------------------------------------------------------------------------------- Michael G Popham SGML Project - Computer Development Officer Computer Unit - Laver Building North Park Road, University of Exeter Exeter EX4 4QE, United Kingdom email: sgml@uk.ac.exeter M.G.Popham@uk.ac.exeter (INTERNET) Phone: +44 392 263946 Fax: +44 392 211630 -----------------------------------------------------------------------------