Abstract: "This paper traces the history of the Text Encoding Initiative, through the Vassar Conference and Poughkeepsie Principles to the publication, in May 1994, of the Guidelines for Electronic Text Encoding and Interchange. The authors explain the types of questions that were raised, the attempts made to resolve them, the TEI project's aims, the general organization of the TEI committees, and they discuss the project's future."
Abstract: Parameter entities were once thought to be the domain of only DTD designers. Parameter entities, and their references, can also be placed in the internal DTD subset of document instances. By doing so, authors can indirectly include shared entity declarations or collections of entity declarations. Such indirection can enable groups of authors to share and reuse entities that change frequently. Whereas parameter entities enable entity sharing and reuse, Hytime content location addressing can provide granular reuse of elements within file entities. When combined, paramater entities and content location addressing can enable sharing and reuse of SGML components in either local and far-flung environments."
"The scenario [discussed] is not fictitious; the problems are real and the requirements and objectives are quite common for Company X, as they are for many organizations, large and small. Of course, this paper was written in the referenced DTD and uses all of the features discussed; the SGML markup for this document is available from publish@ibm.net. The creative and judicial use of the features described in this paper provide a reasonable degree of reuse and data management across an organization of virtually any size, without requiring the use of an SGML-enabled data manager. However, a capable SGML-enabled data manager, combined with one or more of these features, can provide an organization with a formidable, extensible, and highly automated reuse environment."
This paper was delivered as part of the "User" track in the SGML/XML '97 Conference.
Note: The SGML/XML '97 conference proceedings volume is available from the Graphic Communications Association, 100 Daingerfield Road, Alexandria, VA 22314-2888; Tel: +1 (703) 519-8160; FAX: +1 (703) 548-2867. Complete information about the conference (e.g., program listing, tutorials, show guide, DTDs, conference reports) is provided in the dedicated conference section of the SGML/XML Web Page and via the GCA Web server. The electronic proceedings on CDROM was produced courtesy of Jouve Data Management (Jouve PubUser).
The volume apparently has not yet been published (November 1995).
Summary: "Beyond the need to query and retrieve based on tags which exist in a TEI document, a means to manipulate and query classes of objects is also desirable. The TEI DTD uses SGML entity definitions to create "classes" of elements and attributes, in particular, for groups of elements with common structural properties (e.g., all elements that can appear between paragraphs), groups of attributes which apply to certain classes of elements (e.g., attributes for pointer elements), etc. In addition to grouping together elements and attributes with common structural properties, the definition of such classes recognizes common semantic properties among elements and attributes. However, the SGML entity definition mechanism provides only for string substitution within the DTD itself, thereby enabling easy reference to these classes in later element definitions; the common semantic properties that are implicit in the classification scheme are lost for the purposes of retrieval and document manipulation. Obviously, a means to refer to and manipulate classes of elements and attributes in a query and retrieval system would provide substantial additional power for the user."
"We are experimenting with the representation of a DTD and associated documents (i.e., documents conformant to the DTD) in a knowledge representation (KR) system, in order to provide more sophisticated query and retrieval from TEI documents than current systems provide. We are using CLASSIC, a frame-based representation system developed at AT&T Bell Laboratories . Like many KR systems, CLASSIC enables the definition of structured concepts/frames, their organization into taxonomies, the creation and manipulation of individual instances of such concepts, and inference such as inheritance, relation transitivity, inverses, etc. In addition, CLASSIC provides for the key inferences of subsumption and classification. By representing a document as an individual instance of a hierarchy of concepts derived from the DTD, and by allowing the creation of additional user-defined concepts and relations, sophisticated query and retrieval operations can be performed. This paper briefly describes the CLASSIC system, the representation of a DTD and a document conforming to that DTD in CLASSIC, and provides an overview of the kind of query and retrieval that can be performed.
The extended abstract for the document is available online: http://www.stg.brown.edu/webs/tei10/tei10.papers/ide.html; [local archive copy]. Also on the Vassar server: http://www.cs.vassar.edu/~ide/papers/tei10.html See the main database entry for additional information about the conference, or the Brown University web site.
Abstract: "This article describes the major problems in devising a TEI encoding format for dictionaries, which, because of their high degree of structuring and compression of information, are among the most complex text types treated in the TEI. The major problems for this task were: (1) the tension between generality of the description, in order to be widely applicable across dictionaries, and descriptive power, that is, the ability to describe with precision the particular structure of any given dictionary; and (2) the need to accommodate different views and uses of the encoded dictionary, for example, as printed object and as a database of information."
"Abstract: MULTEXT (Multilingual Text Tools and Corpora) is the largest project funded in the Commission of European Communities Linguistic Research and Engineering Program. The project will contribute to the development of generally usable software tools to manipulate and analyse text corpora and to create multi-lingual text corpora with structural and linguistic markup. It will attempt to establish conventions for the encoding of such corpora, building on and contributing to the preliminary recommendations of the relevant international and European standardization initiatives. MULTEXT will also work towards establishing a set of guidelines for text software development, which will be widely published in order to enable future development by others. All tools and data developed within the project will be made freely and publicly available."
The goals of MULTEXT are the creation of "reusable software for multi-lingual linguistic corpus annotation and exploitation; software standard for tool design; TEI-based markup standard for corpus encoding; multi-lingual corpus (English, Dutch, German, French, Italian, Spanish), including a small speech corpus, partially parallel, portions marked up and validated for part of speech and alignment." As for markup: "The TEI Guidelines provide the basis for markup at levels 0 (the TEI header), 1 and 2 as well as many elements of level 3. In collaboration with Eagles, MULTEXT is extending the TEI scheme in order to specify a TEI -conformant Corpus Encoding Style (CES) that is optimally suited to NLP research and can therefore serve as a widely accepted TEI-based style for European corpus work. Application of the CES to CEE languages, which may require minor modifications to accomodate CEE language-specific information and structures, will provide a test of both the TEI Guidelines and MULTEXT and Eagles' extensions to it."
The paper is available on the Internet ftp://ftp.aist-nara.ac.jp/pub/nlp/conferences/SNLR/papers/14.ps.gz in conjunction with the online proceedings; see also the mirror copy in Postscript format and in PDF format. MULTEXT work sponsored by the Commission of European Communities Linguistic Research and Engineering Project 62-050. For more on the project, see the main entry. For other SNLR conference papers, see the online TOC: http://cactus.aist-nara.ac.jp/lab/events/SNLR/snlr.html.
"The Text Encoding Initiative (TEI) Guidelines for Electronic Text Encoding and Interchange are the result of over six years' work by dozens of scholars from all over the world. As such, they represent a pioneer effort in an area where only occasional and isolated attempts were made before. They will certainly serve as the primary basis for encoding texts in electronic form for the foreseeable future. The work of participants in the TEI not only involved consideration of problems of text encoding that are likely to be with us for decades to come, but also required the development of a methodology - from scratch - for approaching these problems. These pioneering efforts, while likely to be refined and extended, must not be lost: they provide the intellectual basis upon which text encoding practices will build in the future. This collection is therefore documents the course of these efforts. `The TEI Guidelines are extraordinary. Even if they were never adopted they would stand as a significant contribution to scholarship for their detailed analysis of the information sets of a huge range of complex text types.' (From the Preface by Charles F. Goldfarb, inventor of the Standard Generalized Markup Language)."
The contents of this volume are also published as a special triple-issue of Computers and the Humanities (CHUM volume 29, numbers 1-3, 1995). The volume bibliography on SGML/TEI (pages 233-242), however, is included only in this book version. Articles in the first CHUM issue: Charles F. Goldfarb, Preface; Nancy Ide and Michael Sperberg-McQueen, The Text Encoding Initiative: Its History, Goals, and Future Development; C. M. Sperberg-McQueen and Lou Burnard, The Design of the TEI Encoding Scheme; Lou Burnard, What is SGML and How Does It Help; Harry Gaylord, Character Representation; Richard Giordano, The TEI Header and the Documentation of Electronic Texts; Dominic Dunlop, Practical Considerations in the Use of TEI Headers in Large Corpora. Articles in the second issue: David Chisholm and David Robey, Encoding Verse Texts; John Lavagnino and Elli Mylonas, The Show Must Go On: Problems of Tagging Performance Texts; Robin Cover and Peter Robinson, Encoding Textual Criticism; Daniel Greenstein and Lou Burnard, Speaking With One Voice: Encoding Standards and the Prospects for an Integrated Approach to Computing in History; Stig Johansson, The Encoding of Spoken Texts; Alan Melby, E-TIF: An Electronic Terminology Interchange Format; Nancy Ide and Jean Véronis, Encoding Dictionaries. Articles in the third issue: Steven J. DeRose and David Durand, The TEI Hypertext Guidelines; David Barnard, Lou Burnard, Jean-Pierre Gaspart, Lynne A. Price, C.M. Sperberg-McQueen, and Giovanni Battista Varile, Hierarchical Encoding of Text: Technical Problems and SGML Solutions; D. Terence Langendoen and Gary Simons, Rationale for the TEI Recommendations for Feature-Structure Markup.
See a volume description for further details, and the order blank from Kluwer.
For other journal special issues and monographs dedicated to the Text Encoding Initiative, see the relevant subentry for TEI.
See a similar article below.
The growing availability of dictionaries in electronic form calls for a model sophisticated enough to represent the richness of entries and enable complex information retrieval. Electronic dictionaries are a special kind of object, intermediary between a text and a database. Textual models are not powerful enough to handle complex information retrieval, and conventional database models are not flexible enough to handle the richness of their information. In this paper, we outline a scheme for representing electronic dictionaries which departs from previously proposed models. In particular, it allows for a full representation of sense nesting and defines an inheritance mechanism which enables the elimination of redundant information. The model provides flexibility which seems able to handle the varying structures of different monolingual dictionaries.
Abstract: "SGML, which is used for document interchange among various environment, is a meta language to describe documents. Before marking up a document, we need to prepare a DTD that defines a document structure.
In general, a DTD applicable to diverse document classes is incompatible with a DTD focusing on the semantic features of documents. If the number of DTDs grows, the costs of developing application programs for the DTDs would also skyrocket.
To apply a DTD focusing on the semantic features to diverse document classes, we developed a system which, from a base generic DTD, derives a different DTD for each document class. Our system also has a function that translates derived DTD instances to base DTD instances. This function frees us from the burden of developing application programs separately for each of the derived DTDs."
Note: The above presentation was part of the "SGML Case Studies" track at SGML '96. The SGML '96 Conference Proceedings volume containing the full text of the paper may be obtained from GCA.
The article summarizes important findings in a recent market research report published by InterConsult, Inc. The 1994 market study on SGML (Standard Generalized Markup Language) "asserts that SGML expenditures now represent 21% of the overall publishing software market, and predicts that the percentage will rise to 30% by 1998, as worldwide revenues for sgml software and services continues to grow more than 30% annually." According to the report, revenues from SGML services for 1993 were "$77 million higher than what was predicted in 1992..." "The study predicts market revenues for the next four years in nine market segments: integration services, conversion services, electronic delivery, parsing, composing, graphics, database and document management, autotagging and conversion software and authoring. Some of these will still be growing well in four years; other segments will peak as SGML becomes more of a mainstream technology..." Contact: InterConsult, 366 Massachusetts Ave., Arlington, MA 02174; Tel: (617) 646-9600, FAX (617) 646-9615.
No personal author is given. The volume was announced as available for free: call 1-800-955-5323; or contact via surface mail: Interleaf. Inc., Prospect Place, 9 Hillside Avenue, Waltham, MA 02154. A copy of the work is also available via HTTP (HTML format): connect with a WWW client to Interleaf, or in case of link failure, use this mirror copy.
Revision and addition of part 2 (alphabetic 3-character codes) is underway: see ISO 639-2 below. For ISO 639:1988, See provisionally (a) the primary data from the 1988 standard as given here from Keld, or (b) a different compilation of the ISO 639:1988 language codes, or (c) the comparable MARC 3-character language codes, from about 1991, and (d) now, the update to USMARC Code List for Languages from November 15, 1996; [mirror copy]..
ISO 639:1988 is a technical revision of ISO 639:1967, prepared by Technical Committee ISO/TC 37. The two-character language codes of ISO 639 are relevant to SGML encoding in two respects. First, the SGML standard (ISO 8879) itself specifies that declaration of 'public text language' should be given using the language code(s) from ISO 639; see ISO 8879-1986(E) page 36, section 10.2.2.3. Second, the WSD (Writing System Declaration) implemented in the Text Encoding Initiative uses the two-character language code of ISO 639 (as amended) as a 'language.code' attribute of the 'nat.language' declaration, specifying the language in which the WSD is written.
ISO 639 contains much other information about the use of language symbols, registration of new symbols, etc. The language codes of ISO 639 are said to be "devised primarily for use in terminology, lexicography and linguistics, but they may be used for any application requiring the expression of languages in coded form." The registration authority for ISO 639 is given as Infoterm, Österreiches Normungsinstitut (ON), Postfach 130, A-1021 Vienna, AUSTRIA.
The two-character language codes of ISO 639 are recognized as being inadequate for use as SGML language attributes when tagging text, viz, for use as global 'lang' attributes attached to any element to identify the language of the text element or a language shift. In principle, there should be nothing wrong with tagging language using SGML elements rather than attributes, if the encoder has principled reasons for not using attributes (e.g., indexing engines which read simple tags but not SGML attributes). But the two-character codes of ISO 639 are neither sufficiently mnemonic nor complete for the world's languages: whereas ISO 639 supplies codes for only about 136 languages, the Ethnologue published by the Summer Institute of Linguistics identifies over 6100 languages (see
Abstract: "This part of ISO 639 provides 3-character alphabetic symbols for the (re)presentation of names of languages. The symbols were devised primarily for libraries, information services, and publishers to use to indicate language in the exchange of information, especially in computerized systems. These symbols have been widely used in the library community, however, they may be used for any application requiring the expression of language in coded form, including use by terminologists and lexicographers. The list is considered to be an open list. This part of ISO 639 also includes guidance on the creation of language symbols and on their use in some of these applications. Languages designed exclusively for machine use, such as computer programming languages, are not included in this code list." There are about 404 language names in the list. See, for comparison: the bibliography entry for the ANSI/NISO standard, or NISO 3-character language codes (Z39.53-1994) [unofficial], [mirror copy]. ISO 639-2 codes are supposed to be based upon (?) the ANSI/NISO set.
ISO CD 639/2:12/16/91 culminates more than three years of intense collaboration between the representatives of ISO TC 37/SC2 (Layout of Vocabularies) and ISO TC46/SC4 (Computer Applications in Information and Documentation). It preserves the principal features of ISO 639-1 (the existing alpha-1 list) while articulating a code that meets the needs of librarians, managers of bibliographic services, and information specialists. The document is out for DIS ballot until April 15, 1992; it is anticipated that executive action will be taken on the DIS following the meeting of ISO TC/46 in London, May 18-22, 1992. Since the list of 3-character language codes is considered to be an open list, the ISO Council has designated a registration authority for 639 part 2. Proposals for allocating new language symbols should be directed to this authority. It is the Library of Congress, c/o Collection Services, Washington, DC 20540. See the list of language codes from a 1992 draft version.
Under: Technical committee / subcommittee: TC 46. Online lists: FTP from the RIPE server: ftp://info.ripe.net/iso3166-countrycodes, [mirror copy], or: Codes for Representation of Names of Countries (ISO 3166-1993 (E), [mirror copy].
See also: ISO/DIS 3166-1 Codes for the representation of names of countries and their subdivisions -- Part 1: Country codes (Revision of ISO 3166:1993); and: ISO/DIS 3166-2 Codes for the representation of names of countries and their subdivisions -- Part 2: Country subdivision code.
ISO/IEC 8632-1[-4] 1992(E). Second edition. Part 1. Functional specification; Part 2. Character encoding; Part 3. Binary encoding; Part 4. Clear text encoding. This standard supersedes the earlier standard: CGM:1986 (ANSI X3.122-1986). For other information on CGM, see the main database entry for Computer Graphics Metafile.
With Amendment A1 (1988), ISO 8879 constitutes the core specification for SGML. A subset of SGML became a US FIPS (Federal Information Processing Standard) in 1988. The British Standards Institution adopted SGML as a national standard (BS 6868) in 1987, and in 1989 SGML was adopted by the CEN/CENELEC Standards Committees as a European standard, #28879. Australia has dual numbered versions of ISO 8879 SGML and ISO 9069 SDIF (AS 3514 - SGML 1987; AS 3649 - 1990 SDIF). The full text of this ISO standard with Amendment A is incorporated into the text Charles Goldfarb's commentary (
This amendment is incorporated into the text of Charles Goldfarb's SGML commentary (
Also available as The British Standard Guide to SGML Document Interchange Format (SDIF), BS 7138 1989 (ISO 9069: 1988; see in "Snippets,"
The "public text" envisioned in this standard as applied to SGML might be DTDs (Document Type Definitions), or declaration subsets of DTDs, public entity sets, etc. Names include an owner name and an object identifier. Equivalent encodings for the names in ASN.1 and SGML may be supplied for interchange purposes. Note: "The intention of the amendment that has resulted in a 2nd edition is to extend 9070 beyond the simple boundaries of SGML only. It is now used by 9541 (and 10036) for the definition of 'structured names'. A New Work Item Proposal is being submitted to change the title and scope of 9070 to show its extended usefulness." (note from Paul Ellison, December 1991) [needs update]
[May 1996]: See also the main entry for ISO 9070 with information on the relevant WWW site.
A major revision of this TR underway (as of May 1990) will result in a new TR with (16) parts: (1) SGML Tutorial (2) Basic Techniques (3) Advanced Techniques (4) Using Short References for Identifying Markup (5) Using non-Latin Alphabets (6) Referencing and Synchronisation (7) Mathematics and Chemistry (8) Tables (9) Using SGML for Computer-to-Computer Interchange (10) Designing Applications for Database Interfacing (11) Application at ISO CS for International Standards and Technical Reports (12) Public Entity Sets for General and Publishing Symbols (13) Public Entity Sets for Mathematics and Science (14) Public Entity Sets for Latin Based Alphabets (15) Public Entity Sets for non-Latin Based Alphabets (16) Public Entity Sets for Ideograms (adapted from Ludo Van Vooren, "SGML Standards Committee Update: Activities of ISO SC 18 WG8,"
See further information on this standard in the Related Standards page. A future version is to include an "ISO chemical character set, ISOchem"; see a note by Martin Bryan (September 1995).
See also: SGML Public Entity Sets, Proposals. [relative to: http://www.ornl.gov/sgml/wg8/9573ent/ENTITIES.HTM]. Sample collections of entities and glyphs (proposed) for potential inclusion into ISO 9573. For: Ugaritic, Old Persian, Glagolitic, Croatian, Buginese, Cherokee, and Gothic Uncials. Developed by Anders Berglund and others.
The document supplies technical guidance for the development of context- sensitive SGML editors. See "Guidelines for Syntax-Directed Editing Systems,"
Voting on the current DIS began 1994-08-10 [and was to end mid-December 1994 or early 1995]. A posting to CTS in early 1995 by James Clark confirmed that negative votes had not been received, and that the vote was therefore expected to pass.
SUMMARY: "This International Standard defines the Document Style Semantics and Specification Language (DSSSL) used to specify formatting and other transformations of SGML-encoded documents. The initial focus of DSSSL is on formatting for both paper and electronic media, and on the conversion of SGML documents encoded according to different DTDs.
This International Standard has been structured to permit future sections to be added to this International Standard to cover the other areas of document processing and data management.
The main objective of the DSSSL Standard is to provide a specification language for expressing formatting and other document processing specifications in a formal and rigorous manner so that these specifications may be processed by a broad range of formatters, either natively or using a translation mechanism.
The DSSSL specification language will include tree transformation specifications and formatting specifications and other semantics to allow users to specify the types of formatting to be applied to various objects during composition and layout and pagination.
For formatting, a DSSSL-driven implementation can create a style sheet language that can be mapped into the DSSSL typographic characteristics and other composition and layout semantics.
In addition to the basic formatting semantics, DSSSL includes a language for writing a general transformation specification that provides the capability to transform documents from one SGML application into another.
DSSSL is designed to allow for specifications that apply to a class of documents. These specifications are applicable to all possible document instances in an SGML application as well as to a particular document instance.
The DSSSL specification language is declarative; it is not intended to be a complete programming language, although it contains constructs normally associated with such languages and provides a well-defined interface to a user-selected programming language, if such a capability is required. DSSSL specifications can be unambiguously parsed and interpreted among heterogenous systems. In addition, DSSSL specifications can be used by existing formatting systems through the use of "front-end" DSSSl processors and translators. DSSSL has no bias toward batch or WYSIWYG formatting systems and does not prescribe any predefined formatting algorithms.
The standardization of formatting semantics is provided in DSSSL through a set of basic structures known as flow objects and the associated set of formatting characteristics that are applied to these objects. DSSSL provides mechanisms for defining and extending the semantic constructs so that a DSSSL application designer can construct a DSSSL application in a manner that best reflects his application environment." [transcription from the Introduction (DIS 1994-08-10)]
For a summary, see: (1)
See now [June 1995] further information in a separate SPDL entry within this database, including pointers to availability of the 1995 draft standard via the Internet (e.g., from the WG8 FTP server and from the SGML Repository).
Description from the 1991 CD version: SMDL "defines a language for the representation of music information, either alone, on in conjunction with text, graphics, or other information needed for publishing or business purposes." Multimedia time sequence information in also supported. SMDL is a HyTime application conforming to ISO/IEC DIS 10744 Hypermedia/Time- based Structuring Language (HyTime), and an SGML application conforming to Standard Generalized Markup Language (ISO 8879:1986). An earlier version was published by ANSI (American National Standards Institute), as ANSI X3V1.8M Journal of Development. ANSI Project X3.542-D. Standard Music Description Language (SMDL). X3V1.8M/SD-8. 60 pages. Sixth Draft. April 15, 1990. See a description of SMDL in an overview article: Steven R. Newcomb, "Standards. Standard Music Description Language Complies with Hypermedia Standard,"
See now [July 1995] further information in a separate SMDL entry within this database, including pointers to availability of the 1995 draft standard (DIS) via the Internet. Or see an overview taken from the DIS.
"HyTime is a standard neutral markup language for representing hypertext, multimedia, hypermedia, and time- and space-based documents in terms of their logical structure. Its purpose is to make hyperdocuments interoperable and maintainable over the long term. HyTime can be used to represent documents containing any combination of digital notations. HyTime is parsable as Standard Generalized Markup Language (ISO 8879:1986). HyTime provides standardized means of expressing (1) intra- and extra-document locations, and arbitrary links between them, (2) the scheduling of multimedia objects in 'finite coordinate spaces,' and (3) rendering instructions for arbitrarily projecting such objects onto other finite coordinate spaces, and other constructs." [taken from an abstract in
For further information on HyTime, see (1) the WWW SGML Page HyTime main entry, (2) the book by Steve DeRose and David Durand, (3) the book by Eliot Kimber, and (4) the
See also Technical Corrigendum 1 to ISO/IEC 10744 [by Charles F. Goldfarb], Draft for ballot: March 27, 1995. The relevant documents are available from the SGML Repository or via this server as three text files: httc1.txt (24K), hi1anarc.txt (46K), and hi1anfsi.txt (22K)
The standard was prepared by Technical Committee ISO/TC 46, Information and documentation, Subcommittee SC4, Computer applications in information and documentation. The title "ISO..." appeared on the print copy distributed in mid-1994 by NISO/EPSIG, despite errors: it was apparently a premature printing. This "ISO" standard supercedes the 1988 (EPSIG/AAP) standard authorized by ANSI/NISO; see the bibliographic reference. The standard included three public DTDs (books, articles, serials) in "final" form and a provisional DTD for mathematics. The ISO 12083 DTDs [though not now in final form (November 1994)] are available on the Exeter SGML Project server and elsewhere; try: Exeter ftp://info.ex.ac.uk/ISO-12083/ or else ftp://actd.saic.com/pub/SGML/ISO-12083/. Although several requests have been made on CTS for release of electronic copies of the DTDs into public space, it remains unclear whether ISO will authorize this form of distribution for the DTDs.
See the EPSIG description "About the Standard"; [mirror copy]
SMSL "Extends HyTime by providing SGML meta-DTD architectural forms for describing the object classes, virtual functions, messages, aggregates and class/data membership used in a multimedia presentation's script. Also contains a definitions for a starter-set of functions used by scripting languages." [from: Index of OII Standards Report.
The SMSL Committee Draft ISO/IEC 13240 is available in Postscript format; [mirror copy, December 22, 1996]. See the main SMSL entry for other details.
Voting was 1993-08-12 thru 1994-02-12. [Entry needs update. Make links to Conformance Testing (Initiative) on main page.]
Abstract: "This presentation explains the goals for BNA's new publishing system, why BNA chose SGML as an integral part of that system, and provides an overview of how BNA implemented the system. Topics covered include undertaking business process re-engineering, adopting SGML, converting legacy data, and lessons learned during the process. BNA (The Bureau of National Affairs, Inc.) and its subsidiaries provide labor, legal, economic, and regulatory information to business, professional, government, and academic users."
"It really all boils down to the data and the fact that the data is the company's most valuable resource (second only to the people who create it). We used the term 'data repository' to refer to BNA's entire collection of documents and other data, including primary source laws, regulations, opinions, internally created news stories, legal headnotes, and reference materials. BNA has acknowledged that we must manage documents as a corporate asset and we must have the ability to search, retrieve, and update documents throughout the publishing life cycle. SGML was chosen as a way to identify and protect the data. BNA started over 50 years ago using typesetting instructions for Linotype operators. In the 70s we used two digit 'locator codes' to identify typesetting instructions. In 1980 we switched to proprietary (Atex coding) to produce our notification and daily publications. In 1985, with the purchase of a Datalogics system to produce our looseleaf publications, we began using unparsed SGML-like coding. Oh, if we could only recover from the blunder of using unparsed data!"
This paper was delivered as part of the "User" track in the SGML/XML '97 Conference.
Note: The SGML/XML '97 conference proceedings volume is available from the Graphic Communications Association, 100 Daingerfield Road, Alexandria, VA 22314-2888; Tel: +1 (703) 519-8160; FAX: +1 (703) 548-2867. Complete information about the conference (e.g., program listing, tutorials, show guide, DTDs, conference reports) is provided in the dedicated conference section of the SGML/XML Web Page and via the GCA Web server. The electronic proceedings on CDROM was produced courtesy of Jouve Data Management (Jouve PubUser).
Abstract: "Sgrep is a Unix tool for searching the contents of text files. Sgrep implements an algebra of unrestricted text fragments called regions. The algebra allows the retrieval of document components, represented as regions, based on conditions on their relative containment and ordering. This simple yet powerful model is suitable for querying structured document formats like electronic mail, RTF, LaTeX, HTML, or SGML documents. We describe the sgrep query language and give examples of its use. Especially, we explain how sgrep can be used for querying and assembling SGML documents."
Available online in Postscript format: ftp://ftp.cs.helsinki.fi/pub/Reports/by_Project/DocMan/Using_sgrep_for_querying_structured_text_files.ps.gz; [mirror copy]. See also the software main entry: 'sgrep' grep-like searching of structured documents.
Abstract: "We present a powerful document transformation language called TranSID, which is targeted at structured (SGML) documents. The language is based on a powerful model where the entire input document tree may be referenced during the transformation process. The evaluation is performed in a bottom-up manner. A language evaluator has been implemented which runs in Unix environments."
Note also the longer work by Greger Lindén: Structured Document Transformations, PhD Thesis, Report A-1997-2, Department of Computer Science, University of Helsinki, June 1997. 122 pages. Available online in Postscript format, via FTP.
The document is available online in Postscript format: via FTP; [local archive copy].
The paper was also published in the Proceedings of The Fifth Symposium on Programming Languages and Software Tools, Jyväskylä, Finland, June 7-8, 1997, ed. Jukka Paakki, pages 72-83, Technical Report C-1997-37, University of Helsinki, Department of Computer Science, June 1997.
"Abstract: Patent and Trademark Office (PTO) Commissioner Bruce A. Lehman reported to a House subcommittee that the development of an electronic filing system is 'critical' to the Office's efforts to reduce patent filing time. The PTO, which recently unveiled the Automated Patent System, hopes to reduce patent processing time to 12 months, down from the high of about 3 years in the mid-1980s. The electronic filing system is necessary to reach this goal while supporting a workload that grows 6 percent annually. The PTO will choose between two off-the-shelf applications based on SGML, one developed by InContext, the other by Microstar Software. The candidates will be tested at small companies starting in August 1996, and Lehman hopes electronic filing to be available within three years."
Abstract: "The Railroad Industry Forum (RIF) is a team of the National Association of Purchasing Managers who were tasked to develop a standard for the exchange of electronic parts catalog data within the North American railroad industry. The RIF members are comprised of major railroads and railroad manufacturers. Mary McCarthy and Betty Harvey, Electronic Commerce Connection, Inc. developed the EPCES DTD. EPCES - Electronic Parts Catalog Exchange Standard, is a standard that was developed by the RIF for interchange and presentation of illustrated parts catalogs. The presentation of EPCES information has been designed to facilitate point and click capability. LinkOne is an electronic parts catalog and service manual delivery system. It has been developed to enable electronic viewing of parts and service information for manufactured equipment and processes. LinkOne provides point and click functionality between graphics and textual information. ISOGEN International Corporation has developed an EPCES filter for LinkOne to support importing and/or exporting parts catalog information from the manufacturers or railroads in SGML compliant to the EPCES standard."
This paper was delivered as part of the "Case Studies" track in the SGML/XML '97 Conference.
For more information on RIF, see the dedicated database entry Railroad Industry Forum: Electronic Parts Catalog Exchange Standard (EPCES), or the description provided by Betty Harvey via the Electronic Commerce Connection web server.
Note: The SGML/XML '97 conference proceedings volume is available from the Graphic Communications Association, 100 Daingerfield Road, Alexandria, VA 22314-2888; Tel: +1 (703) 519-8160; FAX: +1 (703) 548-2867. Complete information about the conference (e.g., program listing, tutorials, show guide, DTDs, conference reports) is provided in the dedicated conference section of the SGML/XML Web Page and via the GCA Web server. The electronic proceedings on CDROM was produced courtesy of Jouve Data Management (Jouve PubUser).
Letter to the editor ("Inbox" department), suggesting that some of the problems identified in Byte's article "Work Flow Without Fear" can be addressed through SGML: "SGML can be used to define the interface format of the documents through the work flow."
The article is a tour of East Asia focused upon SGML issues. Central issues include the character repertoire, character encodings, and native-language document markup. The sidebar article by Gavin Nicol ("Postcard from Tokyo on HTML") discusses ERCS and ISO-2022-IPEUC in relation to Asian language support in SGML/HTML.
See the provisional volume description in a separate document. See also provisionally the Amazon.com entry: "Synopsis: SGML experts are in short supply and in high demand. This book will help jump start SGML users by providing 'cookbook recipes' for the most common SGML document type definitions (DTDs). The CD-ROM contains hundreds of sample DTDs that users can cut and paste from to create their own DTD." [amazon.com]
Abstract: "The paper presents experiences based on the study of a pilot project integrating an SGML-based document processing system at the University of Oslo, Norway. The experiences are examined from three perspectives in order to discuss them in relation to different aspects of the system; the use situation, the organizational benefits and challenges, and the technological requirements. Improving the system based on experiences within one perspective may lead to conflicts to consider when improving the system based on experiences found within other perspectives. The paper states and discusses some of the conflicts in SGML-based document systems. The paper concludes with challenges in development and use of SGML-based document systems, and states some issues for further research."
The document is available online in HTML format or PDF; [local archive copy].
"Abstract: An electronic dictionary system (EDS) is developed with object-oriented database techniques based on ObjectStore. The EDS is composed of two parts: the Database Building Program (DBP), and the Database Querying Program (DQP). DBP reads in a dictionary encoded in SGML tags, and builds a database composed of a collection of trees which holds dictionary entries, and several lists which contain items of various lexical categories. With text exchangeability introduced by the SGML, DBP is able to accommodate dictionaries of different languages with different structures, after easy modification of a configuration file. The tree model, the Category Lists, and an optimization procedure enables DQP to quickly accomplish complicated queries, including context requirements, via simple SQL-like syntax and straightforward search methods. Results show that compared with relational database, DQP enjoys much higher speed and flexibility. With EDS this paper demonstrates how to apply OODBMS's to systems that handle text information with strong yet varied intrinsic hierarchies."
Abstract: "There is a great deal of variation in the encoding of spoken texts in electronic form, both with respect to the types of features represented and the way particular features are rendered. This paper surveys problems in the electronic representation of speech and presents the solutions proposed by the Text Encoding Initiative. The special tags needed for the encoding of spoken texts are discussed, including a mechanism for temporal alignment. Further work is needed on phonological aspects, parallel representation, and on the development of software which connects the systematic underlying representation with a workable format for input and display."
The manual describes the TEI/SGML encoding scheme used to mark up text samples used in the parallel text project. Available on the Internet in HTML format: http://www.hd.uib.no/doc.html: ENPC Documentation [mirror copy].
Available on the Internet in Postscript format: ftp://ftp.hd.uib.no/pub/corpora/enpc.poznan.ps [mirror copy].
The document contains a print version of the (TEI/SGML) DTD used in the parallel text corpus, and examples. Available on the Internet in Postscript format: ftp://ftp.hd.uib.no/pub/corpora/enpc.lund.ps [mirror copy]. For further details, see the main entry for the English-Norwegian Parallel Corpus.
Based upon a paper from the Fourteenth International Conference on English Language Research on Computerized Corpora, Zürich, May 19-23, 1993.
Republished in version 2.0 as "Electronic Texts and their Use for Literary Research". See Electronic Texts and Computer Research by Eric Johnson.
The article is available online via Eric Johnson's WWW server: 'Electronic Texts' [mirror copy here, text only].
"Scholars in the humanities today are routinely doing textual and linguistic research that a generation ago would have been impossible or would have required the dedication of a lifetime. Such research is now feasible because humanists use computers and because texts of major writers are available in electronic form.
The Oxford Electronic Text Library edition of The Complete Works of Jane Austen (OETL Austen) is exactly the kind of electronic text that modern scholars need. It is an accurate rendering of R. W. Chapman's Oxford Illustrated Jane Austen, the standard scholarly edition of Austen, and it contains a wealth of useful information encoded in Standard Generalized Markup Language (SGML). The OETL Austen is distributed in both MS-DOS and Macintosh formats, and a site license is available. It will be used in a multitude of ways by students of Austen for years to come." [from the Introduction]
Johnson favorably reviews the OETL Austen, which uses SGML to structure the electronic text. A copy of the document is available online in HTML format. [Pages under construction: try simply "http://www.dsu.edu/~johnsone/" if the previous link fails.] Full information for ordering the electronic text edition is given in the review. See also a summary of the review by Mary Mallery.
Abstract: A logical history of document editing mechanisms is presented. The design space for document style mechanisms is analyzed. Six primary design issues and the subsidiary issues they raise are discussed. Some major style issues that are seen as the subject of future research are identified.
The author writes a positive review, delineating improvements in the second edition.
Abstract: "Abstract: As Internet tools become more sophisticated, many scientists are abandoning conventional methods of communication, such as the journal and the scientific conference, in favour of electronic means. The obvious benefit of Internet-based communication is the ability to share and discuss data, analysis techniques and conclusions without leaving the laboratory. More importantly, however, the Internet is also inspiring the creation of completely new ways of communication that may have a profound effect on how science is done. The paper discusses the Chemical Markup Language (CML) which facilitates the exchange of chemical information on the Internet. The CML project aims to ensure that chemical software and databases are compatible for use with CML, by means of collaboration with their creators." [CML is an experimental application of XML, Extensible Markup Language.
This article examines the problem of document representation in computer systems for printing, editing or interchange among heterogeneous systems. After a discussion of the various possibilities for defining documentation representation formalisms, it considers a number of standard representations typical of their class: page description languages, SGML, Interscript, ODA. Several other articles in the volume are of direct or marginal relevance to SGML as a metalanguage for document-structuring.
Abstract: "This paper starts by tracing the architecture of document preparation systems. Two basic types of representations appear: at the page level or at logical level. The paper then focuses on logical level representation and tries to survey three existing formalisms: SGML, Interscript, and ODA."
"Abstract: Two technologies have come together to make online technical publishing begin to work. The first and foremost of these technologies is the Internet. Without this massive network of computers and communication equipment, putting a digital library icon on a lab workstation and on an office desktop would have been both problem-plagued and expensive. The second of these facilitating technologies is Standard Generalized Markup Language (ISO 8879: SGML). SGML is, by one definition, a meta-language with which one can capture the structure and semantics of a class of documents. It is internationally recognized as a standard for document representation. Although SGML products have been available for years, the past two years have seen a real growth in interest and use of this technology. AIP has adopted ISO 12083 as the basis for its SGML documents. As a standard, ISO 12083 is overseen by an international working group but not owned by any one organization."
Available from UMI: University Microfilms International, Inc., Number 8804059.
Summary: Several improvements are suggested to the syntax of SGML, the recent international standard for the description of electronic document types. These improvements ease processing by existing tools, remove ambiguity cleanly, and increase human usability. They also indicate some guidelines that should be followed in the design and specification of computer-software standards. By following accepted computer-science conventions for the description of languages the design of a standard may be improved, and the subsequent implementation task simplified.
Draft version 18-October-1988, "accepted for publication in
See also the response of Ron Hayter, "Comments on 'On Improving SGML'," Technical Bulletin 4. Software Exoterica Corporation [OminMark], 1988. Ron Hayter argues that Kaelbling's "improvements" to SGML are based upon a misunderstanding of the intent of the standard. Kaelbling's original draft known to Hayter was apparently 16-March-1988; Kaelbling's revised draft of 18-October-1988 responds to Hayter's comments.
Abstract: "Several improvements are suggested to the syntax of SGML, the recent international standard for the description of electronic document types. These improvements ease processing by existing tools, remove ambiguity cleanly, and increase human usability. They also indicate some guidelines that should be followed in the design and specification of computer-software standards. By following accepted computer-science conventions for the description of languages the design of a standard may be improved, and the subsequent implementation task simplified."
Received 16-March-1988, Revised 18-May-1990. Another version of the paper is found in OSU-CIRSC-7/88-TR22. Author affilation: Siemens AG, ZFE IS EA 11; Corporate Applied Computer Sciences; Otto-Hang-Ring 6; 8000 Munich 83, FRG.
Abstract: "The advantage of structured markup in SGML (Standard Generalized Markup Language) has recently become clear. This technology is being used to automatically convert documents into accessible forms for blind people. In Germany one of the first sets of documents available in SGML is the scientific journal article headers from the "Springer Verlag Journal Preview Service". This article gives a description of the "Journal Header Reader" application. We developed this application to make scientific documents in several formats accessible to blind people. The following chapter gives an overview of the SGML facilities used in our project." [from the document introduction]
Available on the Internet in HTML format: A Journal Header Reader program for the blind, [mirror copy, November 1995].
Abstract: "Structured reporting systems allow health-care workers to record observations using predetermined data elements and formats. The author developed the Data-entry and Reporting Markup Language (DRML) to provide a generalized representational language for describing concepts to be included in structured reporting applications. DRML is based on the Standard Generalized Markup Language (SGML), an internationally accepted standard for document interchange. The use of DRML is demonstrated with the SPIDER system, which uses public-domain internet technology for structured data entry and reporting. SPIDER uses DRML documents to create structured data-entry forms, outline-format textual reports, and datasets for analysis of aggregate results. Applications of DRML include its use in radiology results reporting and a health status questionnaire. DRML allows system designers to create a wide variety of clinical reporting applications and survey instruments, and helps overcome some of the limitations seen in earlier structured reporting systems."
See the main database entry for SPIDER - Structured Platform-Independent Data Entry and Reporting, or the web site for SPIDER. An online document (Postscript) is available which describes DRML: http://www.mcw.edu/midas/papers/AMIA96-DRML.ps; local archive copy.
Abstract: "Structured reporting systems allow health care providers to record observations using predetermined data elements and formats. We present a generalized language, based on the Standard Generalized Markup Language (SGML), for platform-independent structured reporting. DRML (Data-entry and Report Markup Language) specifies hierarchically organized concepts to be included in data-entry forms and reports. DRML documents serve as the knowledge base for SPIDER, a reporting system that uses the World Wide Web as its data-entry medium. SPIDER generates platform-independent documents that incorporate familiar data-entry objects such as text windows, checkboxes, and radio buttons. From the data entered on these forms, SPIDER uses its knowledge base to generate outline-format textual reports, and creates datasets for analysis of aggregate results. DRML allows knowledge engineers to design a wide variety of clinical reports and survey instruments."
See the main database entry for SPIDER - Structured Platform-Independent Data Entry and Reporting, or the web site for SPIDER - Structured Platform-Independent Data Entry and Reporting An online version of the document in Postscript format: http://www.mcw.edu/midas/papers/AMIA96-DRML.ps; local archive copy.
Abstract: "Structured reporting systems allow physicians to record findings by using predefined vocabularies and data-entry formats. The data-entry and reporting markup language (DRML) is used to define structured reporting applications for the SPIDER (structured platform-independent data entry and reporting) system. World Wide Web technology can be used to implement systems for structured entry and retrieval of medical data. The SPIDER system and its DRML report-definition language provide simple, platform-independent tools for structured reporting that conform to internationally recognized standards. The article guides readers through the use of DRML and SPIDER, and allows readers to interactively create structured reporting applications."
"DRML is a generalized report-specification language that simplifies the creation and maintenance of structured reporting applications. The specification of DRML as an SGML document type definition provides standardization that allows DRML documents to be used and exchanged across various computing platforms. Systems for publishing and on-screen editing of SGML documents are available commercially [. . .] . Such programs allow interactive, on-screen editing of DRML documents. Software is also available for validating the syntax of SGML documents [...] By including the DRML document type definition within a document (either explicitly or by reference), such software can be used to check the syntax of a DRML report definition. World Wide Web technology can be used to implement systems for structured entry and retrieval of medical data. The SPIDER system and its DRML report-definition language provide simple, platform-independent tools for structured reporting that conform to internationally recognized standards. This article has demonstrated their use for interactively creating structured reporting applications." [from the conclusion]
The document is available online in HTML format; see also target URL, registration may be requested].
[Received January 22, 1997; revision requested February 26; revision received and accepted March 3; posted March 10. Supported in part by The Whitaker Foundation (Biomedical Engineering Research Grant to C.E.K.) and the National Library of Medicine (USPHS grant G08 LM05705). Presented in part as infoRAD exhibit 9111WKS at the 82nd Scientific Assembly and Annual Meeting of the Radiological Society of North America, Chicago, December 1.
Abstract: "The content-reuse system of The Wall Street Journal Interactive Edition makes extensive use of SGML and XML to reorganize and reformat the content presented in the main wsj.com website. This paper discusses how the structures that define an Interactive Journal edition and its component articles are queried, processed, and converted by automatically triggered content-processors, allowing us to quickly fill requests by potential publishing partners to feature our branded content in their contexts."
[Conclusion:] '. . . All of our content-reuse processes owe their flexibility and ease of implementation to our use of SGML and XML. Articles created in SGML have been translated and served out in all sorts of flavors of HTML and other plain text formats. Edition structures and configuration files specified in XML are processed and tailored by custom software that allows our editors to specify what constitutes a mini-edition. And when our automatically generated content falls short of serving their audiences completely, an editor can step in and finish the job. . . . Our editors and designers are charged with constantly improving how our news can be accessed, navigated through, presented, and used. And our business-development staff is constantly seeking new ways to raise the visibility of our brand, which often means spreading excerpts from our trove of content out to places and platforms that our primary web site would not otherwise reach. Having our news, and the processes that direct where that news belongs, in an extensible format has proved to be the key to fulfilling their requirements.'
The document is available online in PDF format - "News you can reuse." [local archive copy] For other articles in this issue of MLTP, see the annotated Table of Contents.
Revision: Received 7 July 1998, Revised 12 August 1998.
Abstract: "Using SGML within our Web publishing system not only allows us to create better-looking and more complicated HTML than editors could otherwise have authored using a native formatting language, but it also allows our editors and designers to massage the look of the edition as often as desired, and to produce spin-off products without additional editorial effort. To be presented will be an architectural overview describing how our publishing system offers editors a tremendous menu of publish-time choices."
"At The Wall Street Journal Interactive Edition, we have been using SGML to mark up news articles since our launch in April, 1996. The elements and attributes we use in our authoring system attempt to answer the question 'What is this content, and what makes it different?' as opposed to 'How do we want this to look in a Web browser?' Even though we may want a byline to wind up looking bold, we mark it up with a <byline> tag, not a <b> tag. Only later in the publishing process do we translate our documents into HTML and its variants. This paper will outline the benefits of this approach, and then describe in some detail how we create our SGML, and how we format it."
This paper was delivered as part of the "Case Studies" track in the SGML/XML '97 Conference.
Note: The SGML/XML '97 conference proceedings volume is available from the Graphic Communications Association, 100 Daingerfield Road, Alexandria, VA 22314-2888; Tel: +1 (703) 519-8160; FAX: +1 (703) 548-2867. Complete information about the conference (e.g., program listing, tutorials, show guide, DTDs, conference reports) is provided in the dedicated conference section of the SGML/XML Web Page and via the GCA Web server. The electronic proceedings on CDROM was produced courtesy of Jouve Data Management (Jouve PubUser).
Federal Aviation Administration guidelines prescribe compliance with SGML for specified information deliverables, and USAir Group Inc.'s maintenance division selects software products that conform to the SGML standard. The article describes how the workflow accounts for non-SGML data in the USAir information system as well.
Abstract: "By Fall 1986 the Oxford English Dictionary will have been completely entered into machine-readable form as a first step toward creating an integrated version of the Dictionary and its Supplement. The ability to update and revise the OED requires the addition of a considerable amount of structure to the keyboarded text. Various software approaches to transducing the text of the OED in order to add this structure were evaluated, and eventually INR and lsim were chosen. The ise of INR, a program for computing finite automata, necessitated that the structure of the OED be described as a regular language. The methods used to describe the OED, resolve ambiguities and deal with space limitations are detailed. These methods are not limited to the OED, but may be applied to any text in which one wishes to augment the structural information."
The document was also submitted as a master's thesis (Master of Mathematics in Computer Science) to the University of Waterloo. See further on researches related to the production of NOED2 in the main entry for NOED.

