[This local archive copy mirrored from the canonical site: http://www.fstc.org/projects/sdml/Sd-xml.htm; links may not have complete integrity, so use the canonical document at this URL if possible.]

SDML & XML



Prepared by

Michael Lu, Citicorp

From E-mails of

Jim Flynn, @Work Technology

Frank Jaffe, BankBoston

Jeff Kravitz, IBM Research

Stuart Marks, Sun

First Written: April 14, 1998

Last Revision: April 20, 1998

Copyright © 1998




Table of Contents

Foreword

Issues

1. Application

1.1. Limited Memory Devices

1.1.1. End Tags

1.1.2. Use of "&" to Specify Order in DTD

1.2. Comments

1.3. Modifiable DTD

1.4. Attribute Vs. Element

2. Transport

2.1. Document Formatting Rules

2.1.1. Character Encoding

2.1.2. Line Formatting

2.1.3. Space Handling

2.2. MIME

3. Encoding

3.1. Character Encoding

4. Syntax

4.1. Well-Formed

4.1.1. End Tags

4.1.2. Empty Tags

4.1.3. Special Character Usage for Ampersands (&)

4.2. XML Identifier

Glossary



Foreword

In February 1998, W3C announced the release of the XML 1.0 specification as a W3C Recommendation. XML 1.0 is the W3C's first Recommendation for the Extensible Markup Language, a system for defining, validating, and sharing document formats on the Web. XML is a subset of an existing, widely used international text processing standard (Standard Generalized Markup Language) intended for use on the World Wide Web. XML retains SGML's basic features - vendor independence, user extensibility, complex structures, validation, and human readability. XML also offers several important advantages such as:

The Signed Document Mark-up Language (SDML) was developed as an outgrowth of the work of the Financial Services Technology Consortium (FSTC) Electronic Check project, and has been created, fine-tuned, and tested by multiple banks and hardware/software companies over a period of about 2 years. SDML may be used for a wide variety of purposes, such as electronic funds transfer, electronic commerce, or any form of signed contract or agreement.

SDML documents may be digitally signed using public key cryptographic signature and hash algorithms to provide a method of ensuring that documents have the following attributes:

The SDML signature mechanism also allows documents to be combined, or added to, without loss of these features.

Since SDML currently defines its structure largely through SGML, considerable interest has been generated in adopting XML as the specification for defining SDML.

Based on recent discussions and inputs from several key individuals associated with SDML, this paper attempts to identify various issues involved in making SDML XML-compliant. This document does not propose solutions to those issues. Instead, the purpose of this paper is to serve as a companion to the SDML 2.0 specification and provide insight into the logic behind certain design decisions that lead to conflicts with XML. By publishing this document, the FSTC project team seeks to generate further discussions and input from external groups that are interested in signed documents. It is hoped that SDML and XML can evolve together and eventually resolve the differences between the two specifications to produce a W3C-approved Signed Document Mark-up Language Standard. While reading this document, keep in mind that SDML has its roots in FSML, the Financial Services Markup Language, and XML was not released to the public until two years after FSTC had begun its work on electronic check and SDML.

The issues associated with XML compliance are separated into four categories - Application, Transport, Encoding, and Syntax. Some of the issues have impacts that cross several categories and are often interrelated. The document is based on XML 1.0 and SDML 2.0 and assumes you have some familiarity with these two specifications. If you are unfamiliar with these topics, you can get more information from the following URLs:

http://www.fstc.org

http://www.w3c.org/XML


Issues

1. Application

1.1. Limited Memory Devices - The FSTC project team made several design decisions to allow SDML documents to be easily processed by "limited" memory devices such as smart cards, and PDA's, etc. Some elements of the XML design specification would make this much more difficult. Specifically, to make SDML XML-compliant, the following issues need to be addressed:

1.1.1. End Tags - XML requires end tags on all elements, or else the element is considered empty (without content). Here, the concern is that an XML-compliant version of SDML would make all SDML documents, such as E-Check, much larger in terms of memory requirements and making it much more difficult to implement SDML in limited memory devices.

1.1.2. Use of "&" to Specify Order in DTD - XML forbids the use of the and connector "&" in DTD element content models. In SGML and SDML, this symbol allows you to declare the content of an element and specify that all elements must appear, but leave the order of appearance open. Without the use of the "&" connector, sub-tags must appear only in a specific order. Some members of the project team feels that this presents a problem since SDML's signature algorithms require some manipulation of the content of several element blocks. Forcing a pre-determined order, they believe, can lead to severe memory allocation problems in small memory devices. Others, however, disagree. Currently, this topic is still under discussion.

1.2. Comments - XML permits comments virtually anywhere in a document and its associated DTD. Some members of the working group felt that the ability to include comments might allow cryptographic attacks on a signed document to be made simpler, or can cause parsing/processing software to be more complex. Two proposals were made regarding this issue:

  1. Applications need not preserve comments if they produced an output document that is a transformation of the input; or
  2. Disallow comments in SDML entirely.

However, a strong consensus has not been reached.

1.3. Modifiable DTD - XML allows a document to specify markup declarations within the "internal subset" of the DTD. The DTD in use for the processing of a document instance is the combination of declarations in the internal and external subsets. Markup declarations specified in the internal subset take precedence over those in the external subset. The external subset will presumably be specified by a standard; the processing of a document with an internal subset might thus be performed with a DTD that deviates from the standard DTD. The declarations allowed to be present in the internal subset must be carefully restricted so that they cannot affect the semantics of the document. One possibility is to disallow the use of the internal subset entirely.

1.4. Attribute Vs. Element - XML specification allows for more efficient implementation of some of the SDML requirements. As an example, consider the following: Typically, the data specified for the "blkname" element must be unique within an SDML document. Consequently, a customized parser or application program would have to check this in order to validate the document. In XML, this could be accomplished by defining the blkname as an attribute of an element instead of a child element. In this case, the blkname attribute could be defined in the document's DTD as an "ID" attribute, which must be unique. Consequently, an off-the-shelf XML parser would check for uniqueness when it validates the document. This difference has a major implication for the FSTC project team. Changing to ID attribute can seriously affect the signature algorithm and therefore major portion of code already written for FSML. Although the project team is willing to change, the benefits of any changes must be balanced against the effort involved.

2. Transport

2.1. Document Formatting Rules - To ensure safe transportation through legacy systems, SDML requires all information be transported within the body of an e-mail message as printable ASCII text. Since e-mail systems often tamper with the text of a message, document formatting rules were created to safeguard the validity of digital signatures. Unfortunately, these rules create additional issues when it comes to XML compliance.

2.1.1. Character Encoding - To ensure compatibility with legacy e-mail systems, SDML requires that all characters be encoded in a subset of 7-bit ASCII (Tab is not allowed). XML, however, was never designed for such a purpose and only requires all XML processors be able to read entities in either UTF-8 or UTF-16.

2.1.2. Line Formatting - All lines in a SDML document must be less than or equal to 76 characters in length to protect the message from e-mail systems that break up lines longer than 76 characters. SDML also specifies how a line-end sequence must be treated to further protect against unintended alterations by e-mail systems. XML does not have these requirements and does not have a way of specifying these rules given the current language definition.

Comment: Some members of the project team have indicated that this is not necessarily an issue with XML compliance per se. Rather, this could be considered as an "application convention," to use the SGML terminology. SGML and XML allow the application to specify all manner of rules about the content and storage of the document, and this is one of them.

2.1.3. Space Handling - Besides character encoding and line formatting, the SDML specification also includes a section on space handling. Some of these specifications, however, conflict with XML requirements and need to be addressed.

a) Trailing Space - SDML requires that spaces between the last non-space character on a line, and the line-end sequence for the line are not to be included in the hash calculations. Furthermore, the SDML specification dictates that these characters are not to be passed onto applications as part of the field value. An XML processor, however, is expected to pass all characters in a document that are not part of a markup tag through to the application.

Comment: Again, some members of the project feel that this is not necessarily an issue with XML compliance per se. See the Comment for 2.1.2.

b) White Space - In SDML, white space (spaces, tabs, line feeds, and carriage returns) may not be inserted into tags, either between the < and the tag name, or between the tag name and >, or inside the tag name. If a tag has a attribute, white space may not be inserted between the parameter name and the "=" or between the "=" and the parameter value, or inside the parameter name or between the last character of the parameter value and the >. For example, this is not a valid SDML code:

< name id = "x2" >

On the contrary, XML does not have these requirements and would have no trouble processing the above code.

2.1. MIME - To circumvent the problem of e-mail systems altering the text of messages, a proposal was made to enclose SDML messages as MIME (Multi-part Internet Mail Extension) content type that preserves the binary compatibility. Aside from addressing the issue of e-mail system compatibility, MIME encapsulation into binary format also allows for easier implementation of confidentiality via encryption. However, at the time SDML was being designed, MIME was not universally accepted. Even today, many legacy e-mail systems still have difficulty handling MIME attachments. As MIME becomes more widely used, document formatting rules may become obsolete, and may be considered as optional.

3. Encoding

3.1. Character Encoding - SDML restricts the character set used in documents to printable ASCII. This restriction was introduced for the same reason as the document formatting rules (see above).

4. Syntax

4.1. Well-Formed - All XML documents must be well-formed. This requirement introduces several issues that must be addressed:

4.1.1. End Tags - XML requires that all tags must be balanced: that is, all elements that may contain character data must have both start- and end-tags present (omission is not allowed except for empty elements, see below). The current SDML specification conflicts with this rule by not requiring end tags for certain elements due to memory requirements (see Limited Memory Devices).

4.1.2. Empty Tags - XML introduces a special type of element called "empty" element, which are elements that do not contain any character data or child elements. Empty element tags require special formatting. XML requires that any empty element tags, e.g., those with no end-tag like HTML's <IMG>, <HR>, and <BR>, must either end with '/>' or must be made non-empty by adding a real end-tag. SDML does not address the issue of empty tags. However, a proposal was made to not to define any empty elements within SDML documents.

4.1.3. Special Character Usage for Ampersands (&) - XML requires the ampersand character (&) to appear in its literal form only when used as markup delimiter, or within a comment, a processing instruction, or a CDATA section. It is also legal with the literal entity value of an internal entity declaration. If the & character is needed elsewhere, it must be escaped using either numeric character reference or the string &amp;.

SDML, on the other hand, allows ampersand characters (&) to appear within any part of the content. For example, the following code is acceptable to SDML:

<name>PP&L</name>

In XML, the same content must be encoded as

<name>PP&amp;L</name>

This is not a major concern for the FSTC project team and the specification can be modified accordingly if required.

4.2. XML Identifier - XML requires that all documents start with an XML identifier. This must be added to the SDML specification if SDML is to become XML-compliant. Adding such a requirement to SDML would mean changes to code written for FSML. However, such changes can be made without too much difficulty. Along with the XML identifier, the document type declaration ``<!DOCTYPE ...>'' must also be present, or the behavior of a conforming system when it is absent must be specified.


Glossary

Application - Rules for element definitions, syntax, semantics, and processing of the content of the document.

Attribute - A source of additional information about an element. Attribute values may be fixed in the DTD or listed as name-value pairs (name = "value") in the start-tag of an element.

Character Data (CDATA) - Information in an XML document that should not be parsed at all. This allows the use of the markup characters &, <, and > within the text, even though no elements or entities may appear in the section. CDATA declarations may appear in XML attributes, and CDATA-marked section may appear in documents.

DTD - Document Type Definition. A set of rules governing the element types that are allowed within an XML document and rules specifying the allowed content and attributes of each element type. The DTD also declares all the external entities referenced within the document and the notations that can be used. A DTD cannot define what the tags mean or how they are processed, and it cannot redefine many "global" XML rules for handling documents, e.g., comments.

Element - A logical unit of information within an XML or SDML document.

Empty Element - An element that has no textual content.

FSML - Financial Services Markup Language. FSML is an SGML-like markup language designed to allow the creation of electronic financial documents. FSML was developed by the FSTC (Financial Services Technology Consortium) Electronic Check Project.

MIME - Multipurpose Internet Mail Extensions. MIME is a set of specifications that support the structuring of the message body in terms of body parts. Body parts can be of various types, such as text, image, audio, or complete encapsulated messages. It also provides for the encoding of messages in character sets other than 7-bit ASCII.

Parser - A program that converts a serial stream of markup (an XML or SDML file, for example) into an output structure accessible by another higher-level program. XML parsers may perform validation or check to see if a markup is well-formed as they process it.

SDML - Signed Document Markup Language. SDML is an SGML-derived markup language designed to allow the creation of digitally signed electronic documents. It is a generalized version of FSML.

SGML - Standard Generalized Markup Language. An International Standard (ISO 8879:1986) that describes a generalized markup scheme for representing the logical structure of documents in a system-independent and platform-independent manner.

Unicode - A standard for international character encoding. Unicode supports characters that are 2 bytes wide rather than the 1 byte currently supported by most systems, allowing it to include 65,536 characters rather than the 256 available to 1-byte systems.

Well-Formed - A well-formed document may or may not have a DTD. Well-formed documents must begin with an XML declaration and contain properly nested and marked-up elements.

XML - Extensible Markup Language. A profile, or simplified subset, of SGML. Supports generalized markup on the World Wide Web.