Browse Past Issues
Building XML Parsers for Microsoft's IE4
Jean Paoli, David Schach, Chris Lovett,
Andrew Layman, Istvan Cseri
Microsoft cofounded the XML working group at the W3C in July 96 and actively participated in the definition of the standard. This article describes why Microsoft implemented its first XML application and how it led to the development of two XML parsers shipping in Internet Explorer 4.0, one written in C++ and the other in Java. We describe the importance of designing an object model API and our vision of XML as a universal, open data format for the Internet.
Our First Application:
Conventional Web use waits for a user to request a page before sending it. That is known as the "pull" mode. A powerful alternative exists, however, called "push" or "webcasting," in which pages are sent to a user in advance, based on automatic matching of pages to the user's interests. Webcasting provides each user with automatic delivery and offline access to the information and Web sites that he uses most often.
Active Channels for Internet Explorer 4.0
To bring this idea to reality, in February 1997 the Internet Explorer team needed a standard way of describing sites and pages. The first broadly popular form of Web "metadata" (so called because it describes data about other data) is the Channel Definition Format, or CDF . This allows a Web site to post a description of itself in a standard form. Having done so, it is no longer just a site; it is also an "Active Channel."
A channel is a set of related Web pages. Channel Definition Format files include the following characteristics:
In looking for a suitable technology on which to build channels, the Internet Explorer team found that XML and Active Channels are a perfect fit. XML is excellent for metadata, since many of its rules are similar to the widely known HTML language rules; yet it has more facilities for structure and extensibility. This gave the IE team the assurance that parsers would be easy to implement and the format would be broadly usable.
- A minimal CDF file contains a list of URLs pointing to the pages that make up the content of the channel.
- A more advanced CDF file can include title and abstract information describing individual items, a schedule for updates, and a hierarchical organization of the channel's offerings.
- A CDF file must be easy to create and not require changes to existing HTML pages.
CDF is an application of XML that deals with the particulars of Web metadata. CDF consists of a vocabulary of terms that are related to Web sites and their Active Channel content. Technically, the terms are used as "Elements" and "attributes," and CDF defines how they can be used together to expand a Web site into a webcasting channel (see Example 1).
<?XML version="1.0" RMD="NONE" ?>
<!DOCTYPE Channel SYSTEM
<TITLE>Internet Explorer News</TITLE>
<ABSTRACT> The latest news on Internet Explorer. </ABSTRACT>
<TITLE> Latest support for CDF </TITLE>
A Universal, Open Data Format
At the same time as the metadata CDF work was proceeding, members of the Internet Explorer team and others in Microsoft started to understand the broad need for a universal, open data format for the Internet. The opportunities are very exciting.
for the Internet
Although visual and user interface standards are a necessary layer, they are insufficient for representing access medium to text and pictures. There are no standards for intelligent search, data exchange, adaptive presentation, and presonalization. The Internet must go beyond setting an information access and display standard; it must set an information understanding standard--a standard way of representing data so that software can better search, move, display, and otherwise manipulate information currently hidden in contextual obscurity. HTML cannot fulfill these needs because it is a format that describes how a Web page should look, rather than one that represents data. For example:
In short, while HTML provides rich facilities for display, it does not provide any standards-based way to manage as data.
- HTML does not provide a standard way for a doctor to send a prescription to a pharmacist.
- HTML does not enable a medical laboratory to publish statistical information in a format that any receiver can analyze.
- HTML does not describe an electronic payment in a form that any recipient can decode and process.
- HTML does not provide a standard way to search legal libraries to find, for example, all litigation documents about a certain topic.
- HTML does not specify how information in a company catalog can be transmitted, such that a salesman can work offline, show the catalog to clients, take orders, then upload those orders in a standard format.
A standard for data representation will expand the Internet in much
the same way that the HTML standard did for display a few years
ago. The data standard will be the vehicle for business transactions,
publication of personal preference profiles, automated collaboration,
and database sharing. Payments, medical histories, pharmaceutical
research data, semi-conductor part sheets, and purchase orders will
all be written in this format. It will open up a wide variety of new
uses, all based on a standard representation for moving structured
data around the Web as easily as we move HTML pages today. That data
standard is XML.
XML: A Standard Format for Data
XML provides a data standard that can encode the content, semantics, and schemata for a range of cases, from simple to complex. XML can encode the representation for the following:
- An ordinary document
- A structured record, such as a appointment record or purchase order
- An object with data and methods (for example, the persistent form of a Java object or ActiveX control)
- A data record, such as the result set of a query
- Meta-content about a Web site (such as CDF)
- Graphical presentation (such as an application's user interface)
- Standard schema entities and types
- All the links between information and people on the Web
Benefits of XML
As a universal standard for the expression of data, XML offers many advantages to organizations, software developers, Web sites, and ultimately to end-users.
For software developers building Web applications and line-of-business Intranet software, XML provides a powerful, flexible format for expressing data--whether as a wire format for sending data between client and server, a transfer format for sharing data between applications, or a persistent storage format on disk. Because structured data in XML can include a self-describing schema, XML promises interoperability between applications that manipulate structured data independent of the underlying semantics.
For example, because XML enables publishers to supplement their Web sites with metadata such as CDF, users can receive "pushed" content as structured channels. XML can also provide a means for embedding arbitrary data and annotations within HTML, extending the possibilities for Web-based applications based on HTML and scripts.
For end-users, XML promises to provide a much richer set of Web applications for browsing, communication, and collaboration. The growing use of XML will improve Web-browsing applications for viewing, filtering, and manipulating information on the Internet.
As collaboration on the Web spreads to more businesses, customer services will eventually migrate from phone lines and storefronts to Web sites. The majority of these Intranet and Internet business applications will involve manipulation or transfer of data and database records, such as purchase orders, invoices, customer information, appointments, maps, and so forth. XML promises a revolution in the richness of end-user possibilities on the Web because it enables such a wide array of business applications to be implemented on the Internet.
Microsoft XML Parsers
Our long-term goal of XML is that it function as a data format that anyone can use to build a range of Web applications. To achieve this goal, we decided to write an XML parser and make it freely available. The result of these efforts was two XML parsers--one in C++ and the other in Java--both of which are included as part of Microsoft Internet Explorer 4.0. The parsers were written in parallel, but with somewhat different design goals.
The Microsoft XML parser in C++ (MSXML in C++) was written to perform as an integral part of Internet Explorer 4.0. Consequently, its design was oriented toward the following:
In other words, this is a performance parser. Although much effort was spent on wringing the most efficiency from the code, all non-essential features were eliminated. For example, MSXML in C++ is a non-validating parser.
- Fast parsing speed
- Low memory usage
- Asynchronous parsing during download
- Strong international support
In contrast to the XML parser in C++, the goals of the Microsoft XML parser in Java (MSXML in Java) included the following:
For this reason, the Java parser is fully validating, it implements the latest proposed features (such as namespaces), and the source code is freely available.
- To be a reference implementation
- To be a full validating parser
- To be cross-platform
- To promote widespread acceptance of the XML standard
- To experiment with leading edge XML standards efforts, like DOM and namespaces
With some minor exceptions (such as no current support for conditional sections), Microsoft's XML parsers completely implement the W3C Working Draft of the XML specification dated June 30, 1997.[a]
MSXML in Java shipped in the spring of 1997 and is available from http://www.microsoft.com/standards/xml/xmlparse.htm. Both MSXML in C++ and MSXML in Java are shipping with IE40.
Once parsed, an XML document is manipulated through an object model (or API). To really help make XML the standard format for data over the Web, we felt that a standard object model was crucial; one that was simple, scriptable, minimal, and consistent with the Document Object Model (DOM) Working Group.[b] We are currently working with the W3C to standardize the XML object model. The object model is language neutral, which means it is equally accessible from all programming languages. To keep the object model independent of the parsers, it was designed prior to implementing them. The idea was to completely separate the parser implementation from the XML data structures. Having the parser use the object model ensured that problems with the object model would be flushed out during development.
The object model is very simple. It models the XML document as a tree structure using only three classes of objects:
The Document object represents an entire XML document. This object holds the Element tree and document information such as the document type, version and character encoding. The Element object is used for representing the nodes in the tree, and the Collection object is used to represent the child Elements of a given node.
- A Document
- An Element
- A Collection
All XML data is stored in a tree of Element objects. Container Elements are non-leaf nodes. Empty Elements, text, as well as comments and processing instructions are stored as leaf nodes in the tree. An Element's type is revealed by the type property. Currently, the following types are returned:
We considered using a different object in the object model for each of
these types rather than a single object with a type property, but
decided that multiple objects complicated the object model. This was
particularly the case when navigating the Element tree and for untyped
- For container and empty XML Elements
- For PCDATA and CDATA
- For comments
- Processing instructions
The other important properties of the Element object are:
Finally, the Element class provides a basic set of methods for
setting, getting, and removing attribute values as well as adding and
removing child Elements.
- The name (or GI) for objects of type ELEMENT (otherwise an empty string)
- The parent Element of this object in the tree.
- The text for objects of type TEXT or COMMENT (otherwise an empty string)
- A collection of the objects contained by this object. This
collection is empty for all other types besides Element
Element collections are used to walk the XML tree. An Element collection has one property, the length, which is the number of Elements in the collection. Child Elements are fetched via the item method, which returns either an Element by index, or by name. When more than one Element has the same name, the item method returns a new collection with all of the child Elements with that name.
myXMLDoc = new ActiveXObject("msxml");
Using the Java parser and the XML DSO applet that is shipping with IE 4, you can load an XML document as follows:
myXMLDoc.URL = "http://www.somecompany.com/somedata.xml;"
<APPLET class=com.ms.xml.dso.XMLDSO.class width=0 height=0 id=xmldso>
<PARAM NAME=URL VALUE="http://www.somecompany.com/somedata.xml">
Then you can access the Document object via script as follows:
var doc = xmldso.getDocument();
While the object model is minimal, it is functionally complete. We expect that it will evolve over time.
For more information about Microsoft's XML object model see
 and .
Simplicity of design
The Microsoft XML parsers are simple. This is by design. They are implemented as hand-coded, recursive-descent parsers. This has a couple of benefits:
This latter point is especially important since the source code for MSXML in Java is available to the public on the Microsoft Web site. We want it to be a reference implementation that can be understood by any Java programmer. (Another reason parser generating tools are not used is that the language has many lexical Elements that are unlimited in length; we do not want to test a parser generator's buffer size limits.)
- First, the minimal syntax of XML makes a parser generator unnecessary: a hand-coded parser works just fine.
- Second, recursive-descent parsers are both easy to write and easier to understand.
Although XML parsers are required only to read UTF-8 and UCS-2 encodings, the Microsoft's XML parsers handle many more encodings, such as shift-jis, euc-jp, and big5. In fact, the C++ parser supports the same set of character encodings as IE40, and the Java parser supports all the encodings supported by the Java VM. The recursive-descent parsers are isolated from these different encodings by input readers that convert everything to Unicode. While this increases memory usage for European languages, it simplifies string processing overall.
Storage of Element and Attribute names
Because Element and Attribute names tend to repeat, they are stored as atoms so that only one copy of each string is stored. This also speeds up string comparisons because atom objects can be compared for equality very quickly, without comparing the characters in the strings. This technique amortizes some of the cost of checking for NameChar characters and converting Unicode characters to uppercase.
Object model implementation
The Java parser builds the Element tree using the object model. When it creates new Elements it uses an Element class factory that is passed in by the creator of the parser. The parsers come with a default object model implementation that is fully functional; however, clients with special needs can write their own class factory that creates custom objects. This makes it easy for programs that want to use XML but still need to process legacy data structures.
The Java parser does not parse asynchronously, it could be run on a separate thread. The C++ parser parses asynchronously by running on a fiber. The object model was designed so that asynchronous parsing can be implemented transparently to the programmer. Because all properties and methods are function calls, the object model can block the caller when attempting to access a node in the tree that isn't completely downloaded.
Entities and other language features
The Java parser also implements DTD validation, full Entity handling, and the namespace proposal. We found that DTD validation was relatively easy. The XML spec was clear and pointers to algorithms for implementing validation were helpful, but we found that supporting validation does seem to impact the overall performance of the parser.
d even want to.
Namespaces were relatively simple since we already had an atomized Name object in the Java parser to represent all tag and attribute names in the document. We simply added a namespace field to these Name objects, support for parsing the name space declarations, and we were done.
The parsers are small and fast. MSXML in C++ with full international character support is less than 100K and the MSXML Java Parser is 127K.
Using the Object Model
To illustrate how the Object Model can be used to do interesting
things we will show you a small example based on the CDF data we saw
earlier in Example 1. Example 2 shows how to walk the XML
Object Model to find out the INTERVALTIME of the scheduled event.
to Process XML Data
Notice that the GetInterval() method uses a small fixed set of objects and methods to manipulate the XML data that is independent of display-oriented things like HTML. As long as the CDF DTD (or schema) stays relatively fixed, this script code will work on any CDF file. In other words, this is robust enough to build Web-based business applications.
// Fetch the CDF file and extract the INTERVALTIME element
var doc = new ActiveXObject("msxml");
doc.URL = Resolve("cdf.xml");
// First extract the SCHEDULE node
var s = doc.root.children.item("SCHEDULE");
// Then the INTERVALTIME
var t = s.children.item("INTERVALTIME");
// Then the HOUR attribute
var h = t.getAttribute("HOUR");
// Display this with an appropriate message in a popup window
var w = window.open("","NextShow",
w.document.write( "<h2>The next show is in " +
hour + " hours !</h4>" );
// This is a useful little function that I use to resolve a URL relative to the
current document location
var url = document.location.toString();
var base = url.substring(0,url.lastIndexOf("/"));
var href = base + "/" + relurl;
// A button that invokes the above scripts
<input type=button value="When ?"
When we choose XML to encode CDF files, we were a little bit anxious. XML was just created--even though Microsoft co-created the W3C XML Working Groups in July 1996, it was as new to us as anyone else. In addition, launching "channels"--by using the first broad, public application of metadata--by using an untried standard was risky. A few months later (as of this writing in August 1997), we know that we have made the right choice.
The flexibility and ease of use of a text format for representing and exchanging structured information has been demonstrated. CDF is now widely used by industry's leading content providers, Web and Java authoring tool vendors, and "push" developers (such as PointCast, AirMedia, and BackWeb). Multiple tools have been developed to produce CDF files. Because it is simple text-based format, tools are easily developed to generate and process it. XML helped make CDF successful.
Now a set of XML enabling technologies, including C++ and Java parsers with their Object Models, are shipping in Internet Explorer 4.0. Because IE 4.0 will be integrated into Windows 98, there will be an XML parser on each desktop--another step toward the vision of making structured data an integral part of the Web.
At Microsoft, we strongly believe that XML is the standard, extensible, universal data format for the Internet. It is simple and easily authored. It is based on international standards that have been tested for many years. It is enormously extensible. It is flexible enough to allow representation of an incredibly wide range of information, and it also allows this information to be self-describing, so that structured data expressed in XML may be manipulated by software that doesn't have previous knowledge of the underlying meaning behind the data. XML provides a file format for representing data and can be extended to contain a description of its own structure. It is a means of formatting data and also a mechanism for extending and annotating standard HTML.
With its powerful expressiveness and flexibility, XML promises to add structure to data on the Internet, bringing the Web one step closer to realizing the potential for universal communication with anyone, anywhere.
[Return to Text]
[Return to Text]
[Return to Text]
About the Authors
Jean Paoli is a Product Manager in the Internet Explorer 4.0 team where he manages the XML and databinding effort. Prior to joining Microsoft in May 1996, he was the technical director of GRIF S.A., a leader in the creation of SGML authoring tools. Jean has a strong background in SGML and designed for important corporations a lot of systems where SGML, in its approach of structuring and storing information, ensured the long life and easy exchangeability of the data. Jean is a co-editor of the XML standard and co-created with Jon Bosak (and others) the W3C XML working group in July 1996.
- Jean Paoli
- 1 Microsoft Way
- Redmond, WA 98052-6399
Andrew Layman is a Senior Program Manager at Microsoft where he works on Internet and database technologies. Prior to joining Microsoft in 1992, he was a Vice President of Symantec Corporation and original author of the Time Line project management program.
- Andrew Layman
- 1 Microsoft Way
- Redmond, WA 98052-6399
Istvan Cseri is the technical architect of the XML project at Microsoft. Istvan designed the Java XML parser and is one of the co-authors of the Proposal for Extensible Style Language (XSL), which was recently submitted to the W3C. Istvan has a strong background in object oriented frameworks and user interfaces. Prior to join Microsoft, Istvan was at Borland where he was one of the designers and developpers of Quattro Pro for Windows.
- Istvan Cseri
- 1 Microsoft Way
- Redmond, WA 98052-6399
Chris is one of the developer leads on the XML project at Microsoft. He has been working mostly on the Java XML parser reference release. He joined Microsoft in May of this year from a silicon valley startup company where he was working on CD ROM quality multimedia delivery over the web. Chris has a strong background in networking, communications and user interface work from his former work at Taligent and IBM's Santa Teresa Labs.
- Chris Lovett
- 1 Microsoft Way
- Redmond, WA 98052-6399
David Schach is a developer lead on XML in the Internet Explorer Group. He collaborated on the XML Object Model design and wrote the Microsoft XML Parser in C++. Currently, he is working on using XML as a style sheet language and is a co-author of Proposal for Extensible Style Language (XSL), which was recently submitted to the W3C. He has master's degree in computer science from the University of Pennsylvania and joined Microsoft in 1994.
- David Schach
- 1 Microsoft Way
- Redmond, WA 98052-6399
[a] The latest version of this draft was in fact August 7, 1997, and is published as the "Extensible Markup Language (XML)" specification in the "W3C Reports" section of this issue.
[Return to Text]
[b] The "Document Object Model (DOM)" specification is in this issue's "W3C Reports" section.
[Return to Text]
W3J copyright © 1997 O'Reilly & Associates