[This local archive copy is from the official and canonical URL, http://www.asis.org/Bulletin/Oct-98/shobowal.html, 1999-02-05; please refer to the canonical source document if possible.]

Special Section

SGML, XML and the Document-Centered Approach to Electronic Medical Records

by Gloria Shobowale

The health care industry has been slow to adopt new technologies for the exchange of medical information. A Healthcare Quality Commission report shows that health care lags far behind other information-intensive industries in information technology investment. Traditionally, the provider community has used computerized systems for administration and financial management. Most clinical information is still stored in paper records and even financial information sent to payers is usually paper-based, not electronic.

Things are beginning to change, however. Government regulation will be defining how data will be exchanged electronically between provider and payer. As in many industries, the Internet and Web technology are seen as enablers for the exchange of information. There has been significant growth in the application of telemedicine, which is defined by the American Telemedicine Association as “the use of medical information exchanged from one site to another via electronic communications for the health and education of the patient or health care provider and for the purpose of improving care.” A survey by the Association of Telemedicine Service Providers found that telemedicine consultations over two-way video links increased 300% in 1996 over the previous year. In 1997, according to the Center for Telemedicine Law, close to 200 bills related to telemedicine were introduced in state legislatures. Also, the National Library of Medicine is sponsoring telemedicine initiatives across the country, which is sure to encourage further growth.

Telemedicine requires robust telecommunication lines with sufficient bandwidth and sophisticated desktop systems or clinical workstations that can perform multi-tasking operations, but central to the success of telemedicine is the electronic medical record. Clinical information about a patient must be in electronic format to support the increasing need for electronic exchange of information. Not only will an electronic medical record facilitate the exchange of information, it is also seen as the foundation for other computerized applications that will improve the quality of patient care. Customized advice to patients can be given when patient records systems are integrated with point-of-care decision support systems that pull in electronic medical information from the more than two million journal articles available each year to medical professionals.

Clearly, the benefits of telemedicine -- improved access to health care, enhanced quality of care and cost control -- make it worthwhile to pursue and advance the exchange of electronic health care information. Issues such as start-up costs, legal roadblocks, privacy concerns and resistance to change must be addressed and resolved. In addition, adoption of standards is critical for expansion of telemedicine initiatives.

Standards: A Requirement for Interoperability

There are a multitude of standards activities covering the array of issues in health care. There are groups looking at standard coding structures, standard vocabularies and standard message formats. There is work underway on standards for patient identifiers, provider identifiers, employer identifiers and payer identifiers. To address privacy and security issues, several groups are looking at data encryption and digital signature standards. Table 1 shows the key organizations involved in developing standards for health care data interchange.

The U.S. Department of Health and Human Services (HHS) is driving the development of standards for administrative simplification, an aspect of the Health Insurance Portability and Accountability Act (HIPAA). The National Committee on Vital and Health Statistics (NCVHS) will be recommending standards for the exchange of medical information and will include existing standards when available. Health Level Seven’s focus in the past has been on medical informatics and messaging standards, but this is evolving to encompass other types of standards such as those related to object technology, image transfer and document markup. The Insurance Committee (N) within X12 has business transactions as its focus. HHS has adopted EDI standards from X12N for the exchange of data between providers and payers and will be determining the format for identifiers that all players in the health care arena will be required to use.

Much of the standards development activity to date has focused on standards for exchange of administrative and financial data. Clinical information exchange is not as straightforward. Current standards activity must address not only how to exchange coded or structured data, but also how to exchange full-text or narrative information and images. Certain organizations, such as the Institute of Electrical and Electronics Engineers (IEEE), are focused on telecommunication protocols to support exchange of data. The Object Management Group (OMG) is developing a Common Object Request Broker Architecture (CORBA) – middleware – specifically for the medical domain. ASTM is focusing on security issues and the format for medical records. The Digital Imaging and Communication in Medicine (DICOM) standard supports the exchange of clinical data, not just radiological images. Some of these activities are software solutions and some are highly structured definitions for documenting and transmitting information. Within the Health Level Seven (HL7) organization, the SGML Special Interest Group is looking at how the Standard Generalized Markup Language, an ISO standard, may be used to facilitate exchange of clinical information.

Document-Centered Approach

There are two schools of thought related to the format of an electronic medical record. The informatics community advocates the use of coding and translation of a physician’s notes into controlled vocabulary. On the other hand, the document-centered approach values the capture of full-text information in the medical record, with narrative that retains the physician’s own words as originally written. Standards that define controlled vocabulary or machine-readable formats make it easy for computers to talk to one another and manipulate data. However, when narrative is translated into controlled vocabulary it loses its richness. Maintaining the original text of a physician’s notes preserves the total record for future use.

Unlike other standards, the Standard Generalized Markup Language (SGML) supports the document-centered approach to medical records. It is software and platform independent and, as an international standard, has global reach. SGML is a mechanism for defining the structure of a document and the meaning of its components. As a “meta-language,” SGML provides the methodology for encoding text and specifies the markup to be used through a Document Type Definition (DTD). Once the encoding rules for a document type have been defined, a parser can be used to process a document and to ensure all required components are included and in the correct order.

Document Type Definitions are usually developed within an industry, and work is underway within the HL7 SGML Special Interest Group to develop definitions for various medical record document types. This activity is part of the design work on a new, proposed SGML-based architecture called Kona.

The Kona proposal defines four levels of specificity for the exchange of medical record documents. Level 1, ProseDoc, uses minimal markup and allows exchange of imaged or unstructured documents among a wide community of users. Level 2, ClinicalContent, uses minimal markup of documents with loosely defined specifications for data requirements, considered detailed enough for the exchange of information among providers, payers and regulatory agencies. Level 3, EHR, uses the extensive markup and definitions required for full exchange of a patient’s records between providers. Level 4, Enterprise, uses very specific data definitions and markup such as might be found in an integrated delivery system. A higher level of specificity allows more detailed or accurate searching. Documents that are marked up with highly descriptive tags can be broken into components, allowing users to access the exact information they are seeking.

Information Retrieval

SGML applications use standard Internet protocols for the exchange of information. Marked-up documents, encoded for easy retrieval, are housed in distributed systems or data repositories. Internet or Web-based technology provides the mechanism to find and retrieve these documents. Standard Web browsers will be used as the graphical user interface (GUI) tool for displaying information at the desktop. Search engines will be used to access a patient’s medical record, even when individual documents are located in different systems.

Current problems with Web search engines must be solved, however, before effective retrieval of medical records will be possible. Today’s search engines retrieve many documents that are not relevant to a query. Health care requires precision in the records retrieved, in terms of capturing specific documents for the right patient and for finding all of the records for a patient. Use of a Master Patient Index to register patient identifiers is one way to improve retrieval results. Another method for retrieval of documents across distributed systems may be the use of a standard retrieval protocol. The use of metadata also facilitates retrieval of information pertinent to a particular query.

The Need for Metadata

Within the information industry, metadata has long been used to facilitate access to information. For electronic information exchange, there have been multiple independent initiatives to develop metadata within various fields of study. As the amount of information on the Internet grew, it was recognized that a “core” set of metadata used by all communities was desirable and would improve access to relevant, networked information. A series of five workshops taking place between March 1995 and October 1997 resulted in the formation of the Dublin Core Metadata Element Set, which has recently been submitted to the Internet Engineering Task Force (IETF) as an Internet draft.

The Dublin Core consists of 15 elements that describe a resource or document. It is designed to be extensible, meaning that it allows the use of descriptors or qualifiers. (For more on the Dublin Core, see “The Dublin Core: A Simple Content Description Model for Electronic Resources,” Bulletin of the American Society for Information Science, October/November 1997.) Within the group of Dublin Core designers, two views may be found. The minimalists advocate using only the basic Dublin Core in order to keep things simple both for metadata creators and for retrieval systems. The structuralists believe that more formalized data element qualifiers may better meet the needs of certain user communities.

Obviously, the medical community is a highly specialized community that should develop its own detailed metadata scheme. However, the semantics of the Dublin Core are “stable” and, because it “has achieved wide international recognition as the primary candidate for interdisciplinary resource description,” according to Weibel and Hakala, it is a good starting place for the development of medical record metadata. A basic metadata set for a medical record document might be developed as shown in Table 2.

Use of SGML/XML in Health Care

An exciting new standard from the World Wide Web Consortium, the organization that develops standards specifically for the Web, is XML or the eXtensible Markup Language. Like HTML, the current markup language of the Web, XML is an application of SGML designed specifically for use on the Web and much easier to use than full SGML. However, unlike HTML, which is limited to presentational markup, XML has many of the features of SGML that allow management of complex documents. Also like SGML, XML uses tags to label object data making navigation to and retrieval of specific information easier. XML users can create Document Type Definitions as needed, invoke an existing DTD or allow a default definition to be used. An XML document that specifies a DTD can be validated by an XML parser, meaning that its structure will be reviewed to see if it follows the rules laid out in the DTD. If no DTD is specified, an XML document can still be “well-formed” if all XML rules are followed. XML is, or soon will be, a standard feature of Web browsers.

The Kona proposal provides the framework for definition of common data elements in an electronic medical record that can be encoded with standard tags. Because XML tags define objects or parts of a document, they facilitate the transfer of component parts of a document to another computer system. This functionality supports the four levels of the Kona architecture. A recent article by Radosevich asserts that, “if the Kona proposal takes off, the portability of XML documents combined with the Web’s broad reach could be a boon to the health-care industry.”

The Virtual Patient Record

The computer-based medical record, according to the Computer-based Patient Record Institute (CPRI), is a “virtual compilation of non-redundant health data about a person across a lifetime.” As patients become more mobile and as telemedicine activity grows there is an increase in the distribution of patient information among multiple sites or systems. With client/server computing, patient data can be retrieved when needed, viewed at the local desktop (simultaneously with other users), while remaining stored in the original repository. Virtual patient records should not be housed in massive databases; CPRI asserts they should instead reside in “independent computer systems at individual care sites with minimum connectivity requirements and appropriate security.”

Some virtual patient record applications use object-oriented technology to pull together information. Middleware such as CORBA provides access to non-standardized data in legacy systems, but software solutions do not allow true interoperability. On the other hand, XML uses tried-and-true Internet protocols and international standards to define and access tagged objects. Both methods can access information across wide area networks, but a software approach will not stand the test of time. Drafters of the Kona architecture, as supporters of the document-centered approach, believe that “regardless of how well conceived a relational or object-oriented schema, no one can foresee the questions and relationships that will take on significance over time.”

A growing trend is the use of document management systems in health care. The use of distributed systems allows retrieval of patient information directly from the source, rather than through a central repository. Historically, electronic document management systems have solved document management problems within an organization. Another new standard with application to XML documents, the Document Management Alliance (DMA) specification, “provides a rich set of capabilities” that includes “the ability to search across multiple repositories simultaneously, and merge the search results.”

To have true cross-repository interoperability, and to make the vision of a complete virtual patient record a possibility, organizations must move toward the use of standards that are all-encompassing and broad in scope. SGML and XML are standards that fit this description. XML-structured documents with metadata facilitate access to all clinical data for a patient. When combined with Internet protocols and Web browsers, SGML/XML is the enabler for retrieval of patient information across organizational and geographic boundaries. Internet protocols provide the standard methods for exchange of data and Web browsers pull together the information into an easy-to-use, graphical format. But it is the flexibility of SGML/XML that provides the power to this solution.


With increasing globalization, people are recognizing that solutions to problems within an organization are not enough. To realize the full potential the Internet has to offer, it is necessary to have a big picture view. We have the technological capability to access information from anywhere, but the adoption of standards is necessary for expansion of telemedicine and health care information exchange. Middleware solutions have a place in today’s world because they provide a way to retrieve information from legacy systems. As long as organizations continue to create information in proprietary formats, we will always need a “middle” solution. If we move beyond this, however, and use open standards starting from the point of information creation, we minimize the need for a translator.

SGML applications such as XML are open standards solutions that help us progress to this next level. SGML is a future-oriented, open standard with a framework that allows detailed definitions by user communities. The future of the Web includes use of XML. These standards work in conjunction with other global standards – Internet protocols for communication and exchange of data, along with common Web browsers and search engines for finding and presenting information. Marked-up documents stored on Web-accessible servers are poised for use by multiple communities, today and tomorrow.

For More Information

Computer-based Patient Record Institute


Document Management Alliance


DC-5: The Helsinki Metadata Workshop


Healthcare Quality Commission


Health Level 7 SGML SIG


Table 1. Standards Development Organizations for Health Care Data Interchange
Health Level Seven (HL7)
Health Level Seven Standard Version 2.3, Application Protocol for Electronic Data Exhange
Institute of Electrical and Electronics Engineers (IEEE)
P1073.3.1-1994, IEEE Standard for MEdical Device Communications - Transport Profile - Connetion Mode (MIB)
P1157, Standard for Healthcare Data Iterchange (MEDIX)
Data Interchange Standards Association (DISA)
Electronic Data Interchange (EDI), message format standards
X12N - Insurance; eligibility and claim applications
American College of Radiology/National Electrical Manufcturers' Association (ACR/NEMA)
PS 3.1-3.13, Digital Imaging and COmmunicatins in Medicine (DICOM), radiology applications
National Council for Prescription Drug Programs (NCPDP)
Telecommunication Standard Format Version 3.2; transmission of drug claims
E31.11, Electronic Health Record Portability
E31.12, Standards for Electronic Patient Records
Healthcare Informatics Standards Board (HISB)
Joint Working Groups for voluntary coordination among U.S. standards developing organizations
National Committee on VItal Health Statistics (NCVHS) aspe.os.dhhs.gov/ncvhs/ Public advisory body to HHS charged with recommending standards for medical record information and its electronic exhange (HIPAA)
Object Management Group (OMG)
Software consortium creating standards for data interchange, CORBAMed

Table 2. Metadata Core Set for Medical Records
Medical Record Metadata Dublin Core Equivalent
Patient Name Title
Provider of Service
  • Professional Provider Name
  • Facility Provider Name
  • Professional Provider Address
  • Facility Provider Address
Author or Creator
  • Personal Name
  • Corporate Name
  • Personal Name Address
  • Corporate Name Address
Patient Identification Number
  • Schema=Social Security Number

Discharge Diagnosis Code
  • Shema=ICD-9-CM
Subject and Keywords
Discharge Diagnosis Description (textual)
Provider Organizatin (billing entity)
  • Name of clinic, facility, group
  • Address of clinic, facility, group
Location of Services (used when a professional renders services at a facility) Other Contributor
Date(s) of Service
  • Date patient seen
  • Date requisitioned
Date (range of dates)
Category of Document
  • Outpatient encounter
  • Inpatient encounter
  • Progress notes
  • X-ray
  • Lab results
Resource Type
  • UB92
  • HCFA 1500
  • ANSI X12N 837
Resource Identifier (could be combination of Patient ID, Provider ID and Date) Resource Identifier
Medical Record Number (local) Source
Language Language
Collection (e.g., medical record for patient x) Relation (to other resources)
may not be applicable Coverage (spatial and temporal)
Security/Access Terms Rights Management

Gloria Shobowale is vice president, Statewide Operations, for Blue Cross Blue Shield of Texas' HMO product. She can be reached by phone at 972/766-8828 or by e-mail Gloria_Shobowale@bcbstx.com. Her paper on health care information standards was written as part of an independent study project at the University of North Texas where she is a student. Special thanks go from Gloria to Sam Hastings, faculty advisor for the project.

Bulletin of the American Society for Information Science