A novel use of XML is being used in IBM's Genomics Messaging System (GMS) research as part of the Integrated Medical Records (IMR) middleware project.
The focus of the GMS design is the "representation, transmission, and storage of patient genomic information, particularly in the construction of the unified clinical and genomic record, and exploring the standards required. GMS is a proposed specification for an approach with an emphasis on a specific language for embedding supporting information and management functions in streams of DNA data."
According to a project description from the IBM Haifa Labs web site, "the core function of the GMS software is to prepare the genomic information, compress and encrypt it, transmit (or store) it, and decompress and decrypt on receipt (or recovery from storage). This core function, however, is merely the underlying data-representation structure of a larger system, which has the potential to cover many features of clinical bioinformatics."
The Genomic Messaging System Language (GMSL) defines a data stream for information storage and transmission. This language "is highly condensed using Shannon-information-theoretic principles: each command and data element is represented by an 8-bit byte, including bytes that represent the bases of the DNA itself, at various optional levels of compression, down to four base pairs per byte. The language provides basic support features for annotation of the DNA by the clinical genomicist."
The primary function of the Genomic Messaging System Language (GMSL), as discussed in a recent issue of the Journal of Proteome Research includes: (1) retaining content of the source clinical documents as are required, and to combine patient DNA sequences or fragments; (2) allowing the expert to add annotation to the DNA and clinical data prior to its storage or transmission; (3) enabling addition of passwords and file protections; (4) providing tools for levels of reversible and irreversible scrubbing (anonymization) of the patient ID; (5) preventing the addition of erroneous DNA and other lab data to the wrong patient record; (6) enabling several forms of compression and encryption at various levels, which can be supplemented by standard methods applied to the final files; (7) selecting methods of portrayal of the final information by the receiver, including choice of what can be seen; (8) allowing a special form of XML-compliant staggered bracketing to encode DNA and protein features which, unlike valid XML tags, can overlap."
GSM's functionality is "extended by plug-in packages or 'cartridges' both at the input and output ends of the messaging. These enable conversion between GMSL and other XML representations, including the HL7 Clinical Document Architecture. They also include miniature expert systems, which will add automatic annotations at both the DNA and protein sequence levels, merging them with any annotations added interactively by the user. They also include specialized display and interaction cartridges."
The IBM developers report that "the most sophisticated cartridge is a basic automated protein modeling suite, which will model the patient's polymorphic protein from the transmitted gene. Prominent among the applications linked are protein science applications, including the rapid automated modeling of patient proteins with their individual structural polymorphisms. In an initial study, GMS formed the basis of a fully automated system for modeling patient proteins with structural polymorphisms as a basis for drug selection and ultimately design on an individual patient basis."
GMS Description in Journal of Proteome Research
"Genomic Messaging System and DNA Mark-Up Language for Information-Based Personalized Medicine with Clinical and Proteome Research Applications." By Barry Robson and Richard Mushlin. In Journal of Proteome Research Volume 3, Issue 5 (October 11, 2004), pages 930-948 (with 16 references). American Chemical Society (ACS) Publications. DOI: 10.1021/pr0341336.
XML representations are provided in the appendices:
- Appendix 1: GMS Language Commands; a .gms file comprises a field of DNA base pair characters AGCT in which commands are embedded
- Appendix 2: Sample Extract CDA .gmi File. This is the patient record data feed (XML format)
- Appendix 3: Sample DNA .gmd File (XML format)
- Appendix 4: Sample .gms Extract of File from CDA Cartridge (header and clinical sections removed)
- Appendix 5: Sample .xml File after Automatic Annotation (header and clinical sections removed)
- Appendix 6: Sample Screen Shots of Display of .html File after Automatic Annotation; CDA, or Clinical Document Architecture is a specific embodiment of XML proposed by Health Level Seven Inc. for medical applications
See the online preprint /proof version of "Genomic Messaging System and DNA Mark-Up Language" from the Computational Biology web site at the IBM Thomas J. Watson Research Center [cache]
About IBM Integrated Medical Records (IMR)
"Integrated Medical Records (IMR) is middleware being developed at IBM Haifa that can be used to integrate and correlate medical records from diverse sources and transform data into knowledge. Today's medical arena is faced with the challenge of providing patients with improved care, reduced costs, and more efficient use of medical records. The only way to meet this challenge is to create a technology that allows patient information from different sources to be conveniently accessed and shared by different organizations, while maintaining patient privacy and information security. IMR is part of the SHAMAN system which is developed by the Haifa and Watson research labs.
Electronic health records (EHR) are defined as digitally stored healthcare information about an individual's lifetime, with the purpose of supporting patient care, education, and research. These records include data on observations, laboratory tests, diagnostic imaging reports, treatments, therapies, drugs administered, patient identification information, legal permissions, and so on. The IMR middleware transforms medical records from human-readable to machine-processable and facilitates the extraction of electronic health records, or parts of it according to some classification...
Using innovative technology, the data is first annotated to create XML documents that are human readable, as well as machine processable. Next, the XML documents are fully indexed (structure and free text) using an indexing system specialized for XML documents. The middleware API then provides a unified and secured access to electronic health records (EHRs) that are compiled from those documents...
Features. The IMR project leverages several technologies that exist within the Services and CRM Technology department and the Knowledge Management department in Haifa. A powerful engine combines several technologies and enables:
- Correlation of information from different sources within the hospital or from other healthcare centers
- Correlation of anonymous patient information with articles, research papers, and journals that are publicly available
- Correlation of information based on classifications, such as demographic information (genetics, age, sex, etc.), allergies, diseases, findings, and medical history
- Query of the data repository for information extraction, such as 'find all patients whose age is between 30 - 40, female, had over five pregnancies, and have cancer in their family history'
- Correlation of information from drug catalogs with information about allergies, demographic information, and physical characteristics such as weight
- Hide the patient's identification to provide anonymous patient records for research and education [adapted from the overview]
Integrated Medical Records (IMR) Components: The major components include:
- a transformation engine: "Based on the transformations written in the registry, the transformation engine reviews the contents for each data source and annotates it to create XML documents. Currently, the structure of these documents is based on the CDA (HL7 Clinical Document Architecture) format."
- a transformation registry: "This component provides the engine with a list of the transformations it should use on the current data source."
- a document repository: "All the CDAs produced by the engine are stored in the document repository. IMR uses XMLFS to perform the indexing on these documents."
- an EHR extractor: The electronic health record (EHR) extractor contains the APIs that enable external applications to retrieve data from the document repository."
- the authentication and authorization component: "Using this component, users can be specified and given access privileges to specific clinical categories."
About Secure Health and Medical Access Network (SHAMAN)
"Sequencing the human genome and recent advances in the Bioinformatics domain, suggest that medicine of the future will take advantage of genomic data. However, personal genomic and expression data cannot be used to their full potential unless there is tight linkage to patient records in addition to the ability to access these records. The Secure Health and Medical Access Network (SHAMAN) provides the infrastructure to obtain this goal. SHAMAN envisages a unified treatment of three classes of applications and markets: telemedicine (patient-to-professional and professional-to-professional, including management of clinical trials), general public self-health management via the Internet, and anonymous mining of medical data for drug-development, World healthcare, the FDA, and insurance and government bodies. Ultimately, SHAMAN will also include plugging into a potential pathogen data base and IT infrastructure, human gene repositories for research, and large scale research genomics projects for use of animal models as models of human disease...
First steps in the development of SHAMAN are the encoding of the patient record as an Electronic Health Record (EHR). The following data will ultimately be attached to the record: genomic and expression data, X-ray, cardiographic, histological and other graphic and annotation data. This development is carried out by the IMR project..." [from the SHAMAN overview]
Principal references:
- "Genomic Messaging System and DNA Mark-Up Language for Information-Based Personalized Medicine with Clinical and Proteome Research Applications." Preprint from Journal of Proteome Research. [cache]
- Genomics Messaging System (GMS). Reference web site.
- IBM Integrated Medical Records (IMR)
- IMR Components
- IMR Proof of Concept
- Secure Health and Medical Access Network (SHAMAN)
- IBM Haifa Labs
- IBM Research in Computational Biology
- Contact: Barry Robson (IBM T. J. Watson Research Center, Yorktown Heights, NY); WWW.
- "XML in Clinical Research and Healthcare Industries" - General references.