A posting from H. Kaiser Yang reports on the release of thirty-one (31 ) draft XML Schema files and six corresponding sample XML bio-sequence files from the NCBI's data modeling research. The US National Center for Biotechnology Information supports a "multi-disciplinary research group comprised of computer scientists, molecular biologists, mathematicians, biochemists, research physicians, and structural biologists concentrating on basic and applied research in computational molecular biology." NCBI has used ASN.1 [Abstract Syntax Notation One] "for the storage and retrieval of data such as nucleotide and protein sequences, structures, genomes, and MEDLINE records; it permits computers and software systems of all types to reliably exchange both the data structure and content." The draft XML schemas are orthogonal to the DTDs in current use, and will replace the DTDs in the next version of the database toolkit. NCBI earlier "added support for XML output to its ASN.1 toolkit such that an ASN.1 specification could be automatically rendered into an XML DTD; data encoded in ASN.1 can then be output automatically in XML which will validate against the DTD using standard XML tools."
From the March 16, 2001 document "NCBI Data in XML" (not yet updated to reflect use of XML Schemas):
Roughly ten years ago, NCBI chose a language called Abstract Syntax Notation 1 (ASN.1) for describing and exchanging information in a manner similar to the ways XML is now used. ASN.1 came out of the telecommunications industry and is a compact binary encoding intended for both human readable text as well as integers, floating point numbers, and so on. While this is "software friendly" it is less accessible to users familiar with HTML and other text based languages. Tools for ASN.1 have largely stayed within the commercial telecommunications industry while a host of public domain tools of varying character have arisen for XML and HTML.
NCBI has recently added support for XML output to its ASN.1 toolkit. An ASN.1 specification can be automatically rendered into an XML DTD. Data encoded in ASN.1 can automatically be output in XML which will validate against the DTD using standard XML tools. We hope this will make the structured sequence, map, and structure data, as well as the output of tools like BLAST, more accessible to those who wish to work in XML. We are providing XML in two basic modes. Full Data Conversion is the direct mapping of every data field used within NCBI to XML...
While the effect of Roles, Scope, and Alternate Forms results in extensive tags in the XML, it does accurately reflect the structure and use of the data. It allows XML programs to capture as little or as much of the full data structure as they wish. And once converted back from XML to structures or classes in a variety of programming languages there is minimal overhead once again. The full NCBI DTD reflects this structure. What is called the NCBI DTD actually only specifies the basic data structures for publications, sequences, maps, alignments, and structures. These same elements are reused in different roles in many services as well, such as BLAST which produces alignments (defined in the NCBI DTD) as well as other elements specific to BLAST. We have not copied all the referenced modules into a DTD for every service as a practical matter, although we can produce XML output from any ASN.1 interface.
- Announcement 2002-05-21: "XML Schema for NCBI Data Model."
- NCBI ASN.1 Summary
- NCBI Data in XML [cache, alt URL]
- NCBI XML DTDs. See the ZIP archive or tarball.
- NCBI ASN.1 - XML resources
- [US] National Center for Biotechnology Information
- Genomics and proteomics databases (H. Kaiser Yang)
- XML Schemas [cache]
- Contact: H. Kaiser Yang [alt email handle]
- XML for Molecular Biology. Compiled by Paul Gordon.
- "NCBI Molecular Biology Data Model" - Main reference page.
- Related topics:
- Molecular Dynamics [Markup] Language (MoDL)
- StarDOM - Transforming Scientific Data into XML
- Bioinformatic Sequence Markup Language (BSML)
- BIOpolymer Markup Language (BIOML)
- Gene Expression Markup Language (GEML)
- GeneX Gene Expression Markup Language (GeneXML)
- Genome Annotation Markup Elements (GAME)
- MicroArray and Gene Expression Markup Language (MAGE-ML)
- Microarray Markup Language (MAML)
- XML for Multiple Sequence Alignments (MSAML)
- Systems Biology Markup Language (SBML)
- OMG Gene Expression RFP
- Protein Extensible Markup Language (PROXIML)
- Taxonomic Markup Language
- XDELTA: XML Format for Taxonomic Information
- The Species Analyst Project