Cover Pages: NCBI Molecular Biology Data Model

Last modified: May 24, 2002

Technology Reports

NCBI Molecular Biology Data Model

[May 24, 2002] From the March 16, 2001 document "NCBI Data in XML" (not yet updated to reflect use of XML Schemas):

Roughly ten years ago, NCBI chose a language called Abstract Syntax Notation 1 (ASN.1) for describing and exchanging information in a manner similar to the ways XML is now used. ASN.1 came out of the telecommunications industry and is a compact binary encoding intended for both human readable text as well as integers, floating point numbers, and so on. While this is "software friendly" it is less accessible to users familiar with HTML and other text based languages. Tools for ASN.1 have largely stayed within the commercial telecommunications industry while a host of public domain tools of varying character have arisen for XML and HTML.

NCBI has recently added support for XML output to its ASN.1 toolkit. An ASN.1 specification can be automatically rendered into an XML DTD. Data encoded in ASN.1 can automatically be output in XML which will validate against the DTD using standard XML tools. We hope this will make the structured sequence, map, and structure data, as well as the output of tools like BLAST, more accessible to those who wish to work in XML. We are providing XML in two basic modes. Full Data Conversion is the direct mapping of every data field used within NCBI to XML...

While the effect of Roles, Scope, and Alternate Forms results in extensive tags in the XML, it does accurately reflect the structure and use of the data. It allows XML programs to capture as little or as much of the full data structure as they wish. And once converted back from XML to structures or classes in a variety of programming languages there is minimal overhead once again. The full NCBI DTD reflects this structure. What is called the NCBI DTD actually only specifies the basic data structures for publications, sequences, maps, alignments, and structures. These same elements are reused in different roles in many services as well, such as BLAST which produces alignments (defined in the NCBI DTD) as well as other elements specific to BLAST. We have not copied all the referenced modules into a DTD for every service as a practical matter, although we can produce XML output from any ASN.1 interface.

[May 24, 2002] XML Schemas for the NCBI Molecular Biology Data Model. A posting from H. Kaiser Yang reports on the release of thirty-one (31 ) draft XML Schema files and six corresponding sample XML bio-sequence files from the NCBI's data modeling research. The US National Center for Biotechnology Information supports a "multi-disciplinary research group comprised of computer scientists, molecular biologists, mathematicians, biochemists, research physicians, and structural biologists concentrating on basic and applied research in computational molecular biology." NCBI has used ASN.1 [Abstract Syntax Notation One] "for the storage and retrieval of data such as nucleotide and protein sequences, structures, genomes, and MEDLINE records; it permits computers and software systems of all types to reliably exchange both the data structure and content." The draft XML schemas are orthogonal to the DTDs in current use, and will replace the DTDs in the next version of the database toolkit. NCBI earlier "added support for XML output to its ASN.1 toolkit such that an ASN.1 specification could be automatically rendered into an XML DTD; data encoded in ASN.1 can then be output automatically in XML which will validate against the DTD using standard XML tools." [Full context]

References:

[US] National Center for Biotechnology Information
Announcement 2002-05-21: "XML Schema for NCBI Data Model."
NCBI ASN.1 Summary
NCBI Data in XML [cache, alt URL]
NCBI XML DTDs. See the ZIP archive or tarball.
NCBI ASN.1 - XML resources
Genomics and proteomics databases (H. Kaiser Yang)
XML Schemas [cache]
Contact: H. Kaiser Yang [alt email handle]
XML for Molecular Biology Compiled by Paul Gordon.

Historic - NCBI Records

From 1997-11-18]

The National Center for Biotechnology Information (NCBI), organized under the National Library of Medicine and National Institutes of Health, uses SGML encoding in the structuring and interchange of bibliographic information and associated metadata for PubMed. "PubMed is NLM's search service to access the 9 million citations in MEDLINE and Pre-MEDLINE (with links to participating on-line journals), and other related databases." The service was inaugurated in June, 1997, "officially announced at a Capitol Hill press conference by Vice President Gore, Senators Harkin and Specter and the Directors of NIH, NLM and NCBI."

The use of SGML encoding in the communication format from publishers (while simple) is significant in light of the magnitude of the database: the delivery of MEDLINE access through PubMed "makes available the world's largest medical database -- more than nine million medical articles from 70 different countries [...] -- to consumers, health professionals, medical librarians, and scientists around the globe. Over 30,000 people a day are using MEDLINE, and the database is growing at a rate of 1,000 citations a day. Presently, full text of about 100 scientific journals is linked to PubMed, and hundreds more journals are expected to be online in the coming months.

Within PubMed, SGML encoding is the "standard data format for publishers to use in submitting citation data to NCBI for processing into the MEDLINE or PubMed databases."

Links:

Home Page
NCBI standard publisher data format, with Version 1.5 of the NCBI DTD. [local archive copy]
NCBI SGML entity list; [local archive copy]
PubMed Overview
NLM Home Page
National Library of Medicine (NLM) - main database entry

Addresses:
National Center for Biotechnology Information
National Library of Medicine
Building 38A, Room 8N805
Bethesda, MD 20894
Tel: 301-496-2475
Fax: 301-480-9241
Email: pubmed@ncbi.nlm.nih.gov


SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY