[April 11, 2000] A communiqué from Michael Hoffman of Rosetta Inpharmatics reports on the development of GEML - an open-standard XML format for DNA microarray and gene expression data. The Gene Expression Markup Language (GEML) "is a file format for storing DNA microarray and gene expression data for chip patterns and chip scans (profiles). GEML is an open-standard XML format which enables exchanging data between a variety of gene expression systems including web-based genome databases. GEML stores which data collection methodology was used, without making assumptions about the meaning of a measurement. This enables possible normalization, integration, and comparison of data across methodologies. GEML handles profile data whether or not a specific pattern is referenced and whether or not the scans include raw image data or refer to image files. This format also handles absolute or relative intensity measurements. GEML is independent of any particular database schema. This data standard is designed to separate data reporting and collection from methodology used. The profile_type attribute enables keeping track, in the data file, of which methodology was used, enabling normalization across methodologies and thus comparison of data which was collected by a variety of methodologies. Expression analysis data is generated by a variety of sources and is usually stored in the form of disk files. Industry-wide standards for the formats of such files do not exist, so there are now many different formats, with different representations of data and different types of information stored. For example, some data file formats for chip scans reference a specific pattern, while others do not. Similarly, some data file formats for chip scans can contain the raw image data or can reference one or more image files, while others cannot. The GEML format is designed to be broadly applicable and to support easy exchange of data among these various other formats. GEML is a published, documented standard offered expressly for the purpose of data interchange among gene expression systems and tools. This data file format has the following advantages: (1) Independent of any particular database schema. (2) Keeps track of which data collection methodology was used, enabling normalization, integration, and comparison of data across methodologies. (3) Extensible through the ability to specify additional name/value pairs. (4) Is XML-based. GEML supports translating from file formats for various genomics systems and related data collection methodologies. GEML is a free, public-domain, open-standard XML DTD. GEML was created and is licensed in order to define a single, distinct GEML format and avoid proliferation of incompatible variations."
[November 29, 2000] "On behalf of the GEML Community, Rosetta Inpharmatics has submitted to the Object Management Group (OMG) a proposed DTD based on the new version of Gene Expression Markup Language - GEML 2.0. Rosetta Inpharmatics Initial Submission regarding the Gene Expression RFP describes work in connection with the GEML DTD: "Rosetta Inpharmatics and Agilent Technologies have been using the GEML 1.0 format as part of internal pipelines for the past year. Rosetta has been continuously loading XML files on the order of thirteen megabytes into the Rosetta Resolver system, an enterprise expression data analysis product. We recently used internal tools to export the more than one thousand profiles, assigned annotations, and supporting patterns that constituted the data for the article, Functional Discovery via a Compendium of Expression Profiles, that appeared in the July 7, 2000 issue of Cell. The total size of the export, when compressed, was a little over a half of gigabyte of data. That data was then imported by Harvard into their Rosetta Resolver system. We have not, as of yet, implemented the interfaces contained in this proposal but given that the size of the compressed XML files has proven no technical obstacle, we see no technical problems in implementing the interfaces. Rosetta has developed the freeware GEML Conductor tools for visualization of GEML formatted data and for conversion of gene expression data in other formats into GEML." See the XML DTD and IDL file.
References:
[November 29, 2000] Rosetta Inpharmatics' Initial Submission regarding the OMG's Gene Expression RFP -- XML DTD for GEML 2.0. [cache]
OMG Collaborative Revised Submission for Gene Expression " The Life Science Research Technical Committee (LSRTC) issued a Request For Proposal (RFP) that asked for "...proposals which define interfaces and services in support of array based gene expression data collection, management, retrieval, and analysis." The progress of this RFP can be found at the Gene Expression RFP section of the OMG Web site. Although the scope of the proposals is narrowed to a specific area of biotechnology, the amount of data generated by these experiments can be enormous (i.e., in the range of gigabytes). There is also the need to capture experimental annotation, such as the purpose for a particular set of experiments, the technology used to print the array, the hardware used to scan the arrays, the method by which biological samples were treated, quality control, etc. Three proposals were received by the OMG. The submitters were NetGenics, European Bioinformatics Institute (EBI) for Microarray Gene Expression Database Group (MGED), and Rosetta Inpharmatics for Gene Expression Markup Language (GEML) format. The three submitters have agreed to work together on a combined revised submittal..."
[January 12, 2001] "Living Language." By Mark Pesce. From Feedmag.com (January 2001). ['HTML revolutionized the way information is shared worldwide. Can a new language do the same for the human genome? Mark Pesce reports'] "Everything's coming up roses! Or more precisely, Arabidopsis thaliana, a somewhat nondescript white flower selected by biologists as the model in the plant kingdom for genetic research. Related to broccoli and cauliflower, A. thaliana is the most studied plant in human history; every week new papers are published about its properties. In an astounding leap forward, the journal Nature just announced that the entire genome for A. thaliana has been sequenced first for the vegetal world. As plants go, it was a relatively easy task; the complete genome runs to 120 megabytes of information, compared to 1.6 gigabytes for wheat and a hefty 3 gigabytes for humanity. What makes this discovery just a bit different from the ever-increasing flow of genetic revelations is that, in another first, Nature has announced that all genomic information presented in its pages -- and on its Web site -- will be published in GEML, or Gene Expression Markup Language, a lingua franca defining a common standard for the bits of life... WHAT IS GEML exactly? It's a DTD (Document Type Definition) for the common expression of genetic information. Those of you who have done any Web design are likely familiar with another DTD -- HTML -- and its 'tags,' those little bits of formatting information enclosed by the '' symbols. In HTML there are tags such as 'TITLE' (which gives a page its title), 'B' (for bold), 'IMG' (for images) and so forth. GEML has its own tags, which define the kinds of data that interest geneticists... GEML, together with some clever computer programs, could help scientists greatly accelerate the process of winnowing the chaff from the grain of our genetics, allowing them to share their complementary (and often conflicting) databases of identified gene sequences to produce a more accurate map of ourselves. Craig Venter, CEO of Celera Genomics, has openly speculated that mapping the human genome onto gene sequences could take the next fifty years; with GEML, this estimate could easily be cut in half, provided that geneticists in competitive commercial organizations find it more profitable to share what they've learned than to keep it hoarded away and hidden from view. GEML ISN'T alone. It has a competitor, another DTD known as CellML, used to define the complex interactions that take place within cells. CellML takes an integrated approach to describing all of the processes within a living cell -- its genes, proteins, enzymes, and chemical reactions, the pathways and connections between each part of the whole. CellML seems well suited to the kinds of work that supercomputers do -- creating simulations of incredibly complex systems -- while GEML only defines the genetics that create the cell. Neither GEML nor CellML may be the final word in this convergence between biology and information. And, despite Metcalfe's Law -- which states that the value of a thing increases as more and more people use it -- the CEOs of the genomics companies are at least a little afraid that if knowledge advances too widely, their hard-earned advantages will slip away like water through their fingers. The year 2001 is to the genomics industry what the year 1991 was to informatics. The pieces are all in place for an incredible explosion in discovery, creativity, and wealth. But they're locked behind the prison walls of fear..."
OMG RFI. Responses to the gene expression request for information (RFI)
Rosetta Inpharmatics response to RFI.
Contact the developer: Michael Miller
See also: "GeneX Gene Expression Markup Language (GeneXML)." Formerly also called 'GEML'.