[This local archive copy (text only) mirrored from the canonical site: http://www.topogen.com/sbir/rfc.html; links may not have complete integrity, so use the canonical document at this URL if possible.]
Bioinformatic Sequence Markup Language (BSML)
Request for Comments: 971201 Obsoletes: 970901
TopoGEN, Inc. (www.topogen.com)
1275 Kinnear Road
Columbus, OH 43212
USA
Direct email responses to: joe.topogen@iwaynet.net
RFC current version (this document): www.topogen.com/sbir/rfc.html
Definition of standard: BSML.DTD (latest
version)
Bioinformatic Sequence Markup Language (BSML):
A Public Domain Protocol for Graphic Genomic Displays
Status of this Document
This document specifies a public domain standard for the encoding and display of DNA, RNA and protein sequence information (the project is funded by a grant from the National Human Genome Research Institute). The document requests discussion and suggestions for improvement, and distribution of the document is unlimited. Responses to this document may be posted for public discussion, under the respondent's name or anonymously, unless such posting is explicitly prohibited in the response (include "Do not post" or "Post anonymously.").
1. Introduction
1.1 Need for a Standard
1.2 Goals and Criteria
1.3 Implementation Requirements
1.3.1 Development and Formalization of the Standard
1.3.2 Stages of Implementation
1.3.3 Implementation Criteria
1.3.4 Language Standards: Encoding Semantic Content
1.3.5 Software Criteria: Creating and Interpreting Semantic Content
2. BSML Language Overview
2.1 Background
2.2 Source Standards
2.2.1 Standards for Genetic Sequence Information (NCBI's ASN.1 and
NCGR's GSDB)
2.2.2 Standards for Encoding Semantic Content (SGML)
2.2.3 Standards for Encoding and Enabling Network Communication (XML)
2.2.4 Standards for Encoding and Managing Display Properties (HTML,CSS,DSSSL)
2.3 Semantic Encoding in SGML
2.3.1 Notes on Conventions and Symbols - A Brief SGML/XML/BSML Tutorial
2.3.2 XML Element Identification
2.3.3 XML Element Reference
2.4 Representing Sequences and Their Features
2.4.1 Separating Content from Display: BSML Document Sections
2.4.2 Using BSML Semantic Content
2.4.3 Definitions: Representing Sequence Information
2.4.4 Displaying Sequence Information
2.5 Representing Sequence and Feature Sets
2.6 Representing Sequence and Feature Data
2.7 Displaying Sequences, Features, and Sets
2.7.1 Representing Display Objects
2.7.2 Graphic Display Objects
2.7.3 Textual Display Objects
2.7.4 Representing Sizes, Positions and Dimensions
2.8 Representing Links
2.8.1 Internal Links
2.8.2 External Links
2.8.3 Implicit Links
2.8.4 Explicit Links
2.8.5 One-to-One Links
2.8.6 One-to-Many and Many-to-One Links
2.8.7 Link Actuation and Behavior
2.9 Navigation and Selection
2.9.1 Graphic
2.9.2 Element Hierarchy
2.9.3 Linked
2.9.4 Query
2.10 Controlling Display Style
2.10.1 Style Sheets
2.10.2 Applying Style to Elements
2.11 Structure of a BSML Document
3. BSML Examples
3.1 Basic Sequence Display
3.2 Sequence with Default Interval Feature Display
3.3 Sequence with Point Feature Display
4. BSML Formal Language Specification
4.1 Document Type Definitions (DTDs)
4.2 Basic Elements and Entities
4.3 XML and SGML Compliance
4.4 BSML Document Type Definition (DTD)
5. BSML Browser Overview
5.1 Overall Browser Capabilities
5.2 Communication Criteria
5.3 Navigation Criteria
5.4 Visualization Criteria
5.5 Textual Browsers
5.6 Conversion Utilities
6. BSML Software Specifications
6.1 Manual Creation and Editing Requirements
6.2 Automatic Creation Requirements
6.3 BSML Processing Requirements
6.4 Entity Manager Requirements
7. Relationship of BSML to Other Initiatives
7.1 Biowidget Consortium
7.2 CORBA
8. BSML 2.0: Advanced Topics to be Implemented
8.1 Password Protection
8.2 Encryption
8.3 DSSSL Support
The primary purpose of this project, funded by an SBIR from the National Human Genome Research Institute, is to develop a public domain protocol for graphic genomic displays (background published earlier is available at www.topogen.com/sbir/pubgraph.html). This section provides an overview of the rationale for the standard.
There are currently many sources for graphic displays of sequences (chromosome, genetic, and physical maps of a variety of types). These include displays produced by:
A public domain standard is needed for sequence representation and display because currently there are:
A standard for representing sequences and their graphic display properties should:
The storage and representation of sequence information should be:
Although the standard need not explicitly define software requirements for its implementation, some requirements are strongly implied by the formulation. Software for document creation should hide the underlying data representation system from users and provide:
Software for the display of the encoded information (a "BSML sequence browser") should provide:
1.3 Implementation Requirements
1.3.1 Development and Formalization of the Standard
This document proposes a general approach for the standard, but does not include all implementation details. Publication of the initial working version of the standard (BSML 1.0) will occur around Jan. 31, 1998, after incorporating revisions to the current specification (BSML 0.1). Publication of BSML 1.0 will be accompanied by a number of formalization actions, including establishment of:
1.3.2 Stages of Implementation
It seems certain that a useful standard must evolve over time. Consequently, we propose to begin with a limited standard that accomplished some fundamental tasks, but which does not attempt to accomplish all tasks. Our approach defines three points in time that are associated with versions of the standard ("implementable" indicates that software implementation is feasible):
Date |
Version | Specification includes | Implementable |
Dec. 1, 1997 | 0.1 | General approach | No (demo only) |
Jan. 31, 1998 | 1.0 | Basic features | Yes |
Dec. 31, 1998 | 2.0 | Advanced features | Yes |
Topics that are not to be explicitly covered in Version 1.0 are listed in Section 8.
1.3.3 Implementation Criteria
To accomplish the goals of the standard, what is needed is:
1.3.4 Language Standards: Encoding Semantic Content
The standard must encode three general types of semantic content:
1.3.5 Software Criteria: Creating and Interpreting Semantic Content
Criteria must be developed that relate four classes of software to the standard:
The first two types of software are responsible for creating documents using the proposed standard. The third type of software has the responsibility of interpreting the semantic content encoded by the standard. The fourth type of software provides file and data management services for the display software.
The proposed standard is named "Bioinformatic Sequence Markup Language" (BSML), and each part of this name merits attention:
Bioinformatic Sequence Markup Language encodes descriptions of:
Before describing how these descriptions are encoded, some background on the origins of BSML is presented.
There are good reasons for basing BSML on a number of public standards:
Four general groups of standards serve as the sources for BSML, including standards for encoding, enabling, and managing:
2.2.1 Standards for Genetic Sequence Information (NCBI's ASN.1 and NCGR's GSDB)
NCBI - the National Center for Biotechnology Information (part of the National Library of Medicine, United States National Institutes of Health) - provides a public domain representation of biological sequences that uses Abstract Syntax Notation (ASN.1). Although NCBI sequence information may be output in a variety of formats (e.g., by using NCBI's Entrez to export a sequence description as a GenBank flat file), the NCBI databases represent sequences in the ASN.1 format. The NCBI data model provides an excellent basis for the representation of sequences and sequence interrelationships. For more information, see the NCBI website at ncbi.nlm.nih.gov.
A related representation of sequence information has been developed by the National Center for Genome Resources. This representation - Genome Sequence Database (GSDB) Version 1.0 - provides a relational database (SyBase) model for sequence information. For more information see www.ncgr.org. The GSDB schema provides useful representations for a number of sequence features and sequence interrelationships.
2.2.2 Standards for Encoding Semantic Content (SGML)
The NCBI and NCGR standards provide semantic structures that accomplish many of the goals of this project. They do not however, provide the following (nor were they designed to):
In developing the standard for this project, one choice was to extend the NCBI and/or NCGR standards to accommodate these needs. A second choice was to find existing standards that incorporate solutions for some of these problems and also allow direct incorporation of the NCBI and NCGR sequence representation schemata. We chose the second alternative and selected Standard Generalized Markup Language (SGML) as the framework for representing sequence information (see the World Wide Web Consortium - W3C - at www.w3.org/MarkUp/SGML). There were several reasons for this choice:
In making the decision to use SGML rather than the NCBI or NCGR sequence representations, we decided that the standard should be compatible with both of these encodings as well as with other popular models of sequence representation (e.g., the European Molecular Biology Laboratory's EMBL sequence format). Thus the BSML standard permits bidirectional, automated conversion between BSML and other widely used formats.
2.2.3 Standards for Encoding and Enabling Network Communication (XML)
While SGML provides methods for encoding semantic content, it does not provide directly for the transmission of documents over networks (specifically, transfer over the Internet using http - the hypertext transfer protocol). In 1996, a World Wide Web Consortium SGML working group was formed to develop a simplified version of SGML that could be used on the World Wide Web. The result of this effort was a new standard termed "eXtensible Markup Language" (XML). The development of XML (as of July 1, 1997) is now proceeding under the auspices of the W3C (World Wide Web Consortium).
XML is termed an SGML profile. In contrast to HTML, XML provides standards for semantic encoding and for linking documents over the World Wide Web. (For more information on XML, see www.w3.org/XML/). XML bases its document linking strategies on aspects of the HyTime (ISO/IEC 10744 Hypermedia/Time-based Structuring Language) standard. XML also includes many features of the Text Encoding Initiative (TEI) in its representation of links between document elements.
Using XML, a model for representing information is completely specified by defining a Document Type Definition (DTD). Whereas the DTD specifies how to encode information (e.g., how to represent a sequence and its features), the DTD does not specify how to interpret the semantic content. This job is left to the software that processes a document that uses the DTD. For this reason, our description of the standard includes a discussion of the requirements imposed on document processing software.
Note: XML is also closely related to the DOM (Document Object Model) specification (see www.w3.org/DOM/), which defines standard interfaces for the manipulation of document content.
2.2.4 Standards for Encoding and Managing Display Properties (HTML,CSS,DSSSL)
XML provides a framework for semantic encoding that allows for a one-to-one translation from ASN.1 syntax or from GSDB table schemata. Two ingredients are still required:
The HTML DTD provides a number of tools for representing display properties. For this reason, it was decided to base parts of BSML on the relevant display properties defined in the newest version of HTML, 4.0 (see www.w3.org/TR/PR-html40/).
XML supports two methods for controlling display style:
DSSSL is not used to a great extent yet, although it offers a full set of facilities for controlling formatting. We decided to implement DSSSL (see www.w3.org/Style/#dsssl) support as an advanced feature in BSML 2.0. For now (BSML 1.0), only CSS is supported.
BSML is based in part on the CSS, Level 2 specification (see www.w3.org/TR/WD-CSS2/). In particular, CSS2 defines "paged media," which may include paper, transparencies, or computer screens. For the purpose of presenting sequence maps and displays, this model is more appropriate than the traditional HTML "scrolled media" representation of a document as one (possibly very long) page.
The SGML approach to encoding semantic content is through an element-attribute-value data model. Semantic content of a particular type (e.g., a DNA sequence) is termed an element, which is defined in two ways:
2.3.1 Notes on Conventions and Symbols - A Brief SGML/XML/BSML Tutorial
Naming conventions in BSML (XML is case sensitive):
The occurrence of an element in a content model is specified by adding one of three characters to its name:
The relationship between successive elements in a content model is indicated by separators:
Examples:
Attributes are of three general types:
When an attribute is defined, it is assigned one of three types of default value:
2.3.2 XML Element Identification
Every element in an XML document has a unique identifier as one of its attributes. In BSML, this attribute is always named id and this attribute is a token of type ID. This model provides a way to refer uniquely to every element (sequence, feature, etc.) defined in a BSML document. Every element also has a title, which is a displayable identifier.
2.3.3 XML Element References
Element references (e.g., a set of sequences referring to each sequence in the set) use attributes of token type IDREF (a reference to one ID) or type IDREFS (a reference to any number of IDs). XML processors automatically ensure that IDs are unique and that references to IDs point to valid elements.
2.4 Representing Sequences and Their Features
The general approach in BSML is to represent relations among objects of interest in one of two ways:
2.4.1 Separating Content from Display: BSML Document Sections
A BSML document is divided into two main sections:
Note: A BSML document need not contain a Display section if it is used purely to store and transmit sequence information.
The elements comprising the definitions section are discussed in 2.4, 2.5, and 2.6. The elements comprising the display section (including links among elements) are discussed in 2.7, 2.8, and 2.9. The overall structure of a BSML document, combining both sections, is discussed in 2.10.
The most fundamental BSML object is the genetic sequence, which may be a DNA, RNA, or protein sequence. The representation of individual sequences follows the NCBI ASN.1 and NCGR GSDB data models. Additional data structures are defined for dealing with sequence data (2.5) and with relationships among sequences (2.6).
BSML represents a DNA sequence by an element named Sequence. This element is itself composed (in part) of elements defining:
In simplified SGML terminology, the Sequence element is defined as:
ELEMENT Sequence (Source*,Seq-data?,Feature-tables*)
Each Sequence is characterized by a number of required and optional attributes, such as the sequence name, sequence length, shape, number of strands, etc. In SGML terminology, this information is represented as an attribute list (ATTLIST), with each attribute defined by its name, possible values, and default value (this list provides illustrations and is not complete):
ATTLIST Sequence name values default id ID #IMPLIED title CDATA #IMPLIED length CDATA #REQUIRED shape circular,linear #IMPLIED strands 1,2 "2"
The SGML model is hierarchical in that higher level elements are composed of one or more lower level objects. Thus, for example, the Feature-tables element defined as part of a Sequence element consists of a number of Feature-table elements, each of which is defined as a set of Feature elements:
ELEMENT Features-tables (Feature-table*)
ELEMENT Feature-table (Feature*)
Similarly, each Feature may have any number of Locations and Qualifiers associated with it:
ELEMENT Feature (Location|Qualifier)*
Note: The representation of information in BSML will normally be transparent to users, just as HTML encoding of web pages is transparent to users. Users will interact with the BSML representation through graphical interfaces that conceal the details of the implementation.
2.4.2 Using BSML Semantic Content
Because XML documents (including BSML) encode the semantic properties of their subject matter, these representations make it relatively straightforward to query the contents of a document. This means that many functions may be developed in software implementations without being explicitly represented in the BSML document. For example, one feature may be said to occur before (5' of), within, or after (3' of) another. Such spatial relations may be extracted from the encoding of the feature table and displayed graphically as the result of ad hoc queries (e.g., "Show all sequences in the set with promoters occurring before CDS features.").
2.4.3 Definitions: Representing Sequence Information
One subsection of the Definitions section is named Sequences, and this element contains the definition of each Sequence included in the document. The hierarchical nature of the sequence organization is clearly revealed by inspection of the (simplified) element definitions shown below:
ELEMENT Sequences (Sequence*) ELEMENT Sequence (Source*,Seq-data?,Feature-tables*) ELEMENT Source ELEMENT Seq-data ELEMENT Features-tables (Feature-table*) ELEMENT Feature-table (Feature*) ELEMENT Feature (Location|Qualifier)* ELEMENT Location ELEMENT Qualifier
2.4.4 Displaying Sequence Information
Views are the actual display elements that control the visualization of sequences and their features. The display of BSML content is directed to paged media, including computer screens and printed pages. The Display section includes any number of Page elements as its primary units of organization. Each Page may contain any number of View elements, where each View corresponds to the representation of a Sequence.
Each View uses an IDREF attribute to refer to a Sequence by its unique ID attribute value, and the View inherits all characeristics of its reference Sequence. The View may be customized to display a subrange of the complete sequence or to limit the display to selected Features.
2.5 Representing Sequence and Feature Sets
For both display purposes and for the purpose of capturing semantic content, it is often necessary to group sequences and features. Using the id/idref(s) reference system described above, BSML defines a number of types of Set element (included in the Definitions in a subsection called Sets). Through the various types of set elements, BSML provides data structures for representing any of the following:
A set of related features (e.g., a set of restriction sites for a particular restriction enzyme) may be assigned a variety of attribute values and may be organized hierarchically. In this manner, a Set may represent a number of relationships among sequence features:
2.6 Representing Sequence and Feature Data
The Definitions section of a BSML document contains an optional Tables element that includes any number of Table-import or Table elements. Each Table-import and Table allows access to numeric data which may be directly encoded in the document using tabular or hierarchical data structures or which may be accessed from external files. Both summary and detailed data may be accessed and associated with sequences, features, or sets. These associations allow the data to be displayed in a variety of ways. Table-import and Table elements have optional attributes by which they may be associated with reading frames and strands.
2.7 Displaying Sequences, Features, and Sets
BSML sets values for a number of display factors in order to visualize sequence variation (qualities, quantities, and relations):
BSML provides a number of ways for controlling the display of sequences and their features. The following example illustrates the control of basic sequence display.
The next graphic illustrates how sites may be displayed.
The following display illustrates methods for showing sequence feature alignments.
BSML displays link sequences and sequence listings graphically and semantically. The following graphic can not indicate clearly how these links are activated, but the general idea is conveyed.
2.7.1 Representing Display Objects
The selection of an approach for graphically representing sequence objects was guided by competing requirements for:
There are three general ways to specify how to depict a display object corresponding to a sequence or a sequence feature:
The first option - conceptual description - offers the advantage of simplicity of understanding and use. Often, users will be quite satisfied to let the software decide how to represent features (e.g., big green arrows) so long as the information (gene locations and reading strands) is suitably captured by the display. One problem with this approach is that the display will certainly be different in different vendors' software implementations. This method is best suited to the need for a simple output format to be used by sequence analysis software.
The second option - explicit drawing specification - has the advantage of being self-contained and providing exact instructions. If reasonable default conditions are available (e.g., fonts, line dimensions, and colors), it is not too burdensome to use this method (i.e., every drawing parameter need not be specified). The disadvantage of this method is that different software implementations on different platforms using different output media might have trouble producing the same display. This method is best suited for the need to customize the display using manual editing.
The third option - using an external helper - is attractive in that it permits software implementers to customize the display in any manner they see fit. The problem with this approach is that the helpers must be available and that methods must be defined for passing parameters and for displaying objects in the event that the helper is not available. This method is best suited to the needs of software implementers who wish to use particular display technologies not explicitly defined in this standard.
We decided to support all three approaches, so the graphic specification model allows all three types of description. The three approaches are treated in a hierarchy ranging from lower to higher levels of specification: If an explicit specification is present, it takes precedence over a conceptual instruction. If an external specification is present, it takes precedence over either a conceptual or explicit specification.
Consider, for example, the representation of a gene. This feature will be represented by a Feature element under one of the Feature-table elements of a Sequence. The Feature element is associated with a display object element (Interval-object, in this case). In simplified form, the following examples illustrate how each of the three representation methods might be employed (assuming "genedraw" is an external application that draws genes):
Conceptual: <Interval-object direction="5to3"> Explicit: <Interval-object shape="arrow" color="blue" width="0.04cm"> External: <Interval-object use="genedraw" object="gene" parameters="plus,100,200">
(Technical note: The external reference is presented as an illustration; in fact, BSML does not access external objects in this way.)
2.7.2 Graphic Display Objects
BSML supports the display of a variety of specific graphic structures, but also allows a great deal of freedom on the part of software implementers. The display structures are defined by general properties as well as specific attributes. The purpose of these structures is to provide graphic objects to reflect a variety of underlying structures:
In addition to its unque identifier (id) and name (title), each displayable element has attributes that may be set to control its display:
The fundamental graphic representations include single sequence, sequence-pair, and sequence-set view structures. Single sequence display structures include the display of the sequence itself and the data and features associated with the sequence. There are representations for all of the following:
Sequence data may be represented as:
Sequences features may be represented by:
Numerical data associated with individual sequences map be represented on a map as:
Sequence displays may be annotated through the use of a number of display object types:
Multi-sequence representations include:
Sequence-pair representations include:
2.7.3 Textual Display Objects
Most objects than can be displayed graphically can also be displayed textually as hierarchically arranged lists, tables, etc. Most of the implementation of this type of representation is left to the display software, although BSML does provide a few relevant attributes and elements. Another type of textual listing is of sequence data. BSML provides structures to present such listings in separate windows or as components of maps, including:
2.7.4 Representing Sizes, Positions and Dimensions
There are several issues relating to the description of the locations and sizes of display objects:
BSML permits display objects to be located either relative to a sequence or at an absolute location on the page. This distinction is primarily relevant when sequences are moved or their shape is changed (e.g., from linear to circular).
BSML supports both relative and absolute size representations, although relative representations are encouraged (e.g., expressing a font size as 120% of another font size).
A variety of units is supported for absolute (cm, inches) and relative ( pixels, em, en, ex, percentage) specification of lengths and other dimensions. The resolution of page coordinates and other dimensions follows the CSS2 guidelines.
BSML allows many location and size specifications to be set at either general or specific levels. General specifications indicate a rough location on a page (e.g., "top") or a general size description (e.g., "large"). Specific levels indicate precise quantities (e.g., 20 pixels).
Interactive map display requires the ability to link displayed objects to other displayed objects, to underlying sequences and features, and to source documents containing cross-reference information. Fortunately, XML provides a rich set of linking features:
Every element contains an optional set of Link elements, each of which allows the specification of any of the link types indicated above. To accomplish this, the Link elements define the attribute xml-link and assigns it one of several enumerated values (simple, extended, locator, group, or document).
XML also supports "out-of-line" links. This means that the specification of the links between elements is made in a separate element, where each element in the linking set is identified by a locator, e.g.:
<Extended-link> <Locator href="#seq1"> <Locator href="#seq2"> </Extended-link>
This example creates a link structure that can be traversed easily in either direction between the two sequences. In BSML documents, a subsection named Links contains all out-of-line definitions.
2.8.1 Internal Links
The simplest type of link is to another element in the current document. For example, simple HTML-like links are allowed, such as (# indicates a reference to an id):
<Link href="#seq1">
This link points to the element in the current document with id="seq1".
2.8.2 External Links
External links are to other BSML documents and to non-BSML files (e.g., a graphic image stored in a GIF file). XML external links use URLs (Uniform Resource Locators) of the same type supported in HTML (including the query identifier ? and the fragment identifier # defined in HTML 4.0). Most types of file may be transported across the Internet using the hyptertext transfer protocol (http), as required by BSML software.
Any BSML element may use an external link, e.g.:
<Link href="http://www.topogen.com/sbir/rfc.html">
In the case of another BSML (or any XML/SGML) document, a specific element within the document may be selected by adding the fragment identifier # followed by the id of the element:
<Link href="http://www.topogen.com/sbir/demo.bsml#seq1">
This link points to the element with id="seq1" that is contained in document demo.bsml at the www.topogen.com site.
2.8.3 Implicit Links
Some links may be "inferred" by the display software from the content of the BSML document. For example, suppose that a document includes several homologous sequences, each of which contains a gene with the same name. A linkage set consisting of the features containing the same-named genes may be constructed by the software without explicit instructions. Links of this sort are called implicit links. Such links are often useful for responding to ad hoc queries and commands ("Highlight all same-named genes.").
Some implicit links may also be inferred from the hierarchical structure of BSML. For example, the features in a feature table are obviously linked implicitly to the sequence containing the feature table.
Another type of implicit link is created by tracking the display history created by user actions (zoom, pan, select, etc.). Implicit links of this type are maintained entirely by the software and are not dealt with further here.
2.8.4 Explicit Links
Explicit links are those created using the Link, Extended-link, Group-link, Document-link and Locator elements. In contrast to HTML documents, explicit links need not be displayed and do not require user selection for their actuation. This makes the XML linking mechanism much more flexible than the HTML linking mechanism.
2.8.5 One-to-One Links
A simple link may be used to represent a one-to-one link between two elements. This may be accomplished in the traditional HTML way by imbedding a link element in each element, or by using a group definition that includes only the two elements.
2.8.6 One-to-Many and Many-to-One Links
One-to-many and many-to-one links are easily enabled in XML documents by using Extended-link elements. For example, selecting a gene on one sequence might result in a pop-up list of related genes on other sequences (which need not be contained in the same document).
2.8.7 Link Actuation and Behavior
In HTML browsers, links are actuated when the user clicks on them, causing the link to be traversed. XML provides a much richer environment for specifying the actions to be taken when a link is actuated. This is accomplished by providing defined attributes that XML software can interpret appropriately. In addition to specifying a resource (href=URL), links may have the following attributes:
By defining suitable behaviors for the link, it is possible to relate particular actions to particular selection events (e.g., show sequence data if the user double-clicks on the sequence).
The BSML standard allows elements (sequences, etc.) to be selected, although the standard does not specify how selection should be represented (graying, highlighting, etc.). "Navigation" refers to changing the focus from one selected element (or set) to another, and changing selected objects may be tied to changing displayed pages, etc. The concept of element selection is defined in part by the software implementation and in part by the linking behaviors imbedded in the BSML document. BSML supports four general modes of navigation.
2.9.1 Graphic
Graphic navigation refers to selection by pointing to, or clicking on, a grapically displayed object. Each graphic representation may point to an underlying element in the BSML hierarchy (e.g., a sequence). Graphic selection may be accompanied by other actions, such as the display of a popup menu showing navigational options.
2.9.2 Element Hierarchy
The currently selected element (e.g., currently highlighted sequence) points to a location in the BSML hierarchy. Element hierarchy navigation changes the focus by selecting a related element (next sibling, child, parent, etc).in the hierarchy.
2.9.3 Linked
Linked navigation is based upon the explicitly defined links described in 2.8. User selections may involved individual elements or sets of elements.
2.9.4 Query
Query navigation is based upon processing a query of the current set of elements (e.g., "Show me all sequences with globin genes."). A query returns a list of candidates. When used for navigation, the query list display permits selection of one or more candidates.
2.10 Controlling Display Style
BSML follows the general approach of HTML and XML in using style sheets. The class property is used to provide formatting instructions for particular groups of elements, and formatting may be applied through style attributes at any level.
BSML requires the processing software to supply default values for all display attributes. Some of these values provides base level definitions that are used in resolving relative attribute values. For example, if the base font has size=12pt, another font may be specified as:
<Font size=125%>
In this example, the resulting font would have size=15pt. Virtually all dimensions may be specified relative to a base level.
2.10.1 Style Sheets
A style sheet provides values for display attributes. A default style sheet is required for all documents, and a BSML browser is expected to provide suitable values for all unspecified attributes. (Note: This discussion is based on using CSS for style sheets, not the to-be-developed XS style mechanism derived from DSSSL.)
The simplest type of CSS specification names an element (the selector) and specifies a value for one of its attributes (the declaration), e.g., the following specification states that feature elements should use a font with its size set to 10 pt:
Feature {font-size : 10pt;}
Any number of declarations may be made inside a block, and declarations may apply to more than one element, e.g.:
Feature,Qualifier {font-size : 10pt; font-color : blue;}
When two element names are separated by a space rather than by a comma, the declaration has a very different meaning: the declaration applies only to elements of the second type that are included within elements of the first type. For example, the following declaration applies to the line-color attribute of an Interval element that is contained in a Feature element:
Feature Interval {line-color : red;}
A selector may use attribute values to limit its applicability. The following example applies a line color declaration only to Feature elements contained within Sequence elements for which the strands attribute is set to "2":
Sequence(strands="2") Feature {line-color : black;}
All displayable elements in BSML have an attribute named class. The purpose of this attribute is to allow style declarations to be applied selectively to elements that are members of a class. CSS provides a short-hand method for supplying class values, e.g., the following two statements are equivalent:
Interval.heavy {width=20px}
Interval(class="heavy") {width=20px}
To apply style to all elements of a class, the class name itself is provided after a period, e.g.:
.heavy {width=20px}
2.10.2 Applying Style to Elements
Display attributes may be set at all levels, from the complete document to the individual element. CSS uses an inheritance mechanism to determine the attributes for a particular element:
For example, consider the formatting of text associated with the display of a gene. Suppose that this element is identified as a Feature with the attribute type="gene". The following rules would determine the font to be used in displaying the name of this gene:
For more information on style sheets, see www.w3.org/TR/WD-CSS2/.
2.11 Structure of a BSML Document
SGML documents (including HTML, XML, and BSML) mark their contents using tags. Tags usually are paired, with an opening tag and a closing tag enclosing the content to which they refer. For example, in an HTML document, the <p> and </p> tags enclose a paragraph of text. In the same way that a web page document begins with the <html> tag and ends with the </html> tag, a BSML document begins with <Bsml> and ends with </Bsml>. Between these tags are two major sections:
Thus the overall structure of a BSML document looks like this (note that XML is case-sensitive, so bsml is not the same as Bsml):
<Bsml> <Definitions> ... </Definitions> <Display> ... </Display> </Bsml>
Within the Definitions section, several subsections may be included:
Within the Display section, the display of information is organized by Page elements, each of which may contain any number of View elements. Each View may contain a reference to one of the Sequence definitions in the Sequences subsection of the Definitions section. Thus a complete BSML document might have the following sections (indentation is used for clarity):
<Bsml> <Definitions> <Sequences> <Sequence id="sv40"> ... </Sequence> </Sequences> <Sets> <Set> ... </Set> </Sets> <Tables> <Table> ... </Table> <Tables> </Definitions> <Display> <Page> <View seqref="sv40"> ... </View> </Page> </Display> </Bsml>
Note: Other examples will be placed on our website as BSML evolves and implementable versions of the DTD are released. (See also the output samples shown in 2.7.) The examples included here are intended to illustrate very basic properties that reveal the structure of BSML documents.
<!DOCTYPE Bsml SYSTEM "bsml.dtd"> <Bsml> <Definitions> <Sequences> <Sequence id="SEQ1" title="ECRPOBC" seq-type="dna" units="bp" length=12337 shape="linear" strands=2> </Sequence> </Sequences> </Definitions> <Display> <Page> <View id="VEW1" seqref="SEQ1"> </View> </Page> </Display> </Bsml>
Depending upon default conditions, this document might produce a display such as:
3.2 Sequence with Default Interval Feature Display
This example adds two features that represent genes. These features are given the attribute display-auto="1", which instructs the software to create a display object using default conditions.
<!DOCTYPE Bsml SYSTEM "bsml.dtd"> <Bsml> <Definitions> <Sequences> <Sequence id="SEQ1" title="ECRPOBC" seq-type="dna" units="bp" length=12337 shape="linear" strands=2> <Feature-tables id="FTS1"> <Feature-tableid="FTB1"> <Feature id="FTR1" title="rpoB"type="CDS" direction="right" display-auto="1"> <Interval-loc start-pos="2969" end-pos="6994"> <Qualifier type="gene" qual-value="rpoB"> </Feature> <Feature id="FTR2" title="rpoC" type="CDS" direction="right" display-auto="1"> <Interval-loc start-pos="7074" end-pos="11294"> <Qualifier type="gene" qual-value="rpoC"> </Feature> </Feature-table> </Feature-tables> </Sequence> </Sequences> </Definitions> <Display> <Page> <View id="VEW1" seqref="SEQ1"> </View> </Page> </Display> </Bsml>
A default display of this document might look like this (note that the direction attribute is set to "right" for both genes, indicating the direction for the arrow):
3.3 Sequence with Point Feature Display
The next example illustrates the explicit definition of a display object matched to a feature. In this case, a restriction site on the sequence is shown. Note also that the View object now overrides the default number of strands and produces a single-stranded display.
<!DOCTYPE Bsml SYSTEM "bsml.dtd"> <Bsml> <Definitions> <Sequences> <Sequence id="SEQ1" title="ECRPOBC" seq-type="dna" units="bp" length=12337 shape="linear" strands=2> <Feature-tables id="FTS1"> <Feature-table id="FTB1"> <Feature id="FTR1" title="EcoRI" type="user-defined"> <Site-loc site-pos="5000"> </Feature> </Feature-table> </Feature-tables> </Sequence> </Sequences> </Definitions> <Display> <Page> <View id="VEW1" seqref="SEQ1" strands="1"> <Point-object featureref="FTR1" on-strand="plus" caption="EcoRI restriction site"> </View> </Page> </Display> </Bsml>
This document might produce a display such as the following:
4. BSML Formal Language Specification
This section introduces Document Type Definitions (DTDs) and describes some of the basic features of the BSML specification. The latest version of the DTD is bsml.dtd.
4.1 Introduction to Document Type Definitions (DTDs)
The complete DTD for BSML specifies the format and content of a BSML document. A DTD defines a model of semantic content by using elements and their attributes. Elements define the basic objects represented by the DTD and are specified using the format:
<!ELEMENT element-name start-tag-omission end-tag-omission element-contents>
For example, the BSML element Feature-table is defined as follows:
<!ELEMENT Feature-table - - (Feature*,Display?)>
The content model may indicate the occurrence and order of elements by using connector (,|) and occurrence (?+*) indicators. Attributes are described for each element in any number of declarations of the following type:
<!ATTLIST element-name (attribute-name declared-value default-value)*>
For example, the BSML element Sequence has a number of attributes, such as:
<!ATTLIST Sequence shape (linear,circular) "linear" strands (1,2) "2">
4.2 Basic Elements and Entities
For now, BSML documents use only the XML default character set (ISO10646 UTF-8). Because XML may change its standards for attribute value typing, BSML uses parameter entities to represent fundamental numeric types:
<!ENTITY % integer "CDATA"> <!ENTITY % real "CDATA">
For now, strong typing is not supported in XML, so specifying that an attribute as an integer or real is the same as specifying that it is character data (CDATA). Even so, this usage makes it clearer how the attribute value is to be treated.
BSML uses some HTML 4.0 definitions to represent basic display properties (colors and fonts). These are specified in the DTD.
XML is defined as a SGML profile. While BSML is written to conform with the XML standard, it must be realized that XML itself is not complete. Some issues that need to be resolved include:
Despite these problems, XML appears destined to play a major role in Internet communication and semantic encoding of information. For this reason, basing BSML on XML appears to be a sound decision, even if later modifications to XML (and/or SGML) require some changes to the BSML standard. The basic semantic structure represented by BSML is unlikely to be affected by these changes.
For now, XML and SGML compliance is probably best assured by using a validating SGML parser such as NSGMLS (see 6.4) as the front end for a BSML-specific postprocessor.
The BSML DTD will be created in stages. The first stage (Version 0.1) sketches the basic outline of a BSML document. An implementable version of the DTD will be gradually created for release as Version 1.0 about Jan. 31, 1998. The current version of the DTD is located in file bsml.dtd. To download the latest version of the DTD, see downloads.
Note: The subsection of this section will be extended in response to feedback on the RFC.
The internal representation of the BSML standard is likely to be quite similar to the external representation, using a tree structure to represent the hierarchy among sets, sequences and features. The specific implementation of a BSML browser will depend upon the purposes for which the browser is used.
5.1 Overall Browser Capabilities
The fundamental task of a BSML browser is to display sequences and their features and annotation.
For strictly local applications, the browser need not be Internet aware. For many linking purposes, however, the browser must be able to process requests for documents located anywhere in the Internet. This requires the ability to process and resolve URLs and to use the hypertext transfer protocol (http) to transfer information to a client.
A browser must be able to support the four navigation modes described in 2.9:
A graphic browser needs basic graphic capabilities and, obviously, must support drawing of the basic display objects (lines, arcs, fills, symbols, etc.). More advanced browsers will let users select dimensions to be varied for the purpose of visualizing sequence and feature properties (e.g., varying color saturation or line length to reflect quantitative or qualitative variation in properties of interest).
Although the primary purpose for developing this standard is to facilitate graphic sequence display, there is no requirement that a browser display information graphically. The information contained in a BSML document may easily be viewed in purely textual formats. This approach may be adequate when using the BSML encoding to select, transfer and convert sequence information.
The functionality of the BSML standard should benefit greatly from the development of a number of conversion utilities. In general, this means developing applications that read files in one format (e.g., EMBL sequence files) and output files in another format (e.g., BSML). The following are likely candidates for conversion utilities:
Another area in which conversion utilities may be useful is in incorporating data in standard interchange formats such as the DIF format for data exchange or the RTF format for text exchange. Formats of this type are often available as export formats from many applications.
A somewhat different type of conversion may also be useful - converting information from BSML documents to HTML documents, for publication on the web and on intranets. There are two general ways in which to accomplish this, both of which should be relatively simple to implement:
6. BSML Software Specifications
-- to be written --
6.1 Manual Creation and Editing Requirements
-- to be written --
6.2 Automatic Creation Requirements
-- to be written --
6.3 BSML Processing Requirements
Software (e.g., a "browser") processing BSML documents will generally not work directly with the source BSML file. Rather, such software will usually be written as a back-end postprocessor that uses as its input the output from a front end that performs preprocessing on the document. For now, the most likely front end will be NSGMLS, an SGML parser written by James Clark and released to the public domain. The output of NSGMLS is in the format known as ESIS (Element Structure Information Set). This output is easier to work with and will usually serve as the input to a BSML postprocessor.
6.4 Entity Manager Requirements
BSML documents may contain a variety of references to external documents, including linked resources, external parameter entities, and data entities defined through notation syntax. For example, the following definitions might be included in a DTD:
<!ENTITY testfile SYSTEM "test.bmp" NDATA BMP> <!NOTATION BMP SYSTEM "BMP Format"> <!ELEMENT Picture - - (#PCDATA)> <!ATTLIST Picture img ENTITY #REQUIRED>
In the document itself, the following element might be included:
<Picture img = testfile></Picture>
The ESIS output from this line using NSGMLS (see 6.3) will be:
sBMP Format NBMP stest.bmp f<OSFILE FIND>test.bmp Etestfile NDATA BMP AIMG ENTITY testfile (PICTURE )PICTURE
The role of the entity manager is to resolve file references and supply appropriate files to the processor. Some references are resolved through the use of a catalog that maps SGML identifiers to physical file names. Further discussion of this topic is beyond the scope of this document.
7. Relationship of BSML to Other Initiatives
This section is to be completed as needed in response to feedback on the RFC.
To be written.
To be written.
8. BSML 2.0 - Advanced Topics to be Implemented
This section will be supplemented by topics developed through responses to the RFC.
To be written.
To be written.
To be written.