[This local archive copy (text only) mirrored from the canonical site: http://www.topogen.com/sbir/rfc.html; links may not have complete integrity, so use the canonical document at this URL if possible.]

Bioinformatic Sequence Markup Language (BSML)
Request for Comments: 971201 Obsoletes: 970901

TopoGEN, Inc. (www.topogen.com)
1275 Kinnear Road
Columbus, OH 43212
USA

Direct email responses to: joe.topogen@iwaynet.net
RFC current version (this document): www.topogen.com/sbir/rfc.html
Definition of standard: BSML.DTD (latest version)

Bioinformatic Sequence Markup Language (BSML):
A Public Domain Protocol for Graphic Genomic Displays

Status of this Document

This document specifies a public domain standard for the encoding and display of DNA, RNA and protein sequence information (the project is funded by a grant from the National Human Genome Research Institute). The document requests discussion and suggestions for improvement, and distribution of the document is unlimited. Responses to this document may be posted for public discussion, under the respondent's name or anonymously, unless such posting is explicitly prohibited in the response (include "Do not post" or "Post anonymously.").

Table of Contents

1. Introduction
1.1 Need for a Standard
1.2 Goals and Criteria
1.3 Implementation Requirements
1.3.1 Development and Formalization of the Standard
1.3.2 Stages of Implementation
1.3.3 Implementation Criteria
1.3.4 Language Standards: Encoding Semantic Content
1.3.5 Software Criteria: Creating and Interpreting Semantic Content

2. BSML Language Overview
2.1 Background
2.2 Source Standards
2.2.1 Standards for Genetic Sequence Information (NCBI's ASN.1 and NCGR's GSDB)
2.2.2 Standards for Encoding Semantic Content (SGML)
2.2.3 Standards for Encoding and Enabling Network Communication (XML)
2.2.4 Standards for Encoding and Managing Display Properties (HTML,CSS,DSSSL)
2.3 Semantic Encoding in SGML
2.3.1 Notes on Conventions and Symbols - A Brief SGML/XML/BSML Tutorial
2.3.2 XML Element Identification
2.3.3 XML Element Reference
2.4 Representing Sequences and Their Features
2.4.1 Separating Content from Display: BSML Document Sections
2.4.2 Using BSML Semantic Content
2.4.3 Definitions: Representing Sequence Information
2.4.4 Displaying Sequence Information
2.5 Representing Sequence and Feature Sets
2.6 Representing Sequence and Feature Data
2.7 Displaying Sequences, Features, and Sets
2.7.1 Representing Display Objects
2.7.2 Graphic Display Objects
2.7.3 Textual Display Objects
2.7.4 Representing Sizes, Positions and Dimensions
2.8 Representing Links
2.8.1 Internal Links
2.8.2 External Links
2.8.3 Implicit Links
2.8.4 Explicit Links
2.8.5 One-to-One Links
2.8.6 One-to-Many and Many-to-One Links
2.8.7 Link Actuation and Behavior
2.9 Navigation and Selection
2.9.1 Graphic
2.9.2 Element Hierarchy
2.9.3 Linked
2.9.4 Query
2.10 Controlling Display Style
2.10.1 Style Sheets
2.10.2 Applying Style to Elements
2.11 Structure of a BSML Document

3. BSML Examples
3.1 Basic Sequence Display
3.2 Sequence with Default Interval Feature Display
3.3 Sequence with Point Feature Display

4. BSML Formal Language Specification
4.1 Document Type Definitions (DTDs)
4.2 Basic Elements and Entities
4.3 XML and SGML Compliance
4.4 BSML Document Type Definition (DTD)

5. BSML Browser Overview
5.1 Overall Browser Capabilities
5.2 Communication Criteria
5.3 Navigation Criteria
5.4 Visualization Criteria
5.5 Textual Browsers
5.6 Conversion Utilities

6. BSML Software Specifications
6.1 Manual Creation and Editing Requirements
6.2 Automatic Creation Requirements
6.3 BSML Processing Requirements
6.4 Entity Manager Requirements

7. Relationship of BSML to Other Initiatives
7.1 Biowidget Consortium
7.2 CORBA

8. BSML 2.0: Advanced Topics to be Implemented
8.1 Password Protection
8.2 Encryption
8.3 DSSSL Support


1. Introduction

The primary purpose of this project, funded by an SBIR from the National Human Genome Research Institute, is to develop a public domain protocol for graphic genomic displays (background published earlier is available at www.topogen.com/sbir/pubgraph.html). This section provides an overview of the rationale for the standard.

Table of contents


1.1 Need for a Standard

There are currently many sources for graphic displays of sequences (chromosome, genetic, and physical maps of a variety of types). These include displays produced by:

  1. Commercial software (e.g., PCGene or MapKit).
  2. Software associated with public domain databases (e.g., NCBI's Sequin).
  3. Software that is not specialized for sequence maps (e.g., Microsoft's PowerPoint).

A public domain standard is needed for sequence representation and display because currently there are:

  1. No generally accepted methods for associating sequence features with display features.
  2. No simple ways to move graphic display information from one platform to another.
  3. No standards that allow sequence analysis software (i.e., non-map-displaying software) to export maps as analysis products; each software system, if it provides graphic output, does so in its own proprietary manner.

Table of contents


1.2 Goals and Criteria

A standard for representing sequences and their graphic display properties should:

  1. Describe the features of genetic sequences.
  2. Represent relationships among sequences and their features.
  3. Define graphic objects that represent sequence features and relationships.
  4. Provide representation of the relationships between sequences and source documents such as sequence and genetic marker databases.
  5. Define methods for storing and transmitting encoded sequence and graphic information.

The storage and representation of sequence information should be:

  1. Platform and programming language independent.
  2. Human-legible, for ease of understanding and tutorial purposes.
  3. Simple to create so that it may be implemented easily by non-display applications such as sequence analysis software.
  4. Easily transmitted over the Internet and other networks.
  5. Secure, providing password protection and encryption as required.

Although the standard need not explicitly define software requirements for its implementation, some requirements are strongly implied by the formulation. Software for document creation should hide the underlying data representation system from users and provide:

  1. A graphic interface for the creation and editing of display objects.
  2. Server functions for client applications that need to create sequence displays.

Software for the display of the encoded information (a "BSML sequence browser") should provide:

  1. A graphic interface for sequence manipulation.
  2. Visualization of sequence properties and relationships.
  3. Navigation within and between sequences.
  4. Communication and information sharing within research groups and broadly among members of the scientific community.
  5. Support for maintaining data security (e.g., password validation services).

Table of contents


1.3 Implementation Requirements

1.3.1 Development and Formalization of the Standard

This document proposes a general approach for the standard, but does not include all implementation details. Publication of the initial working version of the standard (BSML 1.0) will occur around Jan. 31, 1998, after incorporating revisions to the current specification (BSML 0.1). Publication of BSML 1.0 will be accompanied by a number of formalization actions, including establishment of:

  1. A "governing body" to control revisions of the standard.
  2. A review and revision procedure.
  3. Methods for dealing with problems and for publishing interim solutions to these problems pending revisions of the standard.

1.3.2 Stages of Implementation

It seems certain that a useful standard must evolve over time. Consequently, we propose to begin with a limited standard that accomplished some fundamental tasks, but which does not attempt to accomplish all tasks. Our approach defines three points in time that are associated with versions of the standard ("implementable" indicates that software implementation is feasible):

    Date

Version Specification includes Implementable
Dec. 1, 1997 0.1 General approach No (demo only)
Jan. 31, 1998 1.0 Basic features Yes
Dec. 31, 1998 2.0 Advanced features Yes

Topics that are not to be explicitly covered in Version 1.0 are listed in Section 8.

1.3.3 Implementation Criteria

To accomplish the goals of the standard, what is needed is:

  1. Rules for defining the features of sequences and their attributes, which may be accomplished in part by identifying acceptable source file formats such as the GenBank, EMBL, and DDBJ formats.
  2. Rules for defining display objects (e.g., arrows) and their properties.
  3. Rules for assigning features and their properties to display objects and their properties.
  4. Rules for defining general attributes of map display (units of measurement, page dimensions, margins, etc.).
  5. Rules for accessing source files needed for interactive functions.
  6. Encoding methods that permit password protected encryption and decryption.
  7. Representation methods that permit information to be easily transmitted over networks, including the Internet.
  8. Definitions of default conditions that allow for automatic map creation and map display with minimal user input.
  9. Provision for customization of default map creation and map display conditions so that users can control the appearance of maps.
  10. Software for the creation and display of the encoded information.

1.3.4 Language Standards: Encoding Semantic Content

The standard must encode three general types of semantic content:

  1. Information associated with sequences themselves (e.g., sequence name, size, and features) and sequence interrelationships (e.g., alignments).
  2. Information associated with sequence display (e.g., a gene displayed as an arrow).
  3. Information associated with links among sequences and between sequences and source documents (database records, publications, etc.).

1.3.5 Software Criteria: Creating and Interpreting Semantic Content

Criteria must be developed that relate four classes of software to the standard:

  1. Software for manually creating and editing map displays.
  2. Software for automatically creating map displays as products of sequence analyses.
  3. Software for displaying maps (and other sequence representations).
  4. Software for resolving network references to documents and databases.

The first two types of software are responsible for creating documents using the proposed standard. The third type of software has the responsibility of interpreting the semantic content encoded by the standard. The fourth type of software provides file and data management services for the display software.

Table of contents


2. BSML Language Overview

2.1 Background

The proposed standard is named "Bioinformatic Sequence Markup Language" (BSML), and each part of this name merits attention:

  1. Bioinformatic: the standard is concerned with the representation of biological information.
  2. Sequence: the standard is specifically concerned with information about DNA, RNA and protein sequences.
  3. Markup Language: the standard is expressed as a markup language that uses "tags" to encode the semantic aspects of sequences and their display. More specifically, BSML is compliant with several international standards (see 2.2).

Bioinformatic Sequence Markup Language encodes descriptions of:

  1. Sequences and their features (e.g., genes, promoters, and restriction sites).
  2. Sets relationships among sequences and their features (e.g., aligned sequences and mutation sets).
  3. Display objects (e.g., arrows, axes, and sequence lines).
  4. Relationships between display objects and sets, sequences, features.

Before describing how these descriptions are encoded, some background on the origins of BSML is presented.

Table of contents


2.2 Source Standards

There are good reasons for basing BSML on a number of public standards:

  1. Many of the requirements for the BSML standard are already expressed in existing standards.
  2. Governing bodies (e.g., ISO, the International Standards Organization) are already in place for maintaining and improving these standards.
  3. As new hardware, software, and network developments occur, we may expect these changes to be reflected in the broader standards upon which BSML is based, providing guidelines for updating the standard.

Four general groups of standards serve as the sources for BSML, including standards for encoding, enabling, and managing:

  1. Genetic (nucleotide and protein) sequence information.
  2. General semantic content.
  3. Network communication (e.g., over the Internet).
  4. Display properties (e.g., fonts and colors).

2.2.1 Standards for Genetic Sequence Information (NCBI's ASN.1 and NCGR's GSDB)

NCBI - the National Center for Biotechnology Information (part of the National Library of Medicine, United States National Institutes of Health) - provides a public domain representation of biological sequences that uses Abstract Syntax Notation (ASN.1). Although NCBI sequence information may be output in a variety of formats (e.g., by using NCBI's Entrez to export a sequence description as a GenBank flat file), the NCBI databases represent sequences in the ASN.1 format. The NCBI data model provides an excellent basis for the representation of sequences and sequence interrelationships. For more information, see the NCBI website at ncbi.nlm.nih.gov.

A related representation of sequence information has been developed by the National Center for Genome Resources. This representation - Genome Sequence Database (GSDB) Version 1.0 - provides a relational database (SyBase) model for sequence information. For more information see www.ncgr.org. The GSDB schema provides useful representations for a number of sequence features and sequence interrelationships.

2.2.2 Standards for Encoding Semantic Content (SGML)

The NCBI and NCGR standards provide semantic structures that accomplish many of the goals of this project. They do not however, provide the following (nor were they designed to):

  1. Standards for representing graphic display properties of sequence objects.
  2. Mechanisms for controlling graphic display parameters (e.g., style sheets).
  3. Standard representations of structures relating display properties to sequence features and associated data.
  4. Standards for the transmission of information over networks, including the Internet.

In developing the standard for this project, one choice was to extend the NCBI and/or NCGR standards to accommodate these needs. A second choice was to find existing standards that incorporate solutions for some of these problems and also allow direct incorporation of the NCBI and NCGR sequence representation schemata. We chose the second alternative and selected Standard Generalized Markup Language (SGML) as the framework for representing sequence information (see the World Wide Web Consortium - W3C - at www.w3.org/MarkUp/SGML). There were several reasons for this choice:

  1. SGML is an internationally maintained standard (ISO 8879) that has been tested thoroughly since its initial formalization in 1985. SGML provides hierarchical semantic encoding capabilities similar to those provided by ASN.1 and can also encode relational database table structures.
  2. SGML allows for applications and profiles that are customized for particular purposes.
  3. A recently developed SGML profile, eXtensible Markup Language (XML), provides many of the information structures required for this project.
  4. Other SGML applications (e.g., HyperText Markup Language - HTML) and related standards (e.g., style languages such as CSS and DSSSL - see below) provide methods for encoding and manipulating display information.

In making the decision to use SGML rather than the NCBI or NCGR sequence representations, we decided that the standard should be compatible with both of these encodings as well as with other popular models of sequence representation (e.g., the European Molecular Biology Laboratory's EMBL sequence format). Thus the BSML standard permits bidirectional, automated conversion between BSML and other widely used formats.

2.2.3 Standards for Encoding and Enabling Network Communication (XML)

While SGML provides methods for encoding semantic content, it does not provide directly for the transmission of documents over networks (specifically, transfer over the Internet using http - the hypertext transfer protocol). In 1996, a World Wide Web Consortium SGML working group was formed to develop a simplified version of SGML that could be used on the World Wide Web. The result of this effort was a new standard termed "eXtensible Markup Language" (XML). The development of XML (as of July 1, 1997) is now proceeding under the auspices of the W3C (World Wide Web Consortium).

XML is termed an SGML profile. In contrast to HTML, XML provides standards for semantic encoding and for linking documents over the World Wide Web. (For more information on XML, see www.w3.org/XML/). XML bases its document linking strategies on aspects of the HyTime (ISO/IEC 10744 Hypermedia/Time-based Structuring Language) standard. XML also includes many features of the Text Encoding Initiative (TEI) in its representation of links between document elements.

Using XML, a model for representing information is completely specified by defining a Document Type Definition (DTD). Whereas the DTD specifies how to encode information (e.g., how to represent a sequence and its features), the DTD does not specify how to interpret the semantic content. This job is left to the software that processes a document that uses the DTD. For this reason, our description of the standard includes a discussion of the requirements imposed on document processing software.

Note: XML is also closely related to the DOM (Document Object Model) specification (see www.w3.org/DOM/), which defines standard interfaces for the manipulation of document content.

2.2.4 Standards for Encoding and Managing Display Properties (HTML,CSS,DSSSL)

XML provides a framework for semantic encoding that allows for a one-to-one translation from ASN.1 syntax or from GSDB table schemata. Two ingredients are still required:

  1. A method for attaching display properties (e.g., fonts, colors, line sizes) to sequence objects.
  2. A method for controlling the properties of these display objects (e.g., setting default fonts).

The HTML DTD provides a number of tools for representing display properties. For this reason, it was decided to base parts of BSML on the relevant display properties defined in the newest version of HTML, 4.0 (see www.w3.org/TR/PR-html40/).

XML supports two methods for controlling display style:

  1. Cascading Style Sheets (CSS)
  2. Document Style Semantics and Specification Language (DSSSL)

DSSSL is not used to a great extent yet, although it offers a full set of facilities for controlling formatting. We decided to implement DSSSL (see www.w3.org/Style/#dsssl) support as an advanced feature in BSML 2.0. For now (BSML 1.0), only CSS is supported.

BSML is based in part on the CSS, Level 2 specification (see www.w3.org/TR/WD-CSS2/). In particular, CSS2 defines "paged media," which may include paper, transparencies, or computer screens. For the purpose of presenting sequence maps and displays, this model is more appropriate than the traditional HTML "scrolled media" representation of a document as one (possibly very long) page.

Table of contents


2.3 Semantic Encoding in SGML

The SGML approach to encoding semantic content is through an element-attribute-value data model. Semantic content of a particular type (e.g., a DNA sequence) is termed an element, which is defined in two ways:

  1. A content model that specifies other elements, including text, hierarchically contained within the element (e.g., a Sequence contains Features).
  2. A set of attributes, each of which has a value (e.g., a Sequence has a length).

2.3.1 Notes on Conventions and Symbols - A Brief SGML/XML/BSML Tutorial

Naming conventions in BSML (XML is case sensitive):

  1. Elements have names beginning with uppercase letters (e.g., Sequence).
  2. Attributes have names beginning with lowercase letters (e.g., shape).
  3. Attribute values are shown in quotes (e.g., shape="linear").

The occurrence of an element in a content model is specified by adding one of three characters to its name:

  1. ? indicates that an element is optional.
  2. + indicates one or more occurrences of the element.
  3. * indicates zero or more occurrences of the element.

The relationship between successive elements in a content model is indicated by separators:

  1. , indicates that the elements occur in sequence (the mathematical sense of "sequence").
  2. | indicates that either one element or the other occurs.

Examples:

  1. X (Y,Z) means that X is composed of elements Y and Z, each of which occurs once.
  2. X (Y*,Z+) means that X is composed of zero or more instances of Y followed by one or more instances of Z.
  3. X (Y|Z) means that X is composed of Y or Z (but not both).
  4. X (Y|Z)* means that X is composed of any number of instances of Y or Z, in any order.

Attributes are of three general types:

  1. Textual: strings consisting of any character data (CDATA), e.g., title="sv40".
  2. Tokenized: one or more tokens significant to XML, e.g., ID, which indicates a unique identifier.
  3. Enumerated: a list of possible values, such as shape=(linear,circular).

When an attribute is defined, it is assigned one of three types of default value:

  1. Specific values are shown in quotation marks, e.g., shape (linear,circular) "linear" indicates that a sequence is treated as linear if no value is specified.
  2. #REQUIRED indicates that a value must be supplied, e.g., seqid IDREF #REQUIRED.
  3. #IMPLIED indicates that the attribute is optional and its value will be provided as needed by the software if it is not specified, e.g., seqid IDREF #IMPLIED.

2.3.2 XML Element Identification

Every element in an XML document has a unique identifier as one of its attributes. In BSML, this attribute is always named id and this attribute is a token of type ID. This model provides a way to refer uniquely to every element (sequence, feature, etc.) defined in a BSML document. Every element also has a title, which is a displayable identifier.

2.3.3 XML Element References

Element references (e.g., a set of sequences referring to each sequence in the set) use attributes of token type IDREF (a reference to one ID) or type IDREFS (a reference to any number of IDs). XML processors automatically ensure that IDs are unique and that references to IDs point to valid elements.

Table of contents


2.4 Representing Sequences and Their Features

The general approach in BSML is to represent relations among objects of interest in one of two ways:

  1. Hierarchically, through the use of the content model (e.g., the Features of a Sequence are contained within the definition of that sequence).
  2. Through the use of references to the unique identifier (ID) of each element.

2.4.1 Separating Content from Display: BSML Document Sections

A BSML document is divided into two main sections:

  1. The definitions section contains descriptions of sequences, features, data associated with sequences, and sets of sequences and features.
  2. The display section contains descriptions of pages on which sequences and their features are depicted graphically through association with various display objects. For example, a sequence is displayed by associating a View element with it.

Note: A BSML document need not contain a Display section if it is used purely to store and transmit sequence information.

The elements comprising the definitions section are discussed in 2.4, 2.5, and 2.6. The elements comprising the display section (including links among elements) are discussed in 2.7, 2.8, and 2.9. The overall structure of a BSML document, combining both sections, is discussed in 2.10.

The most fundamental BSML object is the genetic sequence, which may be a DNA, RNA, or protein sequence. The representation of individual sequences follows the NCBI ASN.1 and NCGR GSDB data models. Additional data structures are defined for dealing with sequence data (2.5) and with relationships among sequences (2.6).

BSML represents a DNA sequence by an element named Sequence. This element is itself composed (in part) of elements defining:

  1. Source information about the sequence.
  2. Sequence data (the series of bases or residues).
  3. Feature tables indicating the locations of genes, promoters, etc.

In simplified SGML terminology, the Sequence element is defined as:

ELEMENT Sequence (Source*,Seq-data?,Feature-tables*)

Each Sequence is characterized by a number of required and optional attributes, such as the sequence name, sequence length, shape, number of strands, etc. In SGML terminology, this information is represented as an attribute list (ATTLIST), with each attribute defined by its name, possible values, and default value (this list provides illustrations and is not complete):

   ATTLIST Sequence
     name    values          default
     id      ID              #IMPLIED
     title   CDATA           #IMPLIED
     length  CDATA           #REQUIRED
     shape   circular,linear #IMPLIED
     strands 1,2             "2"

The SGML model is hierarchical in that higher level elements are composed of one or more lower level objects. Thus, for example, the Feature-tables element defined as part of a Sequence element consists of a number of Feature-table elements, each of which is defined as a set of Feature elements:

ELEMENT Features-tables (Feature-table*)

ELEMENT Feature-table (Feature*)

Similarly, each Feature may have any number of Locations and Qualifiers associated with it:

ELEMENT Feature (Location|Qualifier)*

Note: The representation of information in BSML will normally be transparent to users, just as HTML encoding of web pages is transparent to users. Users will interact with the BSML representation through graphical interfaces that conceal the details of the implementation.

2.4.2 Using BSML Semantic Content

Because XML documents (including BSML) encode the semantic properties of their subject matter, these representations make it relatively straightforward to query the contents of a document. This means that many functions may be developed in software implementations without being explicitly represented in the BSML document. For example, one feature may be said to occur before (5' of), within, or after (3' of) another. Such spatial relations may be extracted from the encoding of the feature table and displayed graphically as the result of ad hoc queries (e.g., "Show all sequences in the set with promoters occurring before CDS features.").

2.4.3 Definitions: Representing Sequence Information

One subsection of the Definitions section is named Sequences, and this element contains the definition of each Sequence included in the document. The hierarchical nature of the sequence organization is clearly revealed by inspection of the (simplified) element definitions shown below:

ELEMENT Sequences (Sequence*)
ELEMENT Sequence (Source*,Seq-data?,Feature-tables*)
ELEMENT Source ELEMENT Seq-data
ELEMENT Features-tables (Feature-table*)
ELEMENT Feature-table (Feature*)
ELEMENT Feature (Location|Qualifier)*
ELEMENT Location ELEMENT Qualifier

2.4.4 Displaying Sequence Information

Views are the actual display elements that control the visualization of sequences and their features. The display of BSML content is directed to paged media, including computer screens and printed pages. The Display section includes any number of Page elements as its primary units of organization. Each Page may contain any number of View elements, where each View corresponds to the representation of a Sequence.

Each View uses an IDREF attribute to refer to a Sequence by its unique ID attribute value, and the View inherits all characeristics of its reference Sequence. The View may be customized to display a subrange of the complete sequence or to limit the display to selected Features.

Table of contents


2.5 Representing Sequence and Feature Sets

For both display purposes and for the purpose of capturing semantic content, it is often necessary to group sequences and features. Using the id/idref(s) reference system described above, BSML defines a number of types of Set element (included in the Definitions in a subsection called Sets). Through the various types of set elements, BSML provides data structures for representing any of the following:

  1. Arbitrary collections of elements (Sequences, Features and Sets).
  2. Sets of aligned sequences.
  3. Sets of contigs.
  4. BLAST search result sets.
  5. Sets of variants or mutations.
  6. Sets of related features (within one sequence or among sequences).

A set of related features (e.g., a set of restriction sites for a particular restriction enzyme) may be assigned a variety of attribute values and may be organized hierarchically. In this manner, a Set may represent a number of relationships among sequence features:

  1. Simple site lists.
  2. Simple sequence interval lists.
  3. Nested motifs composed of heterogeneous groups of features (e.g., a promoter site associated with a coding region and a repressor site).
  4. Lists of sites, intervals, or motifs weighted by numerical values (e.g., a match score).
  5. Lists of pairs of sites and intervals (e.g., primers and repeats).

Table of contents


2.6 Representing Sequence and Feature Data

The Definitions section of a BSML document contains an optional Tables element that includes any number of Table-import or Table elements. Each Table-import and Table allows access to numeric data which may be directly encoded in the document using tabular or hierarchical data structures or which may be accessed from external files. Both summary and detailed data may be accessed and associated with sequences, features, or sets. These associations allow the data to be displayed in a variety of ways. Table-import and Table elements have optional attributes by which they may be associated with reading frames and strands.

Table of contents


2.7 Displaying Sequences, Features, and Sets

BSML sets values for a number of display factors in order to visualize sequence variation (qualities, quantities, and relations):

  1. location
  2. size
  3. value (e.g., color saturation, line thickness)
  4. texture
  5. color
  6. shape
  7. orientation

BSML provides a number of ways for controlling the display of sequences and their features. The following example illustrates the control of basic sequence display.

The next graphic illustrates how sites may be displayed.

The following display illustrates methods for showing sequence feature alignments.

BSML displays link sequences and sequence listings graphically and semantically. The following graphic can not indicate clearly how these links are activated, but the general idea is conveyed.

2.7.1 Representing Display Objects

The selection of an approach for graphically representing sequence objects was guided by competing requirements for:

  1. A simple system that sequence analysis applications can use for automatic graphic formatting of their output.
  2. A rich system that lets users customize the display manually to suit their display requirements.
  3. A flexible system that software developers can adapt to specialize (perhaps proprietary) methods for displaying information graphically.

There are three general ways to specify how to depict a display object corresponding to a sequence or a sequence feature:

  1. Indicate conceptually the aspects of the object to be represented (e.g., represent a gene by an object indicating the 5' to 3' coding direction along the sequence line).
  2. Indicate explicitly how to draw a display object by specifying primitive drawing elements (e.g., draw an arrow using a blue line that is 2 mm thick, etc.).
  3. Pass the task of creating the display object to an external helper application.

The first option - conceptual description - offers the advantage of simplicity of understanding and use. Often, users will be quite satisfied to let the software decide how to represent features (e.g., big green arrows) so long as the information (gene locations and reading strands) is suitably captured by the display. One problem with this approach is that the display will certainly be different in different vendors' software implementations. This method is best suited to the need for a simple output format to be used by sequence analysis software.

The second option - explicit drawing specification - has the advantage of being self-contained and providing exact instructions. If reasonable default conditions are available (e.g., fonts, line dimensions, and colors), it is not too burdensome to use this method (i.e., every drawing parameter need not be specified). The disadvantage of this method is that different software implementations on different platforms using different output media might have trouble producing the same display. This method is best suited for the need to customize the display using manual editing.

The third option - using an external helper - is attractive in that it permits software implementers to customize the display in any manner they see fit. The problem with this approach is that the helpers must be available and that methods must be defined for passing parameters and for displaying objects in the event that the helper is not available. This method is best suited to the needs of software implementers who wish to use particular display technologies not explicitly defined in this standard.

We decided to support all three approaches, so the graphic specification model allows all three types of description. The three approaches are treated in a hierarchy ranging from lower to higher levels of specification: If an explicit specification is present, it takes precedence over a conceptual instruction. If an external specification is present, it takes precedence over either a conceptual or explicit specification.

Consider, for example, the representation of a gene. This feature will be represented by a Feature element under one of the Feature-table elements of a Sequence. The Feature element is associated with a display object element (Interval-object, in this case). In simplified form, the following examples illustrate how each of the three representation methods might be employed (assuming "genedraw" is an external application that draws genes):

Conceptual: <Interval-object direction="5to3">
Explicit: <Interval-object shape="arrow" color="blue" width="0.04cm">
External: <Interval-object use="genedraw" object="gene" parameters="plus,100,200">

(Technical note: The external reference is presented as an illustration; in fact, BSML does not access external objects in this way.)

2.7.2 Graphic Display Objects

BSML supports the display of a variety of specific graphic structures, but also allows a great deal of freedom on the part of software implementers. The display structures are defined by general properties as well as specific attributes. The purpose of these structures is to provide graphic objects to reflect a variety of underlying structures:

  1. Simple individual sequences and their features.
  2. Complexly organized features on a single sequence (e.g., multi-element motifs, variation/mutation sets).
  3. Relationships among sequences (e.g., motifs shared by sequence sets, cloning histories, and homology and alignment sets).

In addition to its unque identifier (id) and name (title), each displayable element has attributes that may be set to control its display:

  1. On/off (display) and selection (selected) status.
  2. Read-only status to control editing.
  3. Display-auto status to control whether a feature is shown without needing an associated display object.
  4. Class membership that may be used in assigning display properties.

The fundamental graphic representations include single sequence, sequence-pair, and sequence-set view structures. Single sequence display structures include the display of the sequence itself and the data and features associated with the sequence. There are representations for all of the following:

  1. Chromosome maps.
  2. Genetic (linkage) maps.
  3. Physical maps, which may be linear or circular.
  4. Gel plot simulations for sequences represented as digest products.

Sequence data may be represented as:

  1. Single- or double-stranded DNA sequences (or translated sequences) shown as an inset on a map.
  2. Text in a sequence viewer, optionally accompanied by site listings.
  3. Boxed text pointing to a specific region on a sequence (sequence blowup).

Sequences features may be represented by:

  1. Point objects (lines indicating specific site locations on a sequence).
  2. Interval objects (boxes and arrows indicating regions on the sequence).
  3. Set objects that control the display of sets of sites or intervals.

Numerical data associated with individual sequences map be represented on a map as:

  1. A chart (histogram or frequency polygon).
  2. A table of values.
  3. An icon providing access to tabular or other presentations.

Sequence displays may be annotated through the use of a number of display object types:

  1. Line-pointers (lines using various patterns and optionally ending in arrowheads).
  2. Captions and text files (free text).
  3. Graphic figures (e.g., GIF images). (Note: BSML uses HTML client-side image mapping to create hot spots on graphics for linking.)

Multi-sequence representations include:

  1. Sets for associating sites (e.g., mutation locations) on a set of sequences.
  2. Sets providing tree representations for displaying homologies and similarities among sequences.
  3. Sets for presenting multiple alignment listings, including consensus sequences.

Sequence-pair representations include:

  1. Dot-matrix plots of regions of alignment, based upon data structures storing significant diagonals that are linked to aligned data.
  2. Sequence alignment listings.

2.7.3 Textual Display Objects

Most objects than can be displayed graphically can also be displayed textually as hierarchically arranged lists, tables, etc. Most of the implementation of this type of representation is left to the display software, although BSML does provide a few relevant attributes and elements. Another type of textual listing is of sequence data. BSML provides structures to present such listings in separate windows or as components of maps, including:

  1. Simple formatted sequence data display, with control over base numbering, font, bases per line, etc.
  2. Double-stranded listings.
  3. Conceptual translation listings for either strand.
  4. Feature/sequence listings (e.g., showing the locations of restriction sites).

2.7.4 Representing Sizes, Positions and Dimensions

There are several issues relating to the description of the locations and sizes of display objects:

  1. Using absolute page coordinates versus sequence relative coordinates.
  2. Using absolute versus relative size representation.
  3. Allowing relative and absolute units of measurement.

BSML permits display objects to be located either relative to a sequence or at an absolute location on the page. This distinction is primarily relevant when sequences are moved or their shape is changed (e.g., from linear to circular).

BSML supports both relative and absolute size representations, although relative representations are encouraged (e.g., expressing a font size as 120% of another font size).

A variety of units is supported for absolute (cm, inches) and relative ( pixels, em, en, ex, percentage) specification of lengths and other dimensions. The resolution of page coordinates and other dimensions follows the CSS2 guidelines.

BSML allows many location and size specifications to be set at either general or specific levels. General specifications indicate a rough location on a page (e.g., "top") or a general size description (e.g., "large"). Specific levels indicate precise quantities (e.g., 20 pixels).

Table of contents


2.8 Representing Links

Interactive map display requires the ability to link displayed objects to other displayed objects, to underlying sequences and features, and to source documents containing cross-reference information. Fortunately, XML provides a rich set of linking features:

  1. As in HTML documents, links use the href=URL (Uniform Resource Locators) format, allowing local or network access.
  2. Links may access any particular element in the current BSML document or in an external document by using the id attribute.
  3. Links may be set to any number of elements by using references to a number of ids.
  4. By using MIME technology and references to registered helper applications, BSML documents may be linked to files of virtually any type.

Every element contains an optional set of Link elements, each of which allows the specification of any of the link types indicated above. To accomplish this, the Link elements define the attribute xml-link and assigns it one of several enumerated values (simple, extended, locator, group, or document).

XML also supports "out-of-line" links. This means that the specification of the links between elements is made in a separate element, where each element in the linking set is identified by a locator, e.g.:

<Extended-link>
 <Locator href="#seq1">
 <Locator href="#seq2">
</Extended-link>

This example creates a link structure that can be traversed easily in either direction between the two sequences. In BSML documents, a subsection named Links contains all out-of-line definitions.

2.8.1 Internal Links

The simplest type of link is to another element in the current document. For example, simple HTML-like links are allowed, such as (# indicates a reference to an id):

<Link href="#seq1">

This link points to the element in the current document with id="seq1".

2.8.2 External Links

External links are to other BSML documents and to non-BSML files (e.g., a graphic image stored in a GIF file). XML external links use URLs (Uniform Resource Locators) of the same type supported in HTML (including the query identifier ? and the fragment identifier # defined in HTML 4.0). Most types of file may be transported across the Internet using the hyptertext transfer protocol (http), as required by BSML software.

Any BSML element may use an external link, e.g.:

<Link href="http://www.topogen.com/sbir/rfc.html">

In the case of another BSML (or any XML/SGML) document, a specific element within the document may be selected by adding the fragment identifier # followed by the id of the element:

<Link href="http://www.topogen.com/sbir/demo.bsml#seq1">

This link points to the element with id="seq1" that is contained in document demo.bsml at the www.topogen.com site.

2.8.3 Implicit Links

Some links may be "inferred" by the display software from the content of the BSML document. For example, suppose that a document includes several homologous sequences, each of which contains a gene with the same name. A linkage set consisting of the features containing the same-named genes may be constructed by the software without explicit instructions. Links of this sort are called implicit links. Such links are often useful for responding to ad hoc queries and commands ("Highlight all same-named genes.").

Some implicit links may also be inferred from the hierarchical structure of BSML. For example, the features in a feature table are obviously linked implicitly to the sequence containing the feature table.

Another type of implicit link is created by tracking the display history created by user actions (zoom, pan, select, etc.). Implicit links of this type are maintained entirely by the software and are not dealt with further here.

2.8.4 Explicit Links

Explicit links are those created using the Link, Extended-link, Group-link, Document-link and Locator elements. In contrast to HTML documents, explicit links need not be displayed and do not require user selection for their actuation. This makes the XML linking mechanism much more flexible than the HTML linking mechanism.

2.8.5 One-to-One Links

A simple link may be used to represent a one-to-one link between two elements. This may be accomplished in the traditional HTML way by imbedding a link element in each element, or by using a group definition that includes only the two elements.

2.8.6 One-to-Many and Many-to-One Links

One-to-many and many-to-one links are easily enabled in XML documents by using Extended-link elements. For example, selecting a gene on one sequence might result in a pop-up list of related genes on other sequences (which need not be contained in the same document).

2.8.7 Link Actuation and Behavior

In HTML browsers, links are actuated when the user clicks on them, causing the link to be traversed. XML provides a much richer environment for specifying the actions to be taken when a link is actuated. This is accomplished by providing defined attributes that XML software can interpret appropriately. In addition to specifying a resource (href=URL), links may have the following attributes:

  1. Rel: The relationship of this resource to the destination of the link.
  2. Rev: The relationship of the link destination to this resource.
  3. Title: A short description of the nature of the link that is meant to be seen by the user.
  4. Role: An attribute designed to help the application software process the link.
  5. Actuate: An attribute indicating whether the link should be traversed when the document is loaded (auto) or only when selected by the user (user).
  6. Show: An attribute indicating how the linked resource should be displayed and processed (include, replace, or new).
  7. Behavior: Detailed instructions to the processing software.

By defining suitable behaviors for the link, it is possible to relate particular actions to particular selection events (e.g., show sequence data if the user double-clicks on the sequence).

Table of contents


2.9 Navigation and Selection

The BSML standard allows elements (sequences, etc.) to be selected, although the standard does not specify how selection should be represented (graying, highlighting, etc.). "Navigation" refers to changing the focus from one selected element (or set) to another, and changing selected objects may be tied to changing displayed pages, etc. The concept of element selection is defined in part by the software implementation and in part by the linking behaviors imbedded in the BSML document. BSML supports four general modes of navigation.

2.9.1 Graphic

Graphic navigation refers to selection by pointing to, or clicking on, a grapically displayed object. Each graphic representation may point to an underlying element in the BSML hierarchy (e.g., a sequence). Graphic selection may be accompanied by other actions, such as the display of a popup menu showing navigational options.

2.9.2 Element Hierarchy

The currently selected element (e.g., currently highlighted sequence) points to a location in the BSML hierarchy. Element hierarchy navigation changes the focus by selecting a related element (next sibling, child, parent, etc).in the hierarchy.

2.9.3 Linked

Linked navigation is based upon the explicitly defined links described in 2.8. User selections may involved individual elements or sets of elements.

2.9.4 Query

Query navigation is based upon processing a query of the current set of elements (e.g., "Show me all sequences with globin genes."). A query returns a list of candidates. When used for navigation, the query list display permits selection of one or more candidates.

Table of contents


2.10 Controlling Display Style

BSML follows the general approach of HTML and XML in using style sheets. The class property is used to provide formatting instructions for particular groups of elements, and formatting may be applied through style attributes at any level.

BSML requires the processing software to supply default values for all display attributes. Some of these values provides base level definitions that are used in resolving relative attribute values. For example, if the base font has size=12pt, another font may be specified as:

<Font size=125%>

In this example, the resulting font would have size=15pt. Virtually all dimensions may be specified relative to a base level.

2.10.1 Style Sheets

A style sheet provides values for display attributes. A default style sheet is required for all documents, and a BSML browser is expected to provide suitable values for all unspecified attributes. (Note: This discussion is based on using CSS for style sheets, not the to-be-developed XS style mechanism derived from DSSSL.)

The simplest type of CSS specification names an element (the selector) and specifies a value for one of its attributes (the declaration), e.g., the following specification states that feature elements should use a font with its size set to 10 pt:

Feature {font-size : 10pt;}

Any number of declarations may be made inside a block, and declarations may apply to more than one element, e.g.:

Feature,Qualifier {font-size : 10pt; font-color : blue;}

When two element names are separated by a space rather than by a comma, the declaration has a very different meaning: the declaration applies only to elements of the second type that are included within elements of the first type. For example, the following declaration applies to the line-color attribute of an Interval element that is contained in a Feature element:

Feature Interval {line-color : red;}

A selector may use attribute values to limit its applicability. The following example applies a line color declaration only to Feature elements contained within Sequence elements for which the strands attribute is set to "2":

Sequence(strands="2") Feature {line-color : black;}

All displayable elements in BSML have an attribute named class. The purpose of this attribute is to allow style declarations to be applied selectively to elements that are members of a class. CSS provides a short-hand method for supplying class values, e.g., the following two statements are equivalent:

Interval.heavy {width=20px}
Interval(class="heavy") {width=20px}

To apply style to all elements of a class, the class name itself is provided after a period, e.g.:

.heavy {width=20px}

2.10.2 Applying Style to Elements

Display attributes may be set at all levels, from the complete document to the individual element. CSS uses an inheritance mechanism to determine the attributes for a particular element:

  1. The default style sheet is first resolved to provide a default set of attribute values for all elements to which the attributes apply.
  2. Style sheets provided in the Styles subsection of the Display section of a BSML document are then resolved. These are applied to individual elements that meet the selection criteria described above.
  3. Style attributes set at the level of the individual element take precedence over other settings.

For example, consider the formatting of text associated with the display of a gene. Suppose that this element is identified as a Feature with the attribute type="gene". The following rules would determine the font to be used in displaying the name of this gene:

  1. If the base font specified in the default style sheet is the only specification, this font is used.
  2. If a style sheet specifies a font that applies to all display elements, this font is used in place of the base font.
  3. If a style sheet specifies a font that applies to Feature elements with type="gene", this font is used rather than any more general font specification.
  4. If a particular instance of a Feature with type="gene" has its font attribute set, that font is used rather than any more generally specified font.

For more information on style sheets, see www.w3.org/TR/WD-CSS2/.

Table of contents


2.11 Structure of a BSML Document

SGML documents (including HTML, XML, and BSML) mark their contents using tags. Tags usually are paired, with an opening tag and a closing tag enclosing the content to which they refer. For example, in an HTML document, the <p> and </p> tags enclose a paragraph of text. In the same way that a web page document begins with the <html> tag and ends with the </html> tag, a BSML document begins with <Bsml> and ends with </Bsml>. Between these tags are two major sections:

  1. The Definitions section defines the sequences and data used by the document.
  2. The Display section defines the display properties associated with the sequences and data.

Thus the overall structure of a BSML document looks like this (note that XML is case-sensitive, so bsml is not the same as Bsml):

<Bsml>
<Definitions>

 ...
</Definitions>
<Display>
...
</Display>
</Bsml>

Within the Definitions section, several subsections may be included:

  1. The Sequences subsection defines any number of Sequence elements, which may be included directly in the BSML document or by reference (e.g., to a database or external file).
  2. The Sets subsection defines any number of Set elements, which are groups of sequences or features that may be related in a variety of ways (e.g., by a homology tree or as a set of mutations).
  3. The Tables subsection defines any number of Table-import or Table elements that contain data associated with sequences or sequence sets.

Within the Display section, the display of information is organized by Page elements, each of which may contain any number of View elements. Each View may contain a reference to one of the Sequence definitions in the Sequences subsection of the Definitions section. Thus a complete BSML document might have the following sections (indentation is used for clarity):

<Bsml>
 <Definitions>
  <Sequences>
   <Sequence id="sv40">
    ...
   </Sequence>
  </Sequences>
  <Sets>
   <Set>
    ...
   </Set>
  </Sets>
  <Tables>
   <Table>
    ...
   </Table>
  <Tables>
 </Definitions>
 <Display>
  <Page>
   <View seqref="sv40">
    ...
   </View>
  </Page>
 </Display>
</Bsml>

Table of contents


3. BSML Examples

Note: Other examples will be placed on our website as BSML evolves and implementable versions of the DTD are released. (See also the output samples shown in 2.7.) The examples included here are intended to illustrate very basic properties that reveal the structure of BSML documents.

Table of contents


3.1 Basic Sequence Display

<!DOCTYPE Bsml SYSTEM "bsml.dtd">
<Bsml>
 <Definitions>
  <Sequences>
   <Sequence id="SEQ1" title="ECRPOBC" seq-type="dna" units="bp"
             length=12337 shape="linear" strands=2>
   </Sequence>
  </Sequences>
 </Definitions>
 <Display>
  <Page>
   <View id="VEW1" seqref="SEQ1">
   </View>
  </Page>
 </Display>
</Bsml>

Depending upon default conditions, this document might produce a display such as:

Table of contents


3.2 Sequence with Default Interval Feature Display

This example adds two features that represent genes. These features are given the attribute display-auto="1", which instructs the software to create a display object using default conditions.

<!DOCTYPE Bsml SYSTEM "bsml.dtd">
<Bsml>
 <Definitions>
  <Sequences>
   <Sequence id="SEQ1" title="ECRPOBC" seq-type="dna" units="bp"
             length=12337 shape="linear" strands=2>
    <Feature-tables id="FTS1">
     <Feature-tableid="FTB1">
      <Feature id="FTR1" title="rpoB"type="CDS"
                direction="right" display-auto="1">
       <Interval-loc start-pos="2969" end-pos="6994">
       <Qualifier type="gene" qual-value="rpoB">
      </Feature>
      <Feature id="FTR2" title="rpoC" type="CDS"
                direction="right" display-auto="1">
       <Interval-loc start-pos="7074" end-pos="11294">
       <Qualifier type="gene" qual-value="rpoC">
      </Feature>
     </Feature-table>
    </Feature-tables>
   </Sequence>
  </Sequences>
 </Definitions>
 <Display>
  <Page>
   <View id="VEW1" seqref="SEQ1">
   </View>
  </Page>
 </Display>
</Bsml>

A default display of this document might look like this (note that the direction attribute is set to "right" for both genes, indicating the direction for the arrow):

Table of contents


3.3 Sequence with Point Feature Display

The next example illustrates the explicit definition of a display object matched to a feature. In this case, a restriction site on the sequence is shown. Note also that the View object now overrides the default number of strands and produces a single-stranded display.

<!DOCTYPE Bsml SYSTEM "bsml.dtd">
<Bsml>
 <Definitions>
  <Sequences>
   <Sequence id="SEQ1" title="ECRPOBC" seq-type="dna" units="bp"
              length=12337 shape="linear" strands=2>
    <Feature-tables id="FTS1">
     <Feature-table id="FTB1">
      <Feature id="FTR1" title="EcoRI" type="user-defined">
       <Site-loc site-pos="5000">
      </Feature>
     </Feature-table>
    </Feature-tables>
   </Sequence>
  </Sequences>
 </Definitions>
 <Display>
  <Page>
   <View id="VEW1" seqref="SEQ1" strands="1">
    <Point-object featureref="FTR1" on-strand="plus"
          caption="EcoRI restriction site">
   </View>
  </Page>
 </Display>
</Bsml>

This document might produce a display such as the following:

Table of contents


4. BSML Formal Language Specification

This section introduces Document Type Definitions (DTDs) and describes some of the basic features of the BSML specification. The latest version of the DTD is bsml.dtd.

Table of contents


4.1 Introduction to Document Type Definitions (DTDs)

The complete DTD for BSML specifies the format and content of a BSML document. A DTD defines a model of semantic content by using elements and their attributes. Elements define the basic objects represented by the DTD and are specified using the format:

<!ELEMENT element-name start-tag-omission end-tag-omission element-contents>

For example, the BSML element Feature-table is defined as follows:

<!ELEMENT Feature-table - - (Feature*,Display?)>

The content model may indicate the occurrence and order of elements by using connector (,|) and occurrence (?+*) indicators. Attributes are described for each element in any number of declarations of the following type:

<!ATTLIST element-name (attribute-name declared-value default-value)*>

For example, the BSML element Sequence has a number of attributes, such as:

<!ATTLIST Sequence
        shape (linear,circular) "linear"
        strands (1,2) "2">

Table of contents


4.2 Basic Elements and Entities

For now, BSML documents use only the XML default character set (ISO10646 UTF-8). Because XML may change its standards for attribute value typing, BSML uses parameter entities to represent fundamental numeric types:

<!ENTITY % integer "CDATA">
<!ENTITY % real "CDATA">

For now, strong typing is not supported in XML, so specifying that an attribute as an integer or real is the same as specifying that it is character data (CDATA). Even so, this usage makes it clearer how the attribute value is to be treated.

BSML uses some HTML 4.0 definitions to represent basic display properties (colors and fonts). These are specified in the DTD.

Table of contents


4.3 XML and SGML Compliance

XML is defined as a SGML profile. While BSML is written to conform with the XML standard, it must be realized that XML itself is not complete. Some issues that need to be resolved include:

  1. Style specification (the XS style mechanism remains to be defined).
  2. Data typing specification (various proposals, including XML-data, have been made to handle problems in XML and SGML resulting from the lack of strong data typing).

Despite these problems, XML appears destined to play a major role in Internet communication and semantic encoding of information. For this reason, basing BSML on XML appears to be a sound decision, even if later modifications to XML (and/or SGML) require some changes to the BSML standard. The basic semantic structure represented by BSML is unlikely to be affected by these changes.

For now, XML and SGML compliance is probably best assured by using a validating SGML parser such as NSGMLS (see 6.4) as the front end for a BSML-specific postprocessor.

Table of contents


4.4 BSML DTD

The BSML DTD will be created in stages. The first stage (Version 0.1) sketches the basic outline of a BSML document. An implementable version of the DTD will be gradually created for release as Version 1.0 about Jan. 31, 1998. The current version of the DTD is located in file bsml.dtd. To download the latest version of the DTD, see downloads.

Table of contents


5. BSML Browser Overview

Note: The subsection of this section will be extended in response to feedback on the RFC.

The internal representation of the BSML standard is likely to be quite similar to the external representation, using a tree structure to represent the hierarchy among sets, sequences and features. The specific implementation of a BSML browser will depend upon the purposes for which the browser is used.

5.1 Overall Browser Capabilities

The fundamental task of a BSML browser is to display sequences and their features and annotation.

Table of contents


5.2 Communication Criteria

For strictly local applications, the browser need not be Internet aware. For many linking purposes, however, the browser must be able to process requests for documents located anywhere in the Internet. This requires the ability to process and resolve URLs and to use the hypertext transfer protocol (http) to transfer information to a client.

Table of contents


5.3 Navigation Criteria

A browser must be able to support the four navigation modes described in 2.9:

  1. Graphic navigation requires the ability to select displayed objects with a pointing device (mouse).
  2. Element hierarchy navigation requires a tree representation of the elements in a document (sequences, etc.) and the ability to select elements by moving through the nodes and branches of the tree.
  3. Linked navigation requires the ability to respond to user selections by displaying individual and group links to the selected element.
  4. Query navigation requires the ability to accept and process a query and to return a list of candidates as either a textual list or a graphically highlighted set, as appropriate.

Table of contents


5.4 Visualization Criteria

A graphic browser needs basic graphic capabilities and, obviously, must support drawing of the basic display objects (lines, arcs, fills, symbols, etc.). More advanced browsers will let users select dimensions to be varied for the purpose of visualizing sequence and feature properties (e.g., varying color saturation or line length to reflect quantitative or qualitative variation in properties of interest).

Table of contents


5.5 Non-graphic Browsers

Although the primary purpose for developing this standard is to facilitate graphic sequence display, there is no requirement that a browser display information graphically. The information contained in a BSML document may easily be viewed in purely textual formats. This approach may be adequate when using the BSML encoding to select, transfer and convert sequence information.

Table of contents


5.6 Conversion Utilities

The functionality of the BSML standard should benefit greatly from the development of a number of conversion utilities. In general, this means developing applications that read files in one format (e.g., EMBL sequence files) and output files in another format (e.g., BSML). The following are likely candidates for conversion utilities:

  1. Public domain physical sequence database formats (GenBank, EMBL, DDBJ, GSDB, etc.).
  2. Public domain genetic database formats (FlyBase, MGDB, etc.).
  3. Relational database formats (e.g., convert a SyBase table).
  4. Spreadsheet formats (e.g., convert an Excel spreadsheet table).

Another area in which conversion utilities may be useful is in incorporating data in standard interchange formats such as the DIF format for data exchange or the RTF format for text exchange. Formats of this type are often available as export formats from many applications.

A somewhat different type of conversion may also be useful - converting information from BSML documents to HTML documents, for publication on the web and on intranets. There are two general ways in which to accomplish this, both of which should be relatively simple to implement:

  1. Create graphic image files from sequence browser displays in an HTML format (e.g., GIF files).
  2. Convert BSML element contents to text fields in HTML documents.

Table of contents


6. BSML Software Specifications

-- to be written --

Table of contents


6.1 Manual Creation and Editing Requirements

-- to be written --

Table of contents


6.2 Automatic Creation Requirements

-- to be written --

Table of contents


6.3 BSML Processing Requirements

Software (e.g., a "browser") processing BSML documents will generally not work directly with the source BSML file. Rather, such software will usually be written as a back-end postprocessor that uses as its input the output from a front end that performs preprocessing on the document. For now, the most likely front end will be NSGMLS, an SGML parser written by James Clark and released to the public domain. The output of NSGMLS is in the format known as ESIS (Element Structure Information Set). This output is easier to work with and will usually serve as the input to a BSML postprocessor.

Table of contents


6.4 Entity Manager Requirements

BSML documents may contain a variety of references to external documents, including linked resources, external parameter entities, and data entities defined through notation syntax. For example, the following definitions might be included in a DTD:

<!ENTITY testfile SYSTEM "test.bmp" NDATA BMP>
<!NOTATION BMP SYSTEM "BMP Format">
<!ELEMENT Picture - - (#PCDATA)>
<!ATTLIST Picture img ENTITY #REQUIRED>

In the document itself, the following element might be included:

<Picture img = testfile></Picture>

The ESIS output from this line using NSGMLS (see 6.3) will be:

sBMP Format
NBMP
stest.bmp
f<OSFILE FIND>test.bmp
Etestfile NDATA BMP
AIMG ENTITY testfile
(PICTURE
)PICTURE

The role of the entity manager is to resolve file references and supply appropriate files to the processor. Some references are resolved through the use of a catalog that maps SGML identifiers to physical file names. Further discussion of this topic is beyond the scope of this document.

Table of contents


7. Relationship of BSML to Other Initiatives

This section is to be completed as needed in response to feedback on the RFC.

Table of contents


7.1 Biowidget Consortium

To be written.

Table of contents


7.2 CORBA

To be written.

Table of contents


8. BSML 2.0 - Advanced Topics to be Implemented

This section will be supplemented by topics developed through responses to the RFC.

Table of contents


8.1 Password Protection

To be written.

Table of contents


8.2 Encryption

To be written.

Table of contents


8.3 DSSSL Support

To be written.

Table of contents