MSAML - A Set of XML Compliant Markup Components for Describing Multiple Sequence Alignments


Impetus

MSAML was formulated to make manipulation and extraction of multiple sequence alignment information easier by logically defining the parts of an alignment for use in an XML conformant application.

MSAML is not an attempt to make an error-proof MSA format. An MSAML document may pass a validating parser test and still contain some errors (e.g. miscalled consensus). It is also not an attempt to capture all facets of an MSA since this would require two-dimensional data formatting. This definition should be considered "lightweight".

Document Type Definition (DTD) Formulation

There are two rules which governed the formulation of the MSAML DTD, as well as all my other DTDs:

  1. Elements with mixed content may have their character data internalized in an attribute called value
  2. .
  3. Elements with mixed content child nodes that occur zero or once may internalize their children's character contents with an attribute of the same name as the child element.

The net effect of these rules is that most data that is stored as element character content can be stored in attributes. Another rule resolves ambiguous cases:

The reason for giving attribute data precedence is that the character content version may be modified by another program unaware of MSAML.

Document Formulation

An MSAML document consists of one or more multiple sequence alignments.

<!ELEMENT MSAML (alignment)*>

An alignment consists of blocks of two or more sequences, possibly preceded or anteceded by a conservation string. Scoring information blocks may flank the set of alignments/conservation blocks.

<!ELEMENT alignment (scoring?,((conservation*,sequence,sequence+)+| (sequence,sequence+,conservation*)+)+,scoring?)>
An alignment element can optionally have an attribute describing the algorithm used to create the alignment, and an attribute giving the conservation data for the alignment (whose value would override that of any conservation child element of the same alignment).
<!ATTLIST alignment conservation CDATA #IMPLIED
                    algorithm (Smith-Waterman|weighted-average|HMM|consensus-word|%algnames;) #IMPLIED>

A sequence consists of sequence characters (i.e. one letter protein or DNA abbreviations) and gap characters. Applications should ignore whitespace. The sequence may also have seqname child elements flanking the sequence.

<!ELEMENT sequence (%string;?,seqname?,%string;?,seq,%string;?)>
A sequence may have a value attribute containing the sequence itself, overriding any character content for the element. A sequence may also have an attribute seqname which must be a unique identifier for the sequence in the alignment, and supercedes any value found in the seqname child element.
<!ATTLIST sequence seqname CDATA #IMPLIED>

A seqname is a text label to identify a sequence.
<!ELEMENT seqname (%string;)>
The identifier can be stored internally as well, and would override any charcter content if present.
<!ATTLIST seqname value CDATA #IMPLIED>
Note: These seqnames are important, especially if the sequences are segmented into multiple fragments. The seqname is the only way to tell that several sequence elements should be concatenated to form a single sequence in an alignment. Any sequence without a seqname is considered to be a standalone sequence. For example:
<alignment>
<sequence><seqname>AE000925_15</seqname> CGATGCCATAAAGGAATCTGAGAAAAGGGAAAAGCCATCTTCCAGCTCAA<gap>----</gap>TGAATT</sequence>
<sequence><seqname>PW56247_2</seqname>   TGTCGTTGGAAGAGTCTCCGAAGAGGAAAATAAGGTGCCCTGAAGTTCAAGTTCCTCAGT</sequence>
<sequence><seqname>MJU67510_9</seqname>  AAATGGGAAATGATTTATAAATTCAACATTAGGAATTTAA<gap>-</gap>GGATGATAATTATGAAACT</sequence>
<sequence><seqname>MJU67490_4</seqname>  AAACGTTTTA<gap>-</gap>GAGGAAGGGAGTAAATTAGAGGAGTTTAATGAATTAGAATTGT<gap>----</gap>CT</sequence>

<sequence><seqname>AE000925_15</seqname> C---AGAAAT----TGACAGGGAACTCCGGGA---AATAATTGAAAAT---AGC------</sequence>
<sequence><seqname>PW56247_2</seqname>   C---AAACAT----AGATGAGGAGTTGGAGAA---AATAGTAGAGCAG---ATA------</sequence>
<sequence><seqname>MJU67510_9</seqname>  CGTAAAAGATGCCTTATCAAGAAGTGATACAACCAGATATTTAAAAGATGAGTTTGG--A</sequence>
<sequence><seqname>MJU67490_4</seqname>  C-CAGAGGAT--------AAGGAATTATTGGA----ATATTTGCAACA--AACT------</sequence>
</alignment>

The same alignment without the labels must be seen as 8 different sequences, which would be biologically wrong.

<alignment>
<sequence>CGATGCCATAAAGGAATCTGAGAAAAGGGAAAAGCCATCTTCCAGCTCAA----TGAATT</sequence>
<sequence>TGTCGTTGGAAGAGTCTCCGAAGAGGAAAATAAGGTGCCCTGAAGTTCAAGTTCCTCAGT</sequence>
<sequence>AAATGGGAAATGATTTATAAATTCAACATTAGGAATTTAA-GGATGATAATTATGAAACT</sequence>
<sequence>AAACGTTTTA-GAGGAAGGGAGTAAATTAGAGGAGTTTAATGAATTAGAATTGT----CT</sequence>

<sequence>C---AGAAAT----TGACAGGGAACTCCGGGA---AATAATTGAAAAT---AGC------</sequence>
<sequence>C---AAACAT----AGATGAGGAGTTGGAGAA---AATAGTAGAGCAG---ATA------</sequence>
<sequence>CGTAAAAGATGCCTTATCAAGAAGTGATACAACCAGATATTTAAAAGATGAGTTTGG--A</sequence>
<sequence>C-CAGAGGAT--------AAGGAATTATTGGA----ATATTTGCAACA--AACT------</sequence>
</alignment>

This type of segmentation is usually for display purposes, and so this situation may not occur in all applications, but the use of seqnames is suggested whenever possible as it makes it easier to refer to sequences in an alignment in a practical manner.

The seq

<!ELEMENT seq (%string;|gap)+>
The seq may internalize its value, overriding any character content.
<!ATTLIST seq value CDATA #IMPLIED>

The conservation element contains the consensus string. Since an alignment can only have one consensus, any conservation values within an alignment must be concatenated.

<!ELEMENT conservation (%string;|gap)+>
<!ATTLIST conservation value CDATA #IMPLIED>
<!ELEMENT gap (%string;)>
<!ATTLIST gap length CDATA #IMPLIED
              at CDATA #IMPLIED>
<!ELEMENT scoring (param)*>
<!ELEMENT param ((%string;)?,pname,(%string;)?,pvalue)>
<!ATTLIST param pname (GapCreationPenalty|GapExtensionPenalty|%paramnames;) #IMPLIED
                pvalue CDATA #IMPLIED>
<!ELEMENT pname (%string;)>
<!ELEMENT pvalue (%string;)>

Other considerations

For compatibility between applications, nucleotide sequences must use the single letter representations from the IUPAC nucleotide nomenclature. Amino acid residues must use the single letter representations from the IUPAC Nomenclature and Symbolism for Amino Acids and Peptides.

Possible Uses