MSAML was formulated to make manipulation and extraction of multiple sequence alignment information easier by logically defining the parts of an alignment for use in an XML conformant application.
MSAML is not an attempt to make an error-proof MSA format. An MSAML document may pass a validating parser test and still contain some errors (e.g. miscalled consensus). It is also not an attempt to capture all facets of an MSA since this would require two-dimensional data formatting. This definition should be considered "lightweight".
There are two rules which governed the formulation of the MSAML DTD, as well as all my other DTDs:
value
The net effect of these rules is that most data that is stored as element character content can be stored in attributes. Another rule resolves ambiguous cases:
The reason for giving attribute data precedence is that the character content version may be modified by another program unaware of MSAML.
An MSAML document consists of one or more multiple sequence alignments.
<!ELEMENT MSAML (alignment)*>
An alignment
consists of blocks of two or more
sequences, possibly preceded or anteceded by a
conservation string. Scoring
information blocks may flank the set of alignments/conservation blocks.
<!ELEMENT alignment (scoring?,((conservation*,sequence,sequence+)+|
(sequence,sequence+,conservation*)+)+,scoring?)>
An alignment element can optionally have an attribute describing the algorithm
used to create the alignment, and an attribute giving the conservation data
for the alignment (whose value would override that of any conservation child
element of the same alignment).
<!ATTLIST alignment conservation CDATA #IMPLIED
algorithm (Smith-Waterman|weighted-average|HMM|consensus-word|%algnames;) #IMPLIED>
A sequence
consists of sequence characters (i.e. one
letter protein or DNA abbreviations) and gap characters. Applications should
ignore whitespace. The sequence may also have
seqname
child elements
flanking the sequence.
<!ELEMENT sequence (%string;?,seqname?,%string;?,seq,%string;?)>
A sequence may have a value
attribute
containing the sequence itself, overriding any character content for the
element.
A sequence may also have an attribute seqname
which must be a unique identifier for the sequence in the alignment, and
supercedes any value found in the
seqname
child element.
<!ATTLIST sequence seqname CDATA #IMPLIED>
A seqname
is a text label to identify a
sequence
.
<!ELEMENT seqname (%string;)>
The identifier can be stored internally as well, and would override any
charcter content if present.
<!ATTLIST seqname value CDATA #IMPLIED>
Note: These seqnames are important, especially if the sequences
are segmented into multiple fragments. The seqname is the only way to tell
that several sequence
elements should be concatenated to form a single sequence in an alignment.
Any sequence
without a
seqname is considered to be a standalone sequence. For example:
<alignment>
<sequence><seqname>AE000925_15</seqname> CGATGCCATAAAGGAATCTGAGAAAAGGGAAAAGCCATCTTCCAGCTCAA<gap>----</gap> TGAATT </sequence>
<sequence><seqname>PW56247_2</seqname> TGTCGTTGGAAGAGTCTCCGAAGAGGAAAATAAGGTGCCCTGAAGTTCAAGTTCCTCAGT </sequence>
<sequence><seqname>MJU67510_9</seqname> AAATGGGAAATGATTTATAAATTCAACATTAGGAATTTAA<gap>-</gap> GGATGATAATTATGAAACT </sequence>
<sequence><seqname>MJU67490_4</seqname> AAACGTTTTA<gap>-</gap> GAGGAAGGGAGTAAATTAGAGGAGTTTAATGAATTAGAATTGT<gap>----</gap> CT </sequence>
<sequence><seqname>AE000925_15</seqname> C---AGAAAT----TGACAGGGAACTCCGGGA---AATAATTGAAAAT---AGC------ </sequence>
<sequence><seqname>PW56247_2</seqname> C---AAACAT----AGATGAGGAGTTGGAGAA---AATAGTAGAGCAG---ATA------ </sequence>
<sequence><seqname>MJU67510_9</seqname> CGTAAAAGATGCCTTATCAAGAAGTGATACAACCAGATATTTAAAAGATGAGTTTGG--A </sequence>
<sequence><seqname>MJU67490_4</seqname> C-CAGAGGAT--------AAGGAATTATTGGA----ATATTTGCAACA--AACT------ </sequence>
</alignment>
The same alignment without the labels must be seen as 8 different sequences, which would be biologically wrong.
<alignment>
<sequence>CGATGCCATAAAGGAATCTGAGAAAAGGGAAAAGCCATCTTCCAGCTCAA----TGAATT </sequence>
<sequence>TGTCGTTGGAAGAGTCTCCGAAGAGGAAAATAAGGTGCCCTGAAGTTCAAGTTCCTCAGT </sequence>
<sequence>AAATGGGAAATGATTTATAAATTCAACATTAGGAATTTAA-GGATGATAATTATGAAACT </sequence>
<sequence>AAACGTTTTA-GAGGAAGGGAGTAAATTAGAGGAGTTTAATGAATTAGAATTGT----CT </sequence>
<sequence>C---AGAAAT----TGACAGGGAACTCCGGGA---AATAATTGAAAAT---AGC------ </sequence>
<sequence>C---AAACAT----AGATGAGGAGTTGGAGAA---AATAGTAGAGCAG---ATA------ </sequence>
<sequence>CGTAAAAGATGCCTTATCAAGAAGTGATACAACCAGATATTTAAAAGATGAGTTTGG--A </sequence>
<sequence>C-CAGAGGAT--------AAGGAATTATTGGA----ATATTTGCAACA--AACT------ </sequence>
</alignment>
This type of segmentation is usually for display purposes, and so this
situation may not occur in all applications, but the use of
seqname
s is suggested
whenever possible as it makes it easier to refer to sequences in an alignment
in a practical manner.
The seq
<!ELEMENT seq (%string;|gap)+>
The seq may internalize its value, overriding any character content.
<!ATTLIST seq value CDATA #IMPLIED>
The conservation
element contains the consensus string.
Since an alignment can only have one consensus, any conservation values within
an alignment
must be concatenated.
<!ELEMENT conservation (%string;|gap)+>
<!ATTLIST conservation value CDATA #IMPLIED>
<!ELEMENT gap (%string;)>
<!ATTLIST gap length CDATA #IMPLIED
at CDATA #IMPLIED>
<!ELEMENT scoring (param)*>
<!ELEMENT param ((%string;)?,pname,(%string;)?,pvalue)>
<!ATTLIST param pname (GapCreationPenalty|GapExtensionPenalty|%paramnames;) #IMPLIED
pvalue CDATA #IMPLIED>
<!ELEMENT pname (%string;)>
<!ELEMENT pvalue (%string;)>
For compatibility between applications, nucleotide sequences must use the single letter representations from the IUPAC nucleotide nomenclature. Amino acid residues must use the single letter representations from the IUPAC Nomenclature and Symbolism for Amino Acids and Peptides.