[Mirrored from: http://www.cstr.ed.ac.uk/projects/ssml_details.html, incompletely linked]
The concept behined SSML was developed by Paul Taylor in 1992. After many delays, a first prototype was developed by Amy Isard and Paul in the summer of 1995. SSML version 1.0, is now fully incorperated into the Festival speech synthesis system, which is available for general use. This page describes this version, a more detailed description can be found in P. A. Taylor. and A. Isard "SSML: A speech synthesis markup language", Speech Communication. We would like SSML to develop into a standard, and that can only be achieved by hearing the opinions of fellow developers and users in the speech synthesis community. CSTR is setting up an informal international forum for SSML discussions. If you'd like to co-operate on the development of SSML, please email Paul Taylor (Paul.Taylor@ed.ac.uk)
Below is the text of a paper by Paul and Amy on SSML.
A common solution to this problem is to allow for annotations in the text. Many commercial systems, such as Entropic's TrueTalk and DEC's Dectalk, allow users of the systems to place annotations (often as escape sequences) in the text, which, for example, instruct the synthesizer to place emphasis on particular words or adopt a particular pronunciation for a word. Although this form of control will not help in situations where pure text can be the only input, in many applications it is feasible for a user of the system to be able to add annotations that will help to improve the quality.
While the annotation approach goes some way towards solving the problem, it suffers from lack of standardisation in that a set of annotations for one system will not work in another. Often the differences in annotations are due to historical quirks and are unnecessary. To counter this problem, we have developed SSML, a speech synthesis markup language. SSML is designed to be a standard annotation language, being sufficiently general to be used by any speech synthesizer.
In fact SGML is a meta-markup language, in that it is a framework within which actual generalised markup languages are formulated, and as such SGML is not strictly limited to document processing in the normal sense. The best known example of a language written using SGML is the Hyper-Text Markup Language (HTML) which is the language used to create multimedia documents on the World Wide Web.
In the development of SGML, it became apparent that it would be impossible to create a standard based on most text-formatting languages around at the time, as these languages often incorporated commands which were specific to the system that the text-formatter was being used on, and would be inappropriate on another system. To counter this problem, a division was made between the internal structure of a document on the one hand, and its physical appearance on the other. It was seen that authors should concentrate only on the structure of the document, and not worry (initially) about how the document would actually appear: that task would be carried out by a special text-formatter using a set of style instructions.
In this way it is possible for authors to generate documents using high-level formatting commands (such as ``emphasis'') which can be processed into their final form on a variety of systems. HTML is a very successful application of SGML in that there are millions of pages of HTML now available world-wide, and they can all be viewed on a variety of browsers, such as netscape, arena, mosaic, emacs etc.
There are strong parallels between the situation in document processing and the annotation concept in speech synthesizers. We have developed SSML with a similar methodology to HTML and other SGML applications in that we have tried to provide the basis for a standard by separating the structural and physical aspects of the input. This allows users to annotate text for speech synthesizers without having to be familiar with the internal details of the synthesizer itself.
<ssml> .... document .... </ssml> This used to start and end documents. This facility allows SSML documents to be included inside HTML and other types of document.
<language="your-language">SSML is not tied to any language, and assumes the default of the system. For multi-lingual synthesis, this command can be used to change language in the middle of a document. Examples:
<emph> word </emph>Words can be emphasised by surrounding them with the <emph> tags. In English one word per phrase is always emphasised, and therefore if no <emph> tags exist within a phrase, default rules are used for emphasis. Example: there is an emphasised word <emph> here </emph>
<define word=(identifier) pro=(pronunciation) standard=(lexicon standard)>This command is used to define or redefine the pronunciation of a word. The word field serves as an identifier and the pro field defines the pronunciation. Different synthesis systems will have different ways of specifying the pronunciation of entries in a lexicon and so care was taken to make these definitions flexible. The "standard" field gives the name of a pre-defined lexicon standard, allowing different phonemic alphabets to be used. Example <define word="Edinburgh" phonemes="E * d . i n . b r @@" standard="cstr">.
<sound src=(sound file)>This tags instructs the system to play a sound file.
<ssml><phrase> The train now standing on platform <emph> A </emph> <phrase> is the B to <emph> C </emph> </ssml>.In this way, the correct prosody would be ensured, but without the programmer needing to know anything about the synthesizer itself. In this way SSML protects the programmer from the specifics of the synthesis system, in much the same way as a C compiler protects a programmer from the specifics of machine code.
As SSML is simple and doesn't necessarily need much linguistic knowledge to operate, it should be usable by most people.