[Mirrored from: http://www.cstr.ed.ac.uk/projects/ssml_details.html, incompletely linked]

SSML: A Speech Synthesis Markup Language

SSML: A Speech Synthesis Markup Language

Introduction

SSML is a speech synthesis markup language. What we mean by "speech synthesis markup language" is described in some detail below, but in short SSML is a way of adding annotations to text for use in guiding pronunciation when those texts are being spoken.

The concept behined SSML was developed by Paul Taylor in 1992. After many delays, a first prototype was developed by Amy Isard and Paul in the summer of 1995. SSML version 1.0, is now fully incorperated into the Festival speech synthesis system, which is available for general use. This page describes this version, a more detailed description can be found in P. A. Taylor. and A. Isard "SSML: A speech synthesis markup language", Speech Communication. We would like SSML to develop into a standard, and that can only be achieved by hearing the opinions of fellow developers and users in the speech synthesis community. CSTR is setting up an informal international forum for SSML discussions. If you'd like to co-operate on the development of SSML, please email Paul Taylor (Paul.Taylor@ed.ac.uk)

Below is the text of a paper by Paul and Amy on SSML.

Enriched Text

It is well acknowledged that speech synthesis systems which use text as their only form of input have great difficulty in always generating high quality pronunciation and prosody. This is largely due to the fact that at present, text analysis components are only capable of a superficial analysis of the text, and have no ability to understand the text being spoken.

A common solution to this problem is to allow for annotations in the text. Many commercial systems, such as Entropic's TrueTalk and DEC's Dectalk, allow users of the systems to place annotations (often as escape sequences) in the text, which, for example, instruct the synthesizer to place emphasis on particular words or adopt a particular pronunciation for a word. Although this form of control will not help in situations where pure text can be the only input, in many applications it is feasible for a user of the system to be able to add annotations that will help to improve the quality.

While the annotation approach goes some way towards solving the problem, it suffers from lack of standardisation in that a set of annotations for one system will not work in another. Often the differences in annotations are due to historical quirks and are unnecessary. To counter this problem, we have developed SSML, a speech synthesis markup language. SSML is designed to be a standard annotation language, being sufficiently general to be used by any speech synthesizer.

Generalised Markup

It is useful to draw a parallel between this situation and that which exists in printing and document processing. In years gone by, a typical computer environment might have consisted of a single mainframe which had its own programming language and text formatter. Although every writer of a document needed to specify things such as headings, italicised words and paragraph breaks, such effects were handled differently on different systems. To combat this incompatibility, first the Generalised Markup Language and then the Standard Generalised Markup Language (SGML) were developed, with the intention of providing a framework for ``machine-independent'' document processing.

In fact SGML is a meta-markup language, in that it is a framework within which actual generalised markup languages are formulated, and as such SGML is not strictly limited to document processing in the normal sense. The best known example of a language written using SGML is the Hyper-Text Markup Language (HTML) which is the language used to create multimedia documents on the World Wide Web.

In the development of SGML, it became apparent that it would be impossible to create a standard based on most text-formatting languages around at the time, as these languages often incorporated commands which were specific to the system that the text-formatter was being used on, and would be inappropriate on another system. To counter this problem, a division was made between the internal structure of a document on the one hand, and its physical appearance on the other. It was seen that authors should concentrate only on the structure of the document, and not worry (initially) about how the document would actually appear: that task would be carried out by a special text-formatter using a set of style instructions.

In this way it is possible for authors to generate documents using high-level formatting commands (such as ``emphasis'') which can be processed into their final form on a variety of systems. HTML is a very successful application of SGML in that there are millions of pages of HTML now available world-wide, and they can all be viewed on a variety of browsers, such as netscape, arena, mosaic, emacs etc.

There are strong parallels between the situation in document processing and the annotation concept in speech synthesizers. We have developed SSML with a similar methodology to HTML and other SGML applications in that we have tried to provide the basis for a standard by separating the structural and physical aspects of the input. This allows users to annotate text for speech synthesizers without having to be familiar with the internal details of the synthesizer itself.

SSML

To date, we have developed a first version of SSML and a ``markup-to-speech'' interpreting system which takes an SSML document as input and produces synthetic speech. Below we give a brief description of some of the markup tags used in SSML. (In SGML starting tags look like and end tags look like . Sometimes no end tag is necessary).

<ssml> .... document .... </ssml> This used to start and end documents. This facility allows SSML documents to be included inside HTML and other types of document.

<language="your-language"> 
SSML is not tied to any language, and assumes the default of the system. For multi-lingual synthesis, this command can be used to change language in the middle of a document. Examples: <phrase> Major (intonation) phrases are used as the primary unit of prosodic structure in SSML. Phrases can take attributes which specify the speech act of the phrase, for example "yn-question" (yes/no question), "wh-question" or "statement". If no speech act type is given, "statement" is assumed as the default.

<emph> word </emph>
Words can be emphasised by surrounding them with the <emph> tags. In English one word per phrase is always emphasised, and therefore if no <emph> tags exist within a phrase, default rules are used for emphasis. Example: there is an emphasised word <emph> here </emph>

<define word=(identifier) pro=(pronunciation) standard=(lexicon standard)>
This command is used to define or redefine the pronunciation of a word. The word field serves as an identifier and the pro field defines the pronunciation. Different synthesis systems will have different ways of specifying the pronunciation of entries in a lexicon and so care was taken to make these definitions flexible. The "standard" field gives the name of a pre-defined lexicon standard, allowing different phonemic alphabets to be used. Example <define word="Edinburgh" phonemes="E * d . i n . b r @@" standard="cstr">.

<sound src=(sound file)>
This tags instructs the system to play a sound file.

Interpreting and Using SSML documents

SSML does not contain explicit instructions as to how documents should be processed as this is left to the individual synthesis interpreters. At present only one such interpreter exists, but we are confident that SSML has been designed carefully enough to allow use on most synthesis systems. The current synthesis interpreter processes the tags in the way one would expect: short pauses are inserted at It is not envisaged that SSML will be used in quite the same way as HTML or other document markup languages. We don't expect users to sit and write lengthy SSML documents with all the desired markup for use with the interpreter. The envisaged applications are deemed to be more in the area of language generation/synthesis interfaces. If a railway timetable announcement system is being designed, the programmer may write a slot and filler synthesis program which prints out the complete text of an announcement and pass it to a synthesizer. For example the carrier sentence could be ``The train now standing on platform A is the B to C'' and the slots A, B and C could be filled as appropriate. Using SSML, additional prosodic information could be added in, for example:

<ssml><phrase> The train now standing on platform <emph> A </emph>
<phrase> is the B to <emph> C </emph> </ssml>.
In this way, the correct prosody would be ensured, but without the programmer needing to know anything about the synthesizer itself. In this way SSML protects the programmer from the specifics of the synthesis system, in much the same way as a C compiler protects a programmer from the specifics of machine code.

As SSML is simple and doesn't necessarily need much linguistic knowledge to operate, it should be usable by most people.

Future

At present SSML and the interpreter are at a prototype stage. The are many future developments planned, but apart from the addition of further obvious tags (for instance, more levels in phrasing), the design of SSML can only progress if the language is tried with different synthesis systems and in different applications.