SSML: A Speech Synthesis Markup Language
From: http://www.cstr.ed.ac.uk/publications/1995/Taylor_1995_c.ascii
SSML: A Speech Synthesis Markup Language
Paul Taylor and Amy Isard
Centre for Speech Technology Research,
University of Edinburgh, UK.
1. Enriched Text
It is well acknowledged that speech synthesis systems which use text
as their only form of input have great difficulty in always generating
high quality pronunciation and prosody. This is largely due to the
fact that at present, text analysis components are only capable of a
superficial analysis of the text, and have no ability to understand
the text being spoken.
A common solution to this problem is to allow for annotations in the
text. Many commercial systems, such as Entropic's TrueTalk and DEC's
Dectalk, allow users of the systems to place annotations (often as
escape sequences) in the text, which, for example, instruct the
synthesizer to place emphasis on particular words or adopt a
particular pronunciation for a word. Although this form of control
will not help in situations where pure text can be the only input, in
many applications it is feasible for a user of the system to be able
to add annotations that will help to improve the quality.
While the annotation approach goes some way towards solving the
problem, it suffers from lack of standardisation in that a set of
annotations for one system will not work in another. Often the
differences in annotations are due to historical quirks and are
unnecessary. To counter this problem, we have developed SSML, a
speech synthesis markup language. SSML is designed to be a standard
annotation language, being sufficiently general to be used by any
speech synthesizer.
2. Generalised Markup
It is useful to draw a parallel between this situation and that which
exists in printing and document processing. In years gone by, a
typical computer environment might have consisted of a single
mainframe which had its own programming language and text
formatter. Although every writer of a document needed to specify
things such as headings, italicised words and paragraph breaks, such
effects were handled differently on different systems. To combat this
incompatibility, first the Generalised Markup Language and then the
Standard Generalised Markup Language (SGML) were developed, with the
intention of providing a framework for ``machine-independent''
document processing.
In fact SGML is a meta-markup language, in that it is a framework
within which actual generalised markup languages are formulated, and
as such SGML is not strictly limited to document processing in the
normal sense. The best known example of a language written using SGML
is the Hyper-Text Markup Language (HTML) which is the language used to
create multimedia documents on the World Wide Web.
In the development of SGML, it became apparent that it would be
impossible to create a standard based on most text-formatting
languages around at the time, as these languages often incorporated
commands which were specific to the system that the text-formatter was
being used on, and would be inappropriate on another system. To
counter this problem, a division was made between the internal
structure of a document on the one hand, and its physical appearance
on the other. It was seen that authors should concentrate only on the
structure of the document, and not worry (initially) about how the
document would actually appear: that task would be carried out by a
special text-formatter using a set of style instructions.
In this way it is possible for authors to generate documents using
high-level formatting commands (such as ``emphasis'') which can be
processed into their final form on a variety of systems. HTML is a
very successful application of SGML in that there are millions of
pages of HTML now available world-wide, and they can all be viewed on
a variety of browsers, such as netscape, arena, mosaic, emacs etc.
There are strong parallels between the situation in document
processing and the annotation concept in speech synthesizers. We have
developed SSML with a similar methodology to HTML and other SGML
applications in that we have tried to provide the basis for a
standard by separating the structural and physical aspects of the
input. This allows users to annotate text for speech synthesizers
without having to be familiar with the internal details of the
synthesizer itself.
3. SSML
To date, we have developed a first version of SSML and a
``markup-to-speech'' interpreting system which takes an SSML document
as input and produces synthetic speech. Below we give a brief
description of some of the markup tags used in SSML. (In SGML
starting tags look like and end tags look like . Sometimes
no end tag is necessary).
.... document ....
This used to start and end documents. This facility allows SSML
documents to be included inside HTML and other types of document.
SSML is not tied to any language, and
assumes the default of the system. For multi-lingual synthesis, this
command can be used to change language in the middle of a
document. Examples: ,
Major (intonation) phrases are used as the primary unit of prosodic
structure in SSML. Phrases can take attributes which specify the
speech act of the phrase, for example "yn-question" (yes/no
question), "wh-question" or "statement". If no speech act type is
given, "statement" is assumed as the default.
word
Words can be emphasised by surrounding them with the tags. In
English one word per phrase is always emphasised, and therefore if no
tags exist within a phrase, default rules are used for
emphasis. Example: there is an emphasised word here
This command is used to define or redefine the pronunciation of a
word. The word field serves as an identifier and the pro field defines
the pronunciation. Different synthesis systems will have different
ways of specifying the pronunciation of entries in a lexicon and so
care was taken to make these definitions flexible. The "standard"
field gives the name of a pre-defined lexicon standard, allowing
different phonemic alphabets to be used. Example .
This tags instructs the system to play a sound file.
4. Interpreting and Using SSML documents.
SSML does not contain explicit instructions as to how documents should
be processed as this is left to the individual synthesis
interpreters. At present only one such interpreter exists, but we are
confident that SSML has been designed carefully enough to allow use
on most synthesis systems. The current synthesis interpreter
processes the tags in the way one would expect: short pauses are
inserted at commands and pitch accents are used to produce
effects.
It is not envisaged that SSML will be used in quite the same way as
HTML or other document markup languages. We don't expect users to sit
and write lengthy SSML documents with all the desired markup for use
with the interpreter. The envisaged applications are deemed to be more
in the area of language generation/synthesis interfaces. If a railway
timetable announcement system is being designed, the programmer
may write a slot and filler synthesis program which prints out the
complete text of an announcement and pass it to a synthesizer. For
example the carrier sentence could be ``The train now standing on
platform A is the B to C'' and the slots A, B and C could be filled as
appropriate. Using SSML, additional prosodic information could be
added in, for example
The train now standing on platform A
is the B to C .
In this way, the correct prosody would be ensured, but without the
programmer needing to know anything about the synthesizer itself. In
this way SSML protects the programmer from the specifics of the
synthesis system, in much the same way as a C compiler protects a
programmer from the specifics of machine code.
As SSML is simple and doesn't necessarily need much linguistic
knowledge to operate, it should be usable by most people.
5. Future
At present SSML and the interpreter are at a prototype stage. The are
many future developments planned, but apart from the addition of
further obvious tags (for instance, more levels in phrasing), the
design of SSML can only progress if the language is tried with
different synthesis systems and in different applications.