[This local archive copy is from the official and canonical URL, http://www.bell-labs.com/project/tts/csssable.html; please refer to the canonical source document if possible.]

SABLE: an XML-based Aural Display List For The WWW

Richard Sproat
Bell Labs, Lucent Technologies
T. V. Raman
Advanced Technology Group, Adobe Systems

Introduction

This paper identifies how SABLE ---an SGML/XML-based markup language for describing speech streams--- fits into the overall architecture of the WWW.

Aural Cascading Style Sheets (ACSS) are designed to separate presentation rules from content: as their name implies, they are the aural analogue to CSS. Given structured content (that is presentation independent) and a set of presentation rules (that are modality specific), producing visible or audible output requires formatting the content according to the specified presentation rules. In visual formatting (typesetting), the result of this process is a visual display list typically encoded as Postscript, PDF or some other set of page marking operators. In fact the visual side of the WWW is currently in the process of standardizing on an XML dialect for encoding such visual display lists: see the work on Scalable Vector Graphics (SVG).

SABLE can be rightly viewed as an aural display list. That is in the overall architecture, SABLE is to auditory displays, what SVG is to visual displays. Notice that both these display list forms exhibit the desired characteristics of scaling, etc., that are not present in the result of playing such display lists -- such as GIF or JPEG in the visual context, or WAV files in the auditory context. Given the above analogy, it is self-evident that we need SABLE in order to capture and communicate aural display lists. The alternative in the absence of SABLE would be for audio formatters to directly make calls into a vendor-specific speech API to produce the spoken output, something that is clearly undesirable.

What are Aural Cascaded Style Sheets?

Cascaded style sheets provide a mechanism to dissociate the document structure explicit in an HTML or XML document from the way in which that document is rendered in a particular medium. Traditionally, browsers such as Netscape have interpreted HTML tags such as H1 in a particular fashion, e.g. by using a certain sized bold font. Cascaded style sheets let one separate the markup of a particular piece of text as a level-one header using the tag H1, from the way that it is rendered, by allowing one to write a separate set of specifications - the style sheet - that defines how text enclosed in this particular tag should appear.

HTML documents may be rendered in a variety of media, including auditorily. When rendering a document auditorily one can specify how the rendering is to be done using Aural Cascaded Style Sheets. Thus one might wish to indicate that H1 elements are rendered using a particular voice. As an example of an Aural style specification, consider Figure 1, from Appendix A of the CSS2 Specification.


@media speech {
       H1, H2, H3, 
       H4, H5, H6    { voice-family: paul, male; stress: 20; richness: 90 }
       H1            { pitch: x-low; pitch-range: 90 }
       H2            { pitch: x-low; pitch-range: 80 }
       H3            { pitch: low; pitch-range: 70 }
       H4            { pitch: medium; pitch-range: 60 }
       H5            { pitch: medium; pitch-range: 50 }
       H6            { pitch: medium; pitch-range: 40 }
       LI, DT, DD    { pitch: medium; richness: 60 }
       DT            { stress: 80 }
       PRE, CODE, TT { pitch: medium; pitch-range: 0; stress: 0; richness: 80 }
       EM            { pitch: medium; pitch-range: 60; stress: 60; richness: 50 }
       STRONG        { pitch: medium; pitch-range: 60; stress: 90; richness: 90 }
       DFN           { pitch: high; pitch-range: 60; stress: 60 }
       S, STRIKE     { richness: 0 }
       I             { pitch: medium; pitch-range: 60; stress: 60; richness: 50 }
       B             { pitch: medium; pitch-range: 60; stress: 90; richness: 90 }
       U             { richness: 0 }
       A:link        { voice-family: harry, male }
       A:visited     { voice-family: betty, female }
       A:active      { voice-family: betty, female; pitch-range: 80; pitch: x-high }
}

Figure 1: An ACSS Specification

The sample specification in Figure 1 includes, for instance:

Setting of particular voice characteristics for various levels of headers (H1-H6), along with more particular pitch specifications for the individual levels.
An overall stress setting for DT elements.
Different voices for unvisited, visited and active links.

See the section on aural style sheets for definitions of the various constructs used.

What is SABLE?

SABLE is an XML/SGML-based markup language for Text-to-Speech synthesis. A more thorough description can be found in a paper, from which this description was excerpted; the current set of specifications for SABLE can be found at http://www.bell-labs.com/project/tts/sable.html.

SABLE is being developed because there is an ever increasing demand for speech synthesis (TTS) technology in various applications including e-mail reading, information access over the web, tutorial and language-teaching applications, and in assistive technology for users with various handicaps. Due to incompatibilities in control sequences for various TTS systems, an application that was developed with a particular TTS system A cannot be ported, without a fair amount of additional work, to a new TTS system B, for the simple reason that the tag set used to control system A is completely different from those used to control system B. The large variety of tagsets used by TTS systems are thus a problem for the expanded use of this technology since developers are often unwilling to expend effort porting their applications to a new TTS system, even if the new system in question is of demonstrably higher quality than the one they are currently using. SABLE attempts to address this problem by proposing a standard markup scheme intended to be synthesizer independent.

More specifically, SABLE is being developed with the following goals in mind:

Synthesizer control: SABLE enables markup of TTS text input, for improving the quality and appropriateness of speech output.
Multilinguality: the tagset should be appropriate for any language.
Ease of use: specialized knowledge of TTS or linguistics should not be required, though users with such experience should be able to apply their knowledge.
Portability: SABLE provides application developers with a consistent mechanism for controlling synthesizers from different companies and on different computing platforms.
Extensibility: SABLE includes a mechanism for non-standard extensions, so it can evolve to support new features in future releases. To encourage research, SABLE allows individual synthesizers to support enhanced features without compromising the portability of SABLE text.

SABLE is based in part on two previous proposals:

The Spoken Text Markup Language (STML; and see the Speech Synthesis Markup Language (SSML) for an even earlier proposal); and
the Java Speech Markup Language.

SABLE, like its predecessors, supports two kinds of markup: the first - termed text description in STML, and structural elements in JSML - marks properties of the text structure that are relevant for rendering a document in speech. In the current version of SABLE, text description is handled by the DIV tag, whose attribute TYPE may be set to such values as sentence, paragraph or even stanza; and by SAYAS, which marks the function of the contained region (e.g. as a date, an e-mail address, a mathematical expression, etc.), and thereby gives hints on how to pronounce the contained region. The second kind of markup - STML's speaker directives or JSML's production elements - control various aspects of how the speech is to be produced. Falling into this latter category are tags such as: EMPH (marks levels of emphasis); PITCH (sets intonational properties); RATE (sets speech rate); and PRON (provides pronunciations as phonemic strings).

In both its generality and its coverage, SABLE has many advantages over existing markups such as Microsoft's SAPI, or Apple's Speech Manager control set:

Whereas the syntax of other schemes is completely ad hoc, SABLE's is based on XML/SGML, a widely-used standard.
SAPI and other markup schemes provide tags only for speaker directives, not for text description. Text-description information, for example, that a particular boundary in a text corresponds to the end of a line in a table (e.g., <DIV TYPE="x-tl">), can in principle be used by a TTS system to advantage to produce reasonable speech output that marks auditorily the presence of that boundary. One does not necessarily want to have to instruct the synthesizer to use a particular intonation pattern, or to implement the break in a particular fashion: one might prefer simply to mark the presence of the boundary in an abstract way, and assume that the system will do something reasonable with that information. Text-description is explicitly designed to allow that kind of abstract specification.
SAPI and other such markup schemes are predefined. In contrast SABLE is intended to be extensible, making it ideal as a tool for TTS research.

A Sample of SABLE Text

As an illustration of SABLE markup, consider the text in Figure 2. This text is available as part of a demonstration of the SABLE markup language using the Bell Labs Multilingual Text-to-Speech System. See the SABLE specification for definitions of the various tags.

<SABLE>
Welcome to the demonstration of the Sable markup language using 
the Bell Labs TTS system.
<SPEAKER gender=female age=younger>
This system allows you to play around with a subset of the
functionality of Sable. For example, you can switch languages between
</SPEAKER>
<LANGUAGE ID=FRA>
français
</LANGUAGE>
<LANGUAGE ID=ESL-MEXICAN>
español
</LANGUAGE>
<LANGUAGE ID=ESL-CASTILIAN>
español
</LANGUAGE>
<LANGUAGE ID=ITA>
italiano
</LANGUAGE>
<LANGUAGE ID=RON>
<SPEAKER AGE=middle GENDER=female>

</SPEAKER>
</LANGUAGE>
<LANGUAGE ID=DEU>
Deutsch
</LANGUAGE>
<LANGUAGE ID=ZHO CODE=BIG5>

</LANGUAGE>
You can also <EMPH LEVEL=2.0> emphasize </EMPH> words, or
put a break 
<BREAK LEVEL=large MSEC=200 TYPE="?"> between them.
<PITCH RANGE=HIGHEST>
You can set properties of the pitch range, <RATE SPEED=fastest> or the 
speech rate </RATE> .
</PITCH>
As we saw above, <SPEAKER age=child> you can set the speaker
</SPEAKER> .
You can insert an audio file, in this case an example of our Russian
TTS system:
<AUDIO SRC=http://www.bell-labs.com/project/tts/russian6.wav>
And you can override the text by a call to the engine tag:
<ENGINE ID=BLTTS DATA="This is the Bell Labs TTS System.">
You won"t hear this.
</ENGINE>
You can set the <PRON SUB="pronounciation"> pronunciation
</PRON> of a 
word, though we do not currently support IPA.

Finally you can control aspects of how some kinds of strings are said
using the &Quot;say as&Quot; tag. For example, a date in English:

<SAYAS MODE=date MODETYPE=dmy>
18/11/1960
</SAYAS>
or in French:
<LANGUAGE ID=fra>
<SAYAS MODE=date MODETYPE=dmy>
18.11.1960
</SAYAS>
</LANGUAGE>

Just be careful to keep some whitespace between tags and surrounding
material. The parser is not very smart.  See the file Read me dot text
for further information.

</SABLE>

Figure 2: A Sample SABLE Document

The Proper Relation between ACSS and SABLE

For universal access, HTML and other web-based documents clearly need to allow for aural style specifications, and ACSS is clearly a good start towards that goal. On the other hand, SABLE is also needed for the reasons outlined above. What then is the proper relation between ACSS and SABLE?

The answer is straightforward: ACSS should be used to provide aural stylesheets; the ACSS specifications should be implemented by a voice browser in a synthesizer-independent fashion using SABLE; this text would then be read to users using their favorite TTS system. The architecture would be thus as in Figure 3.

This image just shows a simple flowchart with
three boxes, one feeding
into another. The first box has the text 'HTML Document + ACSS'. The
second box has the text 'Audio Browser: Converts HTML text into SABLE
using ACSS'. The third has the text 'TTS System: Interprets SABLE text
provided by Audio Browser'.

Figure 3: A Model for Using ACSS and SABLE. Note that the "HTML Document" specified in the top box could well be any XML document.

Audio Formatters

The model of the relation between ACSS and SABLE discussed in the preceding section and diagrammed in Figure 3 presumes the existence of automatic audio formatters that consume XML content and ACSS style sheets to produce a SABLE stream suitable for sending to a TTS-engine. It is our intention that such tools will be provided in the future for at least that subset of ACSS that is practically implementable in current TTS systems.

The model of the interaction would be as follows. Consider the implementation of an H1 tag using the ACSS from Figure 1. Say the audio browser encounters the text

<H1> Introduction </H1>

in an HTML document using this ACSS. The specifications for all tags of the header (H) family are that it is "male" and that its stress is "20". . For H1, the pitch is "x-low" and the pitch-range is "90" (relatively high). Putting this all together, we have a male voice, speaking in the low part of its range, with a high pitch range, and lower than normal ("20") stress. Note that the "90" (also the "20") are on a 0-100 scale, where "50" is considered "normal". Thus "90" could be interpreted as 80% above the normal. In SABLE terms this might translate into:

<PITCH BASE=lowest RANGE=80%>
<EMPH LEVEL=0.5>
Introduction
</EMPH>
</PITCH>

Note that we translate stress=20 using the EMPH tag, with a low setting for the pitch-level attribute; the single tag PITCH implements the two properties pitch=x-low (BASE=lowest) and pitch-range=90 (RANGE=80% or in other words 80% above the current setting in place, assuming the current is the normal range). So the audio browser should produce use the ACSS specification to convert the H1 element into the equivalent SABLE elements shown above.

Next Steps

As we move towards fitting SABLE into the overall WWW architecture, there may be a need to factor out SABLE into meaningful components. As an example, SABLE presently includes a language tag and though the language of a phrase is important for speech synthesis, it is even more important for search engines and other Web robots. For this reason, it would make sense for the language tag should be bubbled up to cover all HTML and XML (HXML of the future) documents, rather than being specific to SABLE.

In a similar vein, some of the SABLE tags, which have heretofore seemed useful in the design of SABLE purely as a markup language for TTS, may need to be reconsidered in the context of the WWW. For instance, SABLE provides a SAYAS tag, which is used to specify various modes of rendering specific stretches of text: for instance one can specify that a string "2/3" is to be read as a date in the "day-month" format. One problem with this approach is that the set of such potential markup seems open-ended: it may be desirable to also add such things as "speak as an age", "speak as a sports score", and so forth. Whether this kind of knowledge should be handled in the interface between SABLE and TTS, or in the interface between ACSS and SABLE is a question that needs to be thrashed out.

Richard Sproat
T. V. Raman