The Cover PagesThe OASIS Cover Pages: The Online Resource for Markup Language Technologies
Advanced Search
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

Cover Stories
Articles & Papers
Press Releases

XML Query

XML Applications
General Apps
Government Apps
Academic Apps

Technology and Society
Tech Topics
Related Standards
Created: July 16, 2004.
News: Cover StoriesPrevious News ItemNext News Item

Speech Synthesis Markup Language (SSML) Version 1.0 Advances to Proposed Recommendation.

Update 2004-09-18: See the news story "Speech Synthesis Markup Language (SSML) Version 1.0 Advances to W3C Recommendation."

[July 16, 2004] The W3C Voice Browser Working Group has released Speech Synthesis Markup Language (SSML) Version 1.0 as a Proposed Recommendation. Based upon wide review for technical soundness and implementability, the WG believes that SSML 1.0 is a now mature technical report.

The Speech Synthesis Markup Language (SSML) is part of the W3C Speech Interface Framework. The specification is designed "to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms."

Related specifications in the W3C Speech Interface Framework include the Speech Recognition Grammar (SRGS), Call Control (CCXML), VoiceXML 2.0, VoiceXML 2.1, Semantic Interpretation, and Dialog Markup ("V3").

"SSML is part of a larger set of markup specifications for voice browsers developed through the open processes of the W3C. It is based upon the JSGF and/or JSML specifications, which are owned by Sun Microsystems, Inc. A related initiative to establish a standard system for marking up text input is SABLE, which tried to integrate many different XML-based markups for speech synthesis into a new one. The activity carried out in SABLE was also used as the main starting point for defining the Speech Synthesis Markup Requirements for Voice Markup Languages. Since then, SABLE itself has not undergone any further development."

"The intended use of SSML is to improve the quality of synthesized content. Different markup elements impact different stages of the synthesis process. The markup may be produced either automatically, for instance via XSLT or CSS3 from an XHTML document, or by human authoring. Markup may be present within a complete SSML document or as part of a fragment embedded in another language, although no interactions with other languages are specified as part of SSML itself. Most of the markup included in SSML is suitable for use by the majority of content developers. However, some advanced features like phoneme and prosody (e.g., for speech contour design) may require specialized knowledge."

The SSML 1.0 Implementation Report and Disposition of Comments documents accompanying SSML v1.0 PR explain the entrance criteria adopted by the Voice Browser Working Group for the Proposed Recommendation phase in the Request for CR: "sufficient reports of implementation experience have been gathered to demonstrate that synthesis processors based on the specification are implementable and have compatible behavior; specific Implementation Report Requirements have been met; and the Working Group has formally addressed and responded to all public comments."

Bibliographic Information

Integration of Speech Synthesis Markup Language With Other Markup Languages

SMIL. "The Synchronized Multimedia Integration Language (SMIL, pronounced "smile") enables simple authoring of interactive audiovisual presentations. SMIL is typically used for "rich media"/multimedia presentations which integrate streaming audio and video with images, text or any other media type. SMIL is an easy-to-learn HTML-like language, and many SMIL presentations are written using a simple text-editor." A SMIL Integration Example is provided in Appendix F: 'Example SSML'.

ACSS. Aural Cascading Style Sheets are employed to augment standard visual forms of documents (like HTML) with additional elements that assist in the synthesis of the text into audio. In comparison to SSML, ACSS-generated documents are capable of more complex specifications of the audio sequence, including the designation of 3D location of the audio source. Many of the other ACSS elements overlap SSML functionality, especially in the specification of voice type/quality. SSML may be viewed as a superset of ACSS capabilities, excepting spatial audio.

VoiceXML. The Voice Extensible Markup Language (VXML) enables Web-based development and content-delivery for interactive voice response applications. VoiceXML supports speech synthesis, recording and playback of digitized audio, speech recognition, DTMF input, telephony call control, and form-driven mixed initiative dialogs. VoiceXML 2.0 extends SSML for the markup of text to be synthesized. An example of the integration between VoiceXML and SSML is provided in Appendix F..." [excerpted from section 2.3]

About the W3C Voice Browser Activity

"W3C is working to expand access to the Web to allow people to interact via key pads, spoken commands, listening to prerecorded speech, synthetic speech and music. This will allow any telephone to be used to access appropriately designed Web-based services, and will be a boon to people with visual impairments or needing Web access while keeping their hands and eyes free for other things. It will also allow effective interaction with display-based Web content in the cases where the mouse and keyboard may be missing or inconvenient.

To fulfill this goal, the W3C Voice Browser Working Group is defining a suite of markup languages covering dialog, speech synthesis, speech recognition, call control and other aspects of interactive voice response applications. Specifications such as the Speech Synthesis Markup Language, Speech Recognition Grammar Specification, and Call Control XML are core technologies for describing speech synthesis, recognition grammars, and call control constructs respectively. VoiceXML is a dialog markup language that leverages the other specifications for creating dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key (touch tone) input, recording of spoken input, telephony, and mixed initiative conversations.

These specifications bring the advantages of Web-based development and content delivery to interactive voice response applications. Further work is anticipated on enabling their use with other W3C markup languages such as XHTML, XForms and SMIL. This will be done in conjunction with W3C work in other areas, including Multimodal Interaction.

Some possible applications include:

  • Accessing business information, including the corporate "front desk" asking callers who or what they want, automated telephone ordering services, support desks, order tracking, airline arrival and departure information, cinema and theater booking services, and home banking services.
  • Accessing public information, including community information such as weather, traffic conditions, school closures, directions and events; local, national and international news; national and international stock market information; and business and e-commerce transactions.
  • Accessing personal information, including calendars, address and telephone lists, to-do lists, shopping lists, and calorie counters.
  • Assisting the user to communicate with other people via sending and receiving voice-mail and email messages. .." [from the W3C Voice Browser Activity page]

Principal References

Hosted By
OASIS - Organization for the Advancement of Structured Information Standards

Sponsored By

IBM Corporation
ISIS Papyrus
Microsoft Corporation
Oracle Corporation


XML Daily Newslink
Receive daily news updates from Managing Editor, Robin Cover.

 Newsletter Subscription
 Newsletter Archives
Bottom Globe Image

Document URI:  —  Legal stuff
Robin Cover, Editor: