Update 2004-09-18: See the news story "Speech Synthesis Markup Language (SSML) Version 1.0 Advances to W3C Recommendation."
[July 16, 2004] The W3C Voice Browser Working Group has released Speech Synthesis Markup Language (SSML) Version 1.0 as a Proposed Recommendation. Based upon wide review for technical soundness and implementability, the WG believes that SSML 1.0 is a now mature technical report.
The Speech Synthesis Markup Language (SSML) is part of the W3C Speech Interface Framework. The specification is designed "to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms."
Related specifications in the W3C Speech Interface Framework include the Speech Recognition Grammar (SRGS), Call Control (CCXML), VoiceXML 2.0, VoiceXML 2.1, Semantic Interpretation, and Dialog Markup ("V3").
"SSML is part of a larger set of markup specifications for voice browsers developed through the open processes of the W3C. It is based upon the JSGF and/or JSML specifications, which are owned by Sun Microsystems, Inc. A related initiative to establish a standard system for marking up text input is SABLE, which tried to integrate many different XML-based markups for speech synthesis into a new one. The activity carried out in SABLE was also used as the main starting point for defining the Speech Synthesis Markup Requirements for Voice Markup Languages. Since then, SABLE itself has not undergone any further development."
"The intended use of SSML is to improve the quality of synthesized content. Different markup elements impact different stages of the synthesis process. The markup may be produced either automatically, for instance via XSLT or CSS3 from an XHTML document, or by human authoring. Markup may be present within a complete SSML document or as part of a fragment embedded in another language, although no interactions with other languages are specified as part of SSML itself. Most of the markup included in SSML is suitable for use by the majority of content developers. However, some advanced features like phoneme and prosody (e.g., for speech contour design) may require specialized knowledge."
The SSML 1.0 Implementation Report and Disposition of Comments documents accompanying SSML v1.0 PR explain the entrance criteria adopted by the Voice Browser Working Group for the Proposed Recommendation phase in the Request for CR: "sufficient reports of implementation experience have been gathered to demonstrate that synthesis processors based on the specification are implementable and have compatible behavior; specific Implementation Report Requirements have been met; and the Working Group has formally addressed and responded to all public comments."
Speech Synthesis Markup Language (SSML) Version 1.0. Edited by Daniel C. Burnett (Nuance Communications), Mark R. Walker (Intel), and Andrew Hunt (ScanSoft). W3C Proposed Recommendation 15-July-2004. Version URL: http://www.w3.org/TR/2004/PR-speech-synthesis-20040715/. Latest version URL: http://www.w3.org/TR/speech-synthesis/ Previous version URL: http://www.w3.org/TR/2003/CR-speech-synthesis-20031218/.
SSML 1.0 Implementation Report. Version: 15-July-2004. Contributors: Laura Ricotti (Loquendo, Chief Editor) Paolo Baggia (Loquendo, co-editor) An Buyle (Scansoft), Dave Burke (VoxPilot), Daniel Burnett (Nuance), Jerry Carter (Independent Consultant), Sasha Caskey (IBM), William Gardella (SAP), Frederic Gavignet (France Telecom), Edouard Hinard (France Telecom), Jeff Kusnitz (IBM), Paul Lamere (Sun), Rob Marchand (VoiceGenie), Sheyla Militello (Loquendo), Luc Van Tichelen (Scansoft).
Integration of Speech Synthesis Markup Language With Other Markup Languages
SMIL. "The Synchronized Multimedia Integration Language (SMIL, pronounced "smile") enables simple authoring of interactive audiovisual presentations. SMIL is typically used for "rich media"/multimedia presentations which integrate streaming audio and video with images, text or any other media type. SMIL is an easy-to-learn HTML-like language, and many SMIL presentations are written using a simple text-editor." A SMIL Integration Example is provided in Appendix F: 'Example SSML'.
ACSS. Aural Cascading Style Sheets are employed to augment standard visual forms of documents (like HTML) with additional elements that assist in the synthesis of the text into audio. In comparison to SSML, ACSS-generated documents are capable of more complex specifications of the audio sequence, including the designation of 3D location of the audio source. Many of the other ACSS elements overlap SSML functionality, especially in the specification of voice type/quality. SSML may be viewed as a superset of ACSS capabilities, excepting spatial audio.
VoiceXML. The Voice Extensible Markup Language (VXML) enables Web-based development and content-delivery for interactive voice response applications. VoiceXML supports speech synthesis, recording and playback of digitized audio, speech recognition, DTMF input, telephony call control, and form-driven mixed initiative dialogs. VoiceXML 2.0 extends SSML for the markup of text to be synthesized. An example of the integration between VoiceXML and SSML is provided in Appendix F..." [excerpted from section 2.3]
About the W3C Voice Browser Activity
"W3C is working to expand access to the Web to allow people to interact via key pads, spoken commands, listening to prerecorded speech, synthetic speech and music. This will allow any telephone to be used to access appropriately designed Web-based services, and will be a boon to people with visual impairments or needing Web access while keeping their hands and eyes free for other things. It will also allow effective interaction with display-based Web content in the cases where the mouse and keyboard may be missing or inconvenient.
To fulfill this goal, the W3C Voice Browser Working Group is defining a suite of markup languages covering dialog, speech synthesis, speech recognition, call control and other aspects of interactive voice response applications. Specifications such as the Speech Synthesis Markup Language, Speech Recognition Grammar Specification, and Call Control XML are core technologies for describing speech synthesis, recognition grammars, and call control constructs respectively. VoiceXML is a dialog markup language that leverages the other specifications for creating dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key (touch tone) input, recording of spoken input, telephony, and mixed initiative conversations.
These specifications bring the advantages of Web-based development and content delivery to interactive voice response applications. Further work is anticipated on enabling their use with other W3C markup languages such as XHTML, XForms and SMIL. This will be done in conjunction with W3C work in other areas, including Multimodal Interaction.
Some possible applications include:
- Accessing business information, including the corporate "front desk" asking callers who or what they want, automated telephone ordering services, support desks, order tracking, airline arrival and departure information, cinema and theater booking services, and home banking services.
- Accessing public information, including community information such as weather, traffic conditions, school closures, directions and events; local, national and international news; national and international stock market information; and business and e-commerce transactions.
- Accessing personal information, including calendars, address and telephone lists, to-do lists, shopping lists, and calorie counters.
- Assisting the user to communicate with other people via sending and receiving voice-mail and email messages. .." [from the W3C Voice Browser Activity page]
- Speech Synthesis Markup Language (SSML) Version 1.0. W3C Proposed Recommendation.
- Speech Synthesis Markup Language XML Schema [cache]. Normative, as described in Appendix D. Note that the SSML synthesis schema includes a no-namespace core schema... "which may be used as a basis for specifying Speech Synthesis Markup Language Fragments embedded in non-synthesis namespace schemas."
- Speech Synthesis Markup Language XML DTD. Non-normative, as described in Appendix E. [cache]
- SSML 1.0 Implementation Report. Includes a table with test results.
- ZIP file for SSML Implementation Report project. The ZIP archive contains a number of resources: the SSML tests are ordered by test assertion id and are organised into folders where the folder name corresponds to the assertion id. In addition the archive includes the following: (1) The Manifest, (2) The Report Submission Template, (3) The Stylesheet.
- SSML 1.0: Candidate Recommendation Disposition of Comments. Edited by Daniel C. Burnett (Nuance). This document "details the responses made by the Voice Browser Working Group to issues raised during the Candidate Recommendation (beginning 18 December 2003 and ending 18 February 2004) review of Speech Synthesis Markup Language (SSML) Version 1.0 Candidate Recommendation. Comments were provided by Voice Browser Working Group members, other W3C Working Groups, and the public via the 'email@example.com' mailing list.
- SSML 1.0 Implementation Report Updates
- W3C news item
- W3C 'Voice Browser' Activity — Voice Enabling the Web!
- W3C Speech Interface Framework:
- Related: Synchronized Multimedia Integration Language (SMIL 2.0)
- Related: CSS Aural style sheets
- W3C Technical Report Maturity Levels for Recommendation-Track Specifications
- Mail Archives for W3C public list 'firstname.lastname@example.org ' "This a list for discussion of the use and design of voice applications on the Web, and more specifically, for feedback on the W3C VoiceXML specifications."
- W3C Contact: Max Froumentin (Voice Browser Working Group)
- Earlier W3C Speech Interface Framework news:
- "W3C Speech Synthesis Markup Language Specification" - Local references.