The Cover PagesThe OASIS Cover Pages: The Online Resource for Markup Language Technologies
Advanced Search
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

Cover Stories
Articles & Papers
Press Releases

XML Query

XML Applications
General Apps
Government Apps
Academic Apps

Technology and Society
Tech Topics
Related Standards
Created: June 28, 2005.
News: Cover StoriesPrevious News ItemNext News Item

Candidate Recommendation for Voice Extensible Markup Language (VoiceXML) 2.1.


W3C's Voice Browser Working Group has released a new version of the Voice Extensible Markup Language (VoiceXML) 2.1 specification at Candidate Recommendation level. VoiceXML "enables integration of voice services with data services using the familiar client-server paradigm. It is is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations."

The W3C Voice Browser Working Group, chartered through January 31, 2007, is developing XML-based specifications to "bring the benefits of Web technology to the telephone, enabling Web developers to create applications that can be accessed via any telephone, and allowing people to interact with these applications via speech and telephone keypads. The W3C Speech Interface Framework is a suite of markup specifications aimed at realizing this goal. It covers voice dialogs (VoiceXML), speech synthesis (SSML, PLS), speech recognition (SRGS, SISR), telephony call control for voice browsers (CCXML) and other requirements for interactive voice response applications, including use by people with hearing or speaking impairments."

VoiceXML Version 2.1 defines a set of eight (8) markup elements, either newly introduced or enhanced from VoiceXML 2.0; they represent features now commonly implemented by VoiceXML applications. According to the Candidate Recommendation Introduction, "the popularity of VoiceXML 2.0 spurred the development of numerous voice browser implementations early in the specification process so that VoiceXML 2.0 has been phenomenally successful in enabling the rapid deployment of voice applications that handle millions of phone calls every day. This success has led to the development of additional, innovative features that help developers build even more powerful voice-activated services. While it was too late to incorporate these additional features into version 2.0, the purpose of VoiceXML 2.1 is to formally specify the most common features to ensure their portability between platforms and at the same time maintain complete backwards-compatibility with VoiceXML 2.0."

A new <data> element in VoiceXML 2.1 "allows a VoiceXML application to fetch arbitrary XML data from a document server without transitioning to a new VoiceXML document; the XML data fetched by the <data> element is bound to ECMAScript through the named variable that exposes a read-only subset of the W3C Document Object Model (DOM). V2.1 also supports dynamic concatenation of prompts using a new <foreach> element; it allows a VoiceXML application to iterate through an ECMAScript array and to execute the content contained within the <foreach> element for each item in the array. A new attribute for <grammar> supports referencing grammars dynamically; <property> now controls platform settings; <script> can references a document containing client-side ECMAScript; the <transfer> element may support any combination of bridge, blind, or consultation transfer types to transfer the user to another destination. Normative Appendix B provides the VoiceXML Schema and and non-normative Appendix A supplies a VoiceXML Document Type Definition (DTD).

Exit criteria for the VoiceXML 2.1 Candidate Recommendation phase and entrance into the Proposed Recommendation phase "require at least two independently developed interoperable implementations of each required feature, and at least one or two implementations of each optional feature depending on whether the feature's conformance requirements have an impact on interoperability. Detailed implementation requirements and the invitation for participation in the Implementation Report are provided in the VoiceXML 2.1 Implementation Report Plan." The specification already has significant implementation experience that is being incorporated into the Implementation Report; the W3C team expects to meet all implementation and interoperability requirements by July 11, 2005.

The next major release of VoiceXML, VoiceXML 3.0, is expected to "provide powerful dialog capabilities that can be used to build advanced speech applications, and to provide these capabilities in a form that can be easily and cleanly integrated with other W3C languages. It will provide enhancements to existing dialog and media control, as well as major new features (e.g., modularization, a cleaner separation between data/flow/dialog, and asynchronous external eventing) to facilitate interoperation with external applications and media components."

The Voice Browser Working Group is part of the W3C's Interaction Domain, and now operates under the new W3C Patent Policy; it has some 73 participants from 28 organizations, and two Invited Experts. W3C's Interaction Domain "is responsible for developing technologies that shape the Web's user interface. These technologies include (X)HTML, the markup language that started the Web. Participants also work on second-generation Web languages initiated at the W3C: CSS, MathML, SMIL and SVG, [which] have become an integral part of the Web. They develop languages that will determine next generation Web user interfaces with cutting-edge technologies such as VoiceXML, Multimodal Interaction and XForms. A major focus today is how to adapt current technologies to enable Web access for anyone, anywhere, anytime, using any device. This includes Web access from 'smartphones', the new generation of mobile phones which integrate traditional phone functionality with new, powerful computing and Web browsing capabilities."

Bibliographic Information

  • Voice Extensible Markup Language (VoiceXML) 2.1. W3C Candidate Recommendation. 13-June-2005. Edited by Matt Oshry (Tellme Networks - Editor-in-Chief), RJ Auburn (Voxeo Corporation), Paolo Baggia (Loquendo), Michael Bodell (Tellme Networks), David Burke (Voxpilot Ltd), Daniel C. Burnett (Invited Expert), Emily Candell (Comverse), Hakan Kilic (Scansoft), Scott McGlashan (Hewlett-Packard), Alex Lee (VoiceGenie), Brad Porter (Tellme Networks), and Ken Rehor (Vocalocity). Version URL: Latest version URL: Previous version URL:

    Acknowledgements: VoiceXML 2.1 was written by the participants in the W3C Voice Browser Working Group. The following have significantly contributed to writing this specification: Matt Oshry (Tellme Networks), RJ Auburn (Voxeo Corporation), Paolo Baggia (Loquendo), Michael Bodell (Tellme Networks), David Burke (Voxpilot Ltd), Daniel C. Burnett (Invited Expert), Emily Candell (Comverse), Ken Davies (HeyAnita), Jim Ferrans (Motorola), Jeff Haynie (Vocalocity), Matt Henry (Voxeo Corporation), Hakan Kilic (Scansoft), Scott McGlashan (Hewlett-Packard), Brien Muschett (IBM), Rob Marchand (VoiceGenie), Jeff Kusnitz (IBM), Brad Porter (Tellme Networks), Ken Rehor (Vocalocity), Laura Ricotti (Loquendo), Davide Tosello (Loquendo), and Milan Young (Nuance).

  • Authorizing Read Access to XML Content Using the <?access-control?> Processing Instruction 1.0. W3C Working Group Note. 13-June-2005. Edited by Matt Oshry (Tellme Networks), Brad Porter (Tellme Networks), and RJ Auburn (Voxeo Corporation). Version URL: Latest version URL:

  • Voice Browser Call Control: CCXML Version 1.0. W3C Working Draft. 11-January-2005. Edited by RJ Auburn (Voxeo Corporation). Version URL: Latest version URL: Previous version URL: With XML Schema.

    See also the non-normative diff-marked HTML version based upon the HTML version; it features a stylesheet-driven toggle for "Show Change Marks" vs. "Hide Change Marks".

    This version of CCXML was written with the participation of members of the W3C Voice Browser Working Group, and a special thanks is in order for the following contributors: David Attwater (BTexact Technologies), RJ Auburn (Voxeo Corporation), Eric Burger (Snowshore), David Burke (VoxPilot), Emily Candell (Comverse), Ken Davies (Hey Anita), Rajiv Dharmadhikari (Genesys), Dave Donnelly (BTexact Technologies), Daniel Evans, Mike Galpin (Syntellect), Markus Hahn, Jeff Haynie (Vocalocity), David W Heileman (Unisys), Gadi Inon, Rob Marchand (VoiceGenie), Scott McGlashan (Hewlett-Packard), Rachel Muller (Aspect), Paolo Baggia (Loquendo), Brad Porter (TellMe Networks), Kenneth Rehor (Vocalocity), Dave Renshaw (IBM), Mark Scott (VoiceGenie), Saravanan Shanmugham (Cisco), Scott Slote (Nortel Networks), Govardhan Srikant (Nortel Networks), Jim Trethewey (Intel), and Wai-Yip Tung (Cisco).

The 'access-control' Processing Instruction

A previous version of VoiceXML 2.1 included description of a mechanism for securing access to XML data via an access-control processing instruction (PI). This content has now been published as a W3C NOTE "Authorizing Read Access to XML Content Using the <?access-control?> Processing Instruction 1.0."

"XML representations of presentation markup and data are widely available to web browsers over HTTP. Web browsers often run with a higher privilege level than the applications running in those browsers. In order to prevent applications from accessing privileged content, browsers restrict applications to only read XML resources from the application's domain (e.g., LSParser in Document Object Model (DOM) Level 3 Load and Save Specification or the <data> element in VoiceXML 2.1). This limitation restricts the universe of XML content available to an application and precludes the open sharing of public XML data between applications.

This W3C Note describes one mechanism in use by voice browser vendors to allow XML content providers to specify which application domains can access their XML content. For example, the National Oceanic and Atmospheric Administration (NOAA) may declare that their XML weather data can be accessed by any application, while a stock ticker provider can allow access to individual partner applications that have licensed that data...

Before allowing an application executing in the context of a user agent to manipulate external XML content, a user agent validates that the host requesting the content is allowed to access the content. This validation is performed by comparing the hostname and IP Address of the document server from which the requesting application was fetched to the list of hostnames, hostname suffixes, and IP addresses listed in the processing instruction included in the XML content to be fetched..."

Note: This document was published "for information purposes only. The Working Group does not plan to issue updates and therefore has no current plans or process by which to handle feedback. The W3C has not analyzed the security problems which motivated the publication of this NOTE. This NOTE only addresses a subset of the security issues involved in exposing XML data over HTTP. This NOTE documents an existing practice used under certain circumstances but in no way implies that the technique would be appropriate or secure to protect document access under all circumstances. Implementors should perform their own security analysis...

Related W3C Speech Interface Framework Specifications

  • Voice Browser Call Control: CCXML Version 1.0. W3C Working Draft. 11-January-2005.

    The Call Control Extensible Markup Language (CCXML) is being developed by members of the Voice Browser Working Group within the W3C Voice Browser Activity. CCXML is "designed to provide telephony call control support for dialog systems, such as Voice Extensible Markup Language (VoiceXML). While CCXML can be used with any dialog systems capable of handling media, CCXML has been designed to complement and integrate with a VoiceXML interpreter. Because of this, the CCXML specification contains many references to VoiceXML's capabilities and limitations. The CCXML specification also provides details on how VoiceXML and CCXML can be integrated. However, the two languages are separate, and are not required in an implementation of either language. For example, CCXML could be integrated with a more traditional Interactive Voice Response (IVR) system or a 3GPP Media Resource Function (MRF), and VoiceXML or other dialog systems could be integrated with some other call control systems." [adapted from the 11-January-2005 Working Draft]

  • Pronunciation Lexicon Specification (PLS) Version 1.0. W3C Working Draft. 14-February-2005.

    This Working Draft "defines the syntax for specifying pronunciation lexicons to be used by speech recognition and speech synthesis engines in voice browser applications... The accurate specification of pronunciation is critical to the success of speech applications. Most speech recognition and text-to-speech synthesis engines provide extensive high quality lexicons with pronunciation information for most words or phrases. To ensure a maximum coverage of the words or phrases used by an application, application specific pronunciations may be required. These are most commonly needed for proper nouns such as surnames or business names.

    The Pronunciation Lexicon Specification (PLS) is designed to allow interoperable specification of pronunciation information for either speech recognition and speech synthesis engines within voice browsing applications. The language is intended to be easy to use by developers whilst supporting the accurate specification of pronunciation information for international use.

    The language allows one or more pronunciations for a word or phrase to be specified using a standard pronunciation alphabet or if necessary using vendor specific alphabets. Pronunciations are grouped together into a PLS document which may be referenced from other markup languages, such as Speech Recognition Grammar Specification [SRGS] and Speech Synthesis Markup Language [SSML]."

  • Speech Recognition Grammar Specification Version 1.0. W3C Recommendation. 16-March-2004.

    "The primary use of a speech recognizer grammar is to permit a speech application to indicate to a recognizer what it should listen for, specifically: words that may be spoken, patterns in which those words may occur, and spoken language of each word. Speech recognizers may also support the Stochastic Language Models (N-Gram) Specification. Both specifications define ways to set up a speech recognizer to detect spoken input but define the word and patterns of words by different and complementary means. Some recognizers permit cross-references between grammars in the two formats. The rule reference element of this specification describes how to reference an N-gram document.

    The [SRGS Recommendation] document defines the syntax for grammar representation. The grammars are intended for use by speech recognizers and other grammar processors so that developers can specify the words and patterns of words to be listened for by a speech recognizer. The syntax of the grammar format is presented in two forms, an Augmented BNF (ABNF) Form and an XML Form. The specification ensures that the two representations are semantically mappable to allow automatic transformations between the two forms...

    Augmented BNF syntax (ABNF): this is a plain-text (non-XML) representation which is similar to traditional BNF grammar and to many existing BNF-like representations commonly used in the field of speech recognition including the JSpeech Grammar Format from which this specification is derived. Augmented BNF should not be confused with Extended BNF which is used in DTDs for XML and SGML.

    XML: This syntax uses XML elements to represent the grammar constructs and adapts designs from the PipeBeach grammar, TalkML, and a research XML variant of the JSpeech Grammar Format...

  • Speech Synthesis Markup Language (SSML) Version 1.0. W3C Recommendation. 07-September-2004.

    SSML 1.0 elevates the role of high-quality synthesized speech in Web interactions and represents a fundamental specification in the W3C Speech Interface Framework.

    SSML Version 1.0 was produced by members of the W3C Voice Browser Working Group as part of the the Voice Browser Activity within W3C's Interaction Domain. W3C's Voice Browser WG seeks to "develop standards to enable access to the Web using spoken interaction. The Speech Synthesis Markup Language Specification is one of these standards and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms."

  • Semantic Interpretation for Speech Recognition. W3C Working Draft. 8-November-2004.

    "The SISR Working Draft document "defines the process of Semantic Interpretation for Speech Recognition and the syntax and semantics of semantic interpretation tags that can be added to speech recognition grammars to compute information to return to an application on the basis of rules and tokens that were matched by the speech recognizer. In particular, it defines the syntax and semantics of the contents of Tags in the Speech Recognition Grammar Specification.

    Semantic Interpretation may be useful in combination with other specifications, such as the Stochastic Language Models (N-Gram) Specification, but their use with N-grams has not yet been studied.

    The results of semantic interpretation describe the meaning of a natural language utterance. The current specification represents this information as an ECMAScript object, and defines a mechanism to serialize the result into XML. The W3C Multimodal Interaction Activity is defining a data format (EMMA) for representing information contained in user utterances. It is believed that semantic interpretation will be able to produce results that can be included in EMMA."

  • CSS3 Speech Module. W3C Working Draft. 16-December-2004.

    "CSS (Cascading Style Sheets) is a language for describing the rendering of HTML and XML documents on screen, on paper, in speech, etc. CSS defines aural properties that give control over rendering XML to speech. This Working Draft describes the text to speech properties proposed for CSS level 3. These are designed for match the model described in the Speech Synthesis Markup Language (SSML) Version 1.0...

    The speech rendering of a document, already commonly used by the blind and print-impaired communities, combines speech synthesis and 'auditory icons'. Often such aural presentation occurs by converting the document to plain text and feeding this to a screen reader — software or hardware that simply reads all the characters on the screen. This results in less effective presentation than would be the case if the document structure were retained. Style sheet properties for text to speech may be used together with visual properties (mixed media) or as an aural alternative to visual presentation.

    Besides the obvious accessibility advantages, there are other large markets for listening to information, including in-car use, industrial and medical documentation systems (intranets), home entertainment, and to help users learning to read or who have difficulty reading.

    When using voice properties, the canvas consists of a two channel stereo space and a temporal space (you can specify audio cues before and after synthetic speech). The CSS properties also allow authors to vary the characteristics of synthetic speech (voice type, frequency, inflection, etc.)..."

About the VoiceXML Forum

"The VoiceXML Forum is an industry organization formed to create and promote the Voice Extensible Markup Language (VoiceXML). With the backing and contributions of its diverse membership, including key industry leaders, the VoiceXML Forum has successfully driven market acceptance of VoiceXML through a wide array of speech-enabled applications. The VoiceXML Forum is a program of the IEEE Industry Standards and Technology Organization (IEEE-ISTO).

Principal References

Hosted By
OASIS - Organization for the Advancement of Structured Information Standards

Sponsored By

IBM Corporation
ISIS Papyrus
Microsoft Corporation
Oracle Corporation


XML Daily Newslink
Receive daily news updates from Managing Editor, Robin Cover.

 Newsletter Subscription
 Newsletter Archives
Bottom Globe Image

Document URI:  —  Legal stuff
Robin Cover, Editor: