Cover Pages: VoiceXML 2.0 and Speech Recognition Grammar Published as W3C Recommendations.

The World Wide Web Consortium has released the first two W3C Recommendations in its Speech Interface Framework. "Aimed at the world's estimated two billion fixed line and mobile phones, W3C's Speech Interface Framework will allow an unprecedented number of people to use any telephone to interact with appropriately designed Web-based services via key pads, spoken commands, listening to pre-recorded speech, synthetic speech and music."

The Voice Extensible Markup Language (VoiceXML) Version 2.0 Recommendation defines VoiceXML, designed for "creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications."

The second Recommendation, Speech Recognition Grammar Specification Version 1.0, is key to VoiceXML's support for speech recognition, and is used by developers to describe end-users responses to spoken prompts. It defines syntax for representing grammars for use in speech recognition so that developers can specify the words and patterns of words to be listened for by a speech recognizer. The syntax of the grammar format is presented in two forms, an Augmented BNF Form and an XML Form. The specification makes the two representations mappable to allow automatic transformations between the two forms."

Bibliographic Information

Voice Extensible Markup Language (VoiceXML) Version 2.0. W3C Recommendation 16-March-2004. Edited by Scott McGlashan (Hewlett-Packard, Editor-in-Chief), Daniel C. Burnett (Nuance Communications), Jerry Carter (Invited Expert), Peter Danielsen (Lucent, until October 2002), Jim Ferrans (Motorola), Andrew Hunt (ScanSoft), Bruce Lucas (IBM), Brad Porter (Tellme Networks), Ken Rehor (Vocalocity), and Steph Tryphonas (Tellme Networks). Version URL: http://www.w3.org/TR/2004/REC-voicexml20-20040316/. Latest Version URL: http://www.w3.org/TR/voicexml20/. Previous Version URL: http://www.w3.org/TR/2004/PR-voicexml20-20040203/.

Speech Recognition Grammar Specification Version 1.0. W3C Recommendation 16-March-2004. Edited by Andrew Hunt (ScanSoft) and Scott McGlashan (Hewlett-Packard). Version URL: http://www.w3.org/TR/2004/REC-speech-grammar-20040316/. Latest version URL: http://www.w3.org/TR/speech-grammar/. Previous version URL: http://www.w3.org/TR/2003/PR-speech-grammar-20031218/.

"This document was written with the participation of the members of the W3C Voice Browser Working Group, here listed in alphabetical order: Mike Brown, Lucent Bell Labs; Dan Burnett, Nuance Communications; Emily Candell, Comverse; Jerry Carter, Invited Expert; Debbie Dahl, Invited Expert; Debajit Ghosh, Nuance Communications; Andrew Hunt, ScanSoft; Stefan Krause, ScanSoft; Sol Lerner, ScanSoft; Bruce Lucas, IBM; Jens Marschner, ScanSoft; Scott McGlashan, Hewlett-Packard; Yves Normandin, Locus Dialogue; Brad Porter, Tellme; Dave Raggett, W3C/Canon; David Ramsthaler, Cisco; Luc Van Tichelen, ScanSoft; Kuansan Wang, Microsoft; Laura Werner, BeVocal.

VoiceXML Overview

Goals: VoiceXML's main goal is to bring the full power of Web development and content delivery to voice response applications, and to free the authors of such applications from low-level programming and resource management. It enables integration of voice services with data services using the familiar client-server paradigm. A voice service is viewed as a sequence of interaction dialogs between a user and an implementation platform. The dialogs are provided by document servers, which may be external to the implementation platform. Document servers maintain overall service logic, perform database and legacy system operations, and produce dialogs. A VoiceXML document specifies each interaction dialog to be conducted by a VoiceXML interpreter. User input affects dialog interpretation and is collected into requests submitted to a document server. The document server replies with another VoiceXML document to continue the user's session with other dialogs.

VoiceXML is a markup language that:

Minimizes client/server interactions by specifying multiple interactions per document.

Shields application authors from low-level, and platform-specific details.

Separates user interaction code (in VoiceXML) from service logic (e.g., CGI scripts).

Promotes service portability across implementation platforms; VoiceXML is a common language for content providers, tool providers, and platform providers.

Is easy to use for simple interactions, and yet provides language features to support complex dialogs.

While VoiceXML strives to accommodate the requirements of a majority of voice response services, services with stringent requirements may best be served by dedicated applications that employ a finer level of control.

Scope: Scope of VoiceXML The language describes the human-machine interaction provided by voice response systems, which includes: Output of synthesized speech (text-to-speech); Output of audio files; Recognition of spoken input; Recognition of DTMF input; Recording of spoken input; Control of dialog flow; Telephony features such as call transfer and disconnect. The language provides means for collecting character and/or spoken input, assigning the input results to document-defined request variables, and making decisions that affect the interpretation of documents written in the language. A document may be linked to other documents through Universal Resource Identifiers (URIs).

The [Recommendation] document defines VoiceXML, the Voice Extensible Markup Language. Its background, basic concepts and use are presented in Section 1. The dialog constructs of form, menu and link, and the mechanism (Form Interpretation Algorithm) by which they are interpreted are then introduced in Section 2. User input using DTMF and speech grammars is covered in Section 3, while Section 4 covers system output using speech synthesis and recorded audio. Mechanisms for manipulating dialog control flow, including variables, events, and executable elements, are explained in Section 5. Environment features such as parameters and properties as well as resource handling are specified in Section 6. The appendices provide additional information including the VoiceXML Schema, a detailed specification of the Form Interpretation Algorithm and timing, audio file formats, and statements relating to conformance, internationalization, accessibility and privacy..." [from the Rec]

Speech Recognition Grammar Overview

"The primary use of a speech recognizer grammar is to permit a speech application to indicate to a recognizer what it should listen for, specifically: words that may be spoken, patterns in which those words may occur, and spoken language of each word. Speech recognizers may also support the Stochastic Language Models (N-Gram) Specification [NGRAM]. Both specifications define ways to set up a speech recognizer to detect spoken input but define the word and patterns of words by different and complementary means. Some recognizers permit cross-references between grammars in the two formats. The rule reference element of this specification describes how to reference an N-gram document.

The [Recommendation] document defines the syntax for grammar representation. The grammars are intended for use by speech recognizers and other grammar processors so that developers can specify the words and patterns of words to be listened for by a speech recognizer. The syntax of the grammar format is presented in two forms, an Augmented BNF (ABNF) Form and an XML Form. The specification ensures that the two representations are semantically mappable to allow automatic transformations between the two forms.

Augmented BNF syntax (ABNF): this is a plain-text (non-XML) representation which is similar to traditional BNF grammar and to many existing BNF-like representations commonly used in the field of speech recognition including the JSpeech Grammar Format from which this specification is derived. Augmented BNF should not be confused with Extended BNF which is used in DTDs for XML and SGML.

XML: This syntax uses XML elements to represent the grammar constructs and adapts designs from the PipeBeach grammar, TalkML, and a research XML variant of the JSpeech Grammar Format.

Both the ABNF Form and XML Form have the expressive power of a Context-Free Grammar (CFG). A grammar processor that does not support recursive grammars has the expressive power of a Finite State Machine (FSM) or regular expression language... This form of language expression is sufficient for the vast majority of speech recognition applications..." [excerpted from the Rec]

From the W3C Announcement

Giving voice to the Web, the World Wide Web Consortium (W3C) has published VoiceXML 2.0 and Speech Recognition Grammar Specification (SRGS) as W3C Recommendations. The goal of VoiceXML 2.0 is to bring the advantages of Web-based development and content delivery to interactive voice response applications. SRGS is key to VoiceXML's support for speech recognition, and is used by developers to describe end-users responses to spoken prompts.

Today's announcement marks the advancement to Recommendation status of the first two specifications in W3C's Speech Interface Framework. Aimed at the world's estimated two billion fixed line and mobile phones, W3C's Speech Interface Framework will allow an unprecedented number of people to use any telephone to interact with appropriately designed Web-based services via key pads, spoken commands, listening to pre-recorded speech, synthetic speech and music.

"The completion of VoiceXML 2.0 and SRGS marks an exciting milestone in the convergence of telecom technologies and the Web. Historically, there were both technical and cultural gaps between the way voice-based systems have evolved and that of the Internet and Web, leaving the information available only to voice systems or the Web, " explained Tim Berners-Lee, W3C Director. " With the development of the W3C Speech Interface Framework, including VoiceXML 2.0 and SRGS, we're now able to integrate and benefit from the strengths of both groups - the power and impact of industrial research and broad product testing and deployment, and the extensibility and openness of technical solutions that are consistent with Web technical principles and can scale accordingly."

A World Wide Web Consortium (W3C) Recommendation is understood by industry and the Web community at large as a Web standard. Each Recommendation is a stable specification developed by a W3C Working Group and reviewed by the W3C Membership. Recommendations promote interoperability of Web technologies of the Web by explicitly conveying the industry consensus formed by the Working Group.

In the W3C Speech Interface Framework, VoiceXML controls how the application interacts with the user, while the Speech Synthesis Markup Language (SSML) is used for spoken prompts and the Speech Recognition Grammar Specification (SRGS) for guiding the speech recognizers via grammars that describe the expected user responses. Other specifications in the Framework include Voice Browser Call Control (CCXML), which provides telephony call control support for VoiceXML and other dialog systems, and Semantic Interpretation for Speech Recognition, which defines how speech grammars bind to application semantics.

"VoiceXML 2.0 has the power to change the way phone-based information and customer services are developed. No longer will we we have to press 'one' for this or 'two' for that. Instead, we will be able to make selections and provide information by speech," explained Dave Raggett, W3C Voice Browser Activity Lead. "In addition, VoiceXML 2.0 creates opportunities for people with visual impairments or those needing Web access while keeping their hands and eyes free for other things, such as getting directions while driving."

The Speech Recognition Grammar Specification — SRGS — allows applications to specify the words and phrases that users are prompted to speak. This enables robust speaker independent recognition. SRGS covers both speech and DTMF input. DTMF input is valuable in noisy conditions or when the social context makes it awkward to speak. Speech recognizers are generally able to report the degree of confidence — that is, the likelihood of having correctly recognized the word or phrase - and may provide the most likely alternatives when the recognizer is uncertain as to which of them the user actually said.

In addition to the continued work on the remainder of the Speech Interface Framework, the Voice Browser Working Group is already hard at work designing the requirements for the next major version of the dialog markup language, which will build upon the success of VoiceXML 2.0 and incorporate ideas from SALT, XHTML+Voice, and other W3C Member contributions. The W3C Voice Browser Working Group is among the largest and most active in W3C. Its participants include: Aspect Communications, BeVocal, Canon, Comverse Technology, Convedia, ERCIM, France Telecom, HeyAnita, Hitachi, HP, IBM, Intel, IWA-HWG, Loquendo, Microsoft, MITRE, Mitsubishi Electric, Motorola, Nuance Communications, Openstream, SAP, Scansoft, Siemens, Snowshore Networks, Sun Microsystems, Telera, Tellme Networks, Verscape, VoiceGenie Technologies, Voxeo, and Voxpilot..." [excerpted]

Principal references:

W3C Announcement: "World Wide Web Consortium Issues VoiceXML 2.0 and Speech Recognition Grammar as W3C Recommendations. Critical Components of the W3C Speech Interface Framework Now Complete." Announcement also in French and Japanese.
VoiceXML Forum Announcement: "VoiceXML Forum Endorses Release of VoiceXML 2.0 Recommendation by W3C. Readies Platform Certification Program. Vendors Encouraged to Submit Products for Certification Now."
Voice Extensible Markup Language (VoiceXML) Version 2.0. W3C Recommendation 16-March-2004.
VoiceXML XML Schema Definition. See the specification's normative Appendix O for other schemas defined in the VoiceXML namespace.
Speech Recognition Grammar Specification Version 1.0. W3C Recommendation 16-March-2004.
Testimonials for W3C's Recommendations: VoiceXML 2.0 and Speech Recognition Grammar Specification (SRGS). From Aspect Communications, Comverse, Genesys Telecommunications Laboratories, HP, IBM, Loquendo, Microsoft Corporation, Motorola, Nuance, Openstream, Inc., ScanSoft, TellMe, Vocalocity, VoiceGenie, Voxeo, and Voxpilot.
W3C Voice Browser home page
Voice Browser Working Group Charter
Mail archive of 'www-voice@w3.org', a list for discussion of the use and design of voice applications on the Web, and more specifically, feedback on the W3C VoiceXML specifications.
Voice Browser Activity Statement
VoiceXML 2.0 Implementation Report
Implementation Report -- Speech Recognition Grammar Specification Version 1.0
Introduction and Overview of W3C Speech Interface Framework. W3C Working Draft 4-December-2000.
W3Cs Royalty-Free License. The core VoiceXML 2.0 specification is made available according to W3C Royalty-Free (RF) Licensing Requirements.
Voice Browser Patent Statements
VoiceXML Forum web site
Related W3C Specifications:
- Speech Synthesis Markup Language Version 1.0. W3C Candidate Recommendation 18-December-2003.
- Voice Browser Call Control: CCXML Version 1.0. W3C Working Draft 12-June-2003.
- IBM's Call Control XML (CCXML) Interpreter. Consists of an interpreter for WebSphere Voice Response for AIX, along with example applications and associated user documentation. AIX 5.1 with ML3 and Voice Response Beans Environment 3.1.0.10 are prerequisites for installing and using the CCXML Interpreter.
- Semantic Interpretation for Speech Recognition. W3C Working Draft 16-November-2001.
Earlier news:
- "W3C Voice Extensible Markup Language (VoiceXML) 2.0 a Proposed Recommendation."
- W3C Announcement: "World Wide Web Consortium Issues VoiceXML 2.0 as a W3C Proposed Recommendation. Cornerstone to the W3C Speech Interface Framework is Nearly Complete." Also available in French and Japanese. [source, and news item]
- VoiceXML Forum Announcement: "VoiceXML Forum Endorses Advancement of VoiceXML 2.0 Specification to Proposed Recommendation. Announces Availability of X+V 1.2 as Multimodal Solution of Choice. Forum Acts as Official Steward of X+V Specification."
- W3C Advances VoiceXML Version 2.0 to Candidate Recommendation Status." News story 2003-01-29.
"VoiceXML Forum" - General references.


SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY