[Cache version from http://www-306.ibm.com/software/pervasive/multimodal/x%2Bv/11/spec.htm; please use the canonical source if possible.]
The XHTML+Voice profile brings spoken interaction to standard web content by integrating the mature XHTML and XML-Events technologies with XML vocabularies developed as part of the W3C Speech Interface Framework. The profile includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to XHTML elements and respond to specific DOM events, thereby reusing the event model familiar to web developers. Voice interaction features are integrated with XHTML and CSS and can consequently be used directly within XHTML content.
This section describes the status of this document at the time of its publication. Other documents may supersede this document.
Note that the language profile described in this specification re-uses W3C working drafts that are likely to change. This integration profile will be updated as needed to use the final stable versions of these specifications. This profile is an update to the XHTML+Voice 1.0 profile. XHTML+Voice 1.1 is current with the VoiceXML 2.0 Candidate Recommendation.
The list of known errors in this specification is available at xhtml-voice11-errata.htm. Please report errors in this document to mccobb@us.ibm.com.
1 Introduction
1.1
Motivation
And Applications
1.2 Design
Principles
1.3 XHTML+Voice
Processing Model
1.3.1 Processing
within one
Document
1.3.1.1
Language
and
Version
1.3.1.2
VoiceXML
Scope within
XHTML+Voice
1.3.1.3
Accessing
Speech Dialog Results from
XHTML
1.3.1.4
Accessing
XHTML from a Speech
Dialog
1.3.2 Cancel
1.3.3
Declarative
Synchronization of Input
Modes
1.3.4 Events
and Event Handling
1.3.5
Document
Linking with Voice
1.3.6
Aural
Style Sheets
2 VoiceXML
2.0 Modules
2.1 Modularization
Of VoiceXML 2.0
2.2 Speech
Dialogs
2.3 Executable
Content
2.4 Speech
Grammars
2.5 Speech
And Non-speech Audio Output
2.6 Event
Handling
3 XHTML
Modularization
3.1 Document
Conformance
3.2 User
Agent Conformance
3.3 XHTML
Namespace Integration
3.4 XHTML+Voice
Profile
3.5 XHTML+Voice
Abstract Modules
3.5.1 Abstract
Modules
3.5.2 Element
content shorthands
3.5.3
Attribute
list shorthands
4 XML-Events
Module
4.1 Listener
4.2
Event
Types
5 XHTML+Voice
Extension Module
5.1 Sync
5.2
Cancel
5.3
VoiceXML
Field ID Attribute
5.4 VoiceXML
Prompt SRC Attribute
A Reusable
VoiceXML
B Examples
B.1
What
You See Is What You Can Say
B.2 Mixed-initiative
Conversational Interface
B.3 Speech-Enabled
Mail Interface
B.4 Reusable
VoiceXML Subdialogs
C FIA
for XHTML+Voice
D DTD
D.1
xhtml+voice11.dtd
E
Schema
E.1
xhtml+voice11.xsd
F
References
F.1
Normative
References
F.2 Informative
References
This section is informative.
This document defines version 1.1 of the XHTML+Voice profile. XHTML+Voice 1.1 is a member of the XHTML family of document types, as specified by XHTML Modularization [XHTML Modularization]. XHTML is extended with a modularized subset of VoiceXML 2.0, the XML-Events module, and a module containing a small number of attribute extensions to both XHTML and VoiceXML. The latter module facilitates the sharing of multimodal input data between the VoiceXML dialog and XHTML input and text elements.
The XML-Events module [XML Events] provides XML host languages the ability to uniformly integrate event listeners and associated event handlers with Document Object Model (DOM) Level 2 [DOM2 Events] event interfaces. The result is an event syntax for XHTML-based languages that enables an interoperable way of associating behaviors with document-level markup.
VoiceXML [VoiceXML 2.0] has been designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed-initiative conversations. In this document, VoiceXML 2.0 is modularized to prepare it for integration into the XHTML family of languages using the XHTML modularization framework. The modules that combine to support speech dialogs for updating XHTML forms and form elements are selected to be added to XHTML. The modules are described as well as the integration issues. The modularization of VoiceXML 2.0 also specifies DOM event types specific to voice interaction for use with the XHTML Events module. Speech dialogs authored in VoiceXML 2.0 can then be treated as event handlers to add voice-interaction specific behaviors to XHTML documents. The language integration supports all of the modules defined in XHTML Modularization, and adds speech interaction functionality to XHTML elements to enable multimodal applications. The document type defined by the XHTML+Voice profile is XHTML Host language document type conformant.
Two mature technologies, XHTML 1.1 [XHTML 1.1] and VoiceXML 2.0 [VoiceXML 2.0] are integrated using [XHTML Modularization] to bring spoken interaction to the visual web. The design leverages open industry APIs like the W3C DOM to create interoperable web content that can be deployed across a variety of end-user devices. Multiple modes of interaction are synchronized and integrated using the DOM 2 Events model [DOM2 Events] and exposed to the content author via XML Events [XML Events].
Today, web applications are authored in XHTML with user interaction created via XHTML form elements. The W3C is presently working on XForms [XForms], the next generation of web forms that bring the power of XML to web application development. The combination of XHTML and Voice described in this document can leverage the semantic richness of web applications created using XForms, while providing a smooth transition for today's developers wishing to deploy multimodal applications by adding spoken interaction to present-day web content. Integrating the work of the W3C voice browser working group into mainstream XHTML content has the advantage of ensuring that future enhancements to the voice browser component such as natural language understanding will be incorporated. Thus, a smooth transition path for web developers wishing to deliver increasingly smart user interaction for their web applications is provided. Building on XHTML Basic [XHTML Basic] and XHTML modularization, content developers will be able to deploy their content to a wide variety of end-user clients ranging from mobile phones and small PDAs to desktop browsers.
XHTML+Voice is an XML application [XML 1.0].
XHTML+Voice is designed for creating multimodal dialogs that combine in a straightforward way the visual input mode represented by XHTML, and speech input and output, as represented by VoiceXML. Here is a "Hello World" example of XHTML+Voice:
<?xml version="1.0"?>
<html
xmlns="http://www.w3.org/1999/xhtml"
xmlns:vxml="http://www.w3.org/2001/vxml"
xmlns:ev="http://www.w3.org/2001/xml-events"
xmlns:xv="http://www.voicexml.org/2002/xhtml+voice"
>
<head>
<title>XHTML+Voice Example</title>
<!-- voice handler -->
<vxml:form id="sayHello">
<vxml:block><vxml:prompt xv:src="#hello"/>
</vxml:block>
</vxml:form>
</head>
<body>
<h1>XHTML+Voice Example</h1>
<p id="hello" ev:event="click" ev:handler="#sayHello">
Hello World!
</p>
</body>
</html>
The speech dialog identified by "sayHello" is activated when the user clicks anywhere on the paragraph identified by "hello." The speech dialog is a VoiceXML form that synthesizes the text obtained from the same paragraph that activated the form. The speech output is "Hello World!"
A speech dialog is defined within XHTML+Voice as a [VoiceXML 2.0] form with a unique ID. The VoiceXML form is activated by an XML-event with an associated handler that references the form's unique ID. The XML-event is generated from a user interaction with an XHTML element, generally a form control, or from a document event such as load or unload. Activating the VoiceXML form sets all form and field item variables to their initial values. This clears the the guard conditions on all form items that don't have an initial value set with the expr attribute. The form is run according to the form interpretation algorithm (FIA) specified by VoiceXML.
A VoiceXML form requires language and VoiceXML version information. VoiceXML 2.0 includes language and version attributes with its root <vxml> element. XHTML+Voice obtains language and VoiceXML version from XHTML as follows. Language is obtained from the HTML root element's xml:lang attribute, while the VoiceXML version can be derived from the value of the VoiceXML namespace. The language can be overriden by the xml:lang attribute on the VoiceXML grammar and prompt tags.
A VoiceXML form within an XHTML+Voice document does not have the session and document scopes defined by VoiceXML. It does not have these scopes for two reasons. First, <form> is the top level VoiceXML element in an XHTML+Voice document. Second, XHTML+Voice does not allow transitions from one voice handler to another. VoiceXML 2.0 allows a form to have either dialog or document scope. If the form's scope is document, as set by the scope attribute, the form is active while another form in the document is running. When the speech input matches the grammar of the form with document scope, there is a transition from the currently running form to the form with the document scope. XHTML+Voice does not allow this transition. Consequently, a form's scope is limited to dialog and the scope attribute is omitted. The grammar scope attribute is also omitted for the same reason. The remaining inner VoiceXML scopes, dialog and anonymous, are the same in XHTML+Voice, as required by the VoiceXML FIA.
XHTML+Voice allows a speech dialog to be referenced as a voice handler in an external file. Because the speech dialog has no scope outside of its enclosing form, only the form in the external file is processed when the form is activated. For example, the script elements in the external file will not be processed. This is because the visual browser only executes script in the current document, and the VoiceXML <script> element is not supported. This requires the external reference to contain a fragment identifer specifying the form in addition to an absolute or relative URI. This differs from VoiceXML, which specifies that when the fragment is absent, the form "invoked is the lexically first dialog in the document" [VoiceXML 2.0]. With this restriction, the speech dialog can reside in any external XML document, including VoiceXML. Only the calling document has to be an XHTML+Voice document.
Because XHTML script placed in an external file is not processed, validation of VoiceXML results cannot be performed within an external subdialog by calling out to some ECMAScript contained within a VoiceXML script tag. ECMAScript validation of subdialog results can only be performed from the calling document. Validation methods must be included in the ECMAScript objects passed as parameters to the subdialog.
VoiceXML <field>, <subdialog>, and <var> elements do not have any visibility to the XHTML namespace as ECMAScript variables. Furthermore, there is no requirement to support the VoiceXML elements as nodes in the DOM object available to JavaScript. There are several problems with supporting the DOM object. Unlike XHTML form control elements, VoiceXML form item elements don't have a value attribute and consequently the DOM node value property is missing. A value attribute is necessary because the VoiceXML form item elements are their own ECMAScript variables, and they have defined values only while the enclosing form is active. At all other times their values are undefined.
Speech dialog results may be accessed from XHTML in one of the following ways:
The global JavaScript scope of an XHTML+Voice document is available to a speech dialog. For example, an XHTML form control element, such as input, can be accessed from within VoiceXML using the DOM object traversal notation available to JavaScript. For example, the value of an input field with name "from_city" is set from the VoiceXML assign tag as follows:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<form id="form_id" xmlns="www.w3.org/2001/vxml">
<field name="from_field">
<filled>
<assign name="document.main.from_city.value"
expr="from_field"/>
</filled>
</field>
</form>
</head>
<body>
<form name="main" action="">
<input name="from_city" type="text"
ev:event="focus" ev:handler="#form_id"/>
</form>
</body>
</html>
The document keyword in XHTML+Voice refers to the JavaScript DOM object. The application.lastresult$ variables are at the same scope as the DOM object, which is effectively the VoiceXML application scope.
Multiple speech dialogs running simultaneously are not allowed by XHTML+Voice. A speech dialog runs in its own thread and, for many devices, the audio subsystem can be owned by only one thread at one time. Also, other resources that are guaranteed to be thread-safe may cause a voice handler to indefinitely block. Therefore, only one speech dialog can be running at one time per loaded XHTML+Voice document. If only one speech dialog can be running at one time, the activating speech dialog must cancel the currently running dialog. This is the default behavior. The running dialog should also be canceled when the current XHTML+Voice document is unloaded.
A speech dialog may be canceled for other reasons, which depend on the multimodal browser implementation and user preference. For example, A speech dialog may be canceled whenever the user leaves the current XHTML form control or clicks on another element, i.e., whenever an HTML 4.01 "onblur" event is generated. An input timeout may also cancel the current speech dialog. However, if a cancel button or some other means to cancel is supplied, then only the default behavior will in most cases be preferred. It is preferred in cases where the XHTML+Voice application is running in "voice-only" mode while the user is working in another window or application. An e-mail application, for example, should allow a "voice-only" mode to run after losing application focus. The multimodal browser may also have a user preference for cancel which would override the default behavior. A good strategy would be for the multimodal browser to cancel only upon activation of a new speech dialog by default, and provide a user-preference for cancelling upon an "onblur" event.
The document author can cancel the currently running speech dialog with the <cancel> element that can be specified by an XHTML element as a handler for an XML Event. The XHTML+Voice Extension Module section provides more details.
Cancel is a message from the visual browser that must be handled by the VoiceXML FIA. It is separate from the cancel event supported by VoiceXML that cancels the currently running prompt. The cancel message from the visual browser modifies the FIA in the sense that it must be checked throughout the FIA, and if it is received then the FIA must terminate.
XHTML+Voice 1.1 supports a declarative synchronization of XHTML form control elements and the VoiceXML <field> element. XHTML+Voice 1.1 introduces sync as a standalone element. Sync specifies the following behaviors. First, sync allows input from one speech or visual modality to set the field in the other modality. Second, setting the focus of an <input> element that is synchronized with a VoiceXML field updates the FIA to visit that VoiceXML field. This is useful when there are multiple fields within a VoiceXML form. Sync is both a message to the VoiceXML FIA from the visual browser, like cancel, and a message from the FIA to the visual browser. The XHTML+Voice Extension Module section provides more details.
The nomatch, noinput, help, and error VoiceXML event types are propagated as XML-events to XHTML. They can be linked to a Javascript handler using the XML-events syntax for specifying target, observer, event, and handler. The events are propagated regardless of whether the event has already been caught and handled properly within the VoiceXML form. Within VoiceXML a chain of events can be created, where one event is caught and another event is thrown, and so on. Because the entire chain of events is propagated to XHTML, the application author should be careful not to chain multiple events of the same type. The VoiceXML error event subtypes error.semantic, error.badfetch, error.unsupport.element, etc., are propagated as the error event type to XHTML. This is in accordance with the VoiceXML specification. This allows for the user to define additional error subtypes that can be handled by the visual browser. More general user-defined event types are also supported. If a user-defined event type is defined within the VoiceXML form, such as "foo.bar", then when that event is thrown within the form, it is propagated to XHTML as an XML-event. For the example below, both the noinput and foo.bar events are handled by the visual browser via the XML-events listener tag. Note that the VoiceXML form exits because the foo.bar event is not handled within the form. Throwing an unhandled foo.bar event is like throwing an unhandled exit event, except that the foo.bar event is propagated to XHTML before the form exits.
<vxml:form id="ex1">
<vxml:catch event="noinput">
<vxml:throw event="foo.bar"/>
</vxml:catch>
<vxml:field name="f1">
<vxml:grammar type="boolean"/>
<vxml:prompt>Say yes or no</vxml:prompt>
</vxml:field>
</vxml:form>
<ev:listener ev:target="ex1" ev:event="noinput" ev:handler="#h1"/>
<ev:listener ev:target="ex1" ev:event="foo.bar" ev:handler="#h2"/>
In addition to the VoiceXML event types listed above, XHTML+Voice supports the vxmldone event type. The vxmldone event is generated when the currently running VoiceXML form completes. All the event types that XHTML+Voice supports are listed in the XML-Events Module.
Document linking with voice is available to the author. Given an XHTML+Voice document with the following link and a tags:
<link rel="glossary" title="Glossary" href="glossary.html"/> <link rel="contents" title="Contents" href="contents.html"/> <a href="chapter3.html" title="Next Page" rel="next">Next</a> <a href="chapter1.html" title="Previous Page" rel="previous">Previous</a>
The following grammar can be produced from parsing the document. For this example the grammar is JSGF. The grammar is collected from each link and a element in the document that has a rel attribute. The document author uses the rel attribute to enable document linking for a select set of link and a elements. For each element with a rel attribute, the rel attribute value is added to the grammar. Alternatively, the title attribute can be used in place of rel for international language support:
#JSGF V1.0 iso-8859-1;
grammar document-links;
public <document-links> = Glossary {this.$value="glossary.html"}
| Contents {this.$value="contents.html"}
| Next Page {this.$value="chapter3.html"}
| Previous Page {this.$value="chapter1.html"};
The grammar scope of the grammar is document so that it is always active. While XHTML+Voice does not support authoring a grammar with document scope within a form, the multimodal browser should support grammars with document scope for document linking and command and control.
With the addition of a src attribute to the VoiceXML <prompt> element, XHTML+Voice 1.1 is able to support Aural style sheets declared according to [CSS2]. Within XHTML, a paragraph with id set to "warnPara" can be styled with the CSS "warn" class:
<p id="warnPara" class="warn">warning</p>
The CSS has visual and aural rules for class "warn." When the VoiceXML<form> processes a prompt with the src attribute set to that paragraph, the aural style rules for "warn" are invoked. The VoiceXML Prompt SRC Attribute Section provides a complete example.
This section first modularizes VoiceXML 2.0 and then specifies the various VoiceXML 2.0 modules used in the creation of the XHTML+VoiceXML profile.
The files making up the modularization of the VoiceXML 2.0 SCHEMA are available as voice-xml-modules.zip and have been created to ease the process of integrating VoiceXML 2.0 and XHTML. These modules do not change the VoiceXML 2.0 language as specified by the voice browser working group of the W3C. This section gives a high-level overview of each module.
| Module | Purpose | Elements | XHTML+VoiceXML? |
| Events | Events thrown by Voice XML processor | catch help noinput
nomatch error throw |
Y |
| Executable statements | Statements for use in voice handlers | assign clear var
log reprompt |
Y |
| Filled | Voice handlers invoked when a slot is filled. | filled |
Y |
| Flow control | Flow control constructs from VoiceXML | if else elseif
return |
Y |
| Forms | Encapsulate voice dialogs | form field record
subdialog block initial
option |
Y |
| Miscellaneous | Non-local transfers in VoiceXML | exit goto link
script submit |
N |
| Menus | VoiceXML menus | menu choice enumerate |
N |
| Object | Foreign objects for VoiceXML | object |
N |
| Resources | Specifying resources for VoiceXML | param property |
Y |
| Root | VoiceXML stand-alone documents | vxml meta |
N |
| Output | Speech and audio output | prompt value audio
emphasis voice break
prosody say-as phoneme
paragraph p sentence s
mark |
Y |
| Telephony | Telephony control | transfer disconnect |
N |
| User Input | Speech input constructs from VoiceXML | grammar count example
token import item
one-of rule ruleref |
Y |
| Attributes | Common attributes used in VoiceXML | NA | Y |
| Datatypes | Common datatypes used in VoiceXML | NA | Y |
| Document Model | Defines content model for VoiceXML elements | NA | N |
Modules vxml-exec-1.xsd, vxml-filled-1.xsd,
vxml-resource-1.xsd vxml-flow-1.xsd, and
vxml-form-1.xsd support authoring handlers that implement speech
dialogs.
Modules vxml-filled-1.xsd, vxml-flow-1.xsd,
vxml-exec-1.xsd, and vxml-resource-1.xsd declare
constructs for use within voice handlers. The semantics of these constructs are
as defined in the VoiceXML 2.0 specification.
The speech grammar modules provide constructs for authoring speech grammars
as specified in VoiceXML 2.0. The modules are provided by the normative VoiceXML
2.0 SCHEMA and are unchanged: grammar-core.xsd,
grammar.xsd, vxml-grammar-restriction.xsd, and
vxml-grammar-extension.xsd. The restriction and extension modules
allow the elements and attributes normatively specified by the speech grammar
specification [Speech
Grammars] to be included within the VoiceXML 2.0 namespace.
The speech and audio output modules define constructs for producing spoken
and non-spoken audio output. The modules are provided by the normative VoiceXML
SCHEMA and are unchanged: synthesis-core.xsd,
synthesis.xsd, vxml-synthesis-restriction.xsd, and
vxml-synthesis-extension.xsd. As with the speech grammar modules,
the elements and attributes normatively defined in the SSML specification [SSML
1.0] are included within the VoiceXML 2.0 namespace.
This section is normative.
A conforming XHTML+Voice document is a document that requires only the facilities described as mandatory in this specification. Such a document must meet all of the following criteria:
It must validate against the XML Schema found in schema provided in this document.
The root element of the document must be html.
The name of the default namespace on the root element must be the XHTML
namespace name: http://www.w3.org/1999/xhtml.
If a DOCTYPE declaration is present and includes a public identifier, the DOCTYPE declaration must reference the DTD provided in this document using its Formal Public Identifier. The system identifier may be modified appropriately.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+Voice 1.1//EN" "http://www.w3.org/Voice/Group/2003/xhtml+voice11.dtd">
The user agent must conform to the "User Agent Conformance" section of the XHTML specification ([XHTML 1.0], section 3.2) and the conformance requirements detailed in the VoiceXML modules ([VoiceXML 2.0]) supported by the integration profile.
The user agent must conform to the following additional user agent rule:
When the user agent claims to support facilities defined within the VoiceXML 2.0 specifications or facilities required by this specification through normative reference, it must do so in ways consistent with the facilities' definition.
The default XML namespace of an XHTML+Voice document is XHTML. XHTML+Voice extends XHTML with VoiceXML, XML-events, and XHTML+Voice extensions. The VoiceXML, XML-events, and XHTML+Voice extension elements and attributes are included through additional namespace declarations:
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:vxml="http://www.w3.org/2001/vxml"
xmlns:ev="http://www.w3.org/2001/xml-events"
xmlns:xv="http://www.voicexml.org/2002/xhtml+voice">The name of the unique prefix identifier for the namespace within the
document, for example, vxml for VoiceXML elements, is left to the
document author's discretion.
The XHTML functionality in the XHTML+Voice document type is based upon the XHTML modules defined in [XHTML Modularization]. The XHTML+Voice profile includes the XHTML modules defined in [XHTML Basic], such as the basic XHTML forms and tables modules. Added to the XHTML Basic modules are the following modules:
The notation, terms and document conventions used here are borrowed from [XHTML 1.1].
The profile includes the XHTML basic module defined in [XHTML Basic], the XHTML scripting module defined in [XHTML 1.1], the XML Event module defined in [XML Events], the XHTML+Voice extension module defined in the XHTML+Voice Extension Module, and the following VoiceXML 2.0 modules:
The namespaces used in these modules are as follows:
| Element | Content | Attributes |
|---|---|---|
| Base Module (XHTML) | ||
| base | EMPTY | href* (URI) |
| Basic Forms Module (XHTML) | ||
| form | Heading | Block - form | Common, action* (URI), method ("get"* | "post"), enctype (ContentType) |
| input | EMPTY | Common, Access, checked ("checked"), maxlength (Number), name (CDATA), size (Number), src (URI), type ("text"* | "password" | "checkbox" | "radio" | "submit" | "reset" | "hidden" ), value (CDATA) |
| label | (PCDATA | Inline - label)* | Common, accesskey (Character), for (IDREF) |
| select | option+ | Common, multiple ("multiple"), name (CDATA), size (Number) |
| option | PCDATA | Common, , selected ("selected"), value (CDATA) |
| textarea | PCDATA | Common, Access, cols* (Number), name (CDATA), rows* (Number) |
| Basic Tables Module (XHTML) | ||
| caption | (PCDATA | Inline)* | Common |
| table | caption?, tr+ | Common, summary (Text), width ( Length ) |
| td | (PCDATA | Flow - table)* | Common, Cell, Align |
| th | (PCDATA | Flow - table)* | Common, Cell, Align |
| tr | td+ | Common, Align |
| Events Module (VoiceXML) | ||
| catch | Exec | VoiceHandler, event (NMTOKENS) |
| help | Exec | VoiceHandler |
| noinput | Exec | VoiceHandler |
| nomatch | Exec | VoiceHandler |
| error | Exec | VoiceHandler |
| throw | EMPTY | VoiceHandler, event ( NMTOKEN), eventexpr (Script), message (CDATA), messageexpr (Script) |
| Executable Statements Module (VoiceXML) | ||
| assign | EMPTY | Expr |
| clear | EMPTY | namelist (CDATA) |
| var | EMPTY | Expr |
| log | (PCDATA | value)* | label (CDATA), expr (Script) |
| reprompt | EMPTY | - |
| Filled Module (VoiceXML) | ||
| filled | (Exec)* | mode("any" | "all"*), namelist (CDATA) |
| Flow Control Module (VoiceXML) | ||
| if | (Exec | elseif | else)* | cond (Script) |
| else | EMPTY | - |
| elseif | EMPTY | cond (Script) |
| return | EMPTY | namelist (CDATA), event (NMTOKEN), eventexpr (Script), message (CDATA), messageexpr (Script) |
| Forms Module (VoiceXML) | ||
| form | (Form)* | id (ID) |
| field | ( Audio | EventHandler | filled | grammar | link | vxml:option | prompt | property)* | Item, type (GrammarType), slot ( NMTOKEN ), modal (Boolean), xv:id ( ID) |
| record | ( Audio | EventHandler | filled | grammar | prompt | property)* | Item, type ( ContentType), beep ( Boolean), maxtime ( Duration), modal ( Boolean), dtmfterm ( Boolean), finalsilence ( Duration) |
| subdialog | ( Audio | filled | param | prompt | property)* | Item, Cache, Submit, src (URI), srcexpr (Script), fetchaudio (URI) |
| block | Exec | Item |
| initial | ( Audio | EventHandler | link | prompt | property)* | Item |
| vxml:option | PCDATA | dtmf (CDATA), value (CDATA) |
| Hypertext Module (XHTML) | ||
| a | (PCDATA | Inline - a)* | Common, Access, Linking, hreflang (LanguageCode) |
| Image Module (XHTML) | ||
| img | EMPTY | Common, Dim, alt* (Text), longdesc (URI), src* (URI) |
| Link Module (XHTML) | ||
| link | EMPTY | Linking , media (MediaDesc) |
| List Module (XHTML) | ||
| dl | (dd | dt)+ | Common |
| dt | (PCDATA | Inline)* | Common |
| dd | (PCDATA | Flow)* | Common |
| ol | li+ | Common |
| ul | li+ | Common |
| li | (PCDATA |Flow)* | Common |
| Metainformation Module (XHTML) | ||
| meta | EMPTY | I18N, content* (CDATA), http-equiv (NMTOKEN), name (NMTOKEN), scheme (CDATA) |
| Object Module (XHTML) | ||
| object | (PCDATA | Flow | param)* | Common, Dim, archive (URI), classid (URI), codebase (URI), codetype (ContentType), data (URI), declare ("declare"), name (CDATA), standby (Text), tabindex (Number), type (ContentType) |
| param | EMPTY | id (IDREF), name* (CDATA), type (ContentType), value (CDATA), valuetype ("data"* | "ref" | "object") |
| Output Module (VoiceXML) | ||
| prompt | (Audio | TTS)* | I18N, VoiceHandler, bargein (Boolean), bargeintype ("speech" | "hotword"), timeout (Duration), xv:src (URI) |
| value | EMPTY | expr (Script) |
| audio | ( Audio | TTS)* | Cache, src (URI), expr (Script) |
| emphasis | SentenceContent | level ("strong" | "moderate"* | "none" | "reduced") |
| voice | (SentenceContent | Structure)* | I18N, gender ("male" | "female" | "neutral"), age (Number), variant (Number), name (CDATA) |
| break | EMPTY | size ("large" | "medium"* | "small" | "none"), time (Duration) |
| prosody | (SentenceContent | Structure)* | pitch (CDATA), contour ( CDATA), range ( CDATA), rate ( CDATA), duration (Duration), volume ( CDATA) |
| say-as | ( PCDATA | value)* | type (SayAsType) |
| phoneme | PCDATA | ph (CDATA), alphabet (CDATA) |
| paragraph, p | (SentenceContent | Sentence)* | I18N |
| sentence, s | SentenceContent | I18N |
| mark | (SentenceContent | Sentence)* | name (IDREF) |
| Resources Module (VoiceXML) | ||
| param | EMPTY | Expr, value (CDATA), valuetype ("data"* | "ref"), type (CDATA) |
| property | EMPTY | name (NMTOKEN), value (CDATA) |
| Scripting Module (XHTML) | ||
| script | PCDATA | charset (CharSet), defer ("defer"), src (URI), type* (ContentType), xml:space="preserve" |
| noscript | ( Heading | Block | List)+ | Common |
| Structure Module (XHTML) | ||
| body | (Heading | Block | List)* | Common |
| head | title, ( meta | link | object | script | vxml:form | ev:listener | xv:sync | xv:cancel)* | I18N, profile (URI) |
| html | head, body | I18N, version ( CDATA), xmlns (URI = "http://www.w3.org/1999/xhtml") |
| title | PCDATA | I18N |
| Text Module (XHTML) | ||
| abbr | (PCDATA | Inline)* | Common |
| acronym | (PCDATA | Inline)* | Common |
| address | (PCDATA | Inline)* | Common |
| blockquote | (PCDATA | Heading | Block | List)* | Common, cite (URI) |
| br | EMPTY | Core |
| cite | (PCDATA | Inline)* | Common |
| code | (PCDATA | Inline)* | Common |
| dfn | (PCDATA | Inline)* | Common |
| div | (PCDATA | Flow)* | Common |
| em | (PCDATA | Inline)* | Common |
| h1 | (PCDATA | Inline)* | Common |
| h2 | (PCDATA | Inline)* | Common |
| h3 | (PCDATA | Inline )* | Common |
| h4 | (PCDATA | Inline )* | Common |
| h5 | (PCDATA | Inline )* | Common |
| h6 | (PCDATA | Inline )* | Common |
| kbd | (PCDATA | Inline )* | Common |
| p | (PCDATA | Inline )* | Common |
| pre | (PCDATA | Inline )* | Common, xml:space="preserve" |
| q | (PCDATA | Inline )* | Common, cite (URI) |
| samp | (PCDATA | Inline )* | Common |
| span | (PCDATA | Inline )* | Common |
| strong | (PCDATA | Inline )* | Common |
| var | (PCDATA | Inline )* | Common |
| User Input Module (VoiceXML) | ||
| grammar | (PCDATA | meta | metadata | lexicon | rule)* | Cache, I18N, version (NMTOKEN), root (IDREF), mode ("voice"* | "dtmf"), src (URI), type (ContentType), weight ( CDATA), tag-format (URI) |
| example | PCDATA | |
| lexicon | EMPTY | uri (URI), type (ContentType) |
| tag | PCDATA | |
| token | PCDATA | I18N |
| item | (RuleExpansion)* | I18N, weight (NMTOKEN), repeat (NMTOKEN), repeat-prob (NMTOKEN) |
| meta | EMPTY | name (NMTOKEN), content (CDATA), http-equiv (NMTOKEN) |
| metadata | ANY | |
| one-of | (item)+ | I18N |
| rule | ( RuleExpansion | example)* | id (ID), scope ("private"* | "public") |
| ruleref | EMPTY | I18N, uri (URI), type (ContentType), special ("NULL" | "VOID" | "GARBAGE") |
| XML Events Module (XML Events) | ||
| listener | EMPTY | XEvents |
| XHTML+Voice Extension Module (XHTML+Voice) | ||
| sync | EMPTY | |
| cancel | EMPTY | |
| Elements | Attributes | |
| vxml:field& | id (ID) | |
| vxml:prompt& | src (URI) | |
| Element Entities | Content |
|---|---|
| Audio (VoiceXML) | PCDATA | audio | value |
| Block (XHTML) | address | blockquote | div | p | pre |
| EventHandler (VoiceXML) | catch | help | noinput | nomatch | error |
| Exec (VoiceXML) | Audio | assign | clear | disconnect | exit | goto | if | log | prompt | reprompt | return | script | submit | throw | var |
| Flow (XHTML) | Heading | List | Block | Inline |
| Form (VoiceXML) | EventHandler | grammar | filled | initial | object | link | property | record | subdialog | Variable |
| Heading (XHTML) | h1 | h2 | h3 | h4 | h5 | h6 |
| Inline (XHTML) | a | abbr | acronym | button | br | cite | code | dfn | em | img | input | kbd | label | object | q | samp | select | span | strong | textarea |
| RuleExpansion (VoiceXML) | PCDATA | token | ruleref | item | one-of |
| SentenceContent (VoiceXML) | Audio | SentenceElements |
| SentenceElements (VoiceXML) | break | emphasis | phoneme | mark | prosody | say-as | voice |
| Structure (VoiceXML) | sentence | s | paragraph | p |
| TTS (VoiceXML) | SentenceElements | Structure |
| Variable (VoiceXML) | block | field | var |
| Attribute Entities | Content |
|---|---|
| Access (XHTML) | accesskey (Character), tabindex (Number) |
| Align (XHTML) | align ("left" | "center" | "right"), valign ("top" | "middle" | "bottom") |
| Cache (VoiceXML) | fetchhint ("prefetch" | "safe"), fetchtimeout ( Duration, maxage (Number), maxstale (Number) |
| Cell (XHTML) | abbr (Text), axis ( CDATA), colspan (Number), headers (IDREFS), rowspan (Number), scope ("row" | "col") |
| Common (XHTML) | Core, Events, XEvents |
| Core (XHTML) | class (NMTOKENS), id (ID), title ( CDATA ) |
| Dim (XHTML) | height ( Length ), width (Length) |
| Events (XHTML) | MouseEvents , KeyEvents |
| Expr (VoiceXML) | name (VarName), expr (Script ) |
| I18N (XML) | xml:lang (NMTOKEN) |
| Item (VoiceXML) | name (VarName), cond (Script), expr (Script) |
| KeyEvents (XHTML) | onkeypress (Script), onkeydown (Script), onkeyup (Script) |
| Linking (XHTML) | charset (CharSet), href (URI), hreflang (LanguageCode), rel (LinkTypes), rev (LinkTypes), type (ContentType) |
| MouseEvents (XHTML) | onclick (Script), ondblclick (Script), onmousedown (Script), onmouseover (Script), onmousemove (Script), onmouseout (Script) |
| Next (VoiceXML) | next (URI), expr (Script) |
| Style (XHTML) | style ( CDATA ) |
| Submit (VoiceXML) | method ("get"* | "post"), enctype (ContentType), namelist (CDATA) |
| VoiceHandler (VoiceXML) | count (Number), cond (Script) |
| XEvents (XML Events) | event, observer (IDREF), handler (URI), target (IDREF), phase ("capture" | "default"*), propagate ("stop" | "continue"*), defaultAction("cancel" | "perform"*), id |
| Attribute Type | Description |
|---|---|
| Boolean | "true" | "false" |
| Duration | A positive real number followed by either 's' (seconds) or 'ms' (milliseconds) |
| GrammarType | CDATA |
| SayAsType | "acronym" | "spell-out" |"currency" | "measure" | "name" | "telephone" | "address" | "number" | "number:ordinal" | "number:digits" | "number:cardinal" | "date" | "date:dmy" | "date:mdy" | "date:ymd" | "date:ym" | "date:my" | "date:md" | "date:y" | "date:m" | "date:d" | "time" | "time:hms" | "time:hm" | "time:h" | "duration" | "duration:hms" | "duration:hm" | "duration:ms" | "duration:h" | "duration:m" | "duration:s" | "net" | "net:email" | "net:uri" | "vxml:date" | "vxml:boolean" | "vxml:currency" | "vxml:time" | "vxml:digits" | "vxml:number" | "vxml:phone" |
| VarName | NMTOKEN or NMTOKEN with "$" appended |
This section is normative.
XHTML+Voice extends XHTML with the XML-Events <listener> element and its attributes. The <listener> attributes are added to XHTML elements primarily for activating voice handlers. The <listener> element and attributes belong to the XML-Events namespace:
xmlns:ev="http://www.w3.org/2001/xml-events"
For a given XML language extended with XML Events, a set of event types must be specified independently of the [XML Events] module. The XML Event types supported by the XHTML+Voice profile include all event types defined for [HTML 4.01] intrinsic events. VoiceXML handler activation is specified by including with an XHTML element one of these event types as an XML event and an ID reference to the VoiceXML form as an XML event handler.
The XHTML+Voice profile supports the following VoiceXML 2.0 event types: nomatch, noinput, error, and help. The VoiceXML exit and cancel event types are supported within the VoiceXML form but are not propagated to the visual browser. Event types defined by the author within VoiceXML, also known as user-defined event types, are also propagated to the visual browser. However, the VoiceXML <form> element does not support adding <listener> attributes.
An additional XHTML+Voice event type, "vxmldone", is supported. The vxmldone event is generated when the voice handler completes.
The XHTML+Voice profile extends the XHTML <script> element with XML Events. The <script> element doesn't generate any events of its own, so the target attribute is required to specify capturing an XML event. The <script> element can target any XHTML or VoiceXML element and can specify any HTML 4.01 intrinsic event or VoiceXML event. Here is an example of how a <script> element can be a handler for a "vxmldone" event. The value of XHTML input "drink" is updated when the voice handler "fid" completes:
<script type="text/javascript" ev:event="vxmldone" ev:target="fid">
document.xform.drink.value = application.lastresult$[0].utterance;
</script>
<vxml:form id="fid">
<vxml:field name="f1">
<vxml:grammar src="drink.gram"/>
<vxml:prompt>Coffee, tea, or milk?</vxml:prompt>
</vxml:field>
</vxml:form>
<body>
<form id="xform" action="cgi/submit">
<input type="text" id="drink" ev:event="focus" ev:handler="#fid"/>
</form>
The following table matches the XHTML+VoiceXML event types with the XHTML or VoiceXML elements that support them. When the <listener> event attribute is added to an XHTML element, it must specify a event type supported by the element in the right-hand column. Because the HTML 4.01 event types have been translated into XML-event types, the "on" prefix for these event types have been removed.
| Elements | Event Type |
| XHTML body | load, unload |
| Most XHTML elements | click, dblclick, mousedown, mouseup, mouseover, mouseout, keypress, keydown, keyup |
| XHTML elements: a, label, input, select, textarea, button | focus, blur |
| XHTML form | submit, reset |
| XHTML elements: input, textarea | select |
| XHTML elements: input, select, textarea | change |
| VoiceXML form | nomatch, noinput, error, help, vxmldone, "user defined" |
This section is normative.
The XHTML+Voice Extension module extends XHTML+Voice 1.0 with the <sync> element, <cancel> element, the src attribute of the VoiceXML <prompt> element, and the id attribute of the VoiceXML field element. The element and attributes in this module belong to their own namespace:
xmlns:xv="http://www.voicexml.org/2002/xhtml+voice"
The value of the namespace is temporary until the XHTML+Voice specification is taken over by the W3C Voice Browser Working Group. At that time the Voice Browser Working Group will obtain from the W3C an official XHTML+Voice namespace and location for the XHTML+Voice schema.
The XHTML+Voice <sync> element adds support for synchronization of data entered via either speech or visual input. It binds the value property of the input field, or JavaScript variable, to the VoiceXML field with the given id attribute value. This means several things:
Sync does not activate a voice handler. This means that if the <sync> element has specified an XHTML input field but no VoiceXML form is currently active, nothing will happen when a user clicks on the input field unless an XML-event and event handler are also specified for the input field. If an event and event handler are specified, then when the user clicks on the input field the VoiceXML form is activated and the guard conditions of the VoiceXML form items are cleared. The XHTML input field is not cleared if data is already there.
The <sync> element attributes are:
| input | The name of an XHTML input field or Javascript variable. |
| field | A URI reference to a field ID within a VoiceXML form. |
The type of the input attribute is NMTOKEN. The type of the field attribute is URI. The URI must include a fragment identifier that references a VoiceXML <field> ID. If the <field> element is in an external file, then the fragment identifier is appended to the URI.
The What You See Is What You Can Say and Mixed-initiative Conversational Interface examples both use the <sync> element to synchronize XHTML inputs and VoiceXML fields.
The XHTML+Voice <cancel> element allows a document author to cancel a running speech dialog. It is a stand-alone element with no content that can be referenced as an XML event handler. The <cancel> element has two attributes, id and handler. The id attribute is required. The optional handler attribute references the id attribute of a voice handler form. If the handler attribute is omitted, then the currently running speech dialog is canceled. If handler is specified, then only the specified voice handler is canceled.
The <cancel> element attributes are:
| id | Unique document identifier. |
| handler | A URI reference to a VoiceXML form ID. |
The type of the id attribute is ID. The type of the handler attribute is URI. The URI must include a fragment identifier that references a VoiceXML <form> ID. If the <form> element is in an external file, then the fragment identifier is appended to the URI.
<head><title>Cancel Example</title> ... <xv:cancel id="cid1" handler="#fid1"/> <xv:cancel id="cid2"/> </head> <body> ... <input type="reset" ev:event="click" ev:handler="#cid1"/> <button ev:event="click" ev:handler="#cid2">Cancel Voice</button>
The example above shows how <cancel> can be used to cancel either a specific speech dialog or the