Cover Pages: VoiceXML Forum Publishes Session Log Annotation Markup Language (SLAML) Spec.

Created: August 27, 2007.

News: Cover Stories

VoiceXML Forum Publishes Session Log Annotation Markup Language (SLAML) Spec.

Contents

Summary
Bibliographic Information
Session Log Annotation Markup Language Overview
Related Voice Browser/Multimedia Specifications
About VoiceXML
About the VoiceXML Forum
Principal References

The VoiceXML Forum recently announced the publication of a new draft specification which describes a methodology for collecting, storing, and retrieving runtime data for speech-based services and will help make data-analysis and service-tuning tools platform-independent. The public comment period closes on November 9, 2007.

The Session Log Annotation Markup Language (SLAML) specification was produced by members of the Forum's Data Logging Working Group, itself one of four working groups within the VoiceXML Tools Committee. Other WGs include the Metalangauge Working Group, Advanded Dialogs Working Group, and Open Source Grammars Working Group.

The SLAML specification design recognizes that data generated by a speech-based application during runtime can provide valuable visibility into the performance of the application and the user interaction. However, "data capture has not been adequately addressed by or integrated into any relevant speech industry standards," according to David Thomson, Chair of the VoiceXML Forum Tools Committee. "The SLAML specification will enable service providers to mix-and-match development tools, application servers and VoiceXML browsers, while maintaining a consistent data-logging format. Additionally, industry-wide adoption of the specification will make field data easier to analyze and use, improving speech system performance and usability."

The VoiceXML Forum SLAML specification includes five documents. The VoiceXML Data Logging Overview describes the overall SLAML model, the application server, the VoiceXML browser, and the speech recognizer entities. "A complete VoiceXML system is divided into three parts: an application development environment (offline tools), an application server (online tools, sometimes called a VoiceXML page server or document server), and a VoiceXML browser, sometimes called a speech server or voice gateway. Service creation developers use application development tools to write the application. The application server takes the application specification as input and creates VoiceXML. The VoiceXML browser uses VoiceXML as input to provide the service. During runtime, the application server sends VoiceXML documents to and receives documents from the browser in response to user input and other events. The browser executes VoiceXML code and uses touch-tone, recorded prompts, text-to-speech, and speech recognition software to communicate with callers."

The Session Log Annotation Markup Language (SLAML) "presents a format for annotating XML-based data logs so that information from multiple, concurrent processes can be aggregated and correlated. Logs are represented using XML formats and namespaces associated with the type of server that generated the log."

The Application Server Logging Specification (ASLS) document "defines a format for logging in server-side applications. Log information conforming to ASLS should be embedded in a Session Log Annotation Markup Language (SLAML) document: the SLAML document aggregates logs from all the collaborating entities, including browsers and servers. The server-side logging should complement the browser logs and provide visibility to the actions and decisions involved in handling browser requests. The server-side logging reflects the application logic flow rather than the run-time environment details. The model assumes a web-like processing environment that consists of browsers and servers. This is congruent with the VoiceXML and CCXML environments."

The Automatic Speech Recognition Logging Specification describes tags for logging run-time data in an ASR (automatic speech recognition) server. Typically, the ASR server is part of a speech services system that includes an ASR server, a text-to-speech server, a VoiceXML browser, an application server, a database server, interfaces to external servers, and possibly other servers.

The VoiceXML Browser Data Logging Specification describes tags for logging run-time data in VoiceXML browsers.

VoiceXML is one of the key work items under development by the W3C Voice Browser Working Group. W3C's VBWG is producing a suite of specifications known as the W3C Speech Interface Framework. It covers voice dialogs, speech synthesis, speech recognition, telephony call control for voice browsers and other requirements for interactive voice response applications, including use by people with hearing or speaking impairments. VoiceXML 2.1 was recently approved as a W3C Recommendation. It "is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications..."

Publication of the First Public Working Draft of VoiceXML 3.0 is expected by the end of September 2007. Its purpose is "to provide powerful dialog capabilities that can be used to build advanced speech applications, and to provide these capabilities in a form that can be easily and cleanly integrated with other W3C languages. It will provide enhancements to existing dialog and media control, as well as major new features (e.g. modularization, a cleaner separation between data/flow/dialog, and asynchronous external eventing) to facilitate interoperability with external applications and media components."

The VoiceXML Forum was founded in 1999 as an industry organization whose mission is to promote and to accelerate the worldwide adoption of VoiceXML-based applications. To this end, the Forum serves as an educational and technical resource, a certification authority and a contributor and liaison to international standards bodies, such as the World Wide Web Consortium (W3C) IETF, ANSI. and ISO. The VoiceXML Forum is organized as a program of the IEEE Industry Standards and Technology Organization (IEEE-ISTO).

Bibliographic Information

The SLAML specification includes five documents:

VoiceXML Data Logging Overview. Produced by members of the VoiceXML Forum Tools Committee. Primary authors: Bogdan Blaszczak (Intervoice), Kyle Danielson (Lumenvox), Chung Lai (Lumenvox), David Thomson (SpeechPhone), and Andrew Wahbe (Genesys). Draft 0.3. 20-August-2007. 8 pages. Copyright © 2007 VoiceXML Forum. Also in Word (.doc) format.

"This document provides an overview of the Data Logging Specification. The Data Logging Specification is organized with a separate document for each entity — initially the speech recognition server, the application server, and the VoiceXML browser — plus a set of SLAML documents that describe the format and syntax. In the future, the set of specifications will be expanded to include text-to-speech synthesis, speaker verification, web servers, and possibly database, development and/or analysis systems, and other transaction servers..."
Session Log Annotation Markup Language (SLAML). Edited by Andrew Wahbe (Genesys). Draft. 20-August-2007. The SLAML specification defines the data logging model. See the overview below.
Application Server Logging Specification (ASLS). Edited by Bogdan Blaszczak (Intervoice). Produced by members of the VoiceXML Forum Tools Working Group. Draft 1.5: Internal Working Draft, 31-July-2007. Released for Public Review 20-August-2007. ASLS defines a format for logging in server-side applications.

"This document defines a format for high-level logging on application servers. The log information conforming to the Application Server Logging Specification (ASLS) should be embedded in a Session Log Annotation Markup Language (SLAML) document. The SLAML document aggregates logs from all the collaborating entities, including browsers and servers. The server-side logging should complement the browser logs and provide visibility to the actions and decisions involved in handling browser requests. The server-side logging reflects the application logic flow rather than the run-time environment details.

The Application Server Logging Specification (ASLS) is based on a high-level model of the application processing. The model assumes a web-like processing environment that consists of browsers and servers. This is congruent with the VoiceXML and CCXML environments. The abstractions used in ASLS are independent of any specific implementation of server-side applications. The abstractions capture information sufficient for application-level, business-oriented reporting or other, similar post-processing. ASLS is not a substitute for the run-time trace logs..."
Automatic Speech Recognition Logging Specification. Edited by Kyle Danielson and Chung Pak Lai (LumenVox). Produced by members of the VoiceXML Forum Tools Working Group. Draft 1.8: Internal Working Draft, 31-July-2007. Released for Public Review 20-August-2007. Describes tags for logging run-time data in an automatic speech recognition server.

"This document describes tags for logging run-time data in an ASR (automatic speech recognition) server. Typically, the ASR server is part of a speech services system that includes an ASR server, a text-to-speech server, a VoiceXML browser, an application server, a database server, interfaces to external servers, and possibly other servers. The data logging (DL) specification, defined by the Tools Committee within the VoiceXML Forum, is called SLAML and comprises an SLAML overview plus a specification for the individual servers mentioned above (ASR, browser, etc.)..."
VoiceXML Browser Data Logging Specification. Schema 'vxml-session.xsd' describes tags for logging run-time data in VoiceXML browsers. Hypertext-style documentation is provided for XML elements and types (simple, complex).

Session Log Annotation Markup Language Overview

Excerpted from the SLAML "Introduction" and "General Model" description:

The SLAML specification "presents a format for annotating XML-based data logs so that information from multiple, concurrent processes can be aggregated and correlated. Logs are represented using XML formats and namespaces associated with the type of server that generated the log. For example, an application server may use one set of tags to log information, and a database server might use another. The Session Log Annotation Markup Language (SLAML) provides a common framework that can be used by these logging formats including attributes for identifying a log's source entity (i.e., the server that produced it) and timestamping logs. It also provides structural rules that all logging formats can follow so that data can be organized in a consistent fashion.

Tracing through logs generated by the interactions of distributed processes is often problematic since it is difficult to identify the logs on one machine that correspond to a request placed by another machine. This problem is often due to the fact that clock synchronization is either missing or not accurate to the level required to make an association between logs. Under load, this task becomes impossible without some sort of mechanism to identify the logs associated with a specific request. SLAML, therefore, provides a common framework for correlating the sending and receipt of messages between processes or entities in a distributed system. It also provides mechanisms for dealing with parallelism within a single entity.

A SLAML document may contain logging information from a single process or server; however, SLAML allows information from multiple sources to be aggregated into a single document. Typically, an aggregate document would contain all of the information associated with a single, system-wide session. A document could, however, contain information related to multiple sessions. Each SLAML document includes a manifest of the sessions described by the document and can capture the association between logs and a session.

Entities: SLAML can be used to organize and annotate the logs produced by a system of one or more entities. An entity is a thread of control in the system that performs some processing and records logging information about that processing. These entities may be operating system processes, or individual threads within the same physical process...

Logs: Entities record information about the processing they perform and group the information into logs. Each log contains information from a single entity; however, a single entity can store information in multiple logs. Each log is uniquely identified by a tag. Typically, information in a log is semantically related in some way; for example, a log may contain information associated with a specific session...

Log Records and Log Record Formats: The XML contents of a log are referred to as log records. Log records are represented with elements outside of the SLAML namespace. It is expected that each type of entity that produces SLAML logs will define an XML log record format and namespace for its log records. SLAML defines attributes that are used to annotate logs in a consistent way across log record formats. A log record format is typically defined using a schema, DTD or other mechanism... There are three main types of elements in a log record: events, periods, and data elements. Event elements represent an atomic event that occurred at a specific point in time during the processing performed by a system entity. The receipt of a system message is an example of an event. A decision that is reached during processing is often logged as an event, though the processing required to reach the decision lasted for a period of time rather than being atomic...

Messages, and Interactions: Entities are able to interact with each other using some form of communication or stimulus. We will use the term message to refer to this stimulus. In addition to communicating with each other, entities can send and receive messages from sources that are external to the system. Individual messages are components of higher level interactions that fit within the constraints defined by some protocol. For example, many protocols organize messages into request-response pairs wherein the request message solicits the recipient to send another response message back to the request sender. These request-response pairs are modeled as interactions by SLAML...

Classes: Entities are grouped into sets called classes. Every entity in a class accepts the same set of messages. SLAML does not depend on any formal definition of these messages. It only requires that entities be grouped in this way. Whenever an entity sends a message or request to another, the the associated logging information annotated with the class of the recipient. This is useful information as it provides insight into the role of the message recipient...

Sessions: SLAML can be used to annotate the logs of a system that organizes its processing based on sessions. Informally, a session is some subset of processing that is related in some way, for example, a session may reflect the processing associated with a specific user. A system may at anytime be executing one or more sessions...

The Session Manifest: Every SLAML document contains a session manifest listing of all of the sessions that are started in logs contained in the document. Each session listing provides a name for the session may contain a description of the session that is recorded using an XML schema and namespace associated with the entity that received the session start...

Log Aggregation: SLAML provides mechanisms for annotating log records so that all of the logs that are associated with a session can be aggregated into a single SLAML document. The sending of a message (or request) is always annotated with the class of the recipient entity. SLAML also requires that the tag of the log that captures the receipt of the message is recorded along with the class, and allows the address of the message or request to be optionally recorded...

Supplemental Log Annotations: SLAML allows a log processor to annotate a log with additional information. This allows comments on a log to be recorded by the by a human being who is assessing the information it contains. For example, in a support engineer can use this mechanism to indicate where a problem occurred and to provide other supplemental information. Automated processes could also add these annotations, if some log processing yeilds information that should be added to a SLAML document. As with the logs themselves, these annotations can be in any XML format...

Related Voice Browser/Multimedia Specifications

Several specifications from W3C, IETF, and the VoiceXML Forum related to SLAML (in alphabetical order) include:

CCXML (Call Control Extensible Markup Language): The Voice Browser Call Control: CCXML Version 1.0 specification "describes CCXML, or the Call Control eXtensible Markup Language. CCXML is designed to provide telephony call control support for dialog systems, such as VoiceXML. While CCXML can be used with any dialog systems capable of handling media, CCXML has been designed to complement and integrate with a VoiceXML interpreter. Because of this there are many references to VoiceXML's capabilities and limitations. There are also details on how VoiceXML and CCXML can be integrated. However, it should be noted that the two languages are separate and are not required in an implementation of either language. For example, CCXML could be integrated with a more traditional Interactive Voice Response (IVR) system or a 3GPP Media Resource Function (MRF), and VoiceXML or other dialog systems could be integrated with other call control systems." The document produced as part of the W3C Voice Browser Activity.
MOML (Media Objects Markup Language): IETF Internet Draft Media Objects Markup Language (MOML) was produced for the IETF Session Initiation Proposal Investigation (SIPPING) Working Group. MOML is an XML language for composing complex media objects from a vocabulary of simple media resource objects called primitives. It is primarily a descriptive or declarative language to describe media processing objects. This MOML specification describes a markup language to configure and define media resource objects within a media server. The language allows the definition of sophisticated and complex media processing objects which may be used for application interactions with users, i.e. as part of a user dialog, or as media transformation operations. Media Objects Markup Language (MOML) itself does not specify a language suitable for constructing complete user interfaces as does VoiceXML. Rather, it defines a language from which individual pieces of a dialog may be specified. MOML is not a standalone language but will generally be used in conjunction with other languages such as the Media Sessions Markup Language (MSML) or protocols such as the Session Initiation Protocol (SIP). MSML is used to invoke and control many different services on a media server and to manipulate the flow of media streams within a media server. SIP is used to establish media sessions and there are conventions to use the SIP Request-URI to invoke common media server services. MOML has both a framework, which describes the composition of media resource objects, and the definition of an initial set of primitive media resource objects. The following sections describe the structure and usage of MOML followed by sections defining all of the MOML XML elements..."
MRCP (A Media Resource Control Protocol): Developed by Cisco, Nuance, and Speechworks, The Media Resource Control Protocol (MRCP) (IETF Request for Comments #4463) is designed to provide a mechanism for a client device requiring audio/video stream processing to control processing resources on the network. These media processing resources may be speech recognizers (a.k.a. Automatic-Speech-Recognition (ASR) engines), speech synthesizers (a.k.a. Text-To-Speech (TTS) engines), fax, signal detectors, etc. MRCP allows implementation of distributed Interactive Voice Response platforms, for example VoiceXML interpreters. The MRCP protocol defines the requests, responses, and events needed to control the media processing resources. The MRCP protocol defines the state machine for each resource and the required state transitions for each request and server-generated event. The MRCP protocol does not address how the control session is established with the server and relies on the Real Time Streaming Protocol (RTSP) to establish and maintain the session. The session control protocol is also responsible for establishing the media connection from the client to the network server. The MRCP protocol and its messaging is designed to be carried over RTSP or another protocol as a MIME-type similar to the Session Description Protocol (SDP)...

The Media Resource Control Protocol Version 2 (MRCPv2) allows client hosts to control media service resources such as speech synthesizers, recognizers, verifiers and identifiers residing in servers on the network. MRCPv2 is not a "stand-alone" protocol — it relies on a session management protocol such as the Session Initiation Protocol (SIP) to establish the MRCPv2 control session between the client and the server, and for rendezvous and capability discovery. It also depends on SIP and SDP to establish the media sessions and associated parameters between the media source or sink and the media server. Once this is done, the MRCPv2 protocol exchange operates over the control session established above, allowing the client to control the media processing resources on the speech resource server..." See details in the VoiceXML Forum MRCP Committee page.
MSCML (Media Server Control Markup Language and Protocol): Media Server Control Markup Language (MSCML) (IETF Request for Comments #4722) is a markup language used in conjunction with SIP to provide advanced conferencing and interactive voice response (IVR) functions. MSCML presents an application-level control model, as opposed to device-level control models. One use of this protocol is for communications between a conference focus and mixer in the IETF SIP Conferencing Framework... MSCML fills the need for IVR and conference control with requests and responses over a SIP transport... the goal of MSCML is to provide an application interface that follows the SIP, HTTP, and XML development paradigm to foster easier and more rapid application deployment... Many attributes in the MSCML schema have default values. In order to limit demands on the XML parser, MSCML applies these values at the protocol, not XML, level. The MSCML schema documents these defaults as XML annotations to the appropriate attribute. VoiceXML fills the need for IVR with requests and responses over a HTTP transport. This enables developers to use whatever model fits their needs best. The document defines the IANA Registration of MIME Media Type 'application/mediaservercontrol+xml'.
MSCP (Media Server Control Protocol): IETF Internet Draft Media Server Control Protocol (MSCP) defines a protocol to control interactive dialog and conferencing functions on a media server. The protocol messages — requests, responses and notifications — are modeled on dialog and conferencing elements defined in CCXML (Voice Browser Call Control), and interactive dialogs can be specified in VoiceXML (Voice Extensible Markup Language). MSCP messages have self-contained XML representation and transaction models, so the protocol is independent of the underlying transport channel. Messages may be transported using SIP or, preferably, using a dedicated transport channel... The scope of the protocol is control of media server functions for interactive media (e.g., play a prompt, interpret DTMF, etc) and conferencing functions (e.g., create a conference, join participants to conference, etc) as well as notifications related to these functions. The protocol defines request-response and notification messages in XML for these functions. The protocol also provides extensibility mechanisms allowing messages which are defined outside this document to be passed using the MSCP protocol. The extensibility mechanisms may be used to pass messages describing other functions..."
MSML (Media Server Markup Language): IETF Internet Draft Media Server Markup Language (MSML) is an XML language used to control the flow of media streams and services applied to media streams within a media server. It is used to invoke many different types of services on individual sessions, groups of sessions, and conferences. MSML allows the creation of conferences, bridging different sessions together, and bridging sessions into conferences. MSML may also be used to create user interaction dialogs and allows the application of media transforms to media streams. Media interaction dialogs created using MSML allow construction of IVR dialog sessions to individual users as well as to groups of users participating in a conference. Dialogs may also be specified using other languages, VoiceXML, which support complete single-party application logic to be executed on the Media Server. MSML is a transport independent language, such that it does not rely on underlying transport mechanisms and language semantics are independent of transport. However, SIP is a typical and commonly used transport mechanism for MSML, invoked using the SIP URI scheme. This specification defines using MSML Dialogs using SIP as the transport mechanism... MSML has been designed to address the control and manipulation of media processing operations (e.g., announcement, IVR, play and record, ASR/TTS, fax, video), as well as control and relationships of media streams (e.g., simple and advanced conferencing). It provides a general-purpose media server control architecture. MSML can additionally be used to invoke other more complex IVR languages such as VoiceXML...
PLS (Pronunciation Lexicon Specification): The Pronunciation Lexicon Specification (PLS) Version 1.0 specification was published as a W3C Working Draft on 26-October-2006. "The accurate specification of pronunciation is critical to the success of speech applications. Most Automatic Speech Recognition (ASR) and Text-To-Speech (TTS) engines internally provide extensive high quality lexicons with pronunciation information for many words or phrases. To ensure a maximum coverage of the words or phrases used by an application, application-specific pronunciations may be required. For example, these may be needed for proper nouns such as surnames or business names. The Pronunciation Lexicon Specification (PLS) is designed to enable interoperable specification of pronunciation information for both ASR and TTS engines within voice browsing applications. The language is intended to be easy to use by developers while supporting the accurate specification of pronunciation information for international use. The language allows one or more pronunciations for a word or phrase to be specified using a standard pronunciation alphabet or if necessary using vendor specific alphabets. Pronunciations are grouped together into a PLS document which may be referenced from other markup languages, such as the Speech Recognition Grammar Specification (SRGS) and the Speech Synthesis Markup Language (SSML).

In its most general sense, a lexicon is merely a list of words or phrases, possibly containing information associated with and related to the items in the list. This document uses the term lexicon in only one specific way, as 'pronunciation lexicon'. In this particular document, 'lexicon' means a mapping between words (or short phrases), their written representations, and their pronunciations suitable for use by an ASR engine or a TTS engine. However, pronunciation lexicons are not limited to voice browsers, because they have proven effective mechanisms to support accessibility for persons with disabilities as well as greater usability for all users — for instance in screen readers and other user agents, such as multimodal interfaces..."
SCXML (State Chart XML): The State Chart XML (SCXML): State Machine Notation for Control Abstraction specification was published as a W3C Working Draft on 21-February-2007. SCXML provides a generic state-machine based execution environment based on CCXML and Harel State Tables. It is part of the VBWG VXML 3.0 architecture, as referenced by Michael Bodell. See also State Chart XML (SCXML) - IBM Resources.

The Working Draft document describes SCXML as a general-purpose event-based state machine language that can be used in many ways, including:
- As a high-level dialog language controlling VoiceXML 3.0's encapsulated speech modules — voice form, voice picklist, etc.
- As a voice application metalanguage, where in addition to VoiceXML 3.0 functionality, it may also control database access and business logic modules
- As a multimodal control language in the MultiModal Interaction framework, combining VoiceXML 3.0 dialogs with dialogs in other modalities including keyboard and mouse, ink, vision, haptics, etc; it may also control combined modalities such as lipreading (combined speech recognition and vision) speech input with keyboard as fallback, and multiple keyboards for multi-user editing
- As the state machine framework for a future version of CCXML
- As an extended call center managment language, combining CCXML call control functionality with computer-telephony integration for call centers that integrate telephone calls with computer screen pops, as well as other types of message exchange such as chats, instant messaging, etc
- As a general process control language in other contexts not involving speech processing
SCXML combines concepts from CCXML and Harel State Tables. CCXML is an event-based state machine language designed to support call control features in Voice Applications — specifically including VoiceXML but not limited to it. The CCXML 1.0 specification defines both a state machine and event handing syntax and a standardized set of call control elements. Harel State Tables are a state machine notation that was developed by the mathematician David Harel and is included in UML. They offer a clean and well-thought out semantics for sophisticated constructs such as a parallel states. They have been defined as a graphical specification language, however, and hence do not have an XML representation. The goal of this document is to combine Harel semantics with an XML syntax that is a logical extension of CCXML's state and event notation..."
SIP (Session Initiation Protocol): SIP: Session Initiation Protocol defined in IETF RFC 3261 is an application-layer control (signaling) protocol for creating, modifying, and terminating sessions with one or more participants. These sessions include Internet telephone calls, multimedia distribution, and multimedia conferences. SIP invitations used to create sessions carry session descriptions that allow participants to agree on a set of compatible media types. SIP makes use of elements called proxy servers to help route requests to the user's current location, authenticate and authorize users for services, implement provider call-routing policies, and provide features to users. SIP also provides a registration function that allows users to upload their current locations for use by proxy servers. SIP runs on top of several different transport protocols. See the IETF SIP Working Group Charter for a list of relevant RFCs and Internet Drafts.
SIP Media Services: IETF RFC 4240 defines Basic Network Media Services with SIP, aka SIP with VoiceXML 2.0 (Netann). "In SIP-based networks, there is a need to provide basic network media services. Such services include network announcements, user interaction, and conferencing services. These services are basic building blocks, from which one can construct interesting applications. In order to have interoperability between servers offering these building blocks (also known as Media Servers) and application developers, one needs to be able to locate and invoke such services in a well defined manner. This document describes a mechanism for providing an interoperable interface between Application Servers, which provide application services to SIP-based networks, and Media Servers, which provide the basic media processing building blocks.
SISR (Semantic Interpretation for Speech Recognition): Semantic Interpretation for Speech Recognition (SISR) Version 1.0 was published as a W3C Recommendation on 05-April-2007. A Semantic Interpretation for Speech Recognition (SISR) Version 1.0 Implementation Report is available.

The Recommendation defines the syntax and the semantics of Semantic Interpretation Tags for use with the Speech Recognition Grammar Specification (SRGS). It is possible that Semantic Interpretation Tags as defined here can be used also with the Stochastic Language Models (N-Gram) Specification, but the current specification does not specifically address such use and does not guarantee that the Semantic Interpretation Tags as defined here are meeting the needs of such use... The basic principles for the Semantic Interpretation mechanism defined in this specification are the following: (1) semantic information is represented as values associated with non-terminals; (2) statements in Semantic Interpretation Tags are either valid ECMAScript code (Compact Profile) or string literals; (3) expression evaluation order is connected to the grammar rule definitions and the sequence of words in the recognized utterance.

Grammar Processors, and in particular speech recognizers, use a grammar that defines the words and sequences of words to define the input language that they can accept. The major task of a grammar processor consists of finding the sequence of words described by the grammar that (best) matches a given utterance, or to report that no such sequence exists.

In an application, knowing the sequence of words that were uttered is sometimes interesting but often not the most practical way of handling the information that is present in the user utterance. What is needed is a computer processable representation of the information, the Semantic Result, more than a natural language transcript. The process of producing a Semantic Result representing the meaning of a natural language utterance is called Semantic Interpretation (SI).

The Semantic Interpretation process described in this specification uses Semantic Interpretation Tags (SI Tags) to provide a means to attach instructions for the computation of such semantic results to a speech recognition grammar. When used with a VoiceXML 2.0 Processor, it is expected that a Semantic Interpretation Grammar Processor will convert the result generated by an SRGS speech grammar processor into an ECMAScript object that can then be processed as specified in section 3.1.6 Mapping Semantic Interpretation Results to VoiceXML Forms in VoiceXML 2.0. The W3C Multimodal Interaction Activity is defining an XML data format (EMMA: Extensible MultiModal Annotation Markup Language) for containing and annotating the information in user utterances. It is expected that the EMMA language will be able to integrate results generated by Semantic Interpretation for Speech Recognition...
SIVR (Speaker Recognition Format for Raw Data Interchange): The Speaker Recognition Format for Raw Data Interchange (SIVR) specification was produced by members of the VoiceXML Forum Speaker Biometrics Committee, chartered to develop requirements for speaker biometrics capabilities in VoiceXML systems. The Committee collaborate with relevant organizations including W3C, IETF, ANSI, and ISO to identify use cases for voice-only and multimodal applications. The members review existing platform-specific implementations of speech biometrics extensions to VoiceXML, compare existing implementations with requirements defined by the Committee, and are working to develop a standard transaction format based on CBEFF. The group is also chartered to develop best practices for user interface design, application architecture, and the use of speech biometrics in the context of VoiceXML.

Draft Version 1.3 (May 1, 2007) has been submitted to ISO. The VoiceXML Forum Editor is Judith Markowitz, and the ANSI/INCITS/M1 Associate Editor is Guy Cardwell. See also the glossary, requirements, and SIV Introduction and Best Practices

"The Data Exchange File Format for Speaker Identification and Verification document presents the rationale for developing a data-exchange file format (DEFF) for speaker verification and identification (SIV) and the first version of a DEFF that complies with ANSI and ISO's Common Biometric Exchange Formats Framework (CBEFF) standard. The DEFF is designed to be used for transmitting SIV data and information about those data. It can be used as the communication component of a VoiceXML SIV transaction (primarily an authentication). It can be stored and transmitted after the authentication is completed when, for example, a company changes vendors, does an audit, or even does tuning on the application.

The standard specifies a concept and data format for representation of the human voice at the raw-data level with optional inclusion of non-standardized extended data. It does not address handling of data that has been processed to the feature or model/template levels. As such, it can be incorporated into the formats that will be developed for those levels within ISO SC37.

The data format is generic in that it may be applied to and used in a wide range of application areas where automated and human-to-human SIV is performed. No application-specific requirements, equipment, or features are addressed in this standard. Through its XML orientation, this standard does, however, reflect recognition of the overwhelming dominance of the VoiceXML standard in speech processing and associated XML-based standards. The schema and other code examples use a standard form of XML rather than VoiceXML because VoiceXML has not yet completed development of its speaker recognition module..."
SRGS (Speech Recognition Grammar Specification): The Speech Recognition Grammar Specification Version 1.0 was released as a W3C Recommendation on 16-March-2004, and was produced by the W3C Voice Browser Working Group. It defines q syntax for representing grammars for use in speech recognition so that developers can specify the words and patterns of words to be listened for by a speech recognizer. The syntax of the grammar format is presented in two forms, an Augmented BNF Form and an XML Form. Both the ABNF Form and XML Form have the expressive power of a Context-Free Grammar (CFG). A grammar processor that does not support recursive grammars has the expressive power of a Finite State Machine (FSM) or regular expression language. The specification makes the two representations mappable to allow automatic transformations between the two forms. This W3C standard known as the Speech Recognition Grammar Specification is modelled on the JSpeech Grammar Format specification (JSGF), which is owned by Sun Microsystems, Inc., California, U.S.A.

The primary use of a speech recognizer grammar is to permit a speech application to indicate to a recognizer what it should listen for, specifically: Words that may be spoken; Patterns in which those words may occur; Spoken language of each word. Speech recognizers may also support the Stochastic Language Models (N-Gram) Specification. Both specifications define ways to set up a speech recognizer to detect spoken input but define the word and patterns of words by different and complementary means. Some recognizers permit cross-references between grammars in the two formats. The rule reference element of this specification describes how to reference an N-gram document.

A grammar processor is any entity that accepts as input grammars as described in the specification. A user agent is a grammar processor that accepts user input and matches that input against a grammar to produce a recognition result that represents the detected input. A speech recognizer is a user agent which accepts as input (a) a grammar or multiple grammars [which] inform the recognizer of the words and patterns of words to listen for; (b) an audio stream that may contain speech content that matches the grammar(s). Outputs of the speech recognizer inclide (a) descriptions of results that indicate details about the speech content detected by the speech recognizer, where the format and details of the content of the result are outside the scope of this specification; for informative purposes, most practical recognizers will include at least a transcription of any detected words; (b) error and other performance information may be provided to the host environment... A speech recognizer is capable of matching audio input against a grammar to produce a raw text transcription (also known as literal text) of the detected input. A recognizer may be capable of, but is not required to, perform subsequent processing of the raw text to produce a semantic interpretation of the input..."
SSML (Speech Synthesis Markup Language): The Speech Synthesis Markup Language (SSML) Version 1.1 specification was released as a W3C Working Draft on 10-January-2007. Speech Synthesis Markup Language (SSML) Version 1.0 was published as a W3C Recommendation on 7-September-2004. An accompanying SSML 1.0 Implementation Report is available online.

"The Speech Synthesis Markup Language Specification is one of the standards produced by the the W3C Voice Browser Working Group, seeking to develop standards to enable access to the Web using spoken interaction. It is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.

SSML is part of a larger set of markup specifications for voice browsers developed through the open processes of the W3C. A related initiative to establish a standard system for marking up text input is SABLE, which tried to integrate many different XML-based markups for speech synthesis into a new one. The activity carried out in SABLE was also used as the main starting point for defining the Speech Synthesis Markup Requirements for Voice Markup Languages. Since then, SABLE itself has not undergone any further development.

The intended use of SSML is to improve the quality of synthesized content. Different markup elements impact different stages of the synthesis process. The markup may be produced either automatically, for instance via XSLT or CSS3 from an XHTML document, or by human authoring. Markup may be present within a complete SSML document or as part of a fragment embedded in another language, although no interactions with other languages are specified as part of SSML itself. Most of the markup included in SSML is suitable for use by the majority of content developers. However, some advanced features like phoneme and prosody (e.g., for speech contour design) may require specialized knowledge..."
VoiceXML (Voice Extensible Markup Language): VoiceXML is one of the key work items under development by the W3C Voice Browser Working Group. The VBWG is producing a suite of specifications known as the W3C Speech Interface Framework. It covers voice dialogs, speech synthesis, speech recognition, telephony call control for voice browsers and other requirements for interactive voice response applications, including use by people with hearing or speaking impairments.

VoiceXML, referenced below, "is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications..."
xHMI (Extensible Human-Machine Interface): "xHMI is an open, XML-based dialog configuration language for advanced speech applications. By defining a common approach to configuring dialog components, the xHMI specification promotes compatibility between the new generation of advanced speech applications, tools, application frameworks and components, and makes it easier to create and deliver advanced dialog-driven applications capable of more complex and natural interactions with callers... xHMI addresses the need for a common approach to control the behavior of dialog nodes in a speech application. xHMI reflects the standard Model View Controller 2 (MVC 2) design pattern used in modern Web applications. MVC 2 separates functionality related to how application data are manipulated from how they are displayed and stored. An xHMI-based dialog framework therefore makes it easier to develop and maintain speech applications using accepted standard practices... Nuance has created the xHMI specification in partnership with a number of leading speech technology providers who are driving this initiative, including: Apptera, Audium, Convergys Corporation, Dimension Data, Edify, Envox, Fluency, Gold Systems, Holly Australia, iTa, LogicaCMG, Interactive Northwest, Primas, Softel, Tuvox, Unisys, Unveil Technologies Inc., Vicorp, Viecore, VoiceGenie, Voxify and West..."
XHTML+Voice (X+V) Markup Language: An XHTML+Voice Profile 1.0 was presented to W3C in a November 2001 submission request. XHTML+Voice Profile 1.2 (16-March-2004) was produced within the VoiceXML Forum: "The XHTML+Voice profile brings spoken interaction to standard web content by integrating the mature XHTML and XML-Events technologies with XML vocabularies developed as part of the W3C Speech Interface Framework. The profile includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to XHTML elements and respond to specific DOM events, thereby reusing the event model familiar to web developers. Voice interaction features are integrated with XHTML and CSS and can consequently be used directly within XHTML content..."

Mobile X+V 1.2 (02-September-2005) "brings spoken interaction to standard web content by integrating W3C standards for the visual and voice Web. XHTML is for rendering visual content, VoiceXML for spoken interaction, and the different modalities are integrated using XML Events to author DOM2 Event bindings. Using this integration framework, voice handlers can be attached to XHTML elements and respond to specific DOM events, thereby reusing the event model familiar to web developers. Voice interaction features are integrated with XHTML and CSS and can consequently be used directly within XHTML content. This specification builds on existing mobile profiles of W3C technologies such as XHTML Basic and CSS Mobile to create a multimodal specification suitable for use on mobile devices..."

XHTML+Voice has several implmentations, including the Opera Multimodal Desktop Browser. The VoiceXML Forum reference page offers additional historical summary, supplemented by the W3C Multimodal Application Developer Feedback document of 14-April-2006.

About VoiceXML

VoiceXML 3.0 VoiceXML 3.0, with a First Public Working Draft expected late 2007, "is the next major release of VoiceXML. Its purpose is to provide powerful dialog capabilities that can be used to build advanced speech applications, and to provide these capabilities in a form that can be easily and cleanly integrated with other W3C languages. It will provide enhancements to existing dialog and media control, as well as major new features (e.g. modularization, a cleaner separation between data/flow/dialog, and asynchronous external eventing) to facilitate interoperability with external applications and media components."

VoiceXML 3.0 may be enhanced with capabilities such as: (1) VoiceXML dialogs that are cancelleable; (2) VoiceXML dialogs that can received events from the flow layer during execution; these events are exposed in the presentation markup' (3) VoiceXML dialogs that can send events to the flow layer during execution; these events are specified in the presentation markup...

From "Sneak Preview: VoiceXML 3.0" being the slides on DFP framework presented at SpeechTek (August, 2006), 98 pages. By Jim Barnett, Emily Candell, Jerry Carter, Rafah Hosn and Scott McGlashan. "VoiceXML 3.0 Process: VoiceXML 3.0 is being developed by the W3C Voice Browser Working Group (VBWG). The WG has 3-4 F2F meetings per year, and weekly teleconferences. VBWG is one of the largest W3C groups, with some 95 group participants, 38 organizations... VoiceXML 3.0 is being developed under VBWG charter and the W3C Patent Policy. Public information and comments on VBWG activities, including VoiceXML 3.0 are available through the mail list archives... VoiceXML 3.0 design is being driven by external and internal requirements: Addressing unresolved issues from VoiceXML 2.0/2.1, and accounting for the fact that VoiceXML is being used in new ways (e.g., multimodal) and in other languages...

See also: (1) The Voice Browser DFP Framework; (2) V3 (VoiceXML V3) White Paper from the VoiceXML Forum Technical Council; (3) Mike Bodell Slide #9 "VXML 3.0 Architecture" involves a three-tiered 'DFP architecture' where 'D' is for data (W3C DOM, XPath, XForms), 'F' is for flow (SCXML), and 'P' is for presentation (VXML 3.0 and 2.0 Dialog Widgets). It is similar to a traditional Model-View-Controller architecture. One can pick and choose different languages to solve each tier, but VBWG has some languages in mind... One of the main goals of VoiceXML 3.0 is to partition the language into independent modules which can be combined into various language profiles. VoiceXML 3.0 itself defines two profiles and specifies how other profiles can be created...
VoiceXML 2.1. Voice Extensible Markup Language (VoiceXML) 2.1. W3C Recommendation. 19-June-2007. Edited by Matt Oshry, Tellme Networks, Editor-in-Chief), RJ Auburn (Voxeo Corporation), Paolo Baggia (Loquendo), Michael Bodell (Tellme Networks), David Burke (Voxpilot Ltd), Daniel C. Burnett (Nuance Communications), Emily Candell (Comverse), Jerry Carter (Nuance Communications), Scott McGlashan (Hewlett-Packard), Alex Lee (Genesys), Brad Porter (Tellme Networks), and Ken Rehor (Invited Expert). See the announcement and testimonials.

"VoiceXML 2.1 defines a small set of widely implemented additional features. These include using computed expressions for referencing grammars and scripts, the ability to detect where barge-in occurred within a prompt, greater convenience in prompting for dynamic lists of values, to be able to download data without having to move to the next page, to record the user's speech during recognition for later analysis, to pass data with a disconnect, and enhanced control over transfer..."
VoiceXML 2.0 Voice Extensible Markup Language (VoiceXML) Version 2.0. W3C Recommendation 16-March-2004. Edited by Scott McGlashan (Hewlett-Packard, Editor-in-Chief), Daniel C. Burnett (Nuance Communications), Jerry Carter (Invited Expert), Peter Danielsen (Lucent), Jim Ferrans (Motorola), Andrew Hunt (ScanSoft), Bruce Lucas (IBM), Brad Porter (Tellme Networks), Ken Rehor (Vocalocity), Steph Tryphonas (Tellme Networks).

"VoiceXML 2.0 brings the advantages of Web-based development and content delivery to interactive voice response applications. VoiceXML controls the dialog between the application and the user. It is downloaded from HTTP servers in the same way as HTML. This means that application developers can take full advantage of widely deployed and industry proven Web technologies... VoiceXML 2.0 is being applied to a wide range of applications, for instance, call centers, government offices and agencies, banks and financial services, utilities, healthcare, retail sales, travel and transportation, and many more. VoiceXML has tremendous potential for improved accessibility for a wide range of services for people with visual impairments, and via text phones, for people with speaking and/or hearing impairments...

VoiceXML 2.0 and Speech Recognition Grammar Specification (SRGS) 1.0 are the first two components in the W3C Speech Interface to reach W3C Recommendation status. SRGS allows applications to specify the words and phrases that users are prompted to speak. This enables robust speaker independent recognition. The next in line is SSML, the speech synthesis markup language that used with VoiceXML to prompt users and to provide answers to questions. After that we have Semantic Interpretation for Speech Recognition which provides the means for developers to extract application data from the results of speech recognition, and CCXML, an XML telephony call control language for VoiceXML. VoiceXML offers a demonstrable improvement in the user experience when people call up companies on the phone, compared with waiting for ages for human customer service representative to become free, or having to put up with the press one for this, press two for that style of interaction with the previous generation of interactive voice response services..."
VoiceXML 1.0. Voice Extensible Markup Language (VoiceXML) Version 1.0 was released as a W3C Note on 05-May-2000. It was a submission to the World Wide Web Consortium from the VoiceXML Forum, specifically, from AT&T, IBM, Lucent Technologies, and Motorola. The document defined "a new computer language designed to make Internet content and information accessible via voice and phone." See the PDF of 07-March-2000 from the VoiceXML Forum.

About the VoiceXML Forum

"Founded in 1999, the VoiceXML Forum is an industry organization whose mission is to promote and to accelerate the worldwide adoption of VoiceXML-based applications. To this end, the Forum serves as an educational and technical resource, a certification authority and a contributor and liaison to international standards bodies, such as the World Wide Web Consortium (W3C) IETF, ANSI and ISO. The VoiceXML Forum is organized as a program of the IEEE Industry Standards and Technology Organization (IEEE-ISTO). Membership in the Forum is open to any interested company...

Tens of thousands of commercial VoiceXML-based speech applications have been deployed across a diverse set of industries, including financial services, government, insurance, retail, telecommunications, transportation, travel and hospitality. Millions of calls are answered by VoiceXML applications every day. The Forum's primary focus areas include:

Promoting the adoption of VoiceXML-based technologies

Cultivating a global VoiceXML ecosystem

Actively supporting standards bodies and industry consortia, such as the W3C and IETF, as they work on VoiceXML and related standards, such as CCXML, X+V, MRCP, and speech biometrics

VoiceXML is the only XML-based speech language specification to receive the W3C's Final Recommendation. VoiceXML simplifies speech application development by permitting developers to use familiar Web infrastructure, tools and techniques. VoiceXML also enables distributed application design by separating each application's user interaction layer from its service logic. Most developers find VoiceXML application development at least three times faster than development in traditional interactive voice response (IVR) environments. For these reasons, VoiceXML has been widely adopted within the speech industry...

The VoiceXML Forum also supports certification activities, including (1) Platform testing and certification to support application portability and vendor interoperability; (2) Developer certification program to foster growth and maturity of the VoiceXML developer community and to inspire new developers to sharpen their VoiceXML skills...

The VoiceXML Solutions Directory provides a comprehensive directory of VoiceXML products and services. The VoiceXML Solutions Directory helps companies interested in deploying VoiceXML-based business solutions to find the products and services they need quickly and efficiently. As of February 2007, the VoiceXML Solutions Directory included more than 160 products and services offered by the VoiceXML Forum's member companies. Product categories include VoiceXML platforms, application development tools and pre-built dialog modules, and applications. Service categories include VUI design and development, systems integration and solution deployment, and hosted services and solutions...

The VoiceXML Forum Technical Council is at the core of the Forum's technical activities and planning. Composed of representatives from the Forum's Sponsor Member companies, the Council provides technical guidance to the Forum with regard to VoiceXML, X+V, and other related languages and acts as liason to the W3C, IETF, and other standards bodies with regard to these languages and the Forum in general. The council also develops new language initiatives where appropriate and provides authoritative technical information about VoiceXML as needed to answer questions that may arise in the general and technical press..."

Principal References

Announcement 2007-08-20: "VoiceXML Forum Publishes Runtime Data Logging Specification for Public Comment. New Session Log Annotation Markup Language Will Enable Consistent Runtime Data Logging and Improve Performance and Usability of Speech-Based Services."
Session Log Annotation Markup Language (SLAML)
Bibliographic Information for SLAML Specification documents
VoiceXML Forum Tools Committee, composed of four Working Groups:
VoiceXML Forum
W3C and VoiceXML:
- W3C Voice Browser Activity
- VBWG. The W3C Voice Browser Working Group is chartered through January 31, 2009. Contact: Kazuyuki Ashimura (W3C Voice Browser Activity Lead).
- W3C Voice Browser Frequently Asked Questions
- Dialog Requirements for Voice Markup Languages. W3C Working Draft. 23-December-1999.
- Voice Browsers: An introduction and glossary for the requirements drafts. W3C Working Draft. 23-December-1999.
General references:
- VoiceXML. From Wikipedia.
- VoiceXML Development Guide. From Voxeo.
- World of VoiceXML. Maintained by Ken Rehor.
- Voice Extensible Markup Language (VoiceXML) - Local References

Prev:	OGC Releases Transducer Markup Language (TML) Implementation Specification.
Next:	W3C Publishes Semantic Annotations for WSDL and XML Schema Recommendation.


SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY