The SALT Forum has published a draft specification for its "royalty-free, platform-independent standard that will make possible multimodal and telephony-enabled access to information, applications, and Web services from PCs, telephones, tablet PCs, and wireless personal digital assistants (PDAs)." SALT (Speech Application Language Tags) is defined by "a small set of XML elements, with associated attributes and DOM object properties, events and methods, which may be used in conjunction with a source markup document to apply a speech interface to the source page. The SALT formalism and semantics are independent of the nature of the source document, so SALT can be used equally effectively within HTML and all its flavours, or with WML, or with any other SGML-derived markup. SALT is an extension of HTML and other markup languages (cHTML, XHTML, WML) which adds a spoken dialog interface to web applications, for both voice only browsers (e.g., over the telephone) and multimodal browsers. For multimodal applications, SALT can be added to a visual page to support speech input and/or output. This is a way to speech-enable individual HTML controls for 'push-to-talk' form-filling scenarios, or to add more complex mixed initiative capabilities if necessary. For applications without a visual display, SALT manages the interactional flow of the dialog and the extent of user initiative by using the HTML eventing and scripting model. In this way, the full programmatic control of client-side (or server-side) code is available to application authors for the management of prompt playing and grammar activation." Appendix A of the version 0.9 draft specification supplies the SALT XML DTD.
Bibliographic information: SALT: Speech Application Language Tags 0.9 Specification. Draft "work in progress." February 19, 2002. By Cisco Systems Inc., Comverse Inc., Intel Corporation, Microsoft Corporation, Philips Electronics N.V., and Speechworks International Inc. 90 pages.
From the Overview: "The main top level elements of SALT are: (1) <prompt> for speech synthesis configuration and prompt playing (2) <listen> for speech recognizer configuration, recognition execution and post-processing, and recording (3) <dtmf> for configuration and control of DTMF collection (4) <smex> for general purpose communication with platform components. The input elements <listen> and <dtmf> also contain grammars and binding controls: [1] <grammar> for specifying input grammar resources, [2] <bind> for processing of recognition results; and <listen> also contains the facility to record audio input <record> for recording audio input. A call control object is also provided for control of telephony functionality. There are several advantages to using SALT with a mature display language such as HTML. Most notably (i) the event and scripting models supported by visual browsers can be used by SALT applications to implement dialog flow and other forms of interaction processing without the need for extra markup, and (ii) the addition of speech capabilities to the visual page provides a simple and intuitive means of creating multimodal applications. In this way, SALT is a lightweight specification which adds a powerful speech interface to web pages, while maintaining and leveraging all the advantages of the web application model..."
Excerpts from the v0.9 spec:
- DTMF input: <dtmf> [Dual Tone Multi Frequency]. The <dtmf> element is used in telephony applications to specify possible DTMF inputs and a means of dealing with the collected results and other DMTF events. Like <listen>, its main elements are <grammar> and <bind>, and it holds resources for configuring the DTMF collection process and handling DTMF events. The dtmf element is designed so that type-ahead scenarios are enabled by default. That is, for applications to ignore input entered ahead of time, flushing of the dtmf buffer has to be explicitly authored.
- Platform messaging: <smex>. smex, short for Simple Messaging EXtension, is a SALT element that communicates with the external component of the SALT platform. It can be used to implement any application control of platform functionality such as logging and telephony control. As such, smex represents a useful mechanism for extensibility in SALT, since it allows any new functionality be added through this messaging layer. On its instantiation, the object is directed to establish an asynchronous message exchange channel with a platform component through its configuration parameters (specified in <param> elements) or attributes. The smex object can send or receive messages through this channel. The content of a message to be sent is defined in the sent property. Whenever the value of this property is updated (either on page load or by dynamic assignment through script or binding), the message is sent to the platform. The smex element can also receive XML messages from the platform component in its received property. The onreceive event is fired whenever a platform message is received. Since the smex object's basic operations are asynchronous, it also maintains a built-in timer for the manipulation of timeout settings. ontimeout and onerror events may also be thrown.
- Telephony Call Control: The SALT solution assumes that an HTML document containing SALT markup must "have the ability to provide access to telephony call control related functions, such as answering a call, transferring a call (bridged or unbridged), managing a call or disconnecting a call; the specification must define a means to associate a telephony media stream with SALT media tags, such as tags for Speech Recognition, Recording, Speech Synthesis, and Audio Playback... The call control object will be specified as an intrinsic entity of the SALT-enhanced browser. Various call control interface implementations conformant with this specification may be 'plugged in' by browser-specific configuration procedures and in that way be made accessible to SALT documents. SALT documents can query for the presence of these 'plug-ins' at runtime. The object shall appear in the DOM of the HTML document. The object will have various properties and methods that can be manipulated via ECMAScript code. Some of these methods will create derivative 'child' objects, thereby instantiating an entire call control object hierarchy. The objects should also generate events. Event handlers may be written as ECMAScript functions..."
Note also the W3C Multimodal Interaction Activity. "The W3C Multimodal Interaction Activity is extending the Web user interface to allow multiple modes of interaction, offering users the choice of using their voice, or the use of a key pad, keyboard, mouse, stylus or other input device. For output, users will be able to listen to spoken prompts and audio, and to view information on graphical displays. The Working Group is developing markup specifications for synchronization across multiple modalities and devices. The specifications should be implementable on a royalty-free basis... The Multimodal Interaction Activity was formed in February 2002."
Principal references:
- SALT: Speech Application Language Tags 0.9 Specification [cache]
- SALT v09 XML DTD
- "SALT Forum Founded for the Development of Embedded Speech Application Language Tags." News item of 2001-10-24.
- SALT Reference Architecture
- SALT FAQ document
- SALT Forum members
- SALT Forum web site
- See also the VoiceXML Forum ('VoiceXML')
- See also the W3C Multimodal Interaction Activity
- "Speech Application Language Tags (SALT)" - Main reference page.