An announcement from Opera Software at the AVIOS SpeechTEK International Exposition and Educational Conference describes the upcoming release of a multimodal desktop browser based on the XHTML+Voice (X+V) specification.
In 2001, IBM, Motorola, and Opera submitted the XHTML+Voice Profile 1.0 to W3C for standards work. The most recent version of XHTML+Voice Profile 1.2 is managed by the VoiceXML Forum, and "brings spoken interaction to standard web content by integrating the mature XHTML and XML-Events technologies with XML vocabularies developed as part of the W3C Speech Interface Framework. The profile includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to XHTML elements and respond to specific DOM events, thereby reusing the event model familiar to web developers. Voice interaction features are integrated with XHTML and CSS and can consequently be used directly within XHTML content."
The Opera multimodal browser project builds upon an ongoing relationship between IBM and Opera: the new release incorporates IBM's Embedded ViaVoice speech technology. IBM's ViaVoice speech technology supports a variety of real-time operating systems (RTOS) and microprocessors, powering mobile devices such as smart phones, handheld personal digital assistants (PDAs), and automobile components. "By leveraging IBM's voice libraries in this version of Opera, users can navigate, request information, and even fill in Web forms using speech and other forms of input in the same interaction."
The new platform allows users to "interact with the content on the Web in a more natural way, combining speech with other forms of input and output; developers can also start to build multimodal content using the open standards-based X+V markup language, which unifies the visual and voice Web by using development skills a large population of programmers already have today."
Opera Multimodal Browser Overview
"The multimodal browser being developed by IBM and Opera is based on the XHTML+Voice (X+V) specification. This project builds upon IBM's and Opera's ongoing relationship. In 2001, IBM, Motorola and Opera submitted the multimodal standard X+V to the standards body W3C. This mark-up language leverages existing standards, already familiar to voice and Web developers, so they can use their skills and resources to extend current applications instead of building new ones from the ground up.
Multimodal technology allows the interchangeable use of multiple forms of input and output, such as voice commands, keypads, or stylus — in the same interaction.
As computing moves away from keyboard-reliant PCs into devices such as handheld computers and cellular phones, multimodal technology becomes increasingly important. This technology will allow end users to interact with technology in ways that are most suitable to the situation. The Multimodal Browser allows viewing of and interacting with multimodal applications that have been built using X+V..." [from the product description]
From the Opera Software Announcement 2004-03-23
"Voice is the most natural and effective way we communicate. In the years to come it will greatly facilitate how we interact with technology," says Christen Krogh, VP Engineering, Opera Software ASA. "By making this technology available today for the wider Web audience, the serious work of voice-enabling the Web can commence."
While traditional HTML pages continue to be the foundation of the Web, advances in the function, speed and size of both computers and mobile devices, along with today's diversity of users, has increased the demand for more flexible user interfaces. By building on this standardized foundation using XHTML+Voice (X+V), developers can add voice input and output to traditional, graphically based Web pages and achieve natural voice functionality. For example, Opera's presentation tool, Opera Show, can empower users to replace Microsoft PowerPoint, creating light-weight, Internet standards-based presentations that can also make post-publishing a breeze. By combining Opera Show with voice, can allow users in the future be able to give presentations and tell Opera via voice commands to turn to the next slide without approaching the computer and pressing the 'Page Down' keyboard key.
"This new offering can allow us to interact with the content on the Web in a more natural way, combining speech with other forms of input and output — first on PCs, and in the near future, devices such as cellphones and PDAs," said Igor Jablokov, Director, Embedded Speech, IBM, and Chairman of the VoiceXML Forum. "Developers can also start to build multimodal content using the open standards-based X+V markup language, which unifies the visual and voice Web by using development skills a large population of programmers already have today."
Opera will make the IBM integrated voice browser available in English for Windows with initial targets being enterprise customers and developers.
Opera Software ASA is an industry leader in the development of Web browser technology, targeting the desktop, smartphone, PDA, iTV and vertical markets. Partners include companies such as IBM, Nokia, Sony, Motorola, Macromedia, Adobe, Symbian, Canal+ Technologies, Sony Ericsson, Kyocera, Sharp, Motorola Metroworks, MontaVista Software, BenQ, Sendo and AMD. The Opera browser has received international recognition from users, industry experts and media for being faster, smaller and more standards-compliant than other browsers.
Opera's browser technology is cross-platform and modular, and currently available on the following operating systems: Windows, Linux, Mac OS, Symbian OS, QNX, TRON, FreeBSD, Solaris and Mediahighway.
Opera Software ASA is headquartered in Oslo, Norway, with development centers in Linkoping and Gothenburg, Sweden, and a sales representative in Austin, TX. The company is listed on the Oslo Børs under the ticker symbol OPERA..."
Bibliographic Information: XHTML+Voice Profile (Versions 1.0, 1.1, 1.2)
XHTML+Voice Profile 1.2. 16-March-2004. Edited by Jonny Axelsson (Opera Software), Chris Cross (IBM), Jim Ferrans (Motorola), Gerald McCobb (IBM), T. V. Raman (IBM), and Les Wilson (IBM). VoiceXML Forum. Version URL: http://www.voicexml.org/specs/multimodal/x+v/12/spec.html. Previous version URL: http://www.ibm.com/software/pervasive/multimodal/x+v/11/spec.htm.
"The XHTML+Voice profile brings spoken interaction to standard web content by integrating the mature XHTML and XML-Events technologies with XML vocabularies developed as part of the W3C Speech Interface Framework. The profile includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to XHTML elements and respond to specific DOM events, thereby reusing the event model familiar to web developers. Voice interaction features are integrated with XHTML and CSS and can consequently be used directly within XHTML content.
XHTML+Voice Profile 1.1. 28-January-2003. From the IBM web site. The XHTML+Voice Profile 1.0 specification contributed to W3C in November/December 2001 (see following reference) was submitted on RAND terms, as clarified by the W3C Staff Comment: "The IPR declarations provided with the submission reveal that both IBM and Motorola may own patents or patent applications that apply to the XHTML+Voice submission. Both companies state that they are prepared to offer a non-exclusive license under such patents on reasonable and non-discriminatory (RAND) terms." The Staff Comment includes an update as follows: "An updated version of XHTML+Voice (v 1.1) was contributed to the Voice Browser and Multimodal Interaction Working Groups on 11th March 2003. Both Motorola and IBM have revised their IPR disclosures, agreeing to provide a nonexclusive royalty-free licence for any related patent claims they may have." [cache]
XHTML+Voice Profile 1.0. W3C Note. 21-December-2001. Edited by Jonny Axelsson (Opera Software), Chris Cross (IBM), Håkon W. Lie (Opera Software), Gerald McCobb (IBM), T. V. Raman (IBM), and Les Wilson (IBM). Version URL: http://www.w3.org/TR/2001/NOTE-xhtml+voice-20011221. Latest Version URL: http://www.w3.org/TR/xhtml+voice.
Profile XHTML+Voice brings spoken interaction to standard WWW content by integrating a set of mature WWW technologies such as XHTML and XML Events with XML vocabularies developed as part of the W3C Speech Interface Framework. The profile includes voice modules that support speech synthesis, speech dialogs, command and control, speech grammars, and the ability to attach Voice handlers for responding to specific DOM events, thereby re-using the event model familiar to web developers. Voice interaction features are integrated directly with XHTML and CSS, and can consequently be used directly within XHTML content.
The XHTML+Voice profile is designed for Web clients that support visual and spoken interaction. To this end, this document first re-formulates VoiceXML 2.0 as a collection of modules. These modules, along with Speech Synthesis Markup Language and Speech Recognition Grammar Format are then integrated with XHTML using XHTML modularization to create the XHTML+Voice profile. Finally, we integrate the result with module XML-Events so that voice handlers can be invoked through a standard DOM2 EventListener interface.
About the XHTML+Voice Profile Version 1.2
The XHTML+Voice Profile 1.2 document defines version 1.2 of the XHTML+Voice profile. XHTML+Voice 1.2 is a member of the XHTML family of document types, as specified by XHTML Modularization. XHTML is extended with a modularized subset of VoiceXML 2.0, the XML Events module, and a module containing a small number of attribute extensions to both XHTML and VoiceXML. The latter module facilitates the sharing of multimodal input data between the VoiceXML dialog and XHTML input and text elements.
The XML Events module provides XML host languages the ability to uniformly integrate event listeners and associated event handlers with Document Object Model (DOM) Level 2 event interfaces. The result is an event syntax for XHTML-based languages that enables an interoperable way of associating behaviors with document-level markup.
VoiceXML has been designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed-initiative conversations. In this document, VoiceXML 2.0 is modularized to prepare it for integration into the XHTML family of languages using the XHTML modularization framework. The modules that combine to support speech dialogs for updating XHTML forms and form elements are selected to be added to XHTML. The modules are described as well as the integration issues. The modularization of VoiceXML 2.0 also specifies DOM event types specific to voice interaction for use with the XML Events module. Speech dialogs authored in VoiceXML 2.0 can then be treated as event handlers to add voice-interaction specific behaviors to XHTML documents. The language integration supports all of the modules defined in XHTML Modularization, and adds speech interaction functionality to XHTML elements to enable multimodal applications. The document type defined by the XHTML+Voice profile is XHTML Host language document type conformant.
Two mature technologies, XHTML 1.1 and VoiceXML 2.0 are integrated using XHTML Modularization to bring spoken interaction to the visual web. The design leverages open industry APIs like the W3C DOM to create interoperable web content that can be deployed across a variety of end-user devices. Multiple modes of interaction are synchronized and integrated using the DOM 2 Events model and exposed to the content author via XML Events.
Today, web applications are authored in XHTML with user interaction created via XHTML form elements. The W3C is presently working on XForms, the next generation of web forms that bring the power of XML to web application development. The combination of XHTML and Voice described in this document can leverage the semantic richness of web applications created using XForms, while providing a smooth transition for today's developers wishing to deploy multimodal applications by adding spoken interaction to present-day web content. Integrating the work of the W3C voice browser working group into mainstream XHTML content has the advantage of ensuring that future enhancements to the voice browser component such as natural language understanding will be incorporated. Thus, a smooth transition path for web developers wishing to deliver increasingly smart user interaction for their web applications is provided. Building on XHTML Basic and XHTML modularization, content developers will be able to deploy their content to a wide variety of end-user clients ranging from mobile phones and small PDAs to desktop browsers..." [from the version 1.2 specification, 16-March-2004]
Related Background Articles
"X+V is a Markup Language, Not a Roman Math Expression. Introducing XHTML + Voice: IBM's Proposal to the W3C on Developing Multimodal UIs." By Les Wilson (Senior Technical Staff Member, IBM Corp., Pervasive Computing Division). From IBM developerWorks. 19-August-2003. "X+V (XHTML plus Voice) is a Web markup language for developing multimodal applications. Like VoiceXML, X+V meets the increasing user demand for voice-based interaction in small and mobile devices. Unlike VoiceXML, X+V uses both voice and visual elements, bringing a world of new potential to the field of wireless user interface development. In this article, IBM Multimodal Architect Les Wilson provides a complete introduction to X+V, including a conceptual overview of multimodal interface development, an architectural view of the three components that comprise X+V's core functionality, and a code example that demonstrates the utility of this promising new markup language." See the sidebar, SALT versus X+V.
"Versatile Multimodal Solutions. The Anatomy of User Interaction." By T.V. Raman, Gerald McCobb, and Rafah A. Hosn. In XML Journal Volume 4 Issue 4 (April 2003). "This article describes X+V 1.1, an update to X+V that integrates the results of more than two years of experience gained by implementing multimodal solutions using this framework. It summarizes the additions to X+V and illustrates their use in creating multimodal interaction that leverages mixed-initiative VoiceXML dialogs. Formal descriptions of these additions can be found in the X+V 1.1 specification; here we'll focus on motivating these additions and explaining their use."
"IBM Delivers Free Speech Tools. WebSphere Everyplace Uses No SALT to Put Linux Where Your Mouth Is." By Edward J. Correia. In Software Development Times (August 15, 2003). "In the shadow of The SCO Group's copyright infringement allegations, IBM Corp. continues to develop and market solutions for Linux-based systems. IBM has released WebSphere Everyplace Multimodal Environment for Embedix, a version of its Eclipse-based IDE that developers can use to target Sharp's Linux-based Zaurus 5600 handheld computer with applications that can be driven visually, by voice or by a combination of the two. The tools include a code editor for modifying existing Web applications with XHTML and VoiceXML (X+V) tagging protocols, predeveloped X+V sample code and a voice-application simulator based on the Opera 7 voice-enabled browser from Opera Software AS. The environment works with IBM's Multimodal Toolkit for WebSphere Studio, also available now. Both tools are free..."
"Motorola Licenses Opera Browser for Phones. Opera Likely to be Incorporated into Models Based on the Linux and Symbian OSes." By Gillian Law. In InfoWorld (February 11, 2004). "Motorola Inc.'s Personal Communications Sector (PCS) division has signed a licensing agreement with Opera Software ASA to use the Oslo company's browser on its phones. The Opera browser's code is small enough to fit on a mobile phone, [Lars] Boilesen said. 'So you can go to any site on the Internet and browse it. Small-screen rendering technology reformats the content on the fly so that it fits. There's no horizontal scrolling, just up and down.' The licensing agreement also allows Motorola to offer the Opera Platform to telecommunications carriers. The platform is designed to integrate online content with a device's own applications, allowing an operator to update the content on the screen of a user's handset. The companies announced last week that they are working together to combine the WAP (Wireless Application Protocol) software stack from a browser developed by Motorola's Global Software Group with Opera's HTML browser software; that will allow phones to access the many WAP-based sites and services that have already been developed..."
Related News: Voice Extensible Markup Language (VoiceXML) 2.1
Following on the W3C announcement for VoiceXML 2.0 and Speech Recognition Grammar as W3C Recommendations, W3C has announced the release of a first public working draft for Voice Extensible Markup Language (VoiceXML) 2.1. VoiceXML 2.1 "specifies a set of eight (8) features commonly implemented by Voice Extensible Markup Language platforms. The specification is designed to be fully backwards-compatible with VoiceXML 2.0. The popularity of VoiceXML 2.0 spurred the development of numerous voice browser implementations early in the specification process. VoiceXML 2.0 has been phenomenally successful in enabling the rapid deployment of voice applications that handle millions of phone calls every day. This success has led to the development of additional, innovative features that help developers build even more powerful voice-activated services. While it was too late to incorporate these additional features into VXML2, the purpose of VoiceXML 2.1 is to formally specify the most common features to ensure their portability between platforms and at the same time maintain complete backwards-compatibility with VXML2..."
According to Dave Raggett, "The new features include using computed expressions for referencing grammars and scripts, the ability to detect where barge-in occurred within a prompt, greater convenience in prompting for dynamic lists of values, to be able to download data without having to move to the next page, to record the user's speech during recognition for later analysis, to pass data with a disconnect, and enhanced control over transfer..."
Bibliographic information: Voice Extensible Markup Language (VoiceXML) 2.1. W3C Working Draft 23 -March-2004. Edited by Matt Oshry, Tellme Networks (Editor-in-Chief); Paolo Baggia, Loquendo; Michael Bodell, Tellme Networks; David Burke, Voxpilot Ltd.; Daniel C. Burnett, Nuance Communications; Emily Candell, Comverse; Jim Ferrans, Motorola; Jeff Haynie, Vocalocity; Hakan Kilic, Scansoft; Jeff Kusnitz, IBM; Scott McGlashan, Hewlett-Packard; Rob Marchand, VoiceGenie; Michael Migdol, BeVocal, Inc.; Brad Porter, Tellme Networks; Ken Rehor, Vocalocity; Laura Ricotti, Loquendo. Version URL: http://www.w3.org/TR/2004/WD-voicexml21-20040323/. Latest version URL: http://www.w3.org/TR/voicexml21/.
Principal references:
Announcement 2004-03-23: "Opera Sings with IBM's Speech Technology. New version of Opera Embeds ViaVoice from IBM."
VoiceXML Forum:
- XHTML+Voice Profile 1.2. VoiceXML Forum.
- VoiceXML Forum Specifications
- VoiceXML Forum web site
- "XHTML+Voice - application/xhtml+voice+xml." By Gerald M. McCobb. IETF Network Working Group, Internet Draft 'draft-mccobb-xplusv-media-type-00'. April 02, 2004.
W3C References:
- W3C Multimodal Interaction Activity
- W3C Multimodal Interaction Working Group Charter. This WG's goal is to extend "the user interface to Web applications to support multimodal interaction by developing markup specifications for synchronization across multiple modalities and devices. Multimodal user interfaces can be used in two ways: to present the same or complementary information on different output modes, and to enable switching between different input modes depending on the current context and physical environment. The Working Group's specifications should be implementable on a royalty-free basis."
- W3C Voice Browser Activity. "W3C is working to expand access to the Web to allow people to interact via key pads, spoken commands, listening to prerecorded speech, synthetic speech and music. This will allow any telephone to be used to access appropriately designed Web-based services, and will be a boon to people with visual impairments or needing Web access while keeping their hands and eyes free for other things. It will also allow effective interaction with display-based Web content in the cases where the mouse and keyboard may be missing or inconvenient."
- Discussion list archives for 'www-voice'
- XHTML+Voice Profile 1.0. W3C Note. 21-December-2001. Submitted by IBM, Motorola, and Opera Software.
- XHTML+Voice Profile Version 1.0 Submission Request. November 30, 2001. The XHTML+Voice Profile was submitted to W3C by IBM, Motorola, and Opera Software under RAND license terms; an updated version was RF. See the W3C Staff Comment.
- "Modularization of XHTML. W3C Recommendation 10-April-2001.
Opera References:
IBM References:
- IBM and Multimodal Access
- Embedded ViaVoice. "The IBM Embedded ViaVoice software family brings the simplicity of a voice interface to mobile devices — enabling them to accept verbal commands and read back messages aloud."
- ViaVoice consumer software. "ViaVoice speech technology delivers a 'multi-modal' environment, freeing users from dependence on the mouse, keyboard and stylus for many applications."
- Access NetFront "supports advanced mobile voice recognition technologies based on the XHTML + Voice (X+V) 1.1 framework. X+V supports voice synthesis and voice recognition of mobile Internet data allowing voice input and output interaction with voice supported Web pages."
- "Multimodal Tools V4.1.1 Frequently Asked Questions." 9 pages. "X+V fits into the Web environment by taking a normal visual Web user-interface, and speech-enabling each part of it. That is, if you take a visual interface and break it up into its basic parts (such as an input field for a time of day, a check box for AM or PM, and so on), you can then simply enable the use of voice by adding voice markup to the visual markup. X+V consists of visual markup, a collection of snippets of voice markup for each element in the user interface, and a specification of which snippets to activate when. For visual markup, X+V uses the familiar XHTML standard. For voice markup, it uses a (simplified) subset of VoiceXML. For associating the snippets of VoiceXML and user-interface elements, X+V uses the XML Events standard. All of these are official standards for the Web as defined by the Internet Engineering Task Force (IETF) that governs web standards..." [source PDF]
- "Developing Multimodal Applications Using XHTML+Voice." IBM Pervasive Computing. January 2003. 15 pages. "This paper illustrates the basic structure and contents of an XHTML+Voice multimodal application, describing its fundamental building blocks. It is intended for those who are familiar with XHTML, VoiceXML, and HTML."
- "XHTML+Voice Programmer's Guide Version 1.0." First Edition. February 2004. Version 1.0. 142 pages.
- Multimodal Tools news group
- IBM and open standards
Press:
- "Opera's Browser Finds Its Voice." By Matt Loney and Paul Festa. From CNET News.com (March 23, 2004). "The next version of the browser is due in 'a couple of months,' according to Opera Chief Executive Jan Tetzchner. Aside from the obvious accessibility benefits, Tetzcher said, there are applications for in-car computing: 'In a car, you would like a combination of screen and voice, but you don't want to be watching a screen while driving. Being able to perform tasks by voice and get voice feedback will be very useful'..."
- "Big Blue Stars in Opera Voice-Recognition Technology." By Jay Lyman. In TechNewsWorld (March 23, 2004). "IBM's Jablokov told TechNewsWorld that thanks to a realization of return on investment and emerging standards, speech-recognition technology is ready for the enterprise, where IBM envisions the back-end suppport required to integrate speech nto the enterprise infrastructure. 'What you see is [that] speech is really mainstreaming,' Jablokov said. Referring to a series of speech-technology announcements from IBM at the AVIOS SpeechTEK 2004 conference, Jablokov downplayed the potential for Microsoft's competing SALT technology, which is proprietary."
- "Opera Gives Voice to Web Browser. Company Embedded Viavoice Into Software." By Gillian Law. In InfoWorld (March 23, 2004). "Opera Software ASA will include voice capabilities in its updated browser software, using IBM Corp.'s embedded ViaVoice technology. The upgraded browser, which will continue to be offered at no cost, will be available later this year. Initially, it will offer support for ViaVoice in English only, but other languages may be developed in due course, Opera Chief Executive Officer (CEO) Jon von Tetzchner said. Voice capabilities could well become the preferred way of interacting with a computer, according to von Tetzchner. While it is obviously useful for people with disabilities, it will also be popular with many other users who prefer voice to using a keyboard and mouse."
Earlier news:
- VoiceXML Forum Announcement: "VoiceXML Forum Endorses Advancement of VoiceXML 2.0 Specification to Proposed Recommendation. Announces Availability of X+V 1.2 as Multimodal Solution of Choice. Forum Acts as Official Steward of X+V Specification."
- "W3C Voice Extensible Markup Language (VoiceXML) 2.0 a Proposed Recommendation." Covers related announcement for VoiceXML Forum.
- "IBM Multimodal Browser Extension for MSIE."
- "IBM and Opera Software Team to Develop Multimodal Browser — XHTML+Voice (X+V) Browser Allows Developers to Extend Voice, Web Applications." Announcement July 24, 2002.
General:
- See also: "Speech Application Language Tags (SALT)."
- VoiceXML - Main reference page.