The Cover PagesThe OASIS Cover Pages: The Online Resource for Markup Language Technologies
SEARCH | ABOUT | INDEX | NEWS | CORE STANDARDS | TECHNOLOGY REPORTS | EVENTS | LIBRARY
SEARCH
Advanced Search
ABOUT
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

NEWS
Cover Stories
Articles & Papers
Press Releases

CORE STANDARDS
XML
SGML
Schemas
XSL/XSLT/XPath
XLink
XML Query
CSS
SVG

TECHNOLOGY REPORTS
XML Applications
General Apps
Government Apps
Academic Apps

EVENTS
LIBRARY
Introductions
FAQs
Bibliography
Technology and Society
Semantics
Tech Topics
Software
Related Standards
Historic
Last modified: June 24, 2005
Voice Extensible Markup Language (VoiceXML)

Overview: This reference document provides information about the Voice Extensible Markup Language (VoiceXML), developed by W3C's Voice Browser Working Group as part of the W3C "Voice Browser" Activity and promoted by the VoiceXML Forum. The 'VoiceXML Forum' was announced originally under the name 'VXML Forum'.

[March 16, 2004]   VoiceXML 2.0 and Speech Recognition Grammar Published as W3C Recommendations.    The World Wide Web Consortium has released the first two W3C Recommendations in its Speech Interface Framework. "Aimed at the world's estimated two billion fixed line and mobile phones, W3C's Speech Interface Framework will allow an unprecedented number of people to use any telephone to interact with appropriately designed Web-based services via key pads, spoken commands, listening to pre-recorded speech, synthetic speech and music." The Voice Extensible Markup Language (VoiceXML) Version 2.0 Recommendation defines VoiceXML, designed for "creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications." The second Recommendation, Speech Recognition Grammar Specification Version 1.0, is key to VoiceXML's support for speech recognition, and is used by developers to describe end-users responses to spoken prompts. It defines syntax for representing grammars for use in speech recognition so that developers can specify the words and patterns of words to be listened for by a speech recognizer. The syntax of the grammar format is presented in two forms, an Augmented BNF Form and an XML Form. The specification makes the two representations mappable to allow automatic transformations between the two forms."

[February 07, 2001] "The VoiceXML Forum is an industry organization established to promote VoiceXML as the universal standard for speech-enabled Web applications. The Forum, which is composed of over 350 member companies (4 Sponsor Members, 29 Promoter Members, and 320 Supporter Members), supports the work of the VoiceXML community through its conformance testing, marketing, education, and outreach efforts. Bolstered by a membership that has more than tripled in the past year, in 2000 the Forum launched a Technical Council to support its Conformance and Education Committees, and also formed a Marketing Committee. The VoiceXML Forum is a program of the IEEE Industry Standards and Technology Organization (IEEE-ISTO), which manages the day-to-day operations of the Forum."

On March 02, 1999, the formation of a new 'Voice eXtensible Markup Language Forum (VXML Forum)' was announced by AT&T, Lucent Technologies, and Motorola. The VXML Forum "aims to drive the market for voice- and phone-enabled Internet access by promoting a standard specification for VXML, a computer language used to create Web content and services that can be accessed by phone. AT&T, Lucent and Motorola will contribute their markup language technologies to the development of the open VXML specification. Seventeen other leading companies from the speech, Internet and communications markets have agreed to support the VXML Forum and play an active role in reviewing or contributing to the VXML specification. The initial specification will be available for public comment and contribution next month [April 1999], with the goal of submitting a final proposed specification for standardization to the World Wide Web Consortium (W3C) later this year. The initial VXML language specification will be based upon characteristics and functionality that includes Phone Markup Language or PML, an extension of the HTML language from AT&T, Lucent and Motorola's VoxML."

"The VXML Forum has four main objectives: (1) to develop an open VXML specification and then submit it for standardization; (2) to educate the industry about the need for a standard voice markup language; (3) o attract industry support and participation in the VXML Forum; (4) o promote industry-wide use of the resulting standard to create innovative content and service applications."

According to information provided on the VXML Forum's Web site, VXML has its roots in a research project called PhoneWeb at AT&T Bell Laboratories. After the AT&T/Lucent split, both companies pursued development of independent versions of a phone markup language. Lucent's Bell Labs continued work on the project, now known as TelePortal. The recent research focus has been on service creation and natural language applications. AT&T Labs has built a mature phone markup language and platform that have been used to construct many different types of applications, ranging from call center-style services to consumer telephone services that use a visual Web site for customers to configure and administer their telephone features. . . As an XML-based definition with an HTML-like appearance, VXML will be easy to learn for experienced Web content programmers and amenable to easy processing by tools to support desktop development of VXML Web applications."

[May 18, 1999] VXML Forum - Update (from VoML Developer Newsletter, Volume 1, Number 2): "Motorola's VXML representatives have been meeting regularly with the other forum partners to help develop the VXML standard. They are making progress towards standardization. IBM has become a forum contributor since our last newsletter, and is actively working with Motorola, AT&T, and Lucent to develop the new language."

Principal References

Articles, Papers, Reports, News

  • [June 15, 2005] "IBM And Speech Technology: An Interview With Bruce Morse." By Tracey Schelmetic. From TCMnet (June 15, 2005). "Where does IBM stand in the realm of speech technologies? CIS spoke with Bruce Morse, vice president of Contact Center Solutions for the IBM Software Group." [Morse:] "IBM's research organization has over 30 years' of experience in speech. It is highly skilled in voice user interface design, persona development and grammar, has more than 250 speech patents and over 100 researchers worldwide in speech labs, including China, Haifa, Tokyo, India and Almaden, working in more than 15 languages. Our work ranges from contact centers to mobile devices to automobiles. IBM is a leader in driving and incorporating speech standards such as VoiceXML, MRCP and W3C. We work with companies of all sizes. IBM was the first to deploy natural language understanding in an automated contact center. For two consecutive years, JD Power and Associates surveys rating customer satisfaction with in-car navigation systems found the top cars were from Honda and Acura, which use IBM's Embedded ViaVoice speech recognition technology. Our contact center customers have found our speech solutions improve call retention rates by six to 10 percent, cutting call times by 10 percent and decreasing costs by up to 90 percent compared to assisted services. IBM is also helping the large community of developers, ISVs and customers deploy and manage speech enablement. We have made significant contributions to the speech industry, through open standards work on VoiceXML, CCXML and MRCP, as well as to the Eclipse Foundation, including our recently announced contributions of VoiceXML and CCXML editors. In addition, we recently announced our contribution to the Apache Foundation of the Reusable Dialog Components (RDC) Framework... There are two million to three million J2EE developers in the marketplace, and our tooling and open source strategy has been to enable this highly skilled group to expand its reach into speech enablement. By creating plug-ins to the Eclipse framework, we help developers leverage their existing skills in Web development to extend to speech. We are contributing to the speech industry's efforts in order to shorten development time and decrease complexity through our commitment to open standards such as VoiceXML, CCXML, MRCP, xHTML and X+V. In addition, we have donated approximately 20 VoiceXML Reusable Dialog Components (RDCs) to the open-source community through IBM's Alphaworks..."

  • [April 2005] "Opera 8 Ships One Million Browsers with X+V Multimodal Technology." By Igor Jablokov (VoiceXML Forum Director; IBM). From VoiceXML Review Volume 5, Issue 2 (March/April 2005). "Opera Software ASA recently announced that version 8.0 of its browser received over one million downloads within four days of release. The Norwegian software vendor has created a fast and standards compliant Web experience. While this news is certainly commendable for any product introduction, rivaling even Mozilla's Firefox, it is also a milestone for the multimodal and voice standards community. Opera has included a feature that could usher in an age of human-computer interaction predicted long ago by many a science fiction writer. The Windows version of this browser now has an option that enables voice interaction. This functionality is provided by the IBM Multimodal Runtime Environment, which connects the Opera Browser to IBM Embedded ViaVoice (the same technology currently shipping in certain auto navigation systems). Not only does this enable users to interact with the entire browser interface using their voices (e.g., users can say 'browser go home' or 'browser fullscreen'), but they can also execute applications written in the XHTML+Voice (X+V for short) markup language. The X+V language permits developers to write and deploy multimodal Web applications, which allow users to interact through sight, sound and speech. This language was co-authored by IBM, Motorola and Opera and is under consideration by the W3C standards body. While modern day VoiceXML applications require specialized skills, X+V applications are different in that they more closely resemble standard Web applications. This breaks the current speech development paradigm and can allow the large body of Web developers to simply add voice interaction to existing Web applications... In any environment where 'hands-free' is not just a buzzword but a necessity, such as in healthcare, warehousing or enterprise applications, the value of this system becomes obvious. Doctors can ask for patient status by name or get alerted to changes in medical conditions using the natural sounding voice output (CTTS) that is included with the browser. In warehouses, companies can increase worker productivity by having the system communicate new orders to employees and leaving their hands free to fulfill the order. Also consider insurance adjusters speaking into complex forms and recording accident information while focused on investigating a scene. Opera 8 for Windows offers users a gateway into the multimodal experience. IBM looks forward to developers' creativity in leveraging these standards-based technologies to augment existing Web applications for increased end user productivity..."

  • [March 23, 2004]   Opera Multimodal Desktop Browser Supports XHTML+Voice (X+V) Specification.    An announcement from Opera Software at the AVIOS SpeechTEK International Exposition and Educational Conference describes the upcoming release of a multimodal desktop browser based on the XHTML+Voice (X+V) specification. In 2001, IBM, Motorola, and Opera submitted the XHTML+Voice Profile 1.0 to W3C for standards work. The most recent version of XHTML+Voice Profile 1.2 is managed by the VoiceXML Forum, and "brings spoken interaction to standard web content by integrating the mature XHTML and XML-Events technologies with XML vocabularies developed as part of the W3C Speech Interface Framework. The profile includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to XHTML elements and respond to specific DOM events, thereby reusing the event model familiar to web developers. Voice interaction features are integrated with XHTML and CSS and can consequently be used directly within XHTML content." The Opera multimodal browser project builds upon an ongoing relationship between IBM and Opera: the new release incorporates IBM's Embedded ViaVoice speech technology. IBM's ViaVoice speech technology supports a variety of real-time operating systems (RTOS) and microprocessors, powering mobile devices such as smart phones, handheld personal digital assistants (PDAs), and automobile components. "By leveraging IBM's voice libraries in this version of Opera, users can navigate, request information, and even fill in Web forms using speech and other forms of input in the same interaction." The new platform allows users to "interact with the content on the Web in a more natural way, combining speech with other forms of input and output; developers can also start to build multimodal content using the open standards-based X+V markup language, which unifies the visual and voice Web by using development skills a large population of programmers already have today."

  • [January 27, 2004] "Telco Punts $2.5m on Interactive-Voice XML." By Julian Bajkowski. In ComputerWorld Australia (January 20, 2004). "AAPT [Australian telecommunication company, owned by New Zealand's largest telecommunications company, Telecom New Zealand] will invest more than $2.5 million on a new, retail-customer interactive voice project that ports directly back into its mainframe billing and transaction systems in an effort to reduce call centre and administration costs. Based on a VeCommerce natural speech recognition engine and SOAP/XML interface, the solution will allow customers a voice interface directly into the telco's billing system to perform transactions - rather than waiting for a call centre employee to do the same thing. While AAPT says that inbound customers will still be able to speak to staff, under the new system, the voice-driven, self-serve regime is clearly designed to eliminate both customer-service bottlenecks and cut the call centre staff costs that go with them. Analysts say such systems may offer competitive advantages because of operational cost reductions through shifting inbound call centre functions from a human base to robotic base. Gartner's vice president of research for enterprise networks, Geoff Johnson, confirmed that the uptake of voice-driven XML (or VXML) has been swift over the last 18 months, largely led by investments in VoIP infrastructure and voice engine and application improvements. 'It's the diplomacy and sophistication along with fluency and fault tolerance that is driving this. It's the personal productivity [to the customer] that makes this attractive,' Johnson said Despite obvious pay offs, Johnson warned risks still existed in deploying voice-driven XML systems, noting some cultures (like Japan) simply do not tolerate non-human interfaces. Johnson said rollouts can come unstuck if enterprises attempted to port 'too many complex functions' to such systems..."

  • [September 19, 2003]   SnowShore Networks Develops Royalty-Free Media Server Control Markup Language (MSCML).    SnowShore Networks has officially announced royalty free licensing terms for implementing technology in the Media Server Control Markup Language (MSCML). MSCML is an XML-based protocol used in conjunction with the Session Initiation Protocol (SIP) to enable the delivery of advanced multimedia conferencing services over IP networks. The protocol was "submitted to the IETF as an Internet Draft in 2002 after a rigorous two year test and evaluation process. It is used to drive the delivery of IP enhanced conferencing to wireline, wireless and broadband networks worldwide. SnowShore also announced the successful deployment of MSCML in both trials and live network environments; it is is currently being used by a number of application developers, media server manufacturers, equipment vendors and service providers, including Z-Tel, IBM, Broadsoft, Bay Packets, Commetrex, Leapstone and Ubiquity Software. Industry watchers and vendors alike view MSCML as the essential protocol for call control between the media server and application server in the IP services architecture. SnowShore communicated to the IETF that it is offering Royalty Free licenses of its intellectual property necessary for implementing the MSCML standard. This inclusive policy provides IP application developers, infrastructure vendors and service providers with the opportunity to bring to market new IP enhanced conferencing and innovative services within the universal framework of SIP and MSCML."

  • [July 23, 2003] "HP Acquires PipeBeach to Strengthen Leadership in Growing VoiceXML Interactive Voice Market. Standards-based Products from PipeBeach Bolster HP OpenCall Portfolio and Enhance HP's Ability to Deliver Speech-based Solutions." - "HP today announced the acquisition of PipeBeach AB, a Stockholm, Sweden-based provider of speech-based products and technology that enable the delivery of interactive voice solutions. HP plans to integrate PipeBeach's VoiceXML-based products into its OpenCall suite of enhanced telecommunications software. This acquisition significantly enhances HP's ability to help telecom service providers, network equipment providers and independent software vendors simplify the creation and deployment of VoiceXML-based applications. HP will be better able to provide these customers with the flexibility and agility they need to get to market faster, reduce costs and improve customer loyalty. PipeBeach's VoiceXML-based products and technology enable users to speak into their mobile phones and devices to obtain Web-based information such as news, stock prices and e-mail, as well as conduct transactions, such as online banking. The information and options are conveyed to the user through speech - instead of text or images. 'HP is committed to the interactive voice market, and we intend to help our customers grow by providing them with a full spectrum of development tools, products and solutions,' said Jean-Rene Bouvier, vice president and general manager, HP OpenCall Business Unit. 'When we combine the advanced PipeBeach technologies - and track record in VoiceXML -- with HP's OpenCall portfolio and unique combination of telecom and IT expertise, we can establish HP as a global provider of interactive voice solutions.' HP intends to build on its strong presence in the interactive voice market by leveraging the PipeBeach acquisition to accelerate the development of open standards for voice and multimodal technologies in three primary ways: (1) By helping to deliver a complete VoiceXML development and deployment environment that will enable service providers and mobile operators to accelerate rollout of new, revenue-generating voice-enabled services (2) By accelerating the growth of voice portals, which can help reduce customer service costs. As users increasingly opt for voice browsing versus calling a live operator for basic information and assistance, the cost per call has dropped from $5 for human-assisted service to about 50 cents for voice-automated service.(3) (3) By helping increase customer loyalty through improved service. Voice portals can reduce wait times significantly because they are equipped to handle unpredictable spikes in call volume and allow users to interact directly with Web-based information and systems... HP OpenCall speechWeb interacts between the telephone network and standard Web servers that host VoiceXML applications. It enables mobile users to interact with VoiceXML Web pages, using advanced speech recognition and text-to-speech or spoken prompts. Today, speechWeb supports approximately 40 languages..."

  • [March 25, 2003] "VXML and VoIP Boost Customer Satisfaction. Quickly ID Your Callers Via These Two Technologies." By Veronika Megler (Certified Consulting IT Architect, Emerging and Competitive Markets, IBM). From IBM developerWorks, Wireless. March 2003. ['In this article on how to improve your customers' experience with your automated telephone-response system, Veronika Megler demonstrates how to combine VXML and VoIP with the information inherent in a telephone call to identify both the caller and the number being called. Use this lesson to improve telephone system efficiencies and bring back customers.'] "As both a consumer and a technologist, I am continually annoyed by the poor usability of many automated telephone-response systems -- especially because I know how little effort it would take to improve them. In Talk to my VoIP, I described an application that uses a Voice-over IP (VoIP) connection to access VoiceXML- (VXML) fronted back-end applications. I also showed how these technologies can provide flexible access to application information and deliver better telephone-based assistance to the average service-center caller. By using what you already know about your caller the moment you answer the call, you can expand these concepts to take personalization and usability to the next level... by accessing the target number, you can provide different voices for different choices. You can use these basic principles and existing technology to build increasingly usable voice-driven applications..."

  • [January 29, 2003]   W3C Advances VoiceXML Version 2.0 to Candidate Recommendation Status.    The W3C's Voice Extensible Markup Language (VoiceXML) Version 2.0 specification has been released as a Candidate Recommendation, together with an explicit call for implementation. "VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of web-based development and content delivery to interactive voice response applications." Comments on the CR specification are invited through 10-April-2003, when the VoiceXML specification is expected to enter the Proposed Recommendation phase. [Note 2003-02-25: Voice Extensible Markup Language (VoiceXML) Version 2.0. W3C Candidate Recommendation 20-February-2003. Updated version provides "a correction to the schemas to fix problems found with some schema tools..."]

  • [January 29, 2003] "Dispute Could Silence VoiceXML." By Paul Festa. In ZDNet News (January 29, 2003). "The Web's leading standards group called on developers to implement its nearly finished specification for bringing voice interaction to Web sites and applications. But the intellectual property claims of a handful of contributors, including Philips Electronics and Rutgers University, threaten to keep the specification tied up in negotiations, the standards body warned. The World Wide Web Consortium (W3C) on Tuesday issued VoiceXML 2.0 as a candidate recommendation, the penultimate stage in the consortium's approval process. The job of VoiceXML -- part of the W3C's Voice Browser Activity -- is to let people interact with Web content and applications using natural and synthetic speech, other kinds of prerecorded audio, and touch-tone keypads. In addition to adding speech as a mode of interaction for everyday Web surfing, the W3C has its eye on other applications. These include the use of speech for the visually impaired and for people accessing the Web while driving. The group called VoiceXML a central part of its work on voice-computer interaction. 'The VoiceXML language is the cornerstone of what we call the W3C speech interface framework--a collection of interrelated languages that are used to create speech applications,' said Jim Larson, co-chair of the W3C's voice browser working group and manager of advanced human I/O (input/output) at Intel. 'Using these types of applications, the computer can ask questions and the user can respond using words and phrases or by touching the buttons on their touch-tone phone'... Other W3C specifications control individual pieces of the voice-browsing puzzle. The Speech Synthesis Markup Language (SSML), for example, describes how the computer pronounces words, with attention to voice inflection, volume and speed. The Speech Recognition Grammar Specification (SRGS), establishes what a user must say in response to a computer prompt. And the Semantic Interpretation for Speech Recognition (Semantic Interpretation) strips down text and translates it to a form that the computer can understand..."

  • [December 17, 2002] "Standardizing VoiceXML Generation Tools." By David L. Thomson. In VoiceXML Review (December 2002). "An area where we have an opportunity to make VoiceXML easier to use and more portable is in development and runtime tools. VoiceXML provides two significant advantages in authoring speech-enabled applications, when compared to previous methods. It allows a developer to build speech services with less effort and it allows applications written for one speech platform to run on another speech platform. These advantages are diminished, however, if software tools used to create and support VoiceXML code are inadequate or incompatible. The VoiceXML Tools Committee, under the direction of the VoiceXML Forum, has been working on methods for improving the quality and uniformity of tools as described below. To define a process for improvement, we must first outline an architecture that illustrates how tools are connected. Companies currently building tools include application developers, speech server suppliers, speech engine vendors, speech hosting service bureaus, stand-alone tool developers, and customers... Development tools and runtime software on the VoiceXML page server must use the same meta language. Since the meta language is generally unique to a given tool vendor, runtime software on the VoiceXML page server will only work with development tools from the same vendor... the VoiceXML Tools Committee is studying ways to standardize the meta language. Vendors would then use the standard meta language to represent parameters of the call flow, even if vendor tools otherwise provide different features. Two proposals under consideration are: (1) the XForms standard under development by the W3C and (2) an XML-based standard where styles sheets convert between formats used by different vendors. This rather ambitious goal will, if successful, improve the interoperability of development and runtime tools and make applications portable across vendors... Tools for developing VoiceXML-based speech applications are a critical factor in making VoiceXML easy to use. While VoiceXML itself may be well-defined, industry software for generating VoiceXML code lacks uniformity. We have launched an effort to define two standards that will help VoiceXML systems interoperate across different vendors. The effort will define how applications are represented and how runtime data is transported and stored. We hope that this effort will foster the creation of better tools and make developing VoiceXML services faster and easier..."

  • [December 17, 2002] "Enhancing VoiceXML Application Performance By Caching." By Dave Burke. In VoiceXML Review (December 2002). "The VoiceXML architectural model specifies a partitioning of application hosting, and application rendering. Specifically, the application is served from a Web Server and is typically created dynamically within the framework of an Application Server or equivalent. The VoiceXML Interpreter renders the resultant VoiceXML document, transmitted across a network by HTTP, into a series of instructions interpreted by the Implementation Platform. Implied in this model is a geographical distribution of the application hosting environment and the VoiceXML platform and thus the incursion of network latencies. An application might make many subsequent requests for new VoiceXML documents during its lifetime and thus these latencies may have considerable adverse effects on performance. In this article we will discuss how caching can be used to enhance the performance of VoiceXML applications. Caching is a strategy for storing temporary 'objects' (e.g., VoiceXML resources) local to the VoiceXML Interpreter that can be employed by the application developer for optimising these latencies. In what follows we will use the phrase 'origin server' to denote the application hosting environment, and 'user agent' to refer to the VoiceXML Interpreter and Implementation Platform... HTTP caching provides a powerful mechanism for improving performance of applications. A performant VoiceXML application that yields customer satisfaction will promote customer retention and also save money on deployment costs. Caching is often poorly understood and under-utilised on the Internet, yet can be effectively harnessed by observing some simple practices as outlined in this article..."

  • [December 03, 2002]   IBM WebSphere Voice Application Access Supports VoiceXML.    IBM has announced the WebSphere Voice Application Access middleware product designed to simplify "building and managing voice portals and to more easily extend web-based portals to voice. Leveraging the scalability, personalization, and authentication features of IBM's WebSphere Portal, it enables mobile workers to more easily access information from multiple voice applications -- using a single telephone number. This new offering includes IBM's WebSphere Voice Server as well as ready-to-use email, personal information management (PIM) functions, and sample portlets. It also supports VoiceXML and Java -- including development tools based on Eclipse, the open-source, vendor-neutral platform for writing software -- and uses open-standard programming languages to create voice-enabled applications that will interoperate with a range of web servers and databases. Building on the VoiceXML standards allows IBM WebSphere Voice Application Access to work with third party browsers and their associated underlying speech recognition and text-to-speech technologies. As the VoiceXML 2.0 specification nears final approval, IBM WebSphere Voice Application Access will move quickly to support it."

  • [December 03, 2002] "IBM Advances Pervasive Computing Strategy With New Software. Voice Portal Technology and Tools Extend IBM Momentum With Device Manufacturers, Service Providers and Enterprises." - "Building on a continuing wave of pervasive computing customer deployments and industry alliances, IBM today announced new software products and tools that make it easier for developers to build and manage voice portals -- as well as extend enterprise applications, such as mobile databases, to new devices. Today's announcement underscores IBM's ongoing commitment to help customers extend computing to new devices using an infrastructure built on a foundation of open, integrated and scalable technologies. IBM has built momentum helping enterprises extend capabilities to their mobile workforce, assisting service providers find new ways to decrease costs and increase revenue streams, and enabling device manufacturers to provide intelligent access to the enterprise. 'Pervasive computing plays a significant role in the on-demand era,' said Rodney Adkins, General Manager, IBM Pervasive Computing Division. 'Over the past year, we've been aggressively laying the foundation that gives people the flexibility to access and interact with information when they want it, where they want it and how they want it. Today's announcement adds to what is quickly becoming an extensive portfolio of technology, hardware, software and services that span the pervasive computing ecosystem.' Adding to its portfolio, IBM unveiled the new WebSphere Voice Application Access product: middleware that simplifies building and managing voice portals and more easily extends web-based portals to voice. Leveraging the scalability, personalization and authentication features of IBM's WebSphere Portal, it enables mobile workers to more easily access information from multiple voice applications -- using a single telephone number. This new offering includes IBM's WebSphere Voice Server as well as ready-to-use email, personal information management (PIM) functions, and sample portlets. It also supports VoiceXML and Java -- including development tools based on Eclipse, the open-source, vendor-neutral platform for writing software -- and uses open-standard programming languages to create voice-enabled applications that will interoperate with a range of web servers and databases. In keeping with IBM's strategy to provide solutions across multiple platforms, IBM will be working to make WebSphere Voice Application Access interoperable with offerings from third party VoiceXML vendors, such as Nuance and Cisco. In addition, IBM is also working with independent solutions vendors including V-Enable, Voxsurf and Viecore to extend their current solutions..." For a beta version of the WebSphere Voice Application Access product, see alphaWorks Voice Portal.

  • [December 02, 2002] "IBM Adds New WebSphere Tool for Voice Apps." By Stacy Cowley. In InfoWorld (December 02, 2002). "IBM will release later this month WebSphere Voice Application Access (WVAA), a tool for developers seeking to voice-enable corporate applications for mobile access. WVAA supports VoiceXML (Voice Extensible Markup Language) and Java, and includes sample portlets and preconfigured functions to speed development time for customized voice portals. Developers can use the technology to build voice interfaces for retrieving information from corporate databases and systems, such as stock quotes or customer information. In conjunction with other IBM development tools, WVAA can enable information to be requested with one device, such as a mobile phone, but delivered to another, like a handheld computer. The technology is particularly well-suited to mobile workers, said Sunil Soares, director of product management for IBM's Pervasive Computing Division. IBM has been working with one real estate company using the technology to allow agents to call and retrieve listings using keywords, he said... Nuance Communications, which competes with IBM on voice software, will support WVAA, as will infrastructure provider Cisco Systems. Voice software companies V-Enable and Voxsurf will also support the technology, as will services firm Viecore..." See also in eWEEK: "IBM Bolts Voice Support Onto Existing Applications," by Carmen Nobel.

  • [October 15, 2002] "VoiceXML, CCXML, and SALT." By Ian Moraes. In XML Journal Volume 3, Issue 9 (September 2002), pages 30-34. "There's been an industry shift from using proprietary approaches for developing speech-enabled applications to using strategies and architectures based on industry standards. The latter offer developers of speech software a number of advantages, such as application portability and the ability to leverage existing Web infrastructure, promote speech vendor interoperability, increase developer productivity (knowledge of speech vendor's low-level API and resource management is not required), and easily accommodate, for example, multimodal applications. Multimodal applications can overcome some of the limitations of a single mode application (GUI or voice), thereby enhancing a user's experience by allowing the user to interact using multiple modes (speech, pen, keyboard, etc.) in a session, depending on the user's context. VoiceXML, Call Control eXtensible Markup Language (CCXML), and Speech Application Language Tags (SALT) are emerging XML specifications from standards bodies and industry consortia that are directed at supporting telephony and speech-enabled applications. The purpose of this article is to present an overview of VoiceXML, CCXML, and SALT and their architectural roles in developing telephony as well as speech-enabled and multimodal applications... Note that SALT and VoiceXML can be used to develop dialog-based speech applications, but the two specifications have significant differences in how they deliver speech interfaces. Whereas VoiceXML has a built-in control flow algorithm, SALT doesn't. Further, SALT defines a smaller set of elements compared to VoiceXML. While developing and maintaining speech applications in two languages may be feasible, it's preferable for the industry to work toward a single language for developing speech-enabled interfaces as well as multimodal applications. This short discussion provides a brief introduction to VoiceXML, CCXML, and SALT for supporting speech-enabled interactive applications, call control, and multimodal applications and their important role in developing flexible and extensible standards-compliant architectures. This presentation of their main capabilities and limitations should help you determine the types of applications for which they could be used. The various languages expose speech application technology to a broader range of developers and foster more rapid development because they allow for the creation of applications without the need for expertise in a specific speech/telephony platform or media server. The three XML specifications offer application developers document portability in the sense that a VoiceXML, CCXML, or SALT document can be run on a different platform as long as the platform supports a compliant browser. These XML specifications are posing an exciting challenge for developers to create useful, usable, and portable speech-enabled applications that leverage the ubiquitous Web infrastructure..." [alt URL]

  • [October 09, 2002] "Progress in the VoiceXML Intellectual Property Licensing Debacle." By Jonathan Eisenzopf (The Ferrum Group). From VoiceXMLPlanet.com. October 2002. "In January of 2002 the World Wide Web Consortium released a rule that requires Web standards to be issued royalty free (RF). Some VoiceXML contributors hold intellectual property related to the VoiceXML standard. Some of those companies have already issued royalty free licenses, while others have agreed to reasonable and non-discriminatory (RAND) licensing terms... The fact that not all contributors have switched to a royalty free licensing model has been a thorn in the progress if the VoiceXML standard. I've voiced my concerns previously on this issue, specifically in SALT submission to W3C could impact the future of VoiceXML... Recently, IBM and Nokia changed their licensing terms from RAND to RF. At the VoiceXML Planet Conference & Expo on September 27 [2002], Ray Ozborne, Vice President of the IBM Pervasive Computing Division assured the audience at the end of his keynote speech that IBM would be releasing all intellectual property that related to the VoiceXML and XHTML+Voice specifications royalty free and encouraged the other participants to do the same... If VoiceXML is going to survive as a Web standard, then all contributors must license their IP royalty free, otherwise, the large investment that's been made will go down the drain. My hope is that the voice browser group at the W3C will either resolve these licensing issues in the next six months or jettison VoiceXML and replace it with SALT. Either way, I believe that it would be prudent for voice gateway vendors to be working on a SALT browser so that customers have the option down the road..."

  • [October 07, 2002] "Qwest Communications Launches Service To Help Customers Design Voice-Enabled Customer Service Applications." - "Qwest Communications International Inc. has "launched a new Web portal that provides business customers and systems integrators with the tools to develop customized interactive voice response (IVR) and speech recognition applications for their customer service functions. With these specially tailored applications, businesses and systems integrators can quickly and cost-effectively provide their customers with higher quality service via the telephone, e-mail or the Internet. The new portal -- called the Qwest Development Network -- is based on the voice extensible mark-up language (VXML), which is an open standard design code used to create voice-enabled Web applications. The development network provides tools for VXML application development, testing, online documentation and expert live technical support. Business customers and systems integrators can create an application from concept to prototype to trial to production at a reduced cost. Also, because the development network is based on an open standard concept, Qwest business customers and systems integrators can protect their development investment and easily migrate the application to Qwest Web Contact Center(sm) or other VXML platforms. Used by large and small companies alike, Qwest Web Contact Center is a Web-driven platform that integrates voice and data applications so businesses can provide world-class customer service over the phone or the Web. Qwest Web Contact Center enables businesses to implement a variety of functions including IVR, in-bound customer service, out-bound marketing, Web chats and other help desk applications. Qwest Web Contact Center supports both Speechworks and Nuance speech recognition technologies, and seamlessly integrates with Genesys and Cisco's ICM Computer Telephony Integration platforms for enhanced call management..."

  • [October 05, 2002] "Voice Biometrics and Application Security. Identification, Verification, and Classification." By Moshe Yudkowsky (WWW). In Dr Dobb's Journal [DDJ] (November 2002) #342, pages 16-22. Feature Article. ['Voice-based biometric security must support identification, verification, and classification. Moshe presents a verification system in which users' voice models are stored in a database on a VoiceXML server.'] "Voice biometrics are an excellent option for application security. Voice biometrics, which measure the user's voice, require only a microphone -- a robust piece of equipment as close as the nearest telephone. In this article, I prototype an application that uses a telephone call to verify identity using freely available voice biometric resources that have simple APIs. Furthermore, the prototype can be easily integrated with Internet-capable applications... Voice biometrics provide three different services: identification, verification, and classification. Speaker verification authenticates a claim of identity, similar to matching a person's face to the photo on their badge. Speaker identification selects the identity of a speaker out of a group of possible candidates, similar to finding a person's face in a group photograph. Speaker classification determines age, gender, and other characteristics. Here, I'll focus on speaker verification resources ('verifiers'). Older verifiers used simple voiceprints, which are essentially verbal passwords. During verification, the resource matches a user's current utterance against a stored voiceprint. Modern verifiers create a model of a user's voice and can match against any phrase the user utters. This is a terrific advantage. First, ordinary dialogue can be used for verification, so an explicit verification dialogue may be unnecessary. Second, applications can challenge users to speak random phrases, which make attacks with stolen speech extremely difficult... The prototype I present uses a telephony server to connect to the telephone network, a speech-technology server, and an application server to execute my code and control the other two servers... For the telephony server, speech-technology resource server, and application server, I use BeVocal's free developer hosting. BeVocal hosts VoiceXML-based applications. VoiceXML is an open specification from the W3C's 'voice browser' working group. XML-based VoiceXML lets you write scripts with dialogues that use spoken or DTMF input, and text-to-speech or prerecorded audio for output. My scripts reside on the Internet and are fetched by the VoiceXML server via HTTP. Since the VoiceXML specification does not define a voice biometrics API, I used BeVocal's extensions to VoiceXML. Another company that offers voice biometrics hosting is Voxeo; Voxeo uses a different API. Voxeo lets you send tokens through HTTP to initiate calls from the VoiceXML server to users, which is convenient for web-based applications -- not to mention more secure, as the application can easily restrict the calls to predefined telephone numbers. Both BeVocal and Voxeo offer free technical support If your application is voice-only and over the phone, adding speaker verification is straightforward. But any Internet-capable application can add VoiceXML... Biometrics in general, and speech technologies in particular, are imperfect and have a unique capacity for abuse: Voices, faces, and other characteristics can be scanned without knowledge or consent. Still, knowing 'something you are' is a powerful security tool when coupled with 'something you have' and 'something you know'." Note: The W3C Voice Browser Working Group was recently [25 September 2002] rechartered as a royalty free group operating under W3C's [then] Current Patent Practice.

  • [July 31, 2002] "VoiceXML Making Web Heard In Call Centers." By Ann Bednarz and Phil Hochmuth. In Network World (July 29, 2002). "Aspect Communications this week will announce call center software that essentially will enable users to navigate Web content via voice commands. The Aspect news comes on the heels of Avaya's announcement last week of interactive voice response (IVR) software that will make data contained in corporate directories and databases available to callers via spoken commands. At the heart of both efforts is support for the latest release of VoiceXML (VXML), Version 2.0. An extension to the XML document formatting standard, VXML streamlines development of voice-driven applications for retrieving Web content. While using voice commands to retrieve information is a routine IVR task, emerging tools support more complex, speech-driven activities, such as filling out forms or retrieving product information, all in a standards compliant rather than proprietary environment. In Aspect's case, customers will be able to use the same databases, application servers and business rules to process voice self-service interactions as they do to process Web self-service transactions. The firm is building the voice-activated service features into its existing software suite, Aspect IP Contact Suite. Avaya is adding VXML capabilities to Version 9.0 of its Avaya IVR server. Previous versions offered speech-recognition features, but 9.0 is the first to embed VXML support. Adoption of standards such as VXML is just one contributor to an overall trend to increase the sophistication of IVR products, making them less dependent on menus that bury information several layers deep and better able to handle queries phrased in natural language, says Martin Prunty, president of consulting firm Contact Center Professionals..." See details in the announcements from Aspect Communications and Avaya.

  • [July 31, 2002] "Aspect Communications Announces First IP-based Self-Service with VXML. Aspect Continues to Prove Customer Service Benefits of Converged Network by Enabling New Types of Voice Self-Service Applications." - Aspect Communications Corporation, a "leading provider of business communications solutions that help companies improve customer satisfaction, reduce operating costs, gather market intelligence and increase revenue, today announced the first IP-based self-service offering with VoiceXML (VXML) capabilities. The combination of IP-based self-service with VXML changes how customers use voice self-service. While customers are accustomed to voice self-service for routine tasks like checking balances and paying bills, IP and VXML open a wide range of more complex activities like filling out forms, placing detailed orders or completing any transaction that could also be handled on the Web. Aspect's offering will provide customers tremendous flexibility in how they interact with businesses. Using IP-based self-service that supports VXML, enterprises can simplify their infrastructure and reduce costs by automating many more customer service tasks than are automated today. Aspect's solution has several unique features that will let companies quickly and affordably develop newer, more useful self-service applications for customers. It offers a single development environment for creating rules that handle self- and live-service via voice, the Web and e-mail. Aspect's software is completely standards-based and integrates with VXML 2.0, allowing customers to perform the same self-service functions over the phone that they perform on the Web. Using natural language, customers can request information, fill out forms and place orders, and the enterprise uses the same databases, application servers and business rules as it does for the Web to process the voice self-service interactions... The Aspect IP Contact Suite enables fully integrated multichannel communications, including traditional voice technologies (PSTN), VoIP, e-mail and Web collaboration. Voice traffic travels over the same IP network as data and other communications such as e-mail and Web-based communications, versus over a separate circuit-switched network. Aspect's solution merges all communication into a unified queue and delivers it to the integrated desktops of service representatives. One network for voice and data centralizes administration, and the browser-based desktop applications empower the representatives to respond to contacts via all channels -- voice, e-mail, Web chat, assisted browsing and more -- on a single desktop with an easy-to-use interface..."

  • [June 24, 2002] "VoiceXML and the Future of SALT. [Voice Data Convergence.]" By Jonathan Eisenzopf. In Business Communications Review (May 2002), pages 54-59. ['Jonathan Eisenzopf is a member of the Ferrum Group, LLC, which provides consulting and training services to companies that are in the process of evaluating, selecting or implementing VoiceXML speech solutions.'] "The past year has been eventful for VoiceXML, the standard that application developers and service providers have been promoting for the delivery of Web-based content over voice networks. Many recent developments have been positive, as continued improvements in speech-recognition technology make voice-based interfaces more and more appealing. Established vendors are now validating VoiceXML, adding it to their products and creating new products around the technology. For many enterprises, this means that the next time there's a system upgrade, VoiceXML may be an option. For example, InterVoice-Brite customers soon will be able to add VoiceXML access to their IVR platform, which would provide callers with access to Web applications and enterprise databases... The introduction of SALT as an alternative to VoiceXML for multi-modal applications will present alternatives for customers who are not focusing exclusively on the telephone interface. However, VoiceXML is likely to be the dominant standard for next-generation IVR systems, at least until Microsoft and the SALT Forum members begin to offer product visions and complete solution sets..."

  • [April 25, 2002]   W3C Voice Browser Working Group Issues VoiceXML Last Call Working Draft.    W3C has released a Last Call Working Draft for Voice Extensible Markup Language (VoiceXML) Version 2.0. Pending receipt of positive feedback on this draft, the W3C Voice Browser Working Group plans to submit the specification for approval as a W3C Candidate Recommendation; comments may be sent for consideration until May 24, 2002. VoiceXML "is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed-initiative conversations. Its major goal is to bring the advantages of web-based development and content delivery to interactive voice response applications. The top-level element is <vxml>, which is mainly a container for dialogs. There are two types of dialogs: forms and menus. Forms present information and gather input; menus offer choices of what to do next... The dialog constructs of form, menu and link, and the mechanism (Form Interpretation Algorithm) by which they are interpreted are then introduced in Section 2. User input using DTMF and speech grammars is covered in Section 3, while Section 4 covers system output using speech synthesis and recorded audio. Mechanisms for manipulating dialog control flow, including variables, events, and executable elements, are explained in Section 5. Environment features such as parameters and properties as well as resource handling are specified in Section 6. The appendices provide additional information including the VoiceXML Schema, a detailed specification of the Form Interpretation Algorithm and timing, audio file formats, and statements relating to conformance, internationalization, accessibility and privacy." [Full context]

  • [April 05, 2002] "Is Speech Recognition Becoming Mainstream?" By Savitha Srinivasan and Eric Brown (IBM Almaden Research Center). In IEEE Computer Volume 35, Number 4 (April 2002), pages 38-41. IEEE Computer Society. ISSN: 0018-9162. This Guest Editors' Introduction provides an introduction to speech recognition in its two primary modes (using speech as spoken input, or as a data or knowledge source), an introduction to VoiceXML, and an overview of other articles in this IEEE Computer Special Issue on Speech Recogntion. ['Combining the Web's connectivity, wireless technology, and handheld devices with grammar-based speech recognition in a VoiceXML infrastructure may finally bring speech recognition to mass-market prominence.'] "... At the simplest level, speech-driven programs are characterized by the words or phrases you can say to a given application and how that application interprets them. An application's active vocabulary -- what it listens for -- determines what it understands. A speech recognition engine is language-independent in that the data it recognizes can include several domains. A domain consists of a vocabulary set, pronunciation models, and word usage models associated with a specific speech application. It also has an acoustic component reflected in the voice models the speech engine uses during recognition. These voice models can be either unique per speaker or speaker-independent. The domain-specific resources, such as the vocabulary, can vary dynamically during a given recognition session. A dictation application can transcribe spoken input directly into the document's text content, a transaction application can facilitate a dialog leading to a transaction, and a multimedia indexing application can generate words as index terms. In terms of application development, speech engines typically offer a combination of programmable APIs and tools to create and define vocabularies and pronunciations for the words they contain. A dictation or multimedia indexing application may use a predefined large vocabulary of 100,000 words or so, while a transactional application may use a smaller, task-specific vocabulary of a few hundred words. Although adequate for some applications, smaller vocabularies pose usability limitations by requiring strict enumeration of the phrases the system can recognize at any given state in the application. To overcome this limitation, transactional applications define speech grammars for specific tasks. These grammars provide an extension of the single words or simple phrases a vocabulary supports. They form a structured collection of words and phrases bound together by rules that define the set of speech streams the speech engine can recognize at a given time. For example, developers can define a grammar that permits flexible ways of speaking a date, a dollar amount, or a number. Prompts that cue users on what they can say next are an important aspect of defining and using grammars. It turns out that speech grammars are a critical component of enabling the Voice Web... The Voice Web -- triggered by the connectivity that wireless technology and mobile devices offer -- may be the most significant speech application yet. Developers originally included speech recognition technology in the device, but now they house this technology on the server side. This trend could lead to the development of powerful mass-market speech recognition applications such as (1) voice portals that provide instant voice access to news, traffic, weather, stocks, and other personal information; and (2) corporate information to streamline business processes within the enterprise. With the advent of VoiceXML, the Voice Web has become the newest paradigm for using technology to reinvent e-commerce. VoiceXML lets users make transactions via the telephone using normal speech without any special equipment. Thus, combining the Web's connectivity, wireless technology, and handheld devices with effective grammar-based speech recognition in a VoiceXML infrastructure may finally lead to the elusive mass market that speech recognition developers have chased for decades."

  • [February 21, 2002]   See: W3C Publishes Specification for Voice Browser Call Control (CCXML).    The W3C Voice Browser Working Group has released a first public Working Draft specification for Voice Browser Call Control: CCXML Version 1.0. The CCXML specification, based upon CCXML 1.0 submitted in April 2001, "describes markup for designed to provide telephony call control support for VoiceXML or other dialog systems. CCXML has been designed to complement and integrate with a VoiceXML system." The draft thus contains many references to VoiceXML's capabilities and limitations, together with details on how VoiceXML and CCXML can be integrated. However, the two languages are separate and are not required in an implementation of either language. For example CCXML could be integrated with a more traditional IVR system and VoiceXML could be integrated with some other call control system... Properly adding advanced telephony features to VoiceXML [through CCXML] entails adding not just a new telephone model, but new call management and event processing, as well... events from telephony networks or external networked entities are non-transactional in nature; they can occur at any time, regardless of the current state of VoiceXML interpretation. These events could demand immediate attention. We could either abandon VoiceXML's admirably simple single-threaded programming model, or delay event-servicing until the VoiceXML program explicitly asked to handle such events. Instead of making either of these bad choices, we instead move all call control functions out of VoiceXML into an accompanying CCXML program. VoiceXML can thus focus on being effective for voice dialogs, while CCXML tackles the very different problems..." [Full context]

  • [October 23, 2001]   W3C Working Draft for Voice Extensible Markup Language (VoiceXML) Version 2.0.    W3C has announced the first release of a public working draft for Voice Extensible Markup Language (VoiceXML) Version 2.0, along with a joint statement on collaborative effort between W3C and the VoiceXML Forum. The new draft is part of the W3C Voice Browser Activity and forms part of the proposals for the W3C Speech Interface Framework. The WD "specifies VoiceXML (Voice Extensible Markup Language) which is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed-initiative conversations. Its major goal is to bring the advantages of web-based development and content delivery to interactive voice response applications. VoiceXML is a markup language that: (1) Minimizes client/server interactions by specifying multiple interactions per document. (2) Shields application authors from low-level, and platform-specific details. (3) Separates user interaction code [in VoiceXML] from service logic [CGI scripts]. (4) Promotes service portability across implementation platforms. VoiceXML is a common language for content providers, tool providers, and platform providers. (5) Is easy to use for simple interactions, and yet provides language features to support complex dialogs." According to a Memorandum of Understanding describing collaboration between the VoiceXML Forum and W3C, "VoiceXML Forum and the W3C have determined that it is in the best interests of the respective organizations and the public that they work together to further develop a dialog markup language... VoiceXML Forum will file an express abandonment of [certain relevant ] U.S. trademark applications, and [during the five-year period] the VoiceXML Forum agrees that the W3C will have sole control of the definition and evolution of the dialog markup language based on the VoiceXML 1.0 that is under development by the W3C Voice Browser Working Group." [Full context]

  • [February 14, 2002] "Updating Your System. Is VoiceXML Right for Your Customer Service Strategy?" [Critical Decisions]. By Jonathan Eisenzopf. In New Architect: Internet Strategies for Technology Leaders Volume 7, Issue 3 (March 2002), pages 20-21. ISSN: 1537-9000. "VoiceXML is based on technology that has been used in IVR systems for years and deployed in many Fortune 500 companies. VoiceXML is simply a thin veneer that abstracts the low-level APIs used to develop IVR applications. Voice dialogs are specified by static (or dynamic) XML documents that contain sets of recorded or synthesized prompts and speech recognition grammars. These XML documents are converted by a VoiceXML gateway into low-level commands that interact with the digital signal processors (DSP) and telephony boards in a VoiceXML gateway. It's unlikely that VoiceXML will bring the Web to the phone, however. Despite the hype, VoiceXML isn't well suited as a general-purpose interface for providing telephone access to the Web. Instead, the two areas where it can provide immediate and compelling benefits are customer service and order entry... The airline industry has used IVR systems to provide flight arrival and departure information for some time. This has dramatically reduced costs by eliminating the need for live operators and shortening the average length of each call. However, customers can be frustrated by touch tone IVR systems, and will often press zero in an attempt to reach a live representative. VoiceXML-based IVRs are a better alternative to such systems because they offer speech recognition and text-to-speech capabilities. For example, Amtrak's IVR application lets callers speak to the system, rather than navigating through multiple menus. Before updating its IVR system to use speech recognition, roughly 70 percent of customers using the system would exit to speak with an operator. After the speech recognition technology was added, Amtrak reports that the exit rate was reduced to 30 percent... Although VoiceXML hasn't been widely adopted yet, the fact that technology vendors are taking an interest in the standard is reassuring. With companies like Oracle, HP, Motorola, and IBM jumping on the VoiceXML bandwagon, it's likely that you'll have access to VoiceXML-capable tools the next time you upgrade your application servers and Web development software... Several companies are already working to improve VoiceXML systems to address these issues. As with most technologies, once VoiceXML's appeal broadens and the benefits of deploying IVR solutions as a compliment to online e-business applications become more evident, the rate of adoption will increase. If you currently handle order entry and customer support with a combination of online and telephone support, now may be the time to consider VoiceXML as a way to reduce costs and realize greater return on your existing software investments..." Note: New Architect was formerly WebTechniques.

  • [February 02, 2002] "Speech Vendors Shout for Standards." By Ephraim Schwartz. In InfoWorld (February 01, 2002). "The battle for speech technology standards is set to escalate next week when a collection of industry leaders submits to the World Wide Web Consortium (W3C) a proposed framework for delivering combined graphics and speech on handheld devices. The VoiceXML Forum, headed by IBM, Nuance, Oracle, and Lucent will announce a proposal for a multimodal technology standard at the Telephony Voice User Interface Conference, in Scottsdale, Arizona. Meanwhile, Microsoft will counter with its own news, using the same conference to announce the addition of another major speech vendor to its SALT (Speech Application Language Tags) Forum. The as yet unnamed vendor intends to rewrite its components to work with Microsoft's speech platform. The announcement will follow the addition of 18 new members to the SALT Forum, a proposed alternative to VXML's multimodal solution. New members of the SALT Forum include Compaq and Siemens Enterprise Networks. Founding members include Cisco, Comverse, Intel, Microsoft, Philips, and SpeechWorks... Most mainstream speech developers are currently creating Voice XML speech applications built on Java and the J2EE (Java 2 Enterprise Edition) environment, and running on BEA, IBM, Oracle, and Sun application servers. This week General Magic and InterVoice-Brite announced a partnership to develop Interactive Voice Recognition (IVR) enterprise solutions for 'J2EE environments,' using General Magic's VXML technology. Until recently Microsoft offered only a simple set of SAPI (speech APIs). Now through acquisition and internal development it has its own powerful speech engine which it is giving away to developers royalty free, said Peter Mcgregor, an independent software vendor creating speech products. Microsoft redeveloped SAPI in Version 5.1 to run on its new speech engine, while simultaneously proposing SALT as an alternative to VXML. Wrapping it all up in a marketing context, Microsoft's Mastan called the company's collection of speech technologies a 'platform,' a term previously not used... The issue over which specification of SALT, not due to be released until sometime later this year, or VXML, whose Version 2 is now out for review, is better is an argument that can only be determined by developers. Each side claims the other's specifications are deficient... IBM's William S. 'Ozzie' Osborne, general manager of IBM Voice Systems in Somers, N.Y.: 'I hope that we get to one standard. Multiple standards fragment the market place and create a diversion. I would like to see us get to a standard that is industry wide and not proprietary. What we are proposing to the W3C, using VXML for speech and x-HTML for graphics in a single program, is cheaper and easier than SALT without having to have the industry redo everything they have done'... Note the 2002-01-31 announcement: "The SALT Forum Welcomes Additional Technology Leaders as Contributors. New Members Add Extensive Expertise in All Aspects of Multimodal and Telephony Application Development and Deployment."

  • [March 08, 2002] "A SIP Interface to VoiceXML Dialog Servers." By Jonathan Rosenberg, Peter Mataga, and David Ladd (dynamicsoft). Internet Engineering Task Force, Internet Draft. Reference: draft-rosenberg-sip-vxml-00.txt. July 13, 2001; expires: February 2002. "VoiceXML is an XML based scripting language for describing voice dialogs. VoiceXML interpreters run within an interpreter context that, among other tasks, provides a call control interface for accessing the interpreter. It is very natural to provide a VoIP-based interpreter context that uses SIP and RTP to communicate with the outside world. In this document, we provide detailed specifications for a SIP/RTP based interpreter context... It is very natural to provide a VoiceXML interpeter context based purely on IP. Specifically, based on VoIP using SIP and RTP, along with HTTP for document access. An incoming VoIP call triggers the execution of the script, fetched from a server using HTTP. The incoming RTP stream for the call is passed to the interpeter for processing, and speech generated by the interpreter is sent over RTP to the called party. We call a pure IP-based VoiceXML system an "IP dialog server", or just "dialog server". Dialog servers are a key part of the application story for SIP-based networks, as described in the SIP application component architecture. That document describes SIP-based dialog servers, and provides a high level overview of how the SIP interface works. This document provides a stand-alone, self-contained, more thorough description of a SIP-based VoIP VoiceXML interpreter context..." [cache]

  • [January 02, 2002] "What's New in VoiceXML 2.0." By Jim A. Larson. In VoiceXML Review Volume 1, Issue 11 (December 2001). "So what's new with VoiceXML 2.0? Plenty. What was a single language, VoiceXML 1.0, has been extended into several related markup languages, each providing a useful facility for developing web-based speech applications. These facilities are organized into the W3C Speech Interface Framework... The VoiceXML 2.0 supports four I/O modes: speech recognition and DTMF as input with synthesized speech and prerecorded speech as output. VoiceXML 2.0 supports system-directed speech dialogs where the system prompts the user for responses, makes sense of the input, and determines what to do next. VoiceXML 2.0 also supports mixed initiative speech dialogs. In addition, VoiceXML 2.0 also supports task switching and the handling of events, such as recognition errors, incomplete information entered by the user, timeouts, barge-in, and developer-defined events. Barge-in allows users to speak while the browser is speaking. The VoiceXML 2.0 is modeled after VoiceXML 1.0 designed by the VoiceXML Forum, whose founding members are AT&T, IBM, Lucent, and Motorola. VoiceXML 2.0 contains clarifications and minor enhancements to VoiceXML 1.0. VoiceXML also contains a new <log> tag for use in debugging and application evaluation... The W3C Voice Browser Working Group has extended VoiceXML 1.0 to form VoiceXML 2.0 plus several new markup languages, including speech recognition grammar, semantic attachment, and speech synthesis. The speech recognition and speech synthesis markup languages were designed to be used in conjunction with VoiceXML 2.0, as well as with non-VoiceXML applications. The speech community is invited to review and comment on working drafts of these languages."

  • [January 02, 2002] "VoiceXML 2.0 from the Inside." By Dr. Scott McGlashan. In VoiceXML Review Volume 1, Issue 11 (December 2001). "With the publication in October 2001 of VoiceXML 2.0 as a W3C Working Draft, VoiceXML is finally on its way to become a W3C standard. VoiceXML 2.0 is based on VoiceXML 1.0, which was submitted to the W3C Voice Browser Working Group by the VoiceXML Forum in May 2000. In this article, we examine some of the key changes in the first public working draft of VoiceXML 2.0 as compared to the VoiceXML 1.0 specification... Since the founding of the Voice Browser Working Group in March 1999, the group had the mission of developing a suite of standards related to speech and dialog. These standards formed the W3C Speech Interface Framework and cover markup languages for speech synthesis, speech recognition, natural language and dialog, amongst others. Since the VoiceXML Forum had made clear its intention to develop VoiceXML 1.0 and submit it to the Voice Browser Working Group, the dialog team focused its efforts on specifying requirements for a W3C dialog markup language and providing detailed technical feedback to the Forum as VoiceXML 1.0 evolved. With the submission of VoiceXML 1.0, the dialog team began its work in earnest of developing VoiceXML into a dialog markup language for the Speech Interface Framework. A change request process was established in order to manage requests for changes in VoiceXML 2.0 from members of the Working Group and other interested parties; changes could include editorial, clarification, functional enhancements, all the way up to complete redesign of the language. Rather than try to incorporate every possible change into VoiceXML 2.0, we decided to limit the scope of changes..."

  • [January 02, 2002] "First Words: So What's New?" By Rob Marchand. In VoiceXML Review Volume 1, Issue 11 (December 2001). ['This month's column touches on some of the things that you can look for in VoiceXML 2.0, and how it impacts some of the VoiceXML tricks and tips he's introduced throughout the year.'] "The VoiceXML Forum founders (AT&T, Motorola, IBM, and Lucent) prepared the original VoiceXML 1.0 Specification. It was then passed over to the W3C Voice Browser Working Group to be evolved into VoiceXML 2.0. It was released as a public working draft on October 23rd of this year, with public comments being accepted until November 23rd . The process moving forward will include (possibly) additional working drafts, followed by a 'Last Call' working draft. Finally, a 'candidate recommendation' will be made available for final comment, followed by the formalization of VoiceXML 2.0 as a W3C Recommendation. There is still substantial work to go through in moving VoiceXML 2.0 through the W3C process, but the specification itself should now include most substantive changes and features that will be considered for the 2.0 recommendation. The current working draft of VoiceXML 2.0 improves on the VoiceXML 1.0 specification in a number of ways. If you're developing on any of the publicly available developer systems, you probably already have access to these features, or at least some of them..."

  • [November 01, 2001] "VoiceXML Developer Series: A Tour Through VoiceXML, Part V." By Jonathan Eisenzopf. From VoiceXMLPlanet. November 01, 2001. ['In the previous edition of the VoiceXML Developer, we created a full VoiceXML application using form fields, a subdialog, and internal grammars. In this edition, we will learn more about one of the most important, but rarely covered components of a VoiceXML application, grammars.'] "Now that we've built a few applications, it's time to talk about grammars. Grammars tell the speech recognition software the combinations of words and DTMF tones that it should be listening for. Grammars intentionally limit what the ASR engine will recognize. The method of recognizing speech without the burden of grammars is called "continuous speech recognition" or CSR. IBM's Via Voice is an example of a product that uses CSR technology to allow a user to dictate text to compose an email or dictate a document. While CSR technologies have improved, they're not accurate enough to use without the user training the system to recognize their voice. Also, the success rate of recognition in noisy environments, such as over a cell phone or in a crowded shopping mall, is reduced greatly. Pre-defining the scope of words and phrases that the ASR engine should be listening for can increase the recognition rate to well over 90%, even in noisy environments. The VoiceXML 1.0 standard uses grammars to recognize spoken and DTMF input. It doesn't, however, define the grammar format. This is changing however with the release of VoiceXML 2, which defines a standard XML-based and alternate BNF notation grammar format. Still, the fact that VoiceXML relies heavily on grammars means that we must create or reuse grammars each time we want to gather input from the user. In fact, the time required to create, maintain, and tune VoiceXML grammars will likely be several magnitudes greater than the time you will take to develop the VoiceXML interfaces. Not having high-quality and complete grammars means that the user will spend too much of their time repeating themselves. A system that cannot recognize input the first time, every time, will alienate users and cause them to abandon the system altogether. Therefore, we are going to spend a bit of time talking about grammars for VoiceXML 1.0 (and now VoiceXML 2) in the coming articles so that you will be armed with the knowledge you need to create successful VoiceXML applications. The first grammar format we are going to learn is GSL, which is used by the Nuance line of products... I want to reflect on some of the things that I've learned as I've been developing new VoiceXML applications over the past year as it relates to grammars. First, grammars can be difficult to develop and time consuming to tune. And things don't stop there. You will probably need to tune the dictionary that the system is using to include alternate word pronunciations as you begin to collect data on where the ASR application is failing. It's very important that the application will be able to recognize what the user is saying most of the time. Because DTMF input is almost 100% accurate, it should be preferred over speech for things like phone and credit card numbers. However, some voice interface designers recommend that you don't mix a touch-tone input with speech input. I'd say it's better than the alternative if you are having problems recognizing number sequences. Remember, speech recognition has gotten much better, but it still takes a great deal of work and care to reach the high 90s percentile success rates that vendors often mention. Thanks again for joining us for another edition of the VoiceXML Developer. In the next edition of the VoiceXML Developer, we will continue our exploration into grammars as part of our tour of the VoiceXML 1.0 specification..."

  • [September 10, 2001] "VoiceXML and the Voice/Web Environment. Visual Programming Tools for Telephone Application Development." By Lee Anne Phillips. In Dr Dobb's Journal [DDJ] (October 2001) #329, pages 91-96. Programmer's Toolchest. "While the Internet is making inroads into the public switched-telephone network, XML protocols such as VoiceXML are providing access to a set of tools that address the entire range of web applications..." The article provides an overview of GUI tools for creating VoiceXML applications, and reviews two: Visual Designer 2.0 from Voxeo, and Covigo Studio. [Covigo Studio "provides a visual programming environment that helps you to rapidly develop integrated mobile data and voice applications. Based on a user-centric process modeling approach, Studio separates user-interaction workflow from presentation design and data source integration. It allows you to build mobile applications from the ground-up or as extensions to existing applications, and to constantly optimize their applications to meet changing user, industry and business needs. The visual modeling approach provides multiple ways to integrate with existing enterprise applications at the presentation layer, business logic layer, or data layer levels. The product integrates with existing IT systems - including complex enterprise business processes encapsulated in systems used for customer relationship management (CRM), enterprise resource planning (ERP), and supply chain automation (SCM). This includes integrating with such technologies as HTML, JSPs, EJBs, JDBC, XML, and packaged application APIs..." The Visual Designer 2.0 from Voxeo is available at no cost. One can use the designer "to visually design phone applications and it will automatically generate the VoiceXML or CallXML markup for you. This allows a voice application developer to focus on important issues like usability and functionality, without having to worry about syntax. Voxeo Designer 2.0 is the first visual phone markup design tool to fully support round-trip development -- any CallXML or Voice XML application may be opened in the Designer tool, updated graphically (or by editing the XML directly) and re-deployed for use. Features include: Visual application design using flowcharts; Full round-trip, bi-directional development; Element/Attribute syntax validation; FTP and HTTP support for file read and write; Full CallXML Tag Support; Full VoiceXML 1.0 Tag support; 100% Pure-Java IDE, runs on any Java Virtual Machine ..."] Additional resources with Lee Anne's article include listings and source code.

  • [August 24, 2001] "Speech Technology Grows Up. Speech applications can save money and the technology is moving into advanced applications." By Kathleen Ohlson. In Network World Fusion (August 20, 2001). "... In the coming months, voice technology will only get better, observers say. Industry experts and vendors expect support for VoiceXML, a specification that would enable speech-based applications and online information to become phone and voice accessible, and the infusion of speech recognition in wireless devices, such as cell phones and PDAs, to flourish. Thrifty has deployed SpeechWorks' interactive speech recognition software to handle customer requests for car rental quotes. Customers who call Thrifty's reservation number are prompted to give information regarding dates, times, car size, city and airport, and then receive reservation information. When a customer wants to book a reservation, he is transferred to a sales agent. The agent receives the calls and information containing the customer's requests on his computer screen. The car agency has handled more than 200,000 calls so far through the system, and it plans to push over more by summer's end. Thrifty receives 4 million calls per year with 30% to 40% coming from customers checking rates and availability, according to DuPont, staff vice president of reservations... In addition to Thrifty, United Airlines and T. Rowe Price are two companies that have recently implemented interactive speech systems. Speech technology is also expected to penetrate in areas such as inventory tracking and salesforce automation, according to industry experts. For example, salespeople could prompt for information regarding their contacts and calendars through a phone...One of the main drivers of speech technology in the coming months will be the adoption of VoiceXML, which basically outlines a common way for speech applications to be programmed. With the adoption of VoiceXML, businesses would only need to build an application once and then could run it on multiple vendor platforms. VoiceXML is the brainchild of IBM, AT&T, Lucent and Motorola, and is currently supported by more than 500 companies, including Nokia, Sprint PCS, Nuance and SpeechWorks. SpeechWorks recently rolled out its VoiceXML-based speech recognition engine OpenSpeech Recognizer 1.0; Nuance, Lucent, IBM and others have implemented VoiceXML into their products..."

  • [August 24, 2001] "Voice XML Version 2 Stalled Over IP Issue." By Ephraim Schwartz. In InfoWorld August 24, 2001. "Version 2 of the Voice XML markup language is all but signed and sealed, but not quite delivered due to a snag in nailing down IP (intellectual property) rights. According to an industry analyst familiar with the issues discussed at the Voice XML Forum, all the specifications have been agreed upon, but there is a concern still that a future developer using VXML could be sued by a member of the Forum for infringement of IP rights... One solution may be that companies [currently 55 Forum members] might choose to provide license-free use or forego patent rights, Meisel added. All sources in the speech technology industry see VXML as a boon to the industry because it uses a standard language already familiar to Web developers. Version 2, expected to ship by the end of the year, is in its final development stages, according to the Forum chairman Bill Dykas... Up until now, developers creating speech applications used proprietary formats for writing speech grammars. A speech grammar is needed to map a wide range of responses into a narrower range, explained Dykas. For example, in a 'yes/no grammar' there may be a dozen ways for a caller to respond in the affirmative to a question including yeah, yes, okay, please, and alright which all can be mapped to Yes. Version 2 of VXML will define a common format so the program has to deal with only a single response. The second major addition to the standard -- the Voice XML Forum is working with the W3C standards body -- is the clarification of the call transfer tags... technology components, as for example in telephony: how to manipulate telephone voice mail and load balancing between mechanisms if a large number of calls come in simultaneously... Other areas include natural language understanding and multimodal interfaces for handhelds and cellular handsets. For example, in using a multimodal interface, a mobile worker may make a voice request to a database for customers that match a certain set of parameters, but the results will be displayed rather than spoken."

  • [August 2001] Early Adopter VoiceXML By Eve Astrid Andersson, Stephen Breitenbach, Tyler Burd, Nirmal Chidambaram, Paul Houle, Daniel Newsome, Xiaofei Tang, and Xiaolan Zhu. Wrox Press. August, 2001. ISBN: 1861005628. The book covers: (1) An overview of the development and deployment environments available; (2) VoiceXML 1.0 syntax tutorial; (3) Grammar use, including JSGF and Nuance GSL syntax; (4) Use of VoiceXML with XSLT, ASP, JSP, and PHP; (5) Nuance Speechobjects; (6) The future of VoiceXML technologies, including VoiceXML 2.0. [...] VoiceXML brings the power of Voice to the Web - the information we are used to accessing through the visual web interfaces of our PCs and mobile devices can now be accessed through speech alone. Building on the functionality already seen in IVR applications deployed by our banks and utility companies, the tag based syntax of VoiceXML will instantly be familiar to existing web developers, and applications can already be deployed using one of the many voice portals available. With the world's billion plus telephones, from antique black candlestick phones to the latest mobiles, there is a huge ready-made audience crying out for voice applications. The userbase encompases those on the move who require easy access to information wherever they are, and those who haven't the money or inclination to access the Internet through a PC. The book aims to give the reader an in-depth analysis of the current state of VoiceXML technology. The information will help you develop voice-enabled applications now, and make sure you are ready for future advances of this quickly changing arena." See the online Table of contents.

  • [September 05, 2001] VoiceXML in RELAX NG. 2001-09-05 or later. From Kohsuke KAWAGUCHI. "I have translated the VoiceXML 1.0 DTD into RELAX NG syntax. I originally wrote it in my short-hand syntax and then used my tool to convert it to full RELAX NG syntax. I've never tested it, so it may well contain several translation errors. All the files are available in one zip file..." [cache]

  • [August 01, 2001]   IBM alphaWorks Releases Voice Toolkit.    The XML development team at IBM alphaWorks labs has released a beta version of a 'Voice Toolkit' to assist in the creation of voice applications "in less time, using a VoiceXML application development environment. The Voice Toolkit features grammar and VoiceXML editors so that application developers do not need to know the internals of voice technology. The Voice Toolkit Beta includes: (1) An integrated development environment (IDE) - runs on the desktop and enables the multi-step process of creating speech applications; (2) A VoiceXML editor - provides content assistance and integrated pronunciation development; (3) A Grammar editor - enables syntax-checking and integrated pronunciation development for generating JSGF grammars for VoiceXML applications. The grammar editor includes grammar creation for SRCL/BNF grammars and it provides conversion capability between SRCL/BNF and JSGF; (4) A pronunciation builder - generates a pronunciation from spelling; and it lets you manually create pronunciations; (5) A basic audio recorder - allows the creation of audio files from spoken text and the playing of previously-recorded audio files; (6) VoiceXML Reusable Dialog Components - pre-written VoiceXML code for use as building blocks for application functions." [Full context]

  • [May 23, 2000] "VoiceXML Forum Founders Submit VoiceXML 1.0 Specification to W3C. Submission Marks Milestone on the Path to Voice-Enabled Internet." - "The VoiceXML Forum today announced that the World Wide Web Consortium (W3C) has acknowledged the submission of Version 1.0 of the VoiceXML specification. At its May 10-12 meetings in Paris, the W3C's Voice Browser Working Group agreed to adopt VoiceXML 1.0 as the basis for the development of a W3C dialog markup language. The Forum's founding members, AT&T, IBM, Lucent Technologies, and Motorola made the W3C submission. Acknowledgement by the W3C will help to accelerate and expand the reach of the Internet through voice-enabled Web content and services. The VoiceXML Forum will host the next meeting of the W3C Voice Browser Working Group in September 2000. 'As the W3C Voice Browser Working Group begins to define the speech interface framework that extends the Web to voice-based devices, we will use VoiceXML as a model for our dialog markup language. The W3C speech interface framework will include integrated markup languages for dialog, grammar, speech synthesis, natural language semantics, and multimodal dialogs, as well as a standard list of reusable dialogs,' said Jim Larson of the Intel Architecture Labs, who is Co-chair of the W3C Voice Browser Working Group..."

  • VoiceXML specification DTD. Posted 18-July-2000. Includes corrections. [cache]

  • VoiceXML DTD - From http://www.w3.org/TR/2000/NOTE-voicexml-20000505.

  • [August 13, 2001] "Creating VoiceXML Applications With Perl." By Kip Hampton. From XML.com. August 08, 2001. ['Kip Hampton shows how Perl and VoiceXML can work together.'] "VoiceXML is an XML-based language used to create Web content and services that can be accessed over the phone. Not just those nifty WAP-enabled 'Web phones', mind you, but the plain old clunky home models that you might use to order a pizza or talk to your Aunt Mable. While HTML presumes a graphical user interface to access information, VoiceXML presumes an audio interface where speech and keypad tones take the place of the screen, keyboard, and mouse. This month we will look at a few samples that demonstrate how to create dynamic voice applications using VoiceXML, Perl, and CGI. A rigorous introduction to VoiceXML and how it works is beyond the scope of this tutorial. For more complete introductions to VoiceXML's moving parts see Didier Martin's 'Hello, Voice World' or the VoiceXML Forum's FAQ... VoiceXML is much more than an alternative interface to the Web. It allows developers to extend their existing applications in new and useful ways, and it offers many unique opportunities for new development. As you may have guessed, though, that power and flexibility come with a hefty price tag: VoiceXML gateways (the hardware and software that connect the Web to the phone system, translate text to speech, interpret the VoiceXML markup, etc.) are not cheap. The good news is that many of prominent VoiceXML gateway providers offer free test and deployment environments to curious developers, so you can check out VoiceXML for yourself without breaking the bank."

  • [July 30, 2001] "XML Gives Voice to New Speech Applications." By Steve Chambers. In Network World [Fusion] Volume 18, Number 31 (July 30, 2001), page 37. "Speech technology is evolving to the point where an exchange of information between a person and a computer is becoming more like a real conversation. Many factors are responsible for this, ranging from an exponential increase in computing power to a general advancement of basic speech technology and user interface design. Speech-based applications deployed to date have been based on code created by a few speech software vendors. VoiceXML will likely change this landscape by virtue of its promised vendor independence in creating speech applications. VoiceXML is the emerging standard for speech-enabled applications. It defines how a dialog is constructed and executed between a caller and a computer running speech recognition and/or text-to-speech software. VoiceXML incorporates the flexibility to create speech-enabled Web-based content or to build telephony-based speech recognition call center applications. . . Vocabularies and grammars are the key components that define the input to a speech-enabled page. The vocabulary consists of the words to be recognized by the speech recognition engine. For example, a vocabulary for a flight information system might consist of city names and travel-related words such as "leaving" and "fly." Grammars provide the structure to identify meaningful phrases. A vocabulary and grammar are combined within a speech-enabled application to define speech recognition within a reasonable range of efficiency for both the caller and the speech recognition processor. Designing a speech application includes presenting data for delivery over the phone, constructing a call flow and enabling prompts and grammars. VoiceXML provides a common set of rules as a flexible foundation, but it's up to the designer to create the appropriate flow and personality for a speech system..."

  • [July 23, 2001] "New technology gives Web a voice." By Wylie Wong. In CNET News.com July 19, 2001. "A budding standard, the brainchild of tech giants AT&T, IBM, Lucent Technologies and Motorola, is fueling new software that allows people to use voice commands via their phones -- either cell or land-based -- to browse the Web. Users of the technology can check e-mail, make reservations and perform other tasks simply by speaking commands. The technology, called VoiceXML, is now winding its way through the World Wide Web Consortium Internet standards body, which is reviewing the specification and could make it a formal standard by year's end. Proponents of VoiceXML say standardization is crucial for the market for Web voice access software and services to take off. The standard gives software and hardware makers, as well as service providers and other companies using the technology, a common way to build software to offer Web information and services over the phone. . . Even though the VoiceXML specification hasn't been finalized, tech companies and telecommunications service providers alike have flocked to support the technology and are already offering new software and services that tie the telephone to the Internet. The technology has gained the support of nearly 500 companies, including IBM, networking giant Cisco Systems, database software maker Oracle and stock brokerage firm Charles Schwab..."

  • [May 23, 2001] "The Power Of Voice." By Ana Orubeondo (Test Center Senior Analyst, Wireless and Mobile Technologies). In InfoWorld (May 18, 2001). ['VoiceXML should connect your existing Web infrastructure, the Internet, and the standard telephone by providing a standard language for building voice applications. E-business managers who plan voice portal strategies will need to decide whether to build the portals themselves or turn to a growing number of voice ASPs. Be careful when selecting rapidly evolving voice portal technologies. Key improvements such as grammar authoring in Version 2.0 should iron out some of the shortcomings VoiceXML exhibits.'] VoiceXML is a standard language for building interfaces between voice-recognition software and Web content. Just as HTML defines the display and delivery of text and images on the Internet, VoiceXML translates any XML-tagged Web content into a format that speech-recognition software can deliver by phone. VoiceXML 1.0 is a specification of the VoiceXML Forum, an industry organization founded by AT&T, IBM, Lucent Technologies, and Motorola and consisting of more than 300 companies. With the backing and technology contributions of its four world-class founders and the support of leading Internet industry players, the VoiceXML Forum has made speech-enabled applications on the Internet a reality through its mission to develop and promote VoiceXML. With VoiceXML, users can create a new class of Web sites using audio interfaces, which are not really Web sites in the normal sense because they provide Internet access with a standard telephone. These applications make online information available to users who do not have access to a computer but do have access to a telephone. Voice applications are useful for highly mobile users who need hands-and eyes-free interaction with Web applications, possibly while driving or carrying luggage through a busy airport... Voice portals such as BeVocal, TellMe, and Shoptalk are already providing voice access to stock quotes, movie and restaurant listings, and daily news. The best-suited applications for VoiceXML are information retrieval, electronic commerce, personal services, and unified messaging. Several companies have already employed VoiceXML in information retrieval applications to great success. Hotels, car rental agencies, and airlines have implemented continuous voice access to allow customers to make or confirm reservations, buy tickets, find rates, get store hours and driving directions, and access loyalty programs. Voice automated services help reduce call-center costs and increase customer satisfaction... As the volume of information published using HTML grows and the range of Web services broadens, VoiceXML will become an increasingly attractive technology. VoiceXML increases the leverage under a company's Web investment by offering voice interpretation of HTML content." [altURL]

  • [March 09, 2001] "VoiceXML and the Voice-driven Internet." By David Houlding (The Technical Resource Connection). In Dr. Dobb's Journal Volume 26, Issue 4 (April 2001), pages 88-94. ['David Houlding examines the concept of voice portals, and shows how simple design patterns -- together with XML and XSL- can be used to deliver Internet content to web browsers and wireless devices.'] "Wireless data services are growing at a phenomenal rate, driven to a large extent by the popularity of the Internet services they are delivering. These wireless-enabled Internet services are generally accessible not only by standard web browsers, but also by some mix of web phones, two-way pagers, and wireless organizers. The adoption of these modes of Internet access is being accelerated by the effects of mainstream Internet usage maturing from an initial novelty/hype phase into a ubiquitous set of services we use as common tools in everyday life. In this mode of use, how information is presented is less important than being able to get to the particular information you require easily, when and where you need it... Voice portals leverage both the most natural form of communication -- speech -- and the most pervasive and familiar communications network -- the global telephone network. This network is accessible by either standard wired or mobile cellphones users already have, together with service plans, so no additional cost needs to be incurred for users to access Internet services via voice portals. This eliminates the expense barriers that are currently limiting the penetration of wireless services into the marketplace. Phones also permit eyes- and hands-free operation, enabling Internet service usage via voice portals in situations where wireless devices will not suffice. In this article, I'll discuss the concept of voice portals and the associated architecture. I'll then show how simple design patterns -- together with XML and XSL -- can be used to deliver Internet content and services cost effectively not only to web browsers and various wireless devices, but also to any telephone via VoiceXML (for more information on the VoiceXML Standard, see http://www.voicexml.org/). I'll then present an implementation of this architecture that uses software that is freely available on the Internet. Finally, I'll examine key business and technical issues associated with voice-driven applications. VoiceXML is a new standard with significant industry backing. It promises to create a level playing field on which voice portals may compete for outsourcing the hosting of voice applications. This will drive down cost and improve quality of service for both application providers and their customers. From the application providers standpoint, creating voice applications using VoiceXML has the advantage that content is portable across different voice portals, delivering flexibility with respect to choosing voice portals to host voice applications. Voice portals driven by VoiceXML provide a powerful complementary new mode of access that empowers users with more options regarding when, where, and how they consume Internet services. Using speech as the most natural form of communication, the existing familiar global telephone network as the most pervasive communications network, and enabling eyes- and hands-free operation, this new mode of access promises to further accelerate the growth and maturity of Internet services into a ubiquitous set of tools we use every day." Additional resources include listings and source code.

  • [April 19, 2001] "Introduction to the W3C Grammar Format." By Andrew Hunt. In VoiceXML Review Volume 1, Issue 4 (April 2001). ['The W3C Voice Browser Working has released a draft specification for a standard grammar format that promises to enhance the interoperability of VoiceXML Browsers and drive portability of VoiceXML applications. This article summarizes the key features of the specification and the application of the specification to VoiceXML application development.'] The W3C Speech Recognition Grammar Format specification embodies two equivalent languages.XML Form of the W3C Speech Recognition Grammar Format: Represents a grammar as an XML document with the logical structure of the grammar captured by XML elements. This format is ideal for computer-to-computer communication of grammars because widely available XML technology (parsers, XSLT, etc.) can be used to produce and accept the grammar format.Augmented BNF (ABNF) Form of the W3C Speech Recognition Grammar Format: The logical structure of the grammar is captured by a combination of traditional BNF (Backus-Naur Form) and a regular expression language. This format is familiar to many current speech application developers, is similar to the proprietary grammar formats of most current speech recognizers and is a more compact representation than XML. However, a special parser is required to accept this format. Grammars written in either format can be converted to the other format without loss of information (except formatting). The two formats co-exist because the Working Group found it important to support both computer-to-computer communication format and a more familiar human-readable format (but, as with all decisions reached by a committee, there is a spectrum of opinion on these matters)... The new W3C Speech Recognition Grammar Format is a powerful language for developing both simple grammars and natural language grammars for use in VoiceXML applications. The availability of a standard grammar format will increase the interoperability of VoiceXML applications by allowing each grammar to be authored once and reused across many VoiceXML browsers."

  • [April 19, 2001] "The Speech Synthesis Markup Langauage for the W3C VoiceXML Standard." By Mark R. Walker and Andrew Hunt. In VoiceXML Review Volume 1, Issue 4 (April 2001). ['Among the first in a series of the W3C's soon-to-be-released XML-based markup specifications is the speech synthesis text markup standard. This article summarizes the markup element design philosophy and includes descriptions of each of the speech synthesis markup elements.'] "A new set of XML-based markup standards developed for the purpose of enabling voice browsing of the Internet will begin emerging in 2001 from the Voice Browser Working Group, which was recently organized under the auspices of the W3C. Among the first in this series of soon-to-be-released specifications is the speech synthesis text markup standard. The Speech Synthesis Markup Language (SSML) Specification is largely based on the Java Speech Markup Language (JSML), but also incorporates elements and concepts from SABLE, a previously published text markup standard, and from VoiceXML, which itself is based on JSML and SABLE. SSML also includes new elements designed to optimize the capabilities of contemporary speech synthesis engines in the task of converting text into speech. This article summarizes the markup element design philosophy and includes descriptions of each of the speech synthesis markup elements. The Voice Browser Working Group has utilized the open processes of the W3C for the purpose of developing standards that enable access to the web using spoken interaction. The nearly completed SSML specification is part of a new set of markup specifications for voice browsers, and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in web and other applications. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch and rate across different synthesis-capable platforms. It is anticipated that SSML will enable a large number of new applications simply because XML documents would be able to simultaneously support viewable and audio output forms. Email messages would potentially contain SSML elements automatically inserted by synthesis-enabled, mail editing tools that render themessages into speech when no text display was present. Web sites designed for sight-impaired users would likely acquire a standard form, and would be accessible with a potentially larger variety of Internet access devices. Finally, SSML has been designed to integrate with the Voice Dialogue markup standard in the creation of text-based dialogue prompts. The greatest impact of SSML may be the way it spurs the development of new generations of synthesis-knowledgeable tools for assisting synthesis text authors. It is anticipated that authors of synthesizable documents will initially possess differing amounts of expertise. The effect of such differences may diminish as high-level tools for generating SSML content eventually appear. Some authors with little expertise may rely on choices made by the SSML processor at render time. Authorspossessing higher levels of expertise will make considerable effort to mark as many details of the document to ensure consistent speech quality across platforms and to more precisely specify output qualities. Other document authors, those who demand the highest possible control over the rendered speech, may utilize synthesis-knowledgeable tools to produce 'low-level' synthesis markup sequences composed of phoneme, pitch and timing information for segments of documents or for entire documents..."

  • [April 17, 2001]   W3C Publishes Requirements for Call Control in the Voice Browser Framework.    The W3C Voice Browser Working Group has released an initial working draft specification for "Call Control Requirements in a Voice Browser Framework." The document is presented as "a precursor to work on a specification." It "describes requirements for mechanisms that enable fine-grained control of speech (signal processing) resources and telephony resources in a VoiceXML telephony platform. The scope of these language features is for controlling resources in a platform on the network edge, not for building network-based call processing applications in a telephone switching system, or for controlling an entire telecom network." This W3C activity "focuses on enabling extended call control functionality in a voice browser which supports telephony capabilities. The task is constrained to defining elements and capabilities which either provide augmented functionality to be used in combination with VoiceXML or enhance the existing functionality in VoiceXML. The activities of a Call Control Subgroup will be coordinated with the activities of the Dialog Subgroup, both of which are part of the W3C Voice Browser working group." The requirements specification for call control is set against the backdrop of published goals for richer telephony functionality in VoiceXML, [which is] "designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed-initiative conversations." W3C work on Voice Browsers is being coordinated under the W3C User Interface Domain. [Full context]

  • [April 06, 2001] "WebSphere Studio Leverages XML to Empower Web Developers." By Amy Wu and Sharon Thompson. In XML Journal Volume 2, Issue 4 (April, 2001), pages 56-60. ['A good Web development tool should be easy to use, yet robust enough to create and edit static and dynamic pages, organize and publish files, and help the developer properly maintain the site. IBM's WebSphere Studio is a total project management workbench with several integrated tools that assist developers in all stages of Web development. This article introduces you to Studio's wizards, editors, and publishing functions and exposes some of Studio's weaknesses as well.'] "Studio may be used in conjunction with some of the more common version control software (VCS). But even without an integrated VCS, Studio allows multiple users to access a project and check files in and out. Throughout the development process, Studio assists with link management that maintains links even while users move the source files around within the project. Studio's various editors and wizards are particularly helpful for nonprogrammers. Wizards can enable even novice users to generate server-side logic and add powerful functions to Web sites. They also leverage XML technology to make it easy for users to create Java servlets or JavaServer Pages that access databases and implement JavaBeans. When the development process is complet