Automating the Web with WIDL

[This local archive copy mirrored from http://www.webmethods.com/technology/Automating.html, text only; see this canonical version of the document or its successor if possible.]

	The problem of direct access to Web data from within business applications has until recently been largely ignored. The Web Interface Definition Language (WIDL) is an application of the eXtensible Markup Language (XML) which allows the resources of the World Wide Web to be described as functional interfaces that can be accessed by remote systems over standard Web protocols. WIDL provides a practical and cost-effective means for diverse systems to be rapidly integrated across corporate intranets, extranets, and the Internet. This article appears in "XML: Principles, Tools, and Techniques", the October print issue of O'Reilly's World Wide Web Journal.

WIDL: Application Integration with XML The explosive growth of the World Wide Web is providing millions of end-users access to ever-increasing volumes of information. The resources of legacy systems, relational databases and multi-tier applications have all been made available to the Web browser, which has been transformed from an occasionally informative accessory into an essential business tool for organizations large and small. While the Web has achieved the extraordinary feat of providing ubiquitous accessibility to end-users, it has in many cases reinforced manual inefficiencies in business processes as repetitive tasks are required to transcribe or copy and paste data from browser windows into desktop and corporate applications. This is as true of Web data provided by remote business units and external (i.e. partner or supplier) organizations as it is of Web data accessible from both public and subscription based Web sites. Business units that have previously been unable to agree on middleware and data interchange standards are (by default) agreeing on HTTP and HTML as data communication and presentation standards. Because of the overwhelming focus on the browser almost all Web applications require interaction with a human user. The problem of direct access to Web data from within business applications has been largely ignored, as has the possibility of using the Web as a platform for automated information exchange between organizations. The debut of XML is set to change all this, and in the process spark a major Web revolution: Web Automation. XML enables the creation of Web documents that preserve data structure and include "machine-readable" hooks to enable intelligent processing by client applications. It is not necessary, however, for Web content to exist as XML in order for XML to be used today to automate the Web. The use of XML to deliver metadata about existing Web resources can provide sufficient information to empower non-browser applications to automate interactions with Web servers. XML metadata defining interfaces to Web-enabled applications can provide the basis for a common API across legacy systems, databases, and middleware infrastructures, effectively transforming the Web from an access medium into an integration platform. Web Automation Imagine everything a browser can do: sign-on to a secure Web-site; query that site for data, download the results, upload a response. Now imagine that your business applications can do the same thing, automatically, without human intervention and without using a browser. This is the power of Web Automation. The benefits of Web Automation are numerous: competitive intelligence - aggregate product pricing data, news reports application integration - leverage investments in Web data and infrastructure implement robust e-commerce solutions without expense and difficulty of EDI or CORBA realize a 100% Web-based alternative to EDI put web site functionality in the heart of customers' and suppliers' IT infrastructures The incredible diversity of Web resources presents significant challenges for the automation of arbitrary tasks on the Web. A robust infrastructure for Web Automation needs to provide: full interaction with HTML forms an ability to handle both HTTP Authentication and Cookies both on-demand and scheduled extraction of targeted Web data aggregation of data from a number of Web sources chaining of services across multiple Web sites an ability to integrate easily with traditional application development languages and environments a framework for managing change in both the locations and structures of Web documents In order to integrate business systems over the Web, it is not sufficient to have only data that describes itself; it is also necessary to have metadata that describes the behavior of services hosted by Web servers. webMethods has defined the Web Interface Definition Language (WIDL) as an application of XML to lay the foundation for Web Automation. WIDL The goal of the Web Interface Definition Language (WIDL) is to enable automation of all interactions with HTML/XML documents and forms, providing a general method of representing request/response interactions over standard web protocols, and allowing the Web to be utilized as a universal integration platform. Where XML supports the creation of Web content that preserves data structure, and promises Web documents that are "machine-readable", WIDL is an application of XML that defines interfaces and services within and across HTML, XML, and text documents. Services defined by WIDL map existing Web content into program variables, allowing the resources of the Web to be made available, without modification, in formats well-suited to integration with diverse business systems. WIDL brings to the Web many of the features of IDL concepts that have been implemented in distributed computing and transaction processing platforms including DCE, and CORBA. A major part of the value of DCE and CORBA is that they can define services offered by applications in an abstract but highly usable fashion. WIDL describes and automates interactions with services hosted by Web servers on intranets, extranets and the Internet; it provides a standard integration platform and a universal API for all web-enabled systems. A service defined by WIDL is equivalent to a function call in standard programming languages. At the highest level, WIDL files are collections of services. WIDL defines the location (URLs) of each service, input parameters to be submitted (via GET or POST methods) to each service, and output parameters to be returned by each service. WIDL provides the following features: A browser is not required to drive Web applications Service definitions are dynamically interpreted and can thus be centrally managed Client applications are insulated from changes in service locations and data extraction methods Developers are insulated from network programming concerns Application resources can be integrated across firewalls and proxies WIDL can be used to describe interfaces and services for: Static documents (HTML, XML, and plain text files) HTML forms URL directory structures WIDL also has the ability to specify conditions for successful completion of a service, and error messages to be returned from services to calling programs. Conditions further enable services to be defined that span multiple documents. Applications of WIDL The success of the Web has exposed the advantages of distributed information systems to a global audience. Around the world IT organizations, regardless of industry, are searching for ways to connect the Internet with new or existing applications, to use Web technology to reduce development, deployment, and maintenance costs. Using HTML, XML and HTTP as corporate standards glue, application integration requires only that target systems be Web-enabled. There are hundreds of products in the market today which Web-enable existing systems, from mainframes to client/server applications. The use of standard Web technologies empowers various IT departments to make independent technology selections. This has the effect of lowering both the technical and 'political' barriers that have typically derailed cross-organizational integration projects. The use of proprietary middleware infrastructures to integrate applications requires not only that the same software product be purchased by both organizations and successfully installed in both target hardware environments, but also that both target applications be tailored to support the middleware API. This type of investment can be disastrous if one company spends six months designing a CORBA-based business system only to discover that one of their business units or business partners is unable to install CORBA because it conflicts with their existing infrastructure. Conflicts can arise because of hardware or software incompatibilities, or simply because of difficulties in acquiring appropriate development resources. A number of analysts have already warned that proprietary e-commerce platforms could lock suppliers into relationships by forcing them to integrate their systems with one infrastructure for business-to-business integration, making it costly for them to switch to or integrate with other partners who have selected alternate e-commerce platforms. Buyer-supplier integration issues involve many-to-many relationships, and demand a standard platform for functional integration and data exchange. Here is a brief overview of the types of applications that WIDL enables: Manufacturers and Distributors access supplier and competitor e-commerce systems automatically to check pricing and availability load product data (spec. sheets) from supplier Web sites place orders automatically (i.e. when inventory drops below predetermined levels) integrate package tracking functionality for enhanced customer service Human Resources automated update of new employee information into multiple internal systems automated aggregation of benefits information from healthcare and insurance providers Governments Kiosk systems that aggregate data and integrate services across departments or state and local offices Shipping and Delivery Services multi-carrier package tracking and shipment ordering access to currency rates, Customs regulations, etc. Shipping companies were early leaders in bringing widely applicable functionality to the Web. Web-based package tracking services provide important logistics information to organizations large and small. Many organizations employ people for the sole purpose of manually tracking packages to ensure customer satisfaction and to collect refunds for packages that are delivered late. Integrating package tracking functionality directly into warehouse management and customer service systems is a huge benefit, boosting productivity and enabling more efficient use of resources. Using WIDL, the web-based package tracking services of numerous shipping companies can be described as common application interfaces, to be integrated with various internal systems. In almost all cases, programmatic interfaces to different package tracking services are identical, which means that WIDL can impose consistency in the representation of functionality across systems. Example 1 illustrates the use of WIDL to define a package tracking service for Federal Express. Note that the WIDL specifies a 'Shipping' template. This indicates that there is a general class of shipping services and that this particular WIDL is one implementation of the shipping interface. Example 1. The WIDL Representation of a Package Tracking Service <WIDL NAME="FedexShipping" Template="Shipping" BASEURL="http://www.fedex.com" VERSION="2.0"> <SERVICE NAME="TrackPackage" METHOD="GET" URL="/cgi-bin/track_it" INPUT="TrackInput" OUTPUT="TrackOutput" /> <BINDING NAME="TrackInput" TYPE="INPUT"> <VARIABLE NAME="TrackingNum" TYPE="String" FORMNAME="trk_num" /> <VARIABLE NAME="DestCountry" TYPE="String" FORMNAME="dest_cntry" /> <VARIABLE NAME="ShipDate" TYPE="String" FORMNAME="ship_date" /> </BINDING> <BINDING NAME="TrackOutput" TYPE="OUTPUT"> <CONDITION TYPE="FAILURE" REFERENCE="doc.title[0].text" MATCH="FedEx Warning Form" REASONREF="doc.p[0].text['&.']" /> <CONDITION TYPE="SUCCESS" REFERENCE="doc.title[0].text" MATCH="FedEx Airbill:" REASONREF="doc.p[1].value" /> <VARIABLE NAME="disposition" TYPE="String" REFERENCE="doc.h[3].value" MASK="$" /> <VARIABLE NAME="deliveredOn" TYPE="String" REFERENCE="doc.h[5].value" MASK="%%%$" /> <VARIABLE NAME="deliveredTo" TYPE="String" REFERENCE="doc.h[7].value" MASK=":" /> </BINDING> </WIDL> The FedexShipping interface in example 1 contains one service (TrackPackage) which takes three input parameters (TrackingNum, DestCountry, ShipDate) and returns three output parameters (disposition, deliveredOn, deliveredTo). The WIDL definition describing the TrackPackage service is stored in an ASCII file which is utilized by client programs at runtime to determine both the location of the service (URL) and the structure of documents that contain the desired data. Client programs access WIDL definitions from local files, naming services such as LDAP, HTTP servers, or other URL access schemes (see figure 3). Unlike the way CORBA and DCE IDL are normally used, WIDL is interpreted at runtime. As a result, Service, Condition, and Variable definitions within WIDL files can be administered without requiring modification of client code. This usage model supports application-to-application linkages that are far more robust and maintainable than if they were coded by hand. One of WIDL's most significant benefits is its ability to insulate client programs from changes in the format and location of Web documents. As long as the parameters of services do not change, Service URLs, object references in variables, regions, and conditions can all be modified without affecting applications that utilize WIDL to access Web resources. There are three models for WIDL management: client side - where WIDL files are co-located with a client program naming service - where WIDL definitions are returned from directory services, i.e. LDAP server side - where WIDL files are referenced by, co-located with, or embedded within Web documents. WIDL does not require that existing Web resources be modified in any way. Flexible management models allow organizations to describe and integrate Web sites that are beyond their control, as well as to provide their business partners with interfaces to services that are controlled. The ability to seamlessly migrate from independent to shared management eases the transition from informal to formal business-to-business integration. Elements of WIDL The Web Interface Definition Language (WIDL) consists of six XML tags: <WIDL> defines an interface, which can contain multiple services and bindings <SERVICE/> defines a service, which consists of input and output bindings <BINDING> defines a binding, which specifies input and output variables, as well as conditions for successful completion of a service <VARIABLE/> defines input, output and internal variables used by a service to submit HTTP requests, and to extract data from HTML/XML documents. <CONDITION/> defines success and failure conditions for the binding of output variables; specifies error messages to be returned upon service failure; enables alternate bindings attempts and the chaining of services <REGION/> defines a region within an HTML/XML document; useful for extracting regular result sets which vary in size, such as the output of a search engine, or news stories The complete WIDL DTD is included in Appendix A. In the next sections the attributes of each element of WIDL are presented and discussed by way of example. <WIDL> <WIDL> is the parent element for the Web Interface Definition Language; it defines an interface. Interfaces are groupings of related services and bindings. The following are attributes of the <WIDL> element: NAME* - required. Establishes a name for an interface. The interface name is used in conjunction with a service name for naming or directory services. VERSION - optional. specifies the version of WIDL. webMethods first implemented WIDL as HTML extension tags. Experience with customers since late 1996 resulted in WIDL 2.0, an application of XML that is capable of automating complex interactions across multiple Web servers. TEMPLATE - optional. WIDL enables common interfaces to services provided by multiple sites. Templates allow the specification of interfaces, implementations of which may be available from multiple sources. A shipping template defines a functional interface for shipping services; various implementations can be provided for FederalExpress, UPS, and DHL. BASEURL - optional. BASEURL is similar to the <BASE HREF=""> statement in HTML. Some of the services within a given WIDL may be hosted from the same Base URL. If BASEURL is defined, the URL for various services can be defined relative to BASEURL. This feature is useful for replicated sites which can be addressed by changing only the BASEURL, instead of the URL for each service. OBJMODEL - optional. Specifies an object model to be used for extracting data elements from HTML and XML documents. Object models are the result of parsing HTML or XML documents. The use of object models is central to the functionality of WIDL. Object References are used in <VARIABLE/>, <CONDITION/> and <REGION/> elements. For this reason, the object model will be briefly discussed before proceeding with the description of the element definitions that constitute WIDL. Object Model Many of the features of WIDL require a capability to reliably extract specific data elements from Web documents and map them to output parameters. Two candidate technologies for data extraction are pattern matching and parsing. Pattern matching extracts data based on regular expressions, and is well suited to raw text files and poorly constructed HTML documents. There is a lot of bad HTML in the world! Parsing, on the other hand, recovers document structure and exposes relationships between document objects, enabling elements of a document to be accessed with an object model. Using an object model, an absolute reference to an element of an HTML document might be specified: doc.p[0].text This reference would retrieve the text of the first paragraph of a given document. From both a development and an administrative point of view, pattern matching is more labor intensive for establishing and maintaining relationships between data elements and program variables. Regular expressions are difficult to construct and prone to breakage as document structures change. For instance, the addition of formatting tags around data elements in HTML documents could easily derail the search for a pattern. An object model, on the other hand, can see through many such changes. Patterns must also be carefully constructed to avoid unintentional matching. In complex cases, patterns must be nested within patterns. The process of mapping patterns to a number of output parameters can easily become unmanageable. It is possible to achieve the best of both worlds by using pattern matching when necessary to match against the attributes of elements accessible via an Object Model. Using a hybrid model of pattern matching within parsed objects provides for the extraction of target information from preformatted text regions or text files. doc.p['Currency:'].text This reference would retrieve the text of the first paragraph that contains 'Currency:' within a given document. Various object models for working with HTML documents have been specified. The W3C has established a working group to define a standard Document Object Model (DOM). The WIDL specification allows for multiple object models. In implementing WIDL, we discovered many functional requirements not currently addressed by existing object models. These requirements will be demonstrated in various examples later in this article. We now continue with a discussion of the attributes of the elements of the WIDL. <SERVICE/> The <SERVICE/> element describes a Web service, such as those provided by CGI scripts, or via NSAPI, ISAPI, or other back-end Web server programs. Services take a set of input parameters, perform some processing, then return a dynamically generated HTML, XML or text document. The attributes of the <SERVICE/> element map an abstract service name into a service's actual URL, specify the HTTP method to be used to access the service, and designate 'bindings' for input and output parameters. NAME - required. Establishes a name for a service. The service name is used in conjunction with an interface name for naming or directory services. URL - required. Specifies the Uniform Resource Locator for the target document. A service URL can be either a fully qualified URL or a partial URL that is relative to the BASEURL provided as an attribute of the <WIDL> element. METHOD - required. Specifies the HTTP method ("Get" or "Post") to be used to access the service. INPUT - required. Designates the <BINDING> to be used to define the input parameters for programs that call the service. The specified name must be that of a <BINDING> contained within the same <WIDL> as the service. OUTPUT - required. Designates the <BINDING> to be used to define the output parameters for programs that call the service. The specified name must be that of a <BINDING> contained within the same <WIDL> as the service. AUTHUSER - optional. Establishes the username for HTTP authentication. AUTHPASS - optional. Establishes the password for HTTP authentication. TIMEOUT - optional. Amount of time before service times out. RETRIES - optional. Number of times to retry the service before failing. Typically the username/password combination is set independent of service definitions in WIDL. The AUTHUSER and AUTHPASS attributes allow a username and password to be defined outside of a calling program. This is useful in cases where multiple client programs use the same service. <BINDING> The <BINDING> element defines input and output variables for a service. Input bindings describe the data provided to a Web resource, and are analogous to the input fields in an HTML form. For a static HTML document no input variables are required. Output bindings describe which data elements are to be mapped from the output document returned as a result of accessing the Web resource with the given input variables. In most cases an output binding will map only a subset of the available elements in the output document. NAME - required. Identifies the binding for reference by service definitions and other binding definitions. TYPE - required. Specifies whether a binding defines input or output parameters. <VARIABLE/> The <VARIABLE/> element is used to describe both input and output binding parameters; different attributes are used depending on the type of parameter being described. Common attributes are: NAME - required. Identifies the variable to calling programs. VALUE - optional. Designates a value to be assigned to the variable in HTTP transactions. For input variables this has the effect of rendering the variable invisible to calling programs, i.e. the specified value is submitted to the web service without requiring an input from calling programs. For output variables this has the effect of hard-coding the value returned when the service is invoked. USAGE - optional. The default usage of variables is for specification of input and output parameters. Variables can also be used internally within WIDL, as well as to pass header information (i.e. USER-AGENT or REFERER) in an HTTP request. The USAGE attribute will be explored in the examples following this overview of the <VARIABLE/> element. TYPE - required. Specifies both the data type and dimension of the variable. The following attributes are specific to input variables: FORMNAME - optional. Specifies the variable name to be submitted via Get or Post methods. Obscure back-end variables can be given names that are more meaningful in the context of the service described by WIDL. Used in conjunction with WIDL Templates, FORMNAME permits the mapping of a single variable name across multiple service implementations. In the package tracking service in example 1 the FORMNAME differs from the variable name. It is also possible to set FORMNAME="" to pass only the variable's value to the back-end program. OPTIONS - optional. Captures the options of list boxes, check boxes, and radio buttons. Useful for validating inputs prior to submitting input parameters to a service and for transforming input criteria into formats acceptable to back-end programs. For example, an options list could be used to translate a meaningful input of "full" to the "f" acceptable to a back-end program. The following attributes are specific to output variables: REFERENCE - optional. Specifies an object reference to extracts data from the HTML, XML or text document returned as the result of a service invocation. MASK - optional. Masks permit the use of pattern matching and token collecting to easily strip away unwanted labels and other text surrounding target data items. NULLOK - optional. Overrides the implicit condition that all output variables return a non-null value. Apart from the "default" behavior of variables defined in input bindings, there are two other usage models supported by WIDL: "internal" and "header". The USAGE attribute can define service inputs in place of or in addition to those required by a web service's HTML form. Internal variables enable variable substitution within input and output bindings. For instance, using internal variables a portion of a service's URL or a pattern for matching within an object reference can be specified as a variable that is part of an input binding. Header variables allow HTTP header information to be included as part of a service request. This is useful in many situations, including the passing of referrer information where required by back-end systems. In Example 2 an auto loan service is defined for a site that uses a directory structure to organize loan information for various states. Rather than using CGI-scripts to access a database of high, low, and average loan rates, unique URLs which contain a state abbreviation as part of target document names are linked from a pick list. The use of internal variables enables the parameterization of a portion of the URL. In this fashion, WIDL is able define an input binding even though no HTML forms are present to query the user for information. The input binding specifies a variable 'state' that is referenced in the URL attribute of the service definition as '%state%'. At runtime the value passed into the 'state' variable is used to complete the service URL. Example 2. Internal variables can be used to paramaterize directory structures <WIDL NAME=autoLoan VERSION=2.0> <SERVICE NAME=AutoLoan METHOD=GET URL="http://www.bankrate.com/autobytel/abt%state%a.htm" INPUT="AutoLoanInput" OUTPUT="AutoLoanOutput" /> <BINDING NAME=AutoLoanInput TYPE=INPUT> <Variable NAME=*state* TYPE=String FORMNAME="state" *USAGE="INTERNAL"* /> </BINDING> <BINDING NAME="AutoLoanOutput" TYPE="OUTPUT"> <CONDITION TYPE="Failure" REASONTEXT="State not found" /> <VARIABLE NAME="state" TYPE="String" REFERENCE="doc.table[4].tr[1].th[0].text" /> <VARIABLE NAME="avgNew" TYPE="String" REFERENCE="doc.table[4].tr[2].td[1].text" /> <VARIABLE NAME="highNew" TYPE="String" REFERENCE="doc.table[4].tr[2].td[2].text" /> <VARIABLE NAME="lowNew" TYPE="String" REFERENCE="doc.table[4].tr[2].td[3].text" /> <VARIABLE NAME="avgUsed" TYPE="String" REFERENCE="doc.table[4].tr[3].td[1].text" /> <VARIABLE NAME="highUsed" TYPE="String" REFERENCE="doc.table[4].tr[3].td[2].text" /> <VARIABLE NAME="lowUsed" TYPE="String" REFERENCE="doc.table[4].tr[3].td[3].text" /> </BINDING> </WIDL> Because the AutoLoan service uses a variable to complete the URL to access a static document, an invalid input parameter results in an invalid URL. The <CONDITION/> statement in the output binding traps the document not found condition and returns a sensible error message to client programs. Internal variables can also be used within object references that use pattern matching to index into the object tree. Example 3 uses the currency exchange service provided by the Federal Reserve Bank to illustrate the use of internal variables to interactively query a single static document. Example 3. Internal variables enable input criteria to be used in object references <WIDL NAME="FederalReserve" TEMPLATE="Currency" BASEURL="http://www.ny.frb.org/" VERSION="2.0"> <SERVICE NAME="ExchangeRate" METHOD="GET" URL="/pihome/mktrates/forex12.shtml" INPUT="currencyInput" OUTPUT="currencyOutput" /> <BINDING NAME="currencyInput" TYPE="INPUT"> <VARIABLE NAME="Currency" TYPE="String" FORMNAME="CURRENCY" USAGE="INTERNAL" /> </BINDING> <BINDING NAME="currencyOutput" TYPE="OUTPUT"> <CONDITION TYPE="FAILURE" REASONTEXT="Currency not found" /> <VARIABLE NAME="rate" TYPE="String" REFERENCE="doc.pre[0].line['*%Currency%'].text[53-65]" /> </BINDING> </WIDL> In this example currency rates for a number of countries are provided in a single document. The object reference for the 'rate' variable in the output binding uses an internal variable 'Currency' as part of the pattern that is matched to discover the current exchange rate. The object reference used in this example also demonstrates two additional text manipulation features of the object model developed by webMethods. The .line[] construct allows access to individual lines of both preformatted text and text that has been formatted with the <br> line-break element. This greatly simplifies pattern matching expressions within object references. The Federal Reserve Currency Exchange service returns rate information in a column from character position 53 to character position 65. This range of characters is specified by qualifying the .text[53-65] attribute of the line matching the input criteria. <CONDITION/> The <CONDITION/> element is used in output bindings to specify success and failure conditions for the extraction of data to be returned to calling programs. Conditions enable branching logic within service definitions; they are used to attempt alternate bindings when initial bindings fail and to initiate service chains, whereby the output variables from one service are passed into the input bindings of a second service. Conditions also define error messages returned to calling programs when services fail. TYPE* - required. Specifies whether a condition is checking for the 'Success' or the 'Failure' of a binding attempt. Any variable that returns a NULL value will cause the entire binding to fail, unless the NULLOK attribute of that variable has been set to true. Conditions can catch the success or failure of either a specific object reference or of an entire binding. In the case where a condition initiates a service chain, it is important that all variables bind properly. REFERENCE - optional. Specifies an object reference which extracts data from the HTML or XML document returned as the result of a service invocation. The REFERENCE attribute for conditions is equivalent to the REFERENCE attribute used in variable definitions. MATCH - required. Specifies a text pattern that will be compared with the object property referenced by the REFERENCE attribute. REBIND - optional. Specifies an alternate output binding. Typically a failure condition indicates that the document returned cannot be bound properly. REBIND redirects the binding attempt. This is useful in situations where the documents returned by a service are dependent upon the input criteria that was submitted. For example, a retail web site may return a different document structure for an SKU depending on whether the item requested is a shirt, a tie, or trousers. The use of REBIND allows a conditions to determine the appropriate binding for extracting the desired data. SERVICE - optional. Specifies a service to invoke with the results of an output binding. Aside from the obvious benefit of chaining services to further automate the tasks that can be encapsulated for client programs, there are many cases when target documents can only be retrieved after visiting several Web pages in succession. In some instances cookies are issues by an entry page that must be visited prior to interacting with HTML forms, in others URLs are dynamically generated from databases for specific user identities. REASONTEXT - optional. The text to be returned as an error message when a service fails. REASONREF - optional. Reference to an object's element to be returned as an error message when a service fails. WAIT - optional. Amount of time to wait before re-trying retrieval of a document after a server has returned a 'service busy' error. RETRIES - optional. Number of times to retry the service before failing. Example 4 illustrates the use of conditions to specify alternate bindings. Alternate bindings can be used when documents returned by services are dependent upon the inputs submitted to the service. In some rare cases, such as the stockMarketInfo service defined in this example, a service occasionally returns different document formats for no apparent reason. Conditions and rebinding handle any such situations. Example 4. Conditions initiate alternate bindings attempts for extraction of output values <WIDL NAME="Yahoo" VERSION="2.0"> <SERVICE NAME="StockMarketInfo" METHOD ="GET" URL="http://quote.yahoo.com/" OUTPUT ="marketOut"> <BINDING NAME="marketOut" TYPE="Output"> <CONDITION Type="Failure" REBIND="marketOut2" /> <VARIABLE TYPE="String[][]" NAME="info" REFERENCE="doc.table[0].tr[0].td[].text" /> <VARIABLE TYPE="String[]" NAME="links" REFERENCE="doc.table[0].tr[0].a[].href" /> </BINDING> <BINDING NAME="marketOut2" TYPE="Output"> <VARIABLE TYPE="String[][]" NAME="info" REFERENCE="doc.table[1].tr[0].td[].td[].text" /> <VARIABLE TYPE="String[]" NAME="links" REFERENCE="doc.table[1].tr[0].a[].href" /> </BINDING> </WIDL> Example 5 illustrates the use of conditions to specify a service chain. Service Chains pass the name-value pairs of an output binding into the input binding of the service specified by a <CONDITION/> statement. Any name-value pairs matching the variables of the chained service's input binding will be used as input parameters. In this example the productSearch service returns a URL when it successfully finds a product matching the search criteria. The success condition on the ProductSearchOutput binding causes the ExtractPrices service to be called. Because the output binding of productSearch matches the input binding of ExtractPrices, the variables are passed from one service into the other. Example 5. Service Chains: the output values of the first service are passed into the second service. <WIDL NAME="EddieBaeur" VERSION=2.0> <SERVICE NAME="ProductSearch" METHOD=GET URL="http://www.ebauer.com/eb/ShopEB/prod_search_results.asp" INPUT="productSearchInput" OUTPUT="productSearchOutput" /> <BINDING NAME="productSearchInput" TYPE="INPUT"> <VARIABLE NAME="searchstring" FORMNAME="searchstring" </BINDING> <BINDING NAME="productSearchOutput" TYPE="OUTPUT"> <CONDITION TYPE="Failure" REFERENCE="doc.p['Sorry'].text" MATCH="Sorry" REASONREF="doc.p['Sorry'].text" /> <CONDITION TYPE="Success" SERVICE="ExtractPrices" /> <VARIABLE NAME="productURL" TYPE="String" REFERENCE="doc.table[0].tr[1].td[3].a[0].href" /> </BINDING> <SERVICE NAME="ExtractPrices" METHOD=GET URL="%productUrl%" INPUT="ExtractPricesInput" OUTPUT="ExtractPricesOutput" /> <BINDING NAME="ExtractPricesInput" TYPE="INPUT"> <VARIABLE NAME="productUrl" TYPE="String" USAGE="INTERNAL" /> </BINDING> <BINDING NAME="ExtractPricesOutput" TYPE="OUTPUT"> <VARIABLE NAME="Price" TYPE="String" REFERENCE="doc.table[1].strong[0].value['\$$']" /> </BINDING> </WIDL> It is important to note that the ExtractPrices service can be called independent of the productSearch service, and that the ExtractPrices service specifies productURL as an internal variable. The output variables from the productSearch service are not available to the ExtractPrices service except in the case where they have been passed via an input binding. Service chains make it possible to interact with "shopping cart" services, where multiple service calls are required to add items, followed by a service call to submit an order. <REGION/> The <REGION/> element is used in output bindings to define targeted sub-regions of a document. This is useful in services that return variable arrays of information in structures that can be located between well known elements of a page. Regions are critical for poorly designed documents where it is otherwise impossible to differentiate between desired data elements (for instance story links on a news page) and elements that also match the search criteria. NAME* - required. Specifies the name for a region. This name can then be used as the root of an object reference. For instance, a region named "foo" can be used in object references such as: foo.p[0].text START - required. An object reference that determines the beginning of a region. END - required. An object reference that determines the end of a region. Example 6 demonstrates the use of regions in a news service, where the number of news stories varies day to day. Regions permit the extraction of data elements relative to other features of a document. The "tops" region begins with a text object that matches the pattern 'Last Updated' and ends with an object that matches 'For more'. Variable references into the "tops" region collect arrays of anchors and anchor text, regardless of the fact that the sizes of the arrays change throughout the day. The object references within "tops" are vastly simplified by the processing already provided by the region definition: tops.a[].text tops.a[].href It is also worth noting that the news service in Example 6 has no input binding. Input Bindings are not required for service definitions. Example 6. Regions permit the extraction of data elements relative to other features of a document. <WIDL NAME="News" VERSION="2.0"> <SERVICE NAME="Techweb" METHOD="GET" URL="http://www.techweb.com/" OUTPUT="techwebOut"> <BINDING NAME="techwebOut" TYPE="OUTPUT"> <REGION NAME="tops"* START="doc.font['Last Updated']" END="doc.b['For more']" /> <VARIABLE NAME="service" TYPE="String" VALUE="TECHWEB Top Stories" /> <VARIABLE NAME="url" TYPE="String" REFERENCE="doc.url" /> <VARIABLE NAME=stories TYPE="String[]" REFERENCE="tops.a[].text" /> <VARIABLE NAME="links" TYPE="String[]" REFERENCE="tops.a[].href" /> </BINDING> </WIDL> Object References The default object model used by WIDL provides object references for accessing elements and properties of HTML and XML documents. This model is based on the Javascript page object model, but without the Javascript method definitions. Using the default object model, all elements of HTML and XML documents can be addressed in the following ways: BY NAME - if the target element has a non-empty name attribute. For example, the value of an HTML element <a name="foo"> can be referenced: doc.foo.value BY ABSOLUTE INDEXING - where each array of elements has a zero-based integer index, i.e.: doc.headings[0].text doc.p[1].text BY RELATIVE INDEXING - which directs the binding algorithm to search the VALUE attributes of each element in the array, until a match is found. The match must be complete, which requires the use of wildcard metacharacters for partial string matches. Note that the search will return the first matching element, if any: doc.tr['pattern'].td[1].text BY REGION INDEXING - which directs the binding algorithm to search only within a region of a document: myregion.a[2].href BY ATTRIBUTE MATCHING - which directs the binding algorithm to search an object's attributes until a match is found. Attribute matching is done with parenthesis instead of square brackets: doc.a(name='foo').href The following properties are available for all objects: .text/.txt - returns the text of a container .value/.val - returns the value of a container .source/.src returns the source of a container .index/.idx returns the index of a container .reference/.ref returns the fully qualified object reference Attributes of HTML containers take precedence over properties, which have alternate accessors. .text/.txt and .value/.val are equivalent except when a document element has an identically named attribute. Putting WIDL to Work WIDL files can be hand coded, or developed interactively with command line or graphical tools which provide aids for determining object references used in <VARIABLE/>, <CONDITION/>, and <REGION/> declarations. Once a WIDL file has been created, its use depends upon the implementation of products that can process and understand WIDL services. A web integration platform based on WIDL needs to provide: a mechanism for retrieving WIDL files, either from a local file system, a directory service such as LDAP, or a URL an HTML and an XML parser, and text pattern matching capabilities, providing an object model for accessing elements of Web documents HTTP and HTTPS support, to initiate requests and receive Web documents Apart from these requirements a WIDL processor could be delivered as a Java class or a Windows DLL, for integration directly with client applications, or as a standalone server with middleware interfaces, allowing thin-client access to web automation functionality. Generating Code The primary purpose of WIDL is integration with corporate business applications. In much the same way that DCE or CORBA IDL is used to generate code fragments, or 'stubs', to be included in development projects, WIDL provides the necessary ingredients for generating Java, JavaScript, C/C++, and even Visual Basic client code. webMethods has developed a suite of Web Automation products for the development and management of WIDL files, as well as the generation of client code from WIDL files. Client 'stubs', which we affectionately call "Weblets", present developers with local function calls, and encapsulate all the methods required to invoke a service that has been defined by a WIDL file. Example 7 features a Java class generated from the package tracking WIDL presented in Example 1 above. This class demonstrates the loadDocument, invokeService and getVariable methods of the Context class that is part of the webMethods API for processing WIDL. Example 7. (Generated Java class). import watt.api.;* public class TrackPackage extends Object { public String TrackingNum; public String disposition; public String deliveredOn; public String deliveredTo; public TrackPackage(String TrackingNum) throws IOException, WattException, WattServiceException { String args[][] = { {"TrackingNum", TrackingNum}, {"DestCountry", DestCountry}, {"ShipDate", ShipDate} }; Context c = new Context(); c.loadDocument("Shipping.widl"); Result r = c.invokeService("FedexShipping", "TrackPackage", args); disposition = r.getVariable("disposition"); deliveredOn = r.getVariable("deliveredOn"); deliveredTo = r.getVariable("deliveredTo"); } } After declaring the variables that will be used by the PackageTracking class, a handle 'c' to a new context of the webMethods web automation runtime is created. All API calls are then made against this handle. loadDocument loads and parses the specified WIDL file, in this case "Shipping.widl". Loading the WIDL defines the services of the Shipping interface to the runtime. invokeService actually submits the input parameters to the "TrackPackage" service, which makes the appropriate HTTP request and returns either a result set which contains the bound output variables or an error message specified by a <CONDITION/> statement within the <SERVICE/> definition. getVariable is then used to extract the values of the output variables and to assign them to class variables. Within the Java application, the package tracking service looks like a simple instantiation of the TrackPackage class: TrackPackage p = new TrackPackage("12345678"); In short, an application makes a call to a local function that has been generated by WIDL. The local function encapsulates the API calls to the WIDL processor. The WIDL processor: loads the WIDL file from a local or remote file system passes the function's input parameters as an HTTP request parses the retrieved document to extract target data items executes any conditional logic for error checking or service chaining returns the extracted data into the output parameters of the calling function Generated Java classes can be incorporated in standalone Java applications, Java Applets, JavaScript routines, or server-side Java 'Servlets'. Generated C/C++ encapsulating Web services can be deployed as DLL's, shared libraries, or standalone executables. webMethods implementation, the Web Automation Platform, provides Java classes, a shared library, a Windows DLL and an Active/X control to support Visual Basic modules which can be embedded in spreadsheets and other applications. Conclusion Web technology is strong on interactivity, but low on automation. The primary applications of the Web, including Push and Agent technologies, are almost exclusively focused on end users. Data that is being made available in HTML format is effectively inaccessible to business applications other than the Web browser. On corporate intranets and extranets, the Web browser has enabled access to business systems, but has in many cases reinforced manual inefficiencies as data must be transcribed from browser windows into other application interfaces. Electronic commerce on the Web is typically driven manually via a browser. In order to achieve Business-to-business integration organizations have resorted to proprietary protocols. The many-to-many nature of Web commerce demands a standard for automated integration. Interactions normally performed manually in a browser, such as entering information into an HTML form, submitting the form, and retrieving HTML documents, can be automated by capturing details such as input parameters, service URLs, and data extraction methods for output parameters. Mechanisms for condition processing can also be provided to enable robust error handling. The Web Interface Definition Language (WIDL) is an application of the eXtensible Markup Language (XML) which allows the resources of the World Wide Web to be described as functional interfaces that can be accessed by remote systems over standard Web protocols. WIDL transforms the Web into a standards-based integration platform, providing a practical and cost-effective infrastructure for business-to-business electronic commerce over Web About the Author Charles Allen is Vice President of Product Management at webMethods, Inc. charles@webMethods.com 3975 University Drive Suite 360 Fairfax, VA 22030 (703) 352-8345 Appendix: WIDL DTD <!ELEMENT WIDL ( SERVICE \| BINDING )* > <!ATTLIST WIDL NAME CDATA #IMPLIED VERSION (1.0 \| 2.0 \| ...) "2.0" TEMPLATE CDATA #IMPLIED BASEURL CDATA #IMPLIED OBJMODEL (wmdom \| ...) "wmdom" > <!ELEMENT SERVICE EMPTY> <!ATTLIST SERVICE NAME CDATA #REQUIRED URL CDATA #REQUIRED METHOD (Get \| Post) "Get" INPUT CDATA #IMPLIED OUTPUT CDATA #IMPLIED AUTHUSER CDATA #IMPLIED AUTHPASS CDATA #IMPLIED TIMEOUT CDATA #IMPLIED RETRIES CDATA #IMPLIED > <!ELEMENT BINDING ( VARIABLE \| CONDITION \| REGION )* > <!ATTLIST BINDING NAME CDATA #REQUIRED TYPE (Input \| Output) "Output" > <!ELEMENT VARIABLE EMPTY> <!ATTLIST VARIABLE NAME CDATA #REQUIRED FORMNAME CDATA #IMPLIED TYPE (String \| String[] \| String[][]) "String" USAGE (Default \| Header \| Internal) "Function" REFERENCE CDATA #IMPLIED VALUE CDATA #IMPLIED MASK CDATA #IMPLIED NULLOK #BOOLEAN > <!ELEMENT CONDITION EMPTY> <!ATTLIST CONDITION TYPE (Success \| Failure \| Retry) "Success" REF CDATA #REQUIRED MATCH CDATA #REQUIRED REBIND CDATA #IMPLIED SERVICE CDATA #IMPLIED REASONREF CDATA #IMPLIED REASONTEXT CDATA #IMPLIED WAIT CDATA #IMPLIED RETRIES CDATA #IMPLIED > <!ELEMENT REGION EMPTY> <!ATTLIST REGION NAME CDATA #REQUIRED START CDATA #REQUIRED END CDATA #REQUIRED >