The W3C has acknowledged receipt of a submission from Republica Corp. for a 'Data Extraction Language' intended to provide a basis for furthering work on any-to-XML transformations. As outlined in the Note, DEL is "an XML format for describing data conversion processes from other data formats to XML. A DEL script specifies how to locate and extract fragments from input data and where to insert them in the resulting XML format. The DEL processor executing the DEL script can use the extracted data to either create a new XML document or modify an existing XML document by creating new elements and attributes at locations specified with XPath expressions." Appendix 1 provides the Data Extraction Language DTD. The submission has been referred to the attention of the XSL Working Group, "as the use cases and parse methods could serve as a starting point for the definition of regular expression matching in XPath 2.0."
From the Introduction: "The DEL processor executing the DEL script can use the extracted data to either create a new XML document or modify an existing XML document by creating new elements and attributes at locations specified with XPath expressions. A DEL script along with the source data are given to the DEL processor which performs the actual data conversion according to the script. The output from the DEL processor is a well-formed XML document containing the desired parts of the source data. Locating data fragments in the input data can be done by searching for patterns and the matching regular expressions (REGEX). The extracted data fragments are first temporarily stored to DEL processor's registers (or stack) in order to be refined before outputting and possibly re-used as a search pattern. The data is then read from the registers or stack and placed into its proper position in the DOM tree of the resulting XML document. In placing the data to XML, a cursor function is used to keep track of the current position. The cursor position can be modified using XPath expressions."
From the W3C team comment: "There are many languages and systems designed to assist in extracting information from text, including Perl (the 'e' in the original acronym stood for 'extraction') and the wide field of lexical scanners and parser generators (such as lex and yacc). Also related is XSLT, an XML transformation language which can, to a certain extent, parse text files to generate XML output. Although XSLT 1.0 does not define functions to parse text using regular expressions, it allows the definition of extensions to do so. A few implementations make this mechanism available, making XSLT an appropriate standard to compare DEL to. XSLT 2.0 is expected to support regular expressions through XPath 2.0; see item 3 in the XPath 2.0 requirements. Not all of DEL's features are covered by XSLT, for example, the possibility of generating CDATA sections, or to control the way the result XML file is output with the 'Document Ready' function. On the other hand XSLT provides many useful functions, such as sorting and numbering, that DEL lacks. One particular feature that DEL could have borrowed from XSLT which would have made the language simpler is the way the result tree is built. While XSLT uses namespaces to allow instantiating the result tree directly, DEL went for the more complicated solution of using constructor elements (<map>) as well as a 'cursor' to navigate through the output tree and add new XML constructs."
Bibliographic information: DEL - Data Extraction Language. W3C Note 31-October-2001. Submission date: June 21, 2001. Version URL: http://www.w3.org/TR/2001/NOTE-data-extraction-20011031. Latest version URL: http://www.w3.org/TR/data-extraction. Edited by Eero Lempinen and Harri Saarikoski (Republica Corp).
Contacts: send inquiries to del@republica.fi. Other contacts: Antti Jokipii and Eetu Ojanen. Republica Corp., Data Extraction Language, Ahlmaninkatu 1, 40100 Jyväskylä, Finland.
Principal references:
- DEL - Data Extraction Language
- Data Extraction Language DTD
- DEL Submission Request
- W3C staff comment from Max Froumentin (Team Contact for the W3C XSL Working Group).
- Announcement 2001-11-06: "W3C Released Republica's Data Extraction Language" [source]
- Republica Corp. web site