Clio is a Computer Science Research project at IBM's Almaden Research Lab. Its developers are designing methods to specify the transformation of legacy data to make it fit for new uses. Clio addresses the challenge of "merging and coalescing data from multiple and diverse sources into different data formats. In particular, it addresses schema matching (the process of matching elements of a source schema with elements of a target schema) and schema mapping (the process of creating a query that maps between two disparate schemas), which lie at the heart of data integration systems. Clio is a tool for generating mappings (queries) between relational and XML Schemas. The user is presented with the structure and constraints of two schemas and is asked to draw correspondences between the parts of the schemas that represent the same real world entity. Correspondences can also be inferred by Clio and verified by the user. Given the two schemas and the set of correspondences between them, clio can generates the (SQL, XSLT, or XQueries) queries that drive the translation of data conforming to the first (source) schema to data conforming to the the second (target) schema."
Clio problem statement: "The world today is full of information sources, all with their own ways of representing data. One common problem that arises is that data, which exists in one representation in some data source, is needed in a different representation for some other purpose. As a simple example, the owner of a data source may want to publish his data using a specific XML DTD, though it is stored in some different (legacy) format. As another example, data warehouses bring data from one or more sources together, in a new form that allows for efficient decision support queries. Today, such situations are for the most part dealt with manually, by an expert user who has knowledge of both the source and target representations. Converting from one data representation to another is a time-consuming and labor intensive project, with few tools available to ease the task."
About Garlic:
"Garlic is a project being developed by members of the database group in Computer Science. The goal of Garlic is to enable large-scale multimedia information systems: large scale in that they involve lots of data with multimedia taken as broadly as possible to mean data of many types. We are particularly concerned about situations in which there is enough data of sufficiently specialized types that users have already made decisions about how to manage it, and have stored it in separate repositories that are specifically adapted to data of that type."
Garlic is an IBM prototype that allows integration of diverse sources such as the above, and allows new sources to be easily added to an existing installation. Garlic offers the ability to interrelate data from multiple sources with a broad range of querying capabilities, in a single, cross-source query. A significant focus of the project is the provision of support for data sources that provide type-specific indexing and query capabilities, such as text search, or search by molecular structure.
Principal references:
- Clio Project - Computer Science Research at IBM's Almaden Research Lab
- Clio website at UToronto
- Mapping XML Schemas - Demo
- Sample XML Schemas. Compare schema integration and schema mapping solutions. Nested schemas are presented in W3C XML-Schema. Relational schemas are given as DB2 DDL statements and/or XML-Schemas for convenience.
- "Mapping XML and Relational Schemas with Clio." By Lucian Popa, Mauricio A. Hernández, Yannis Velegrakis, Renée J. Miller, Felix Naumann, and Howard Ho. In Proceedings 18th International Conference on Data Engineering (ICDE) [San Jose, CA, USA; February 26, 2002 - March 1, 2002].
- "Translating Web Data." By Lucian Popa, Yannis Velegrakis, Renée J. Miller, Mauricio Hernández, and Ronald Fagin. University of Toronto, Technical Report CSRI 441. February 2002. 27 pages.
- "The Clio Project: Managing Heterogeneity." 6 pages.
- "Attribute Classification Using Feature Analysis. [ICDE 2002 Poster Presentation]
- Clio Publications
- IBM Garlic Project. See also the overview.