Cover Pages: The Species Analyst Project

[February 17, 2001] The Species Analyst based at the University of Kansas Natural History Museum and Biodiversity Research Center "is a research project developing standards and software tools for access to the world's natural history collection and observation databases. The primary mechanism for accessing these data is currently through this web site at the Species Analyst Web Interface. Alternative tools are being developed that permit direct access to these data. The tools are in active development, but beta versions may be retrieved from the Species Analyst technical development web site. The Species Analyst uses XML for all its native data caching and retrieval from the remote data sources; the remote data sources are still mostly Z39.50 servers."

Project goals: "The Species Analyst is a research project developing standards, agreements and software tools facilitating the discovery, exchange, use, and analysis of natural history specimen records and observations. Natural history collections are a valuable resource that combined together, provide approximately 3 billion records documenting the global distribution of species over the last 300 years or so. This information is housed in a large number of collections, most of which are using different database management systems, different database schemas, and varying degrees of capture into electronic form. The standards and software developed on the Species Analyst project provides a mechanism for combining most of theses collection databases into a single, coherent, and readily searchable virtual database."

The Species Analyst relies heavily upon the fusion of the ANSI/NISO Z39.50 standard for information retrieval (ISO 23950) and XML. Z39.50 provides an excellent framework for distributed query and retrieval of information both within and across information domains. However, its use is restrictive because of the somewhat obscure nature of it's implementation. All of the tools used by the Species Analyst transform Z39.50 result sets into an XML format that is convenient to process further, either for viewing or data extraction. This fusion of Z39.50 and XML brings standards based information retrieval to the desktop by extending the capabilities of existing tools that users are familiar with such as Microsoft's Internet Explorer and Excel and ESRI's ArcView.

ITIS XML: "The Species Analyst utilizes the information contained in the ITIS*ca database in a variety of ways. The contents of this folder provide information about the XML output option of ITIS and how to extract data from the ITIS XML output. ITIS*ca is the Canadian implementation of the Integrated Taxonomic Information System (ITIS). The goal in this subproject is to define an XML interface to ITIS that may be utilized for programmatic access to the contents of ITIS database implementations. As of June 30, 2000, a new ITIS XML standard was under development.

Mechanism for Searching the ITIS*ca Database: "The ITIS*ca database is accessible through HTTP, hence queries are formatted as URLs. A general query syntax is not yet available for searching the database, however there are two mechanisms for locating taxon records in XML format from ITIS*ca. The first is by using a taxon name or vernacular name as the search term. The second is by using a Taxonomic Serial Number (TSN). TSN searches may also be used to locate parent and child records of a given TSN, which is convenient for navigating the taxonomic database. ZX Sample Scripts for loading an XML document from a Z39.50 target using ZX: "Using ZX, one can perform a complete Z39.50 search and retrieval simply by specifying a URL. The resulting document can be easily reformatted using XSL or the data elements may be extracted using XSL or by using the XMLDOM. This simple piece of JavaScript shows how to load an XML document into the XMLDOM for further processing. Note that the URL can be any URL that generates a valid XML document..."

Version 1.3 of the Darwin Core profile for natural history collections and observation data sets. The Darwin Core (DwC) is "a profile describing the minimum set of standards for search and retrieval of natural history collections and observation databases. Natural history collections and observation data sets represent sets of observations, with each record detailing the observation of an organism, ideally at a specific geo-temporal location. In the case of collections, the observation is permanent in that the organism was collected from the field and preserved in a curated collection intended to last indefinitely. Collected specimens can be prepared in various ways, and several preparations from a single organism are not unusual (skin, skeleton, and perhaps microscope slides), thus there may be several records for a single organism, each representing the organism prepared using different techniques, but all records refering to a single observation event. Conversely, some collection records may represent a collection object that contains many organisms. For example, in icthyology, where the contents of a trawl may be sorted by taxon and lumped into a single collection container. Observation data sets catalog the observation of an organism, also at a specific geo-temporal location, but in this case the organism observed is not collected, and hence the observation record is the only information recorded about the organism. In both cases a taxonomic identification of the organism is attempted, with obvious consequences for accuracy of identification (a specimen available for identification to several experts compared with a potentially fleeting glimpse of an organism in the field)... Record content is also defined in this profile along with suggested syntaxes for encoding result set records in GRS-1, XML, SUTRS, and MARC. Although the profile was originally intended for use with the Z39.50 protocol through Z39.50 servers such as ZBigServer, it is also applicable for defining searches and XML content generated by databases served using HTTP, such as with ZHTTP."

Becoming a Data Provider: "Any institution with specimen or observation collections is welcome to participate in the public distribution of their records using the components available through the Species Analyst. The server software and client tools are available free of charge for non-commercial use. To install ZBigServer is a fairly straight forward process that typically requires a "one-two cups of coffee" period of time. If you are familiar with your database, how it works, and the content of the database is in reasonably good shape, then the whole process can take as little as 15 minutes. Since The Species Analyst clients all generate XML output, there is no technical reason you can not use a web server to provide the same basic functionality as ZBigServer. In fact, a set of active server pages (for Microsoft's IIS or personal web server) are being developed which will let you serve specimen records as XML right from your web server. That is in the future though, right now the easiest and quickest way to start providing data is with ZBigServer. Z39.50 also appears to be more efficient as a mechanism for search and retrieval of data, so there are some performance reasons for choosing Z39.50 over HTTP. Any Z39.50 server that can generate records in the GRS-1 record syntax can participate; there are no peculiar or proprietary features of ZBigServer that exclude the use of other Z39.50 servers."

References:

Species Analyst web site
Species Analyst Project description
ZBig Client ITIS XML Interface
ITIS*ca XML Interface DTD; also at http://habanero.nhm.ukans.edu/DTD/ITIS.dtd. "An experimental XML stream is available from the ITIS*ca web site. This interface was developed to test the feasibility of such an interface and to identify potential problems, and should be treated as experimental only."
"Z39.50 URL Notes for ZIG 07Dec2000." By Dave Vieglais. Monday, November 27, 2000. "The Z39.50 client application is implemented as an 'asynchronous pluggable protocol handler' (APP) which when installed on any win32 platform, provides Z39.50 support for many existing internet applications that utilize the functionality of URL monikers. The current implementation of the APP is called 'ZX'. For a Z39.50 search (Figure 1), the APP connects to the target (INIT), sends the query (SEARCH), presents up to n records (PRESENT) and closes the connection (CLOSE). A similar interaction is performed for SCAN. The APP transforms the Z39.50 response into an XML document that may be further processed using a variety of XML processing tools. For example, typing a Z39.50 URL into a ZX enabled copy of Internet Explorer would show the results of the operation as an XML document. The XML parser developed by Microsoft can perform a Z39.50 search and parse the results simply by calling the 'load' method with a Z39.50 URL. The SAX API can be used in a similar manner to load and process a Z39.50 URL... [fig.] records presented to the Z39.50 client are transformed into an XML document. The resulting XML document is the only response that the hosting application is aware of. MARC records are encoded in a similar fashion to GRS-1 records where tag codes are turned into XML elements preceded by the _ character and sub-fields are contained within the parent elements... GRS-1 records are convenient to recast into XML as their internal structure is very similar to that exhibited by an XML document except numeric tags consisting of a tag type and tag value are used in place of string tags..."
The Species Analyst Internet Integration of Natural History Collections." By Dave Vieglais (University of Kansas, Natural History Museum and Biodiversity Research Center). Slide presentation (70 slides).
Species Analyst news
Darwin Core data providers
Contact: SpeciesAnalyst@ukans.edu. Also: Dave Vieglais and Derek Munro


SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY