Contents
Overview
The OMG Life Sciences Identifiers specification "addresses the need for a standardized naming schema for biological entities in the Life Sciences domains, the need for a service assigning unique identifiers complying with such naming schema, and the need for a resolving service that specifies how to retrieve the entities identified by such naming schema from repositories. The normative parts of the specification are: (1) Platform independent model expressed in the attached XML file created according XMI format rules, v1.0; (2) Platform specific model for Web Services using one of the proposed bindings (SOAP/HTTP, HTTP GET, FTP) for those who are implementing Web Services model; (3) Platform specific model for Java for those who are implementing Java model."
Life Science Identifiers (LSIDs) are "persistent, location-independent, resource identifiers for uniquely naming biologically significant resources including but not limited to individual genes or proteins, or data objects that encode information about them. LSIDs are intended to be semantically opaque, in that the LSID assigned to a resource should not be counted on to describe the characteristics or attributes of the resource that the LSID refers to. The users of the LSIDs are permitted to use individual components (as specified elsewhere in this document) of LSIDs — although the LSID component parts themselves should be treated as opaque pieces of the identifier.LSIDs are expressed as a URN namespace and share the following functional capabilities of URNs:
- Global scope: A LSID is a name with global scope that does not imply a location. It has the same meaning everywhere.
- Global uniqueness: The same LSID will never be assigned to two different objects.
- Persistence: It is intended that the lifetime of an LSID be permanent. That is, the LSID will be globally unique forever, and may be used as a reference to an object well beyond the lifetime of the object it identifies or of any naming author-ity involved in the assignment of its name.
- Scalability: LSIDs can be assigned to any data element that might conceivably be available on the network, for hundreds of years.
- Legacy Support: The LSID naming scheme must permit the support of existing legacy naming systems, insofar as they meet the requirements specified...
- Extensibility: Any scheme for LSIDs must permit future extensions to the scheme.
- Independence: It is solely the responsibility of a name issuing authority to determine conditions under which it will issue a name.
- Resolution: A URN will not impede resolution i.e., translation to a URL..." [from the 04-05-01 Final Approved OMG spec]
Syntax: "LSID Syntax uses a 5-part format: URN:LSID:Authority:Namespace:Object:[Revision-ID], where URN:LSID is a mandatory prefix, Authority is the Internet domain of the organization that assigns an LSID to a resource, Namespace constrains the scope of the object; Object is an alphanumeric describing the object; Revision-ID is an optional version of the object. Examples: 'URN:LSID:ncbi.nlm.nih.gov:genbank:AF271072' and 'URN:LSID:chemacx.cambridgesoft.com:ACX:CAS967582:1'.." [Andrew Ellicott, "Welcome to the 2004Life Science ID Symposium"]
"Released in Q3 2003, the LSID spec is the industry's first and only consistent way to programmatically access any life science data source through the Internet. Web Services-based LSIDs facilitate the integration of LSID 'sources' by LSID 'consumers.' Data source examples include compound registries, assay results management systems, LIMS, inventory, sequence & protein databases plus public resources such as PDB, NCBI, CAS, PubMed (and others); consumer applications of LSID include E-notebooks, discovery portals, informatics 'pipelines,' and LIMS, to name a few..." [I3C home page]
An initial was produced by the Interoperable Informatics Infrastructure Consortium (I3C) Life Science Identifiers (LSID) Technical Group: "In its first year of operation, I3C delivered the Life Science Identifier Resolution System (LSID), a highly relevant open-source solution. LSID provides for scalable, secure, and migration-transparent naming of biologically significant data. It introduces a straightforward approach to identifying data resources stored in multiple, distributed data stores in a manner that overcomes the limitations of naming schemes in use today. LSID is based on a mechanism for federated data and authority identification... Early implementations are at the Protein Data Bank and in Cold Spring Harbor Laboratory's (CSHL) Distributed Annotation System (DAS). The National Human Genome Research Institute's International HapMap Project, a genetic variation mapping that will help identify genetic contributions to common diseases, also makes heavy use of LSIDs. LSID simplifies procedures and ensures uniqueness of identifiers..." [I3C FAQ document]
"Established in 2001, I3C is a discovery informatics consortium, which aims to increase the probability of success in pharmaceuticals and bio-tech R&D by eliminating barriers to application interoperability, data integration and knowledge flow. Huge volumes of multi-format, multi-platform data from disparate sources create many productivity barriers; I3C was borne out of the need for informatics solutions that clear these barriers and allow the acceleration of life science discoveries and new products. I3C members (including large pharmaceuticals, IT companies, not-for-profits and academic institutions) resolutely intend for I3C's work to benefit life science participants, help the industry grow and provide broad societal benefits." [I3C home page]
About the Interoperable Informatics Infrastructure Consortium (I3C)
Established in 2001 and incorporated in 2002, the Interoperable Informatics Infrastructure Consortium (I3C) is an international consortium of pharmaceutical, biotechnology, academic, and information technology organizations working together to facilitate and enable data exchange and data and knowledge management across the entire life sciences community.
The mission of I3C is to promote and guide the design and development of methodologies and solutions for data and tool interoperability in the life sciences, based on open, public specifications, protocols, and guidelines. Our objective: to enable the widest availability of open, interoperable software to accelerate discovery and solve critical problems in drug development.
I3C's approach is unique and focused on real-world problems and practical solutions. Scientific use-cases which exemplify common interoperability bottlenecks will be used to guide the development of solutions that couple technical recommendations with fully-documented, open-reference implementations.
I3C strives to bring together disparate efforts - it is not trying to duplicate the efforts of existing standards organizations. I3C recommendations can be passed either through a formal standards review process or remain as-is, accessible to the community. To avoid duplication of effort, we follow methods, protocols, policies, etc. of other groups wherever possible, and seek to form alliances with such groups whenever appropriate.
Participation In I3C is open to any non-profit organization, academic or government research institution, commercial organization, or interested individual. I3C is a consensus-based initiative by the community, for the community..." [from About I3C, 2004-07]
I3C Members: A snapshot as of 2004-07
- Avaki Corporation
- BIO
- Frederick DePalm (Chiron Corporation)
- Genaissance Pharmaceuticals
- Hewlett-Packard
- IBM Life Sciences
- Infinity Pharmaceuticals
- Manchester Informatics, Ltd. (U. of Manchester)
- Merck & Co.
- Millennium Pharmaceuticals
- Object Management Group (OMG)
- Optive Research
- Platform Computing
- PDB (Rutgers University)
- The Sanger Institute
- Scimagix
- Synthematix
- TIGR (The Institute for Genomic Research)
- TurboWorx
- The University of California at San Diego
- Whitehead Institute/Center for Genome Research
Related Project: W3C Forum on Semantic Web for the Life Sciences
W3C maintains a 'public-semweb-lifesci' public mailing list as "an open forum for scientists and informaticists to discuss issues and initiatives relevant to the exploration and use of the Semantic Web for the life sciences. Managed as part of the W3C RDF Interest Group, this forum is intended to identify opportunities for applying semantic-based approaches to life sciences, as well as advance the rapid design and development of reference models in several areas of life sciences informatics and knowledge exchange. The public-semweb-lifesci discussion list is a forum for detailed domain-related and technical discussions of all approaches to the use of classical logic and ontologies within the Life Science community for the representation of data, ontologies, intepretations, knowledge, biological systems, mechanisms of action, and inference rules..." A W3C Workshop on Semantic Web for Life Sciences was planned for 27-28 October 2004 in Cambridge, Massachusetts, USA.
Principal URLs
- Related reference document: XML in Clinical Research and Healthcare Industries
- Life Sciences Identifiers. An OMG Final Adopted Specification which has been approved by the OMG board and technical plenaries. Document Reference: dtc/04-05-01. Life Sciences Identifiers, Final Adopted Specification, in the finalization phase as of 2004-07. 40 pages. Copyright (c) 1997-2004 Object Management Group (OMG). Other: Copyright (c) 2003, EMBL-EBI; Copyright (c) 2003, Interoperable Informatics Infrastructure Consortium; Copyright (c) 2003, International Business Machines Corporation; Copyright (c) 2003, Object Management Group. Contact: Ms. Linda Heaton. Document URLs: http://www.omg.org/cgi-bin/apps/doc?dtc/04-05-01.pdf or http://www.omg.org/cgi-bin/apps/doc?@dtc/04-05-01.pdf. [cache]
- Life Sciences Identifiers, Accompanying Files. Final Adopted Specification. OMG Document Reference: dtc/04-05-02. ZIP archive with .wsdl, .mdl, ,xml. .java files. See the file listing.
- Object Management Group web site
- OMG Life Sciences Research (LSR) Group. "The life sciences research (LSR) group was formed on 6th August, 1997 in Philadelphia USA. Its inaugural meeting was attended by around fifty people representing forty organisations. The scope of the group covers any aspects related to life sciences research, including bioinformatics, genetics, cheminformatics, structural biology, computational chemistry, computational molecular biology, and clinical trials." MAGE-ML (Gene Expression) and LSID are two LSR specifications.
- Interoperable Informatics Infrastructure Consortium (I3C) - "Accelerating Drug Discovery Through Software Interoperability"
- I3C FAQ document
- About I3C
- I3C Members
- SourceForge Project: LSID (Life Science Identifier). "Life Science Identifier (LSID) resolution protocol, used to locate biologically significant data over a network, within middle-ware providing a client A.P.I. for Life Science applications, and server software, for Industry data providers."
- IBM LSID Project home page
- IBM mailing list
- IBM LSID Resolution Protocol Project. "This project implements the Life Science Indentifier (LSID) resolution protocol, to locate biologically significant data over a network, within middle-ware providing a client API for Life Science applications, and server software, for Industry data providers. Example client applications, using the API are also provided to demonstrate the power of this protocol."
- Tutorial: Setting up your own LSID Authority using Perl. By Stefan Atev. "Demonstrates a step-by-step approach to building an LSID Authority from scratch. Atev demonstrates how he builds this on a minimal data set and on data downloaded from Swiss-Prot on the Linux platform."
- LSID Authority: University of Wisconsin CFL. "The server creates a semantic web of data and metadata information drawn from the NTL-LTER datasets, delivered in an RDF format. Each piece of data or metadata is referenced by a globally unique LSID URN. These identifiers are then be referenced by the metadata of other LSIDs, creating relationships between diverse datasets regardless of content management schemas. This allows for a simple integration path among various researchers and institutions to attain wide-area interoperability with multiple, distributed data systems while maintaining their current infrastructure."
- Systinet LSID Package with WASP Server for Java. "This package uses Systinet's WASP Server for Java API in creating two LSID resolution mechanisms. One is a DNS based resolver that can be plugged directly into a client application. The second is an authority lookup Web service, that can be persistently deployed to a WASP Server and accessed using SOAP messaging."
- BioIT World clinics. Class materials from the clinics given at BioIT World in Boston., March 30th to April 1st, 2004.
Articles, Papers, News
[April 05, 2005] "LSID Best Practices: A Guide to Deploying Life Science Identifiers." By Dan Smith (Advanced Internet Technology Group, IBM) and Ben Szekely (Advanced Internet Technology Group, IBM). From IBM developerWorks (April 05, 2005). "Life Science Identifiers (LSIDs) have become the universally accepted identification scheme for the Life Science domain. Many basic concepts of data modeling map well into LSIDs, however there are also some other factors that must be carefully examined. These include naming, caching, and the distinction between data and metadata. This article guides you through these aspects and discuss the current best practices in each topic..."
[March 30, 2004] "Introduction to I3C and LSID." [Welcome to the 2004Life Science ID Symposium.] By Andrew Ellicott (I3C Executive Director). Presented at the BioIT World Conference (March/April 2004). Covers: "What the I3C is and how we operate; What LSIDs are and how they are applied; How to develop LSID-enabled informatics resources; How LSID technology is evolving; How LSID relates to other industry standards..." See slide #13 for an overview of the LSID standardization process.
[January 12, 2004] "LSID: An Informatics Lifesaver." By Salvatore Salamone. In Bio-ITWorld (January 12, 2004). "The Interoperable Informatics Infrastructure Consortium (I3C) is trying to tackle such issues with a new naming standard and data access protocol called the Life Science Identifier (LSID). At its core, LSID provides a uniform way to name and locate specific pieces of informatics data over the Internet. The I3C, whose members come from life science companies, academic labs, and vendors such as IBM and Oracle, hopes that LSID will help enable interoperability between informatics applications. Work on LSID began in early 2003 when the I3C, in conjunction with vendor members including Sun Microsystems and IBM, developed a specification for naming data resources. Similar to a URL, LSID uses a uniform resource name (URN) to locate data. The URN contains five parameters that uniquely identify the data of interest. During 2003, LSID moved from a concept to a specification. And IBM developed open- source LSID software that some developers are already using and evaluating. But a true test of LSID's acceptance will be the interest I3C can muster from the informatics vendor community. The early signs are encouraging... While waiting for improved Web security, those wishing to use LSID do have ways to safely share data today. For instance, most companies have no problem using the Internet to access PubMed, GenBank, Swiss-Prot, and other public databases. If a company wants to share confidential or proprietary data using LSID, it could set up an internal LSID authority that sits behind the corporate firewall. In this way, a URN lookup would direct the informatics application to an internal server that supports the in-house database. Outside users would not have access to this LSID authority..."
[August 15, 2003] Build a life sciences collaboration network with LSID. A common protocol provides the foundation." By Ben Szekely (Software Engineer, IBM). From IBM developerWorks. August 15, 2003. "If widely adopted, the Life Sciences Identifier protocol will enable scientists and researchers across multiple organizations to share data and collaborate in ways never before considered. You can build services that implement the LSID protocol using a combination of J2EE components that abstract away the protocol handling itself, leaving only the necessity of writing the service logic. Life Sciences Identifier (LSID) is a new naming standard and data-access protocol being developed in the Interoperable Informatics Infrastructure Consortium (I3C.org) along with help from IBM and other technology organizations such as Oracle, Sun Microsystems, and the Massachusetts Institute of Technology. A client application resolves an LSID against a special server called an authority to discover data and information about the data (metadata). The admittedly idealistic goal of LSID Resolution is for all biotech, pharmaceutical, and other life sciences organizations to build LSID Resolution Services in front of their data. With a common standard for data retrieval, scientists across these organizations may then easily share data, facilitating collaboration on such vital projects as drug discovery and disease research. The LSID Server Framework enables this LSID utopia by allowing organizations to provide their data using a service implementation that best matches their data source. Certain data sources will require only mapping from LSID to URL, if each piece of data has a URL that can retrieve it in a standard format. If the data source is a relational database, a more complex service will need to be written. This article shows how to build Resolution Services using Java 2 Enterprise Edition (J2EE)-based components. We'll look at the LSID Client Stack, which provides LSID connectivity within applications; the LSID Server Framework, which enables rapid development and deployment of LSID Resolution Services; and select resolution service implementations. Finally, we'll see how enterprises might integrate these components to form an Enterprise LSID Resolution Network. The figures in this article illustrate the architecture of individual components as well as the interaction between them. The red text and arrows show the involvement of key Java classes..."
[May 27, 2003] "Build an LSID Resolution Service using the Java language. A Java-based Life Sciences Identifier authority consolidates biological data resources." By Stefan Atev (Programmer, IBM) and Ben Szekely (Software Engineer, IBM). From IBM developerWorks. May 27, 2003, updated March 03, 2004. "The amount of biological data being created today is mind-boggling. As a biologist or bioinformaticist, you probably know of places around the network that provide very useful resources for your task at hand -- but remembering the different ways to access this information is often a productivity drain. Maybe you write a few Perl scripts or know someone who will provide you with some code for this or a procedure for that. At this point, you may be thinking that coming up with a common way of naming and finding this data is the only way you will be able to remain a biologist and not a programmer. Of course, the value of having a common way to identify data extends beyond bioinformatics, but for this article we will stay within the life sciences. The Life Sciences Identifier (LSID) is an I3C Uniform Resource Name (URN) specification in progress. You can read more about the specification at the I3C (see Resources for a link). Conceptually, LSID is a straightforward approach to naming and identifying data resources stored in multiple, distributed data stores in a manner that overcomes the limitations of the naming schemes that are in use today. An LSID resolver is a software system that implements an agreed-upon LSID resolution protocol to allow higher-level software to locate and access the data uniquely named by any LSID URN. The "server" side of this resolver solution is called an LSID authority. The client stacks and an example client, the LSID LaunchPad, are provided by the LSID Resolution Protocol Project. In this article, you'll see how to create your own LSID Authority using the LSID resolver stack for the Java language..."