The Cover PagesThe OASIS Cover Pages: The Online Resource for Markup Language Technologies
Advanced Search
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

Cover Stories
Articles & Papers
Press Releases

XML Query

XML Applications
General Apps
Government Apps
Academic Apps

Technology and Society
Tech Topics
Related Standards
Last modified: February 12, 2003
Resource Description and Classification

Resource Description and Classification. Being a collection of references on matters of Subject Classification, Taxonomies, Ontologies, Indexing, Metadata, Metadata Registries, Controlled Vocabularies, Terminology, Thesauri, Business Semantics

A collection of references and survey based upon links and cribbings from various resources on the Internet. An unfinished and non-authoritative reference document. The references cited in this document are only incidentally related to XML; the survey was conducted in connection with work on the OASIS Registry and Repository Technical Committee (Fall 1999/Spring 2000).


Descriptive Cataloging in Libraries

Library of Congress Classification and Subject Headings (LCC/LCSH)

  • "One of the world's most widely spread classification schemes is the Library of Congress Classification System (LCC). This is largely due to the fact that every exported record from the Library of Congress contains their own classification of the item. Apart from being dominating, it is quite old: LCC will soon celebrate its centenary. In 1899 the Librarian of Congress Dr. Herbert Putnam and his Chief Cataloguer Charles Martel decided to start a new classification system for the collections of the Library of Congress (established 1800). Basic features were taken from Charles Ammi Cutter's Expansive Classification. LCC is an enumerative system built on 21 major classes, each class being given an arbitrary capital letter between A-Z, with 5 exceptions: I, O, W, X, Y (these appear at the second or third level in the notation for various subclasses). After this was decided, Putnam delegated the further development of different parts of the system to subject specialists, cataloguers and classifiers. Initially and intentionally the system was, and has remained, decentralised and the different classes and subclasses were published for the first time between 1899-1940. This has lead to the fact that schedules often differ very much in number and the kinds of revisions accomplished." [From the DESIRE Report]

  • LC Classification Outline

  • LC classification resources from CDS. Including LC Classification Outline and Subject Headings [LCSH]

  • Cross-Classification Index from CyberStacks

  • Library of Congress Classification System. An outline of the LC system. "The Library of Congress Classification System (LC System) is used to organize books in many academic and university libraries throughout the United States and world. The LC System organizes material in libraries according to twenty-one branches of knowledge. The 21 categories (labeled A-Z except I,O,W,X and Y) are further divided by adding one or two additional letters and a set of numbers. ... This outline which is part of the Geography site at The Mining Company"

  • Web Resources Arranged by the Library of Congress Classification System

  • Library of Congress Subject Headings/Dewey Decimal Classification

Dewey Decimal Classification (DDC)

  • "The Dewey Decimal Classification System (DCC) was first produced by Melvil Dewey in 1876, originally being produced for a small North American college library. It is currently in its 21st edition (Mitchell 1995; Mitchell, et al. 1996) and is published by Forest Press. DDC is distributed in Machine-Readable Cataloguing (MARC) records produced by the Library of Congress (LC) and bibliographic utilities like OCLC and RLIN. DDC is also used in the national bibliographies of the UK, Canada, Australia, Italy and other countries (Comaromi, et al. 1990, p.6). Research carried out by OCLC in the 1980s established that DDC was a suitable tool for browsing, first for library catalogues and then for the Internet." [from the DESIRE Report]

  • "Both UDC and DDC are enumerative schemes and have a hierarchical structure that provide a systematic arrangement of subjects. Both schemes go beyond a strictly enumerative structure by the use of symbols (auxiliary tables) to enable compound subjects to be built (however this facility is not currently used by either of the services). There are many similarities between the notations of the two schemes (the UDC was originally adapted from the DDC scheme)..."

  • DDC from Forest Press

  • CyberDewey - DDC

  • Scorpion"Scorpion is a project of the OCLC Office of Research exploring the indexing and cataloging of electronic resources. Since subject information is key to advanced retrieval, browsing, and clustering, the primary focus of Scorpion is the building of tools for automatic subject recognition based on well known schemes like the Dewey Decimal System. ... OCLC has initiated the Scorpion research project to address the challenge of applying classification schemes and subject headings cost effectively to electronic information. A thesis of Scorpion is that the Dewey Decimal Classification can be used to perform automatic subject assignment for electronic item... The Dewey Decimal Classification is the most widely used classification scheme in the world. Currently, it is used in 135 countries and has been translated into 30 languages. In the United States it is used in 95% of all public and school libraries, and a large number of college, university, and special libraries. First published in 1876, it has been continuously revised to meet evolving information access needs both in the traditional library and in the electronic environments. It is currently in its 21st edition. Dewey is a hierarchical classification scheme. Each concept is denoted by a number that concisely identifies it and indicates its position in the hierarchy. For readability Dewey numbers are presented with a minimum length of three. Dewey numbers longer than three have a decimal point after the first three digits. The first digit (hundreds) represents the major disciplines; the second digit (tens) represents sub-disciplines of the hundreds; the third digit (ones) represents sub-disciplines of the tens, and so forth. Each number is significant. Generally, a longer number represents a more specific concept, and a shorter number represents a more general concept..."

Universal Decimal Classification (UDC)

  • "The UDC is a general classification scheme of over 60,000 classes as well as a number of auxiliary tables used to describe countries, areas, etc."

  • "The Universal Decimal Classification (UDC) is an international scheme which endeavours to cover all areas of knowledge. Its origins lie in the Dewey Decimal Classification (DDC) which was adapted towards the end of the Nineteenth century by Paul Otlet and Henri LaFontaine in an attempt to create a universal bibliography. Until recently responsibility for the scheme belonged to the FID (Federation Internationale de Documentation), this responsibility was passed to a consortium of publishers (the UDC Consortium) in 1992. The original purpose of use for ordering and indexing entries in a printed bibliography have since been overtaken by its use for indexing and retrieval in computer based systems. The scheme consists of 60,000 classes (divisions and sub-divisions) as well as a number of auxiliary tables to describe countries, etc." [From the DESIRE Report]

  • Note: "SOSIG uses the UDC (Universal Decimal Classification) scheme to create the browsing sections of the service. It was originally chosen in order to co-operate with other UK based Internet gateways and because there was no suitably broad social science classification scheme available. SOSIG does not use the UDC in its complete form, but has drawn upon a selection of over 160 UDC numbers pertaining to the social sciences from the scheme."

IFLA Section on Classification and Indexing

  • Check: International Federation of Library Associations and Institutions (IFLA), IFLA Section on Classification and Indexing. Principles Underlying Subject Heading Languages (SHLs), being "a culmination of several years' work by the Working Group on Principles Underlying Subject Heading Languages, was published in August 1999. The work was completed under the editorship of Maria Ines Lopes of Portugal and Julianne Beall of the United States...."

  • Section on Classification and Indexing

  • Newsletter 21. June 29, 2000.

  • Newsletter 18. Revised March 1999. [cache]

Standard Industry Taxonomies and Ontologies (Industries, Market Sectors, Products, Services, Functions)

North American Industry Classification System (NAICS)

  • "The North American Industry Classification System (NAICS) was developed by Statistics Canada, the Economic Classification Policy Committee of the US Office of Management and Budget, and Mexico's Instituto Nacional de Estadistica, Geografia e Informatica (INEGI). NAICS provides the structure for which the statistical agencies of Canada, Mexico, and the United States will compile comparable data. The NAICS structure consists of sectors, subsectors, industry groups and industries. This structure reflects the levels at which data comparability was agreed upon by the statistical agencies of the three countries. For some sectors, differences in the economies of the three countries prevent full compatibility at the NAICS industry level. Below the agreed-upon level of compatibility, each country may add additional detailed industries, as necessary to meet national needs, provided that this additional detail aggregates to the NAICS level." [source: NAICS Canada]

  • North American Industry Classification System (NAICS)

  • NAICS Codes and Titles

  • Product Classification System

  • NAICS Canada

  • Codes

UNSPSC - United Nations Standard Product and Services Classification

  • Browse UNSPSC codes. "We are the United Nations Standard Product and Services Classification (UNSPSC) Code organization. The UNSPSC was created when the United Nations Development Program and Dun & Bradstreet merged their separate commodity classification codes into a single open system. The UNSPSC Code is the first coding system to classify both products and services for use throughout the global marketplace. The Electronic Commerce Code Management Association (ECCMA) is a not for profit, unbiased, membership organization that oversees the management and development of the UNSPSC Code."

  • UNSPSC Web site

  • UN/SPSC GEN Initiative description. "UN/SPSC (Dun & Bradstreet's Standard Product and Service Codes) GEN description: "UN/SPSC classifies more than 8,000 products and services around the world. The eight-digit, hierarchical code is the result of a merger of the United Nations' Common Coding System (UNCCS) and Dun & Bradstreet's Standard Product and Services Codes (SPSC). The UN/SPSC is a hierarchical classification, having five levels. Each level contains a two-character numerical value and a textual description..."]

  • UNSPSC description from DIFFUSE

  • Using the UNSPSC

  • UNSPSC Codes, XML document format. Posted 2001-07-11 by John Evdemon to the OASIS ebXML list '' (John Evdemon, CTO and Director of Engineering, Vitria Technology).

Standard Occupational Classification (SOC)

The Standard Occupational Classification (SOC) system from the U.S. Department of Labor, Bureau of Labor Statistics "will be used by all Federal statistical agencies to classify workers into occupational categories for the purpose of collecting, calculating, or disseminating data. All workers are classified into one of over 820 occupations according to their occupational definition. To facilitate classification, occupations are combined to form 23 major groups, 96 minor groups, and 449 broad occupations. Each broad occupation includes detailed occupation(s) requiring similar job duties, skills, education, or experience. [2003-02 statement]"

Central Product Classification (CPC)

  • "A reference classification of products based on the physical characteristics of goods or on the nature of the services rendered. CPC provides a framework for collection and international comparison of the various kinds of statistics dealing with goods and services. CPC covers products that are an output of economic activities, including transportable goods, non-transportable goods and services... CPC Version 1.0 will be updated by 2002. This update will be undertaken by the Technical Subgroup of the United Nations Expert Group on Classifications."

  • Description from UNSD

  • Description from DIFFUSE

  • Sponsored by United Nations Statistics Division

  • Central Product Classification Version 1.0

STEPml product identification and classification

The STEPml specification [Revision 1.0 February 7, 2001] addresses the requirements to identify and classify or categorize products, components, assemblies (ignoring their structure) and/or parts. Identification and classification are concepts assigned to a product by a particular organization. This specification describes the core identification capability upon which additional capabilities, such as product structure, are based. Those capabilities are describe in other STEPml specifications and their use is dependent upon use of this specification...

GCL (Government Category List) from UK GovTalk e-Government Metadata Framework

Making it easier to find information is a key aim of Information Age Government, and is addressed in part by the e-Government Metadata Framework (e-GMF). The Government Category List (GCL) is a list of headings for use with the Subject element of the e-Government Metadata Standard (e-GMS). It will be seen in applications such as UK Online. Subject metatags drawn from the GCL will make it straightforward for website managers to present their resources in a directory structure using the GCL headings... The GCL is a living document which must evolve if it is to continue to serve the public in a world of changing technology and changing needs. Suggestions for improving it will be welcomed throughout its lifetime. During 2002, updates will be issued at four-monthly intervals.

NISO Electronic Thesaurus initiative

International Standard Industrial Classification of All Economic Activities (SIC)

  • "International Standard Industrial Classification of All Economic Activities, Third Revision, (ISIC, Rev.3): A standard classification of economic activities arranged so that entities can be classified according to the activity they carry out. The categories of ISIC at the most detailed level (groups) are delineated according to what is in most countries the customary combination of activities described in statistical units. The major groups and divisions, the successively broader levels of classification, combine the statistical units according to the character, technology, organization and financing of production. Wide use has been made of ISIC, both nationally and internationally, in classifying data according to kind of economic activity in the fields of population, production, employment, gross domestic product and other economic activities."

  • View the scheme

  • ISIC GEN description: "The principle is quite similar to NACE with the same accuracy (4 digits), but it is an initiative of the United Nation and is therefore international."

Standard Industrial Classifications (SIC)

  • SIC (Standard Industrial Classifications) GEN description: "The SIC are used to classify industry categories and is used by the Small Business Administration (SBA) as a guide in defining industries for size classifications. The SICs are also used by the VANs to provide transaction filtering so that you only receive the transaction sets that relate to your business area."

  • Standard Industrial Classifications (SIC) Index. Sponsored by ITA (Information Technology Associates).

Nomenclature générale des Activités Economiques dans les Communautés Européenes (NACE)


  • PRODCOM GEN description: "PRODCOM ('PRODuction COMmunitaire', Community production). PRODCOM is the title of the European Community production statistics for Mining and Quarrying, Manufacturing, and Electricity, Gas and Water Supply, i. e. Sections C, D and E of the Statistical Classification of Economic Activities in the EC (NACE Rev. 1). The principe is quite similar to NACE, but it is far more accurate: 8-digits at European level, 9-digits at the national level."

ISO BSR - Basic Semantic Repository

  • ISO BSR - Basic Semantic Repository GEN description: "The purpose of the BSR project is to provide an internationally agreed register of multilingual data concepts with its technical infrastructure. This will provide storage, maintenance and distribution facilities for reference data about semantic units (Basic Semantic Units - BSUs and other Semantic Units) and their links (bridges) with operational directories. It should be noted that the term 'directories' also includes repositories, and the single term directories will be used to include both throughout this document. Its principle function is to provide data in multiple languages that has been developed in a consistent, unambiguous manner according to international standards."

  • BSR description from DIFFUSE: "The ISO Basic Semantic Register (BSR) aims to provide a set of standardized semantics that can be "bridged" to the components of different EDI standards. The definitions of the terms in the register have been defined using the rules laid down in ISO 11179. The representation classes involved include: Amount, Code, Date, DateAndTime, Description, Identifier, Indicator, Name, Number, Label, Percent, Quantity, Rate, Text, Time, and Value. Each definition will be defined in three languages: English, French and German."

  • ISO TC 154 WG 1 - Basic Semantics Register (BSR). Including: (1) Semantic components, (2) Semantic Units, (3) Bridges to EDIFACT, and (4) Transport Glossary

Topic Maps Published Subjects

"A published subject is any subject for which a subject indicator has been made available for public use and is accessible online via a URI. [A subject indicator is a resource that is intended by the topic map author to provide a positive, unambiguous indication of the identity of a subject.]... The general intention behind published subjects is that topic maps interoperability needs non-ambiguous definition of subjects (reified by topics), that should be provided by trustable publishers, in resources available through stable URIs. Those addressable resources, called 'subject definition resources' will provide human-understandable and non-ambiguous definition of subjects, whereas their URIs will provide stable identifiers fit for computer processing, topic maps interoperability and merging, and many other foreseeable semantic applications... Since subject identity forms the basis for merging topic maps and interchanging semantics, authors are encouraged to always indicate the subject identity of their topics in the most robust manner possible, in particular through the use of standardized ontologies expressed as published subject indicators... [approximation, from the specs 2001/2002]"



  • The BizCodes initiative [apparently sponsored by the XML/edi Group] "is focused on promoting the concept of a universal reference system for XML based eBusiness that transcends individual schemas and industry specific exchange formats to provide a true 'lingua franca' for global eBusiness interchanges. This initiative is designed to facilitate the current efforts of standards groups like DISA/X12, UN/EDIFACT and working groups like CEFACT/OASIS with and also commercial efforts like RosettaNet, Microsoft and any other XML based eBusiness implementation. The work developed here is intended to be an open source, public domain collaborative project. The objective is to provide all implementors of XML based semantics for eBusiness a simple and concise system that allows global interoperability. The technology approach is to build XML syntax examples that illustrate the core concepts of the Bizcodes method, and to show how, with only minor modifications, by adopting drafts already published to the W3C XML Version 1.0 recommendation, we can quickly and easily implement a robust, scalable and maintainable system today. In addition a key need is to adhere to the tenets of the simplicity of XML V1.0 itself, that ensure consistent software behaviours across the major XML parser implementations themselves..."

  • BizCodes Initiative Web site

  • A multimedia presentation on Bizcodes

  • Working Draft for eDTD. Archive of eDTD specification and XML sample files. December, 1999. ['White Paper on eDTD Schema Proposal to ebXML Technical Architecture Committee, January 2000.']

  • eDTD for the travel document using BizCodes references.

  • XML/edi Group

Universal Data Element Framework (UDEF)

  • The UDEF "Universal Data Element Framework" is described as a 'Dewey Decimal-Like Indexing System' for the Web. . . "The integration and harmonization of data semantics is a recognized challenge that stands as a barrier to enterprise applications integration and seamless electronic business. The Universal Data Element Framework (UDEF) is a proven rules based approach for establishing data semantic context across multiple domains of discourse. If adopted by the XML community at large, the UDEF-based intelligent identifiers would enable businesses to achieve their XML driven internal (application-to-application) and external (business-to-business) integration visions."

  • UDEF Web Site

  • [May 01, 2000] UDEF Codes

  • See "Universal Data Element Framework (UDEF)."

Classifications of the Functions of Government (COFOG)

  • Purpose is to "classify the purpose of transactions such as outlays on final consumption expenditure, intermediate consumption, gross capital formation and capital and current transfers, by general government. Year of last revision: 1999."

  • UNSD Description

  • View the scheme

XBRL/AICPA Taxonomy for Commercial and Industrial Companies

  • Schema for GAAP Taxonomy from XBRL. "In XBRL, a taxonomy is a set of defined item types and relationships between those item types, where the item types describe financial statements and the information they contain. For example, "Current Assets" and "Assets" are item types, and the relation between them is that Current Assets roll up into Assets. The taxonomy provided in the documents below is intended to name the most relevant and commonly used disclosure items required by US GAAP for commercial and industrial companies, and to arrange them into a hierarchy that reflects their logical and arithmetic relationships. There are over 1100 terms, arranged in a hierarchy up to 12 levels deep. The purpose of this and other taxonomies produced within the scope of XBRL is to supply specific accounting terms and relationships that will be needed for exchanging data between software applications used by companies, lenders, investors, auditors, and regulators, in accordance with existing reporting standards. The US GAAP Taxonomy for Commercial and Industrial companies is only the first in a planned series of taxonomies within XBRL. Experts in the reporting standards of jurisdictions and of other types of commercial enterprises are already at work on some of these taxonomies..." The XBRL Web site has exemplars of the 'Taxonomy for Commercial and Industrial Companies' in several formats (all layers, 2,3,4,5 Layers Deep, 8 Layers Deep) in MS Word and PDF format.

DAML-ONT Ontology

The "DARPA Agent Mark Up Language (DAML)" is a new effort to "help bring the 'semantic web' into being, focusing on the eventual creation of a web logic language. DAML is being designed as an XML-based semantic language that ties the information on a page to machine-readable semantics (ontology). DAML represents joint work between DoD, industry and academia in both the US and the European Community and we hope it will lead to the eventual web standard in this area." The W3C mailing list '' hosts a very active discussion on the developing DAML Ontology Language Specification, released in October 2000. Several new resources are available from the project web sites. The DAML Ontology Library provides a summary submitted obtologies, sortable by URI, Submission Date, Keyword, Open Directory Category, Class, Property, Funding Source, and Submitting Organization.

  • [October 10, 2000] DAML-ONT Initial Release - "the initial version of the DAML Ontology language specification, released in October 2000." From the announcement 10-October-2000: "This first release is a draft language for the 'ontology' core of the language (roughly corresponding to a frame-based or description logic starting place) - this allows the definition of classes and subclasses, their properties, and a set of restrictions thereon. The language does not yet include any specification of explicit inference rules, which we hope will follow. We believe this ontology core will be a useful starting place for extending the language, and for experiments in a web-based semantic language that is accessible to a wide audience. The language is based on RDF (actually RDF Schema) and extends it to include new concepts. A group of researchers from MIT's Laboratory for Computer Science had primary responsibility for developing this language core. This was then extended by a number of people including representives from the OIL effort, the SHOE project, the KIF work, and DAML contractors."

  • DAML Ontology Library [cache 2000-11-25]

  • "Annotated DAML Ontology Markup - Walkthrough." This is an annotated walk through an example DAML Ontology. The example ontology demonstrates each of the features in DAML-ONT, an initial specification for DAML Ontologies. Superscripted text refers to notes at the end of this document. The original example ontology is available separately. DAML builds on existing Web technologies like Extensible Markup Language[XML] and Uniform Resource Identifiers (URIs). The references and suggested reading section cites introductory materials as well as official specifications for these technolgoies." [cache]

  • Example ontology for DAML-ONT, [cache]

  • W3C discussion list for DAML-ONT: ''

IEEE Standard Upper Ontology (SUO)

Scope: "This standard will specify the semantics of a general-purpose upper level ontology. An ontology is a set of terms and formal definitions. This will be limited to the upper level, which provides definition for general-purpose terms and provides a structure for compliant lower level domain ontologies. It is estimated to contain between 1000 and 2500 terms plus roughly ten definitional statements for each term. Is intended to provide the foundation for ontologies of much larger size and more specific scope. (1) The standard will be suitable for automated logical inference to support knowledge-based reasoning applications. (2) This standard will enable the development of a large (20,000+) general-purpose standard ontology of common concepts to be developed, which will provide the basis for middle-level domain ontologies and lower-level application ontologies. (3) The ontology will be suitable for 'compilation' to more restricted forms such as XML or database schema. This will enable database developers to define new data elements in terms of a common ontology, and thereby gain some degree of interoperability with other compliant systems. (4) Owners of existing systems will be able to map existing data elements just once to a common ontology, and thereby gain some degree of interoperability with other representations which are compliant with the SUO. (5) Domain-specific ontologies which are compliant with the SUO will be able to interoperate (to some degree) by virtue of the shared common terms and definitions. (6) Applications of the ontology will include: (a) E-commerce applications from different domains which need to interoperate at both the data and semantic levels. (b) Educational applications in which students learn concepts and relationships directly from, or expressed in terms of, a common ontology. This will also enable a standard record of learning to be kept. (c) Natural language understanding tasks in which a knowledge based reasoning system uses the ontology to disambiguate among likely interpretations of natural language statements."

Upper Cyc Ontology

"Cycorp welcomes you to its first major public release: approximately 3,000 terms capturing the most general concepts of human consensus reality. We refer to this as the "upper Cyc ontology." The full Cyc knowledge base (KB) includes a vast structure of more specific concepts descending below this upper level. Over the past dozen years, we have also entered into Cyc literally millions of logical axioms -- rules and other assertions -- which specify constraints on the individual objects and classes found in the real world. Further specializations have been developed for our customers, especially in recent years, driven by their application needs..."

News Industry and Metadata Initiatives

IPTC [International Press Telecommunications Council] Subject Reference System

  • IPTC Subject Reference System. "The IPTC Subject Reference System has been developed to allow Information Providers access to a universal language independant coding system for indicating the subject content of news items. The system is explained and documented in the IIM Guideline 3. IPTC has now updated the Subject Reference System to Version 4. This update contains all the Olympic Sports and Qualifiers allowing individual sports events to be precisely coded. The new version is available as an XML file together with a free viewer for Windows. The zip package can be downloaded here. Translations are made available from IPTC members to provide guidance to non-English language providers."

  • IPTC [International Press Telecommunications Council] Subject Reference System

Resource Organisation And Discovery in Subject-based services (ROADS)

  • Resource Organisation And Discovery in Subject-based services (ROADS)

  • "ROADS is a software tool-kit allowing gateway managers to pick and choose what parts of the software they require whilst allowing the the integration of other software according to requirement. One of the purposes of ROADS is "to participate in the development of standards for the indexing, cataloguing and searching of subject-specific resources. This open-source software toolkit is being produced by a consortium of developers with expertise in network-based resource identification, indexing and cataloguing. The ROADS project has three partners: (1) the Department of Computer Science at Loughborough University of Technology (2) the ILRT (Institute for Learning and Research Technology) at Bristol University (3) UKOLN (the UK Office for Library and Information Networking) at the University of Bath.

  • ROADS Description

Development of a European Service for Information on Research and Education (DESIRE)

  • DESIRE Home Page

  • "The DESIRE project involves collaboration between project partners working at ten institutions from four European countries - the Netherlands, Norway, Sweden and the UK. In DESIRE (phase) I, research focused on subject based search services based on selection, description and classification of high quality networked resources ... DESIRE phase 2 began in July 1998 and the 10 partners continue this work, but with a more focussed scope: distributed Web indexing, subject-based Web cataloguing, directory services, and caching."

  • DESIRE Metadata Framework Registry. By Rachel Heery, Tracy Gardner, Michael Day, and Manjula Patel. 2000-03-31. "Metadata registries enable authoritative information about metadata schemes to be declared and thus support the extensibility and evolution of element sets and provide some basis for interoperability. The DESIRE metadata registry demonstrates how a metadata registry might work. Elements from several different metadata element sets, including Dublin Core, have been added. This report gives a detailed technical overview of the DESIRE metadata registry implementation and its data model, additional information on the element sets (namespaces) included in the registry and some comments on metadata mappings and cross-walks. . . The DESIRE registry implementation follows the general principles of the ISO/IEC 11179 standard for the specification and standardisation of data elements. Unlike most ISO/IEC 11179 based registries, however, the DESIRE registry implementation has been designed to present data elements from multiple namespaces in a consistent manner, rather than for the maintenance of authoritative definitions of data elements under a single namespace. This means that in addition to providing basic registry functions, the DESIRE registry implementation can provide mappings between different metadata schemes. Within the registry, data elements are mapped onto a single semantic layer - in this case those defined in the ISO Basic Semantics Register (BSR) - so that the mapping process is simplified if and when new metadata vocabularies are added to the registry." See the HTML version of the document.

  • DESIRE Information Gateways Handbook

  • Prototype metadata registry

  • "DESIRE: Project Deliverable." By Michael, Anna Brümmer, Debra Hiom, Marianne Peereboom, Alan Poulter, and Emma Worsfold. A landmark study. "Classification schemes have a role in aiding information retrieval in a network environment, especially for providing browsing structures for subject-based information gateways on the Internet. Advantages of using classification schemes include improved subject browsing facilities, potential multi-lingual access and improved interoperability with other services. Classification schemes vary in scope and methodology, but can be divided into universal, national general , subject specific and home-grown schemes. What type of scheme is used, however, will depend upon the size and scope of the service being designed. A study is made of classification schemes currently used in Internet search and discovery services, particular reference being given to the following schemes: Dewey Decimal Classification (DDC); Universal Decimal Classification (UDC); Library of Congress Classification (LCC); Nederlandse Basisclassificatie (BC); Sveriges Allmäma Biblioteksförening (SAB); Iconclass; National Library of Medicine (NLM); Engineering Information (Ei); Mathematics Subject Classification (MSC) and the ACM Computing Classification System (CCS). Projects which attempt to apply classification in automated services are also described including the Nordic WAIS/WWW Project, Project GERHARD and Project Scorpion."

  • Mapping Classification Schemes

  • The role of classification schemes in Internet resource description and discovery

  • See: SOSIG [Social Science Information Gateway] Classification Scheme

  • See: Biz/ed [Business Education on the Internet] Classification Scheme

  • See: Biz/ed SOSIG Mapping File

Social Science Information Gateway (SOSIG)

  • The Social Science Information Gateway (SOSIG) "aims to provide a trusted source of selected, high quality Internet information for researchers and practitioners in the social sciences, business and law. It is part of the UK Resource Discovery Network."

  • "In 1996 SOSIG became a key player in DESIRE, a Telematics for Research project funded under the European Commission's Fourth Framework Programme. DESIRE aims to develop and enhance Internet services for researchers in Europe, and one of the many areas of research has been the potential for an international network of subject gateways... SOSIG uses harvester technology, developed by the DESIRE project, to automatically create a database of Web resources. The technology uses the URLs from the manually created SOSIG catalogue as seeds in a process that uses robots to gather further resources from the Internet. This data is available for keyword searching but cannot be browsed unless classified. SOSIG has been looking at ways of using an automatic classification system, also developed through DESIRE, with the harvested database. An automatic classification system requires a list of terms that are associated with the various sections of the classification scheme used. The term/class number associations will also be weighted. To generate such a classification vocabulary, SOSIG has created a program that extracts the manually assigned keywords from the main catalogue and analyses the frequency with which they appear in records classified with particular class numbers - in this case UDC. This vocabulary is then used with the DESIRE autoclassification system to assign one or more UDC class numbers to the harvested records...

  • SOSIG Thesauri and RDF: We are exploring ways of generalising our approach to the storage, query and interchange of controlled vocabulary structures such as HASSET and the UDC classification headings used in SOSIG. For this we have adopted the W3C's new Resource Description Framework (RDF) as a data modelling formalism, and have developed storage systems and simple query interfaces to allow thesaurus (and many other) data structures to be represented in a very general manner. The use of RDF also provides us with a syntax (XML) for exchanging controlled vocabulary data with other applications and services. The original method of data storage that was used for the SOSIG thesaurus data involved the creation of query, storage and user-interface facilities that were specific to SOSIG's use of HASSET. The new RDF/XML prototype allows us to reproduce all the facilities currently offered by the SOSIG thesaurus but in a more standards-based manner. The use of a generalised RDF system makes software and data re-use much simpler. The RDF system will be available as part of the DESIRE open-source software toolkit. It consists of a set of Perl modules that implement a "triple storage" system that will be conformant with the RDF data model as specified in the W3C RDF Model and Syntax specification."

Dublin Core Metadata Project

  • "Metadata for Electronic Resources The Dublin Core is a metadata element set intended to facilitate discovery of electronic resources. Originally conceived for author-generated description of Web resources, it has attracted the attention of formal resource description communities such as museums, libraries, government agencies, and commercial organizations..."

  • DC specification for "Subject": "The topic of the resource. Typically, subject will be expressed as keywords or phrases that describe the subject or content of the resource. The use of controlled vocabularies and formal classification schemes is encouraged. Select subject keywords from either the Title or Description information. If the subject of the item is a person or an organization, use the same form of the name as you would if the person or organization were a Creator, but do not repeat the name in the Creator element. In general, choose the most significant and unique words for keywords, avoiding those too general to describe a particular item. This element might well include classification data (for example, Library of Congress Classification Numbers or Dewey Decimal numbers) or controlled vocabularies (such as Medical Subject Headings or Art and Architecture Thesaurus descriptors) as well."

  • DC Home Page

  • Dublin Core FAQ Document

  • Dublin Core Usage Guidelines

  • User Guide Working Draft 1998-07-3

  • Approval of initial Dublin Core Interoperabiity Qualifiers "the DC-Usage Committee has completed balloting of the initial round of proposed Dublin Core Interoperability Qualifiers. These qualifiers are intended to promote interoperability among applications that use element refinements and encoding schemes to increase the semantic precision of metadata. The Dublin Core Metadata Initiative will issue recommendations in the near future about syntactic encoding of approved qualified Dublin Core in HTML, XML, and RDF... Approved Qualifiers for DC 'Subject' use the Encoding Schemes 'LCSH, MeSH, DDC, LCC, UDC'..."

  • "The SUBJECT element may contain various identifiers relating to the subject of the resource, such as keywords or classification notations. The default USMARC mapping is to field 653 (Uncontrolled subject access), although specific fields such as 650 for LC Subject Headings or 050 for LC Classification numbers may be used if the metadata include identification of such subject schemes. This element does not involve descriptive cataloging covered by AACR2, but it should be noted that this is not a transcribed element. Therefore, it may be used without further modification. Its usefulness will be determined by the specificity of the scheme identification. In a catalog that uses controlled subject headings and classification, uncontrolled keywords will be less useful than controlled headings and classification." From Dublin Core Metadata and the Cataloging Rules in the ALA Task Force on Metadata and the Cataloging Rules, Final Report

  • "Guidance on expressing the Dublin Core within the Resource Description Framework (RDF)." Editors: Eric Miller, Paul Miller and Dan Brickley.

The TEI Header

ACM Computing Classification System (CCS)

  • "ACM's first classification system for the computing field was published in 1964. Then, in 1982, the ACM published an entirely new system. New versions based on the 1982 system followed, in 1983, 1987, 1991, and now 1998.

  • "The ACM (Association for Computing Machinery) Computing Classification System has become a standard for identifying and categorising computing literature, as well as areas of computing interest and/or expertise. The current taxonomy for categorising the computing literature saw its first release in 1982. Until recently, CCS was named the Computing Reviews Classification System (CRCS); it was renamed in recognition of its general use as a standard for classifying the computing literature. The 1991 Classification System is a cumulative revision of the 1982 version of the Computing Reviews Classification System. The 1982 Classification System had in turn superseded the previous CR classification introduced in 1964. The Classification has two main parts: a numbered tree containing unnumbered subject descriptors, and a General Terms list. The unnumbered subject descriptors are essentially fourth level nodes...." [description from DESIRE Report]

  • ACM 1998 list - Said to be 'Valid in 2000'. The full classification scheme involves three concepts the four-level tree (containing three coded levels and a fourth uncoded level), General Terms, and implicit subject descriptors.

  • Introduction to CCS

  • CCS Committee Update Report

  • How to Classify Works Using ACM's Computing Classification System

Engineering Information Classification Codes (Ei)

  • The Ei Classification Codes are a classification scheme developed by Engineering Information, Inc. Engineering Information, also known as Ei was created in 1884, at Washington University, St. Louis. Ei's aim is to identify, organise and facilitate easy access to the published engineering literature of the world. The system has been further subdivided (1993) and now comprises six main categories, subdivided into 38 subject series and over 800 individual classes. Up to four levels of increasing specificity are provided below the main categories. It is a numeric scheme, but not hierarchical in content.."

  • Engineering Electronic Library, Sweden (EELS) Ei scheme

  • Browse the subject tree

AGRICOLA Subject Category Codes

  • AGRICOLA Subject Category Codes

  • "The AGRICOLA Subject Category Codes (SCC) are the codes used by [US] National Agricultural Library indexers and catalogers to categorize bibliographic records in the AGRICOLA database. They are also used with AgDB. The schema is a superset of the AGRIS Subject Category Code system. The scope notes presented here represent the rule-set used by NAL staff in deciding which codes to assign to bibliographic records. Users already familiar with the SCCs and having browser table capability, may wish to use the by-code table access provided below. Those unfamiliar with the codes and/or lacking table functionality might better use the alphabetical title rendering that follows the table. For a hierarchical arrangement of the subject structure, see the hierarhical view. The full text of the scope notes may also be searched..."

  • AGRICOLA Subject Category Codes Hierarchical View

Internet Portals and their Search Interfaces

The subject taxonomies/hierarchies now used in Yahoo, AltaVista, and the Open Directory Project indexes (etc.) appear [?] to have been created ad hoc, and appear to change a lot over time. Presumably a large staff is needed to evolve the classification schemes as new categories become relevant to users.

  • Yahoo. 14 top-level subject categories. [Somewhat dated description:] "Some Web sites have tried to organise knowledge on the Internet by devising their own classification scheme. Yahoo!, created in 1994, lists Web sites using their own universal classification scheme or 'ontology', which contains 14 main categories. Each Web site collected for Yahoo! is listed under one of 20,000 categories or sub-categories (Steinberg 1996), the scheme being developed over time by the 20 people doing the classification work..." [From the DESIRE Report]

  • "AltaVista Directory: The Web's Largest" - 16 top-level categories; background? history?

  • DMOZ - Open Directory Project. 15 top-level categories, 257,124 categories total. Apparently " Netscape, Lycos, HotBot" use this system.

General/Miscellaneous References and Authority Lists

I have not had time to properly organize these references, though some of them link to excellent resources.

  • Standards and Specifications List." From DIFFUSE.ORG (Martin Bryan). An excellent resource.

  • The GEN Initiative documents a number of systems in its 'Domain-Independent Nomenclatures' : ISO BSR, NACE, PRODCOM, UN/SPSC, ISIC, SIC. See Global Engineering Networking Initiative (GEN)

  • Business Semantics From DIFFUSE.ORG. See also Electronic Commerce Architecture Standards.

  • [April 10, 2001] "Automated Name Authority Control and Enhanced Searching in the Levy Collection." By Tim DiLauro, G. Sayeed Choudhury, Mark Patton, and James W. Warner (Digital Knowledge Center Milton S. Eisenhower Library Johns Hopkins University) and Elizabeth W. Brown (Cataloging Department Milton S. Eisenhower Library Johns Hopkins University). In D-Lib Magazine [ISSN: 1082-9873] Volume 7, Number 4 (April, 2001). "This paper is the second in a series in D-Lib Magazine and describes a workflow management system being developed by the Digital Knowledge Center (DKC) at the Milton S. Eisenhower Library (MSEL) of The Johns Hopkins University. Based on experience from digitizing the Lester S. Levy Collection of Sheet Music, it was apparent that large-scale digitization efforts require a significant amount of human labor that is both time-consuming and costly. Consequently, this workflow management system aims to reduce the amount of human labor and time for large-scale digitization projects..."

  • CENDI Conference on "Controlled Vocabulary and the Internet" - September 1999. Links to some of the presentations.

  • Beyond Bookmarks: Schemes for Organizing the Web "Beyond Bookmarks: Schemes for Organizing the Web is a clearinghouse of World Wide Web sites that have applied or adopted standard classification schemes or controlled vocabularies to organize or provide enhanced access to Internet resources. Beyond Bookmarks is compiled and maintained by Gerry McKiernan, Science and Technology Librarian and Bibliographer, Science and Technology Services Department, Iowa State University Library, and Curator, CyberStacks(sm), Iowa State University, Ames, IA 50011."

  • Cooperative Online Resource Catalog "CORC is a state of the art, Web based system that helps libraries provide well-guided access to Web resources using new, automated tools and library cooperation. CORC empowers librarians with automated tools for the cooperative creation, selection, organization, and maintenance of web-based resources. CORC provides libraries with access to a large, growing database of high-quality, library-selected Web-based electronic resource descriptions. It allows libraries to target electronic resources available on the Web and collect only the information that is worth having. In addition, CORC succeeds at the difficult tasks of maintaining currency of records and providing relevant Web-based electronic resources through the collaborative effort of its participating libraries. CORC does all of this by: (1) harvesting and formatting basic information about Web-based electronic content (automating the creation of metadata), which reduces typing and cut-and-paste operations. It also assigns appropriate classification numbers and subject headings (2) applying authority control to electronic resources with automatic DDC number assignment, Library of Congress Subject Headings and access to name authority databases (3) providing libraries with access to unique, advanced tools for building and automating the maintenance of single- or multi-library -- pathfinders -- (4) using advancing new standards (e.g., Dublin Core, XML, RDF) adding value to established standards (e.g., MARC, DDC) and cooperatively developing best practices for managing library access to electronic resources available over the Web."

  • "Content Enhancement." Discussion of subject classification and controlled vocabularies. From Metacode Technologies, Inc.

  • Historical 'Metadata' Reference Collection [but not current]

  • LBNL EPA Scientific Metadata Standards Project

  • "Provide browsing using classification schemes"Handbook Chapter from DESIRE

  • Using Library Classification Schemes for Internet Resources

  • Global Information Locator Service (GILS)

  • GTAP economic sectors "The Version 4 GTAP data base contains detailed bilateral trade, transport and protection data characterizing economic linkages among regions, linked together with individual country input-output data bases which account for intersectoral linkages among the 50 sectors within each of 45 Regions."

  • Codes from the Harmonized Tariff System (US) 8-digit

  • Some electronic classification schemes From OCLC.

  • Controlled vocabularies reference guide

  • Syllabus and reading list for subject analysis and classification

  • The American Society of Indexers

  • NLM Classified Subject Index

  • Mathematics Subject Classification 1995

  • "Components and Structure of a VHG [Virtual Hyperglossary ]." "The components of a termEntry are described formally in the DTD, but the current section is a general introduction for non-terminologists... The VHG relies heavily on the emerging ISO FDIS 12620 standard for data categories in terminology, which describes about 300 categories used by terminologists. Terminology requires great precision in the use of words and phrases and for industrial-strength applications you should be careful to use words in a way that is consistent with FDIS 12620. . . Curators will often wish to add strucure to their VHGs. Thus gas in the scientific sense will often be linked to other terms. These might include latent heat, vapour pressure, critical temperature and many more. To systematise this, the curator of a VHG might create a parent termEntry such as vaporisation. For melting phenomena she might create fusion. To unite both of these she might create an even higer level termEntry, phase change...The creation of such classifications requires a great deal of work, technical, organisational and usually political. Ontologies are highly personal, and there are frequently battles over classifications, taxonomies and related approaches. Often they are dynamic, and may have poorly developed terms." See "Virtual Hyperglossary (VHG)."

  • The SCHEMAS Project "SCHEMAS provides a forum for metadata schema designers involved in projects under the IST Programme and national initiatives in Europe. SCHEMAS will inform schema implementers about the status and proper use of new and emerging metadata standards. The project will support development of good-practice guidelines for the use of standards in local implementations. It will investigate how metadata registries can support these aims. The SCHEMAS project is funded as part of the Information Society Technologies (IST) Programme, a theme of the European Union's Fifth Framework Programme managed by the Information Society Directorate-General of the European Commission. Work commenced on the project at the beginning of February 2000 and is scheduled to run until December 2001. . . .Registry: One important focus of this Web site will be a registry of metadata schemas. This registry will hold links to elements and definitions in whatever form they may be available, whether as formal standards or as Web pages for specific projects. It will cover schemas across a broad range of functional requirements, from resource discovery to rights management and digital preservation. The registry itself will serve as a good-practice example of registry use and benefits; details of its configuration will be made available as a technology baseline for other registry implementers. One part of this registry will focus specifically on the schema format of the Resource Description Framework (RDF), a standard for supporting the exchange of metadata on the Web that is up for recommendation by the World-Wide Web Consortium (W3C). The RDF Schema format encodes metadata schemas with explicit Web links to related parent and child schemas -- a capability that is is particularly useful for linking translations of schemas in multiple languages. The SCHEMAS project will produce documentation to help implementers use RDF schemas to link their local schemas to the parent schemas or standards on which they are based. "

  • myRDF - An RDF Toolkit for Librarians (and other Knowledge Managers) - The myRDF Toolkit is a collection of application modules designed to be combined to support different aplications focused on the management, integration and navigation of metadata. Test the Dublin Core Data Model... black box applications, etc... test the applicability of RDF with off-the-shelf SQL databases using open technologies (Apache web server, JDBC, Servlets, Java Server Pages, etc.)"

  • The Open Metadata Registry The Dublin Core Metadata Initiative's Open Metadata Registry is a database of RDF schemas that provides registration, navigation and reuse of semantics defined by various resource description communities. Status: (2000-05-08) This system is a prototype that has been seeded with vocabularies that have not been endorsed by any metadata initiative. This system is designed to illustrate the functionality of RDF as a means for effectively managing semantics and is currently under development.

  • RSLP Collection Description "This project will work with other RSLP projects, enabling them to describe their collections in a consistent and machine readable way. Based on a thorough modelling of collections and their catalogues, the project will develop a collection description metadata schema and associated syntax using the Resource Description Framework (RDF). We will develop a simple Web-based tool in order that projects can describe their collections and prototype a search service based on a database of such descriptions."

  • Cataloging and Classification Quarterly

  • IFLANet Digital Libraries: Metadata Resources

  • Multilingual Subject Entry (Muse) - an attempt to link and harmonize LCSH, RAMEAU (French), and Schlagwortnormdatei (German)

  • Country Names

  • Countries

  • Flags of Countries

  • Currencies

  • Dictionary of Occupational Titles (DOT) Index


[1] This reference document has been created by someone with minimal experience (and no formal training) in the science of ontology, classification, etc. For this reason, among several, it should not be trusted. There are undoubtedly large gaps in coverage, misunderstandings of concepts, etc. Use it with appropriate caution.

Hosted By
OASIS - Organization for the Advancement of Structured Information Standards

Sponsored By

IBM Corporation
ISIS Papyrus
Microsoft Corporation
Oracle Corporation


XML Daily Newslink
Receive daily news updates from Managing Editor, Robin Cover.

 Newsletter Subscription
 Newsletter Archives
Globe Image

Document URI:  —  Legal stuff
Robin Cover, Editor: