[Archive copy mirrored from: http://www.ariadne.ac.uk/issue10/dublin/]
Andy Powell presents three models for the way in which metadata can be managed across a Web-site and describes some of the tools that are beginning to be used at UKOLN to embed Dublin Core metadata into Web pages. This article appears in the Web version of Ariadne only.
The Dublin Core Metadata Element Set (the Dublin Core) [1] is a 15 element metadata set that is primarily intended to aid resource discovery on the Web. The elements in the Dublin Core are TITLE, SUBJECT, DESCRIPTION, CREATOR, PUBLISHER, CONTRIBUTOR, DATE, TYPE, FORMAT, IDENTIFIER, SOURCE, LANGUAGE, RELATION, COVERAGE and RIGHTS. As we begin to consider some initial implementations using the Dublin Core we need to consider how best to manage large amounts of metadata across a Web-site. The ways in which we manage Dublin Core metadata need to be able to cope with potential syntax changes in the way that elements are embedded into HTML and allow for the migration of metadata to other formats [2], for example future versions of PICS labels.
Summaries of the current state of the Dublin Core are available elsewhere [3]. In short, the element set is now stable and the ways in which Dublin Core records can be embedded into HTML Web pages are fairly widely agreed. There is still some discussion about the use of some of the elements, however Dublin Core is beginning to be used in several projects [4].
This article is concerned primarily with the practical issues of using Dublin Core metadata to describe Internet resources. It will concentrate on embedding Dublin Core into HTML Web pages with a view to what can be done now - and how can it be done. Three areas need to be considered:
Firstly, here is a simple example:
<HTML>
<HEAD>
<TITLE>UKOLN: UK Office for Library and Information Networking </TITLE>
<META NAME="DC.title" CONTENT="UKOLN: UK Office for Library and Information Networking">
<META NAME="DC.subject" CONTENT="national centre, network information support, library community, awareness, research, information services, public library networking, bibliographic management, distributed library systems, metadata, resource discovery, conferences, lectures, workshops">
<META NAME="DC.description" CONTENT="UKOLN is a national centre for support in network information management in the library and information communities. It provides awareness, research and information services">
<META NAME="DC.creator" CONTENT="UKOLN Information Services Group">
</HEAD>
<BODY>
...
</BODY>
</HTML>
(The examples given here are largely taken from the UKOLN home page [6]. If you want to see some embedded Dublin Core metadata for real, browse to the page and use your browser's 'View source' option to look at the HTML source of the page).
Note that the META tags are placed within the HEAD section of the page and that the Dublin Core element names are preceded by 'DC.' to form the META tag name. By convention the 'DC' is uppercase and the element name is lowercase.
Only 4 of the Dublin Core elements are shown in the above example - that's fine by the way, with the Dublin Core all elements are optional and, as we'll see in a while, all elements are repeatable. If we consider the Dublin Core elements in this example:
The LANGUAGE qualifier specifies the language used in the element value (not the language of the resource itself, that's given in the LANGUAGE element!). The LANGUAGE qualifier will not be described in any detail here.
The SCHEME qualifier specifies a context for the interpretation of a given element. Typically this will be a reference to an externally-defined scheme or accepted standard. For example, if we were to allocate Library of Congress Subject Headings to the UKOLN Web-site we might add:
<META NAME="DC.subject" CONTENT="(SCHEME=LCSH) Library information networks -- Great Britain">
<META NAME="DC.subject" CONTENT="(SCHEME=LCSH) Information technology -- higher education">
to the META tags above. Note that the SCHEME qualifier is currently embedded into the META tag CONTENT. In HTML 4.0 the META tag will have a separate SCHEME attribute and it will be possible to write:
<META NAME="DC.subject" SCHEME="LCSH" CONTENT="Library information networks -- Great Britain">
<META NAME="DC.subject" SCHEME="LCSH" CONTENT=" Information technology -- higher education">
However, this syntax is illegal in HTML 3.2 (or older) and although it is unlikely to cause any serious problems for current Web browsers it would cause the page to fail to validate using an HTML 3.2 based validation service.
Finally, the TYPE qualifier modifies the element name so as to narrow it's semantics. For example, an author's email address can be thought of as a sub-element of the CREATOR element. To embed the author's email address into an HTML page we can write:
<META NAME="DC.creator.email" CONTENT="isg@ukoln.ac.uk">
Repeated elements
Some elements may need to be given several times, in a Web page with more than one author for example. Remember that Dublin Core allows elements to be repeated, so simply repeat the DC.creator META tag several times:
<META name="DC.creator" content="Powell, Andy">
<META name="DC.creator" content="Stark, Isobel">
Note that it is not possible to group Dublin Core elements embedded in HTML in any formal way. So there is no mechanism for grouping pairs of DC.creator and DC.creator.email META tags.
HTML Authoring tools
Metadata embedded when resource is created
The first model is to embed the metadata directly into HTML Web pages by hand using whatever HTML editing tools are already in use. In some ways this is a nice simple solution but...
Web-site management tools
Metadata embedded when resource is published
The second model is to make use of Web-site management tools to manage metadata. These tools have only become available relatively recently and aim to aid the management of whole Web-sites rather than of individual documents. They usually combine editors, for creating Web pages, with other tools for managing those pages across a site. Typically they work by holding all the data for a site in a database. A 'publish' button is used to create HTML pages based on the information held in the database. These tools are unlikely to be Dublin Core aware but they are likely to support macros which may allow for the creation of embedded Dublin Core META tags as part of the publishing procedure.
However, it is worth bearing in mind that the formats used to hold data and metadata in the database are likely to be proprietary and there are unlikely to be interchange formats to allow the data to be moved easily into other formats so you need to beware of becoming locked into a single system with this model. Nevertheless, in the longer term this looks likely to be the sensible way to go, not least because of the general advantages for Web-site management that these tools offer. For the moment though it is probably too early to make recommendations about their use, particularly as far as metadata management is concerned.
Embedding On-the-fly
Metadata embedded when resource is served
The third model is to hold the metadata in a separate neutral format and to embed it on-the-fly using Server Side Include (SSI) [7] scripts.
SSIs are a simple mechanism for creating all or part of a Web page dynamically. The Web used to consist of two kinds of pages. Static pages maintained using HTML editors of some kind and dynamic pages generated by CGI scripts. More recently SSIs have allowed static pages to embed other pages or call external scripts to form a part of their content. SSIs are typically used to embed a standard copyright notice into or wrap standard headers and footers around all the pages on a site.
The third model makes use of a SSI script to embed Dublin Core metadata into the page on-the-fly. A potential problem may be performance because, for each page that is served, the Web server has to check the HTML file for a SSI and, if necessary, run a script. However, given that SSIs are increasingly being used for other purposes, this may be a problem that has to be addressed anyway.
One other potential problem is that dynamically generated pages tend to be marked with an expiry date of now, which means that Web caches do not cache them! However, some Web servers, for example Apache, can be configured to give pages containing a SSI a sensible expiry date.
With this model there needs to be a tool for creating metadata in the chosen format. In some cases it may be possible to use a commercially available tool. Alternatively it is possible to use a Web based tool like DC-dot (see below).
Finally, this model needs a mechanism for associating the resource with it's metadata. There are two possibilities here. The two could be tied together using a simple filename based mapping from the HTML file to the metadata file. Alternatively, one could assign some sort of unique identifier, for example a PURL or a DOI, to each page and then use that to identify the metadata file.
DC-dot works by first prompting for the URL of the resource that you want to describe. It will then retrieve the page from the Web and automatically generate some Dublin Core META tags based either on existing metadata in the page (title, keywords, description and existing Dublin Core) or on the contents of the page. It should be noted that the methods used to automatically generate metadata are nothing to write home about and typically generate tags that need further modification by the user. DC-dot also looks at the domain name of the Web-site and uses that to determine the publisher - sometimes this produces sensible results, sometimes not!
DC-dot allows you to edit and extend the automatically generated tags. Having done so, you can either cut-and-paste the resulting tags into your HTML page source or save the metadata in a variety of alternate formats including USMARC records, SOIF records, ROADS/WHOIS++ templates and GILS records.
DC-dot can either be accessed from the UKOLN Web-site or download and run locally - see the UKOLN metadata software tools page for details [9].
Currently our plans are to:
SOIF records could be created using the DC-dot Web based tool described above. At UKOLN, after some experimentation with various tools, we have decided to use an MS-Access database to create our Dublin Core records. This integrates quite nicely with the other tools in use on UKOLN staff PCs. Finally we have decided to associate a resource with it's metadata by placing them in separate files in the same directory on the Web server, with the name of the SOIF file being derived from the name of the HTML file.
The diagram should give you an idea of the way things work. Consider a UKOLN author creating a Web page. Having edited the page they then use the MS-Access database to create a SOIF record describing it. The SOIF record is placed in the same directory as the HTML file, using the filename with a '.soif' suffix. For example, the description for intro.html is put into intro.html.soif.
Each Web page for which metadata is created must have a single line added to it. This is the line that calls the SSI script. The example in the diagram above shows the syntax for calling SSIs used by Apache.
Initially the person creating the metadata browses to the file we are describing. By using an ActiveX Control the browsing can be done using a Web browser embedded into MS-Access. Having found the required page, the person enters various metadata items - title, keywords, description, etc. As the record is saved a small Visual Basic program writes out a SOIF record as well. It is important to remember that with this system, MS-Access is simply being used as a front-end tool.
Note that this system allows us to create some NewsAgent specific metadata which will be harvested by the NewsAgent robot and some UKOLN specific metadata which will be used for Web-site management purposes. For example a group ownership is assigned to each page which will allow us to locate all the pages owned by a particular UKOLN group in the future. Currently it is envisaged that the UKOLN specific metadata will be stored in the SOIF records but will not be embedded into Web pages.
Now, lets look at how things work from the point of a Web robot.
Imagine a robot collecting a page from the UKOLN web site. It sends a request for the page to the UKOLN Web server(1). Normally all the server has to do is read the file from disk and send it back to the robot. In our case however, the server must also parse the file looking for SSIs (2). If it finds one, it calls the SSI script (3). One of the pieces of information that the Web server passes to the script is the name of the file it is currently reading. The script appends '.soif' to the filename. If the resulting filename exists, it reads the SOIF record (4), converts it to HTML META tags and passes them back to the server (5). The Web server adds the META tags to the page and returns the whole thing to the robot (6).
Remember that, as far as the UKOLN Web server is concerned, a robot is no different to a person browsing the Web so this procedure is followed for each and every access to the page. However, in theory, it might be possible for the script to check who is accessing the page and only generate Dublin Core META tags for those robots known to make use of them.
Given that we have a script generating our DC.subject and DC.description META tags it seems sensible to let it generate keywords and description META tags containing the same values. So we might end up with:
[1] The Dublin Core Metadata Element Set,
[2] Web Developments Related to Metadata,
[3] The 4th Dublin Core Metadata Workshop Report,
[4] UKOLN Metadata Resources - Dublin Core,
[5] HTML 4.0 W3C Working Draft,
[6] UKOLN Home page,
[7] Using Server Side Includes Apache Week issue 27,
[8] DC-dot - a Dublin Core META tag creator,
[9] UKOLN metadata software tools,
[10] NewsAgent for Libraries,
[11] soif2metadc Perl script,
[12] Summary Object Interchange Format (SOIF) - A review of metadata: a survey of current resource description formats
[13] Harvest Web Indexing
Material on this page is copyright
Ariadne/original authors.
This article last updated/links checked on 16-July-1997
<META NAME="DC.subject" CONTENT="national centre, network information support, library community, awareness, research, information services, public library networking, bibliographic management, distributed library systems, metadata, resource discovery, conferences, lectures, workshops">
<META NAME="DC.description" CONTENT="UKOLN is a national centre for support in network information management in the library and information communities. It provides awareness, research and information services">
<META NAME="keywords" CONTENT="national centre, network information support, library community, awareness, research, information services, public library networking, bibliographic management, distributed library systems, metadata, resource discovery, conferences, lectures, workshops">
<META NAME="description" CONTENT="UKOLN is a national centre for support in network information management in the library and information communities. It provides awareness, research and information services">
It is not so clear whether the DC.title META tag and <TITLE> tag should be the same. Currently at UKOLN, we expect that the <TITLE> tag will continue to be embedded into the Web page by the person creating the page and that the DC.title META tag will be held in the SOIF record and embedded on-the-fly by the SSI script.
Conclusions
This article proposed three areas that should be considered by those thinking about using the Dublin Core to describe the resources on their Web-site. It concentrated primarily on the issues surrounding how best to manage such metadata. By beginning to implement systems for managing metadata we can get some experience of real use and build up a body of resources with embedded Dublin Core. It looked at three models for the way in which metadata can be managed, highlighting the key issues of each. The issues of long term maintenance and transition to other formats should not be underestimated. It also described in some detail one particular implementation of one of these models that is beginning to be used at UKOLN. It is acknowledged that the design of this implementation is not perfect and may well change as we begin to work with significant amounts of metadata.
References
http://purl.org/metadata/dublin_core
http://www.ukoln.ac.uk/groups/web-focus/events/seminars/metadata-june1997/iap-html/
http://hosted.ukoln.ac.uk/mirrored/lis-journals/dlib/dlib/dlib/june97/metadata/06weibel.html
http://www.ukoln.ac.uk/metadata/resources/dc.html
http://www.w3.org/TR/WD-html40-970708/
http://www.ukoln.ac.uk/
http://www.apacheweek.com/features/ssi
http://www.ukoln.ac.uk/metadata/dcdot/
http://www.ukoln.ac.uk/metadata/software-tools/
http://www.sbu.ac.uk/~litc/newsagent/
http://www.ukoln.ac.uk/metadata/software-tools/#soif2metadc/
http://www.ukoln.ac.uk/metadata/DESIRE/overview/rev_20.htm
http://www.tardis.ed.ac.uk/harvest/
Author details
Andy Powell,
Technical Development and Research,
UKOLN
Email:
a.powell@ukoln.ac.uk