ATLAS, Industrial Strength XML

Date:     Thu, 7 Oct 1999 09:23:56 -0400 (EDT)
From:     John Robert Gardner <jrgardn@emory.edu>
To:       "'xml-mailinglist'" <xml-dev@ic.ac.uk>
cc:       "John Robert Gardner -Ph.D." <jrgardn@emory.edu>
Subject:  Industrial Strength XML Serving

I'm venturing this question as a general call for input -- and pitches -- with regard to the following project we're undertaking:

750,000 pages of journals, in both text form and gif images for "canonical preservation" and cross-check
Typed text version, in XML (using TEI largely) yielding ~400,000,000 words (our initial estimates suggest something in the range of 30-50 gigs of total content including gifs), averaged to ~60,000,000 tag nodes, searchable based on content of tags (word strings), element heirarchy, and attribute values, with final form changing infrequently (archival/institutional memory)
Primary access point being MARC records we're rendering into highly granular XML, for crosswalking to DC/RDF/GILS (we're starting with some 200 megs of MARC records alone)

I've been asking offlist for possible consultants as our systems staff has a strong inclination to Oracle 8i and I'm hardly fluent enough on such software to argue based upon what I know. Based on Oracle's white paper, it sounds viable . . . however:

In some of my offlist correspondence, I've detected a dichotomy between the view that "it doesn't matter if it's XML, pizza's, or washing machines you're storing, it's the size that counts (no pun intended)" -- so Oracle's great. ON the other side, is a sense that 8i's newness is a potential unknown for such size in XML (we'll also likely be subcontracting the serving of the gifs, likely out-of-state). The implication was that there were more SGML/XML-native packages out there if we have the budget (we do, within the limits that, say, commissioning a whole new software package is out of the question). :)

Our project is perhaps one of the best funded efforts in the humanities in markup for some time, and surely in a class by itself viz. XML. As it's likely to be a model in various senses/case study, I really want to be sure we commit down the "right" road on this, and be sure of our options along that road. The vision I'm implementing from teh XML side is meant to go beyond another research resource to a full-scale research environment which exploits XSLT for having our stuff accessible--e.g., the MARC -- in multiple tag vocabularies (DC, RDF, GILS, etc.), as well as very sophisticate construction of the resources found through the search (e.g., with DOM, etc.).

At any rate, this question is in no way an obviation either of my offlist inquiries for a consultant, nor of their input thus far. Instead, since the vichy soisse is not yet ready to be stirred, nor even on the stove, all chef's are needed -- if there is a better mousetrap to be made without a reinvention of the wheel, now's the time to know.

TYIA,

John Robert Gardner, Ph.D.
XML Engineer
ATLA-CERTR
http://vedavid.org/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1

Prepared by Robin Cover for the The SGML/XML Web Page archive. For references, see "ATLA Serials Project (ATLAS)."

SEARCH Advanced Search ABOUT Site Map CP RSS Channel Contact Us Sponsoring CP About Our Sponsors NEWS Cover Stories Articles & Papers Press Releases CORE STANDARDS XML SGML Schemas XSL/XSLT/XPath XLink XML Query CSS SVG TECHNOLOGY REPORTS XML Applications General Apps Government Apps Academic Apps EVENTS LIBRARY Introductions FAQs Bibliography Technology and Society Semantics Tech Topics Software Related Standards Historic	ATLAS, Industrial Strength XML Date: Thu, 7 Oct 1999 09:23:56 -0400 (EDT) From: John Robert Gardner <jrgardn@emory.edu> To: "'xml-mailinglist'" <xml-dev@ic.ac.uk> cc: "John Robert Gardner -Ph.D." <jrgardn@emory.edu> Subject: Industrial Strength XML Serving I'm venturing this question as a general call for input -- and pitches -- with regard to the following project we're undertaking: 750,000 pages of journals, in both text form and gif images for "canonical preservation" and cross-check Typed text version, in XML (using TEI largely) yielding ~400,000,000 words (our initial estimates suggest something in the range of 30-50 gigs of total content including gifs), averaged to ~60,000,000 tag nodes, searchable based on content of tags (word strings), element heirarchy, and attribute values, with final form changing infrequently (archival/institutional memory) Primary access point being MARC records we're rendering into highly granular XML, for crosswalking to DC/RDF/GILS (we're starting with some 200 megs of MARC records alone) I've been asking offlist for possible consultants as our systems staff has a strong inclination to Oracle 8i and I'm hardly fluent enough on such software to argue based upon what I know. Based on Oracle's white paper, it sounds viable . . . however: In some of my offlist correspondence, I've detected a dichotomy between the view that "it doesn't matter if it's XML, pizza's, or washing machines you're storing, it's the size that counts (no pun intended)" -- so Oracle's great. ON the other side, is a sense that 8i's newness is a potential unknown for such size in XML (we'll also likely be subcontracting the serving of the gifs, likely out-of-state). The implication was that there were more SGML/XML-native packages out there if we have the budget (we do, within the limits that, say, commissioning a whole new software package is out of the question). :) Our project is perhaps one of the best funded efforts in the humanities in markup for some time, and surely in a class by itself viz. XML. As it's likely to be a model in various senses/case study, I really want to be sure we commit down the "right" road on this, and be sure of our options along that road. The vision I'm implementing from teh XML side is meant to go beyond another research resource to a full-scale research environment which exploits XSLT for having our stuff accessible--e.g., the MARC -- in multiple tag vocabularies (DC, RDF, GILS, etc.), as well as very sophisticate construction of the resources found through the search (e.g., with DOM, etc.). At any rate, this question is in no way an obviation either of my offlist inquiries for a consultant, nor of their input thus far. Instead, since the vichy soisse is not yet ready to be stirred, nor even on the stove, all chef's are needed -- if there is a better mousetrap to be made without a reinvention of the wheel, now's the time to know. TYIA, jr John Robert Gardner, Ph.D. XML Engineer ATLA-CERTR http://vedavid.org/ xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 Prepared by Robin Cover for the The SGML/XML Web Page archive. For references, see "ATLA Serials Project (ATLAS)."