XDF, the eXtensible Data Format for Scientific Data.
Ed Shaya
This document describes an XML mark-up language for
documents containing major classes of scientific data.
This allows the essential and common data components to be
represented in a consistent manner independent of the
scientific specialty involved. The language makes use of
object models encompassed by modern programming languages.
Such data representations would benefit from the widespread
acceptance that XML has, and could bring about greater
interdisciplinary information transfer. It is reasonable
to expect that this approach would lead to a greater
amount of clear public dissemination of scientific and
technical explorations.
A large fraction of scientific numeric data in existence
can be classified into the following three categories:
1) A simple parameter set to a single value, possibly
infinite, or a range of values, possibly of infinite extent
(Eg. x=3.1 or 0 < y < 180).
2) Gridded samples of scalar or vector fields embedded in
a continuous N-dimensional space. We shall refer to this
class as field arrays.
3) Lists of items and values for a selection of their
properties. Commonly, this class of data is called tables.
An important complication is that the values of properties
in tables may be field arrays (Eg, an atlas).
Examples of N-dimensional spaces include physical space,
projected space, time, wavelength, frequency, energy
scales or some other parameter space. The field arrays include
tightly sampled grids such as spectra, images, animations,
or time-series measurements. It also includes sparsely
sampled data such as event detection or pointed single
aperture measurements. The embedded space is almost always
continuous, the sampling is often not.
Because data always has some finite resolution, it is
always gridded (sometimes variably) and both field arrays and
tables are therefore represented similarly as ordered
lists of values. Software can take advantage of this
similarity by sharing input/output methods between the
two. In fact much of the data handling can be similar:
subsetting, taking cross-sectioning, etc. However,
the fundamental difference between the two categories is
the fact that tabled properties rarely form a continuous
space and therefore interpolation and analyses depending
on interpolation are not sensible.
For XDF, the field arrays and tables are considered to be a
single class of N-dimensional objects, that can contain
any combination of four distinct types of dimensions;
continuous coordinates, discontinuous item and field
spaces, and discontinuous vector components that usually
refer to continuous coordinates.
A point of departure for the XDF format is a standardized
format for associating the embedded space with the data and
for keeping data associated with the same space together
in a consistent and organized manner. This has been a
failing of most of the existing scientific data formats
probably because they are too flexible with associated
scales and axes descriptions. It is often extremely
difficult to create an application that can display axes
along with all of the data without some user intervention.
And too often the axes information is simply not present
in the data file.
It is not intended for XDF to redefine all of the header
data (metadata) associated with data since that is mostly
discipline related. Each discipline should include
into the XDF document its own elements for describing
the circumstances of the data collection and relevant
information for understanding the details. The advantages
of XDF come primarily from having a standard core that
directs the data reading and the relative positioning
of multiple arrays. However, it should also be helpful
to express the metadata of the older file formats in XML
because it can then reside in well organized structures
and benefit from standard interfaces for parsing and
transformation.
The objects mentioned in the category 2 data class (tables)
are objects in a broad sense. Sometimes they are simply
things like stars, people, particles, etc. But, sometimes
they are subsets of things like locations on a thing, for
instance, shells in the interior of a star at particular
depths, or locations on the surface of the Earth within
a scan from a down looking satellite. Traditionally,
the objects of tables are placed in the first column.
For the human reader, this is fine because the first column
can be easily scanned by eye. For machine readable forms,
it is preferable for the list of objects to become part of
the metadata along with the field types. The reason
for this is that queries are most often formulated for
these quantities and it takes much longer to read all of
the data then to read just the metadata. XDF allows for,
but does not insists on, extracting object lists from the
data files and placing them instead in the XDF document
as axes. The document can link to the rest of the table,
but not read unless the query result indicates that the
data there is needed.
One of the key concepts in object oriented methodology is
that data should be wrapped with the information necessary
to read it and to make it useful. The XML language allows
for this by including in data documents either references
to applications or code in the form of ECMAscript or Java.
An XML document can have references to files containing
data and different types of data files can be handled by
different applications. This not only allows input and
preliminary processing to be self directed, but it also
allows some of the data to be generated on the end users'
machines. XDF includes functions for calculating values
along axes, and if necessary, calculating positional
information for every grid point. When applications
are referenced or embedded within the data documents,
it greatly reduces the learning curve needed to begin
working with scientific data.
Specifics:
The XDF is a container for parameters, field arrays, and tables.
Field arrays and tables are described by array elements. These
contain axes information and data elements. Data elements
are ordered lists of values of numeric or string types.
The array and parameter elements can be grouped into
structure elements. A simplified structure with an image
looks like this:
a list of values along one dimension
a list of values along other dimension
info on the ordering of the data values and record
format.
...
The Data goes here
Some other array of data...
The structure element is not necessary for this example,
but it becomes useful when more complicated sets of arrays
are involved.
A simplified table would look like this:
Fred Mary Ned
address
phone number
birthday
,,,
1212 Sycamore Rd/Gaithersburg MD 20934
301.123.456
9/23
721 Rose Ave/Richmond VA 20712
etc.
Why are the axis elements nested? This clearly shows the
order of the data in the data element. The most nested
axis has the fastest moving index when being read in.
This form makes it possible for the XDF document to be
transformed in a straitforward way (by XSLT, CSS, or
Perl scripts) into an input program. The nesting of the
end of the loops would reflect the positions of the end
tags of axis elements. It should be possilbe to write a
transformation script for any programming language that
would work on any XDF document.
In more detail, is of the form:
10 20 30 40 50 60 70 80 90
....
The id is useful if the same axis description is used
again anywhere in the document. Then, one can simply use:
Or values elements can use a built-in indexer:
The default values are start="1" and step="1". Thus,
the following form puts the index numbers into the axes:
...
7.5 42.2 33.4
34.6 22.5 12.1
1.4 1.1 22.2
Or values can take a script:
Or values can point to a URL that will do the calculation:
The data elements can also be filled by these various
methods. In addition, the application is pointed to the
proper read module through the Notation Entity mechanism
of XML. The data can be held in files of assorted types of
binary format or fixed width record text format as long as
there is a reader code for that particular format. It is
not yet clear what the best way is to handle rare dataypes.
A reasonable way to go is to have an easy way to add new
datatype readers to the application on the client side
whenever it is noted to not be in the library of readers.
The application uses its code in normal operation, but
goes back to a central library, indicatedby the notation,
when it is missing the needed code.
To do this, one includes in the internal document type definition
entities of the form.
< %datafile1 "http://machine.org/datafile1.dat" NPARSE binaryFile >
This means that the datafile1.dat on machine.org is
non-parsable in binaryfile notation and appears in
attributes of elements as datafile1. The Java clas
binaryReader is to be used to read it. So, the line
in the XDF document to include this data file is:
On the binary, ascii, mark-up debate.
There is an ongoing debate about how best to use XML when
faced with large data files. First of all, most existing
large data files are in binary format and one simply can
not mark these up with XML tags. Should these be converted
to ASCII just to add XML tags. One argument against this
is that one may lose precision on floating point numbers
by doing this. This is not so strong an argument because
rarely does one need full machine precision and probably
an extra character can be used to maintain full precision.
Another argument is that adding tags adds considerable
overhead and will require larger storage space and longer
transmission times. This argument would also hold for
ascii data that is not marked up. But this argument also
does not hold anymore. Recently, an application for
compression of XML documents has been developed, XMIL, that
does a good job of compressing XML taking full advantage
of the fact that each element type is likely to contain
a different datatype and each datatype is compressed well
if the correct compression algorithm is applied.
However, there is still a problem with marking up all
of the data and that is the capabilities of present
day parsers to hold the many millions of separate nodes
that are needed to contain and operate on these data.
It would appear possible to solve this problem also, if a
parser could be written to work directly with XMIL format
and its container systems.
It is because of these problems with large data files and
because of the huge legacy of data files in binary and
ascii format that are not marked up, that XDF is careful
to allow both varieties of data to be included. One can
fuly mark-up data and leave out formatting information or
one can use old fashioned fixed width formats for records
or one can express the binary formats used.
Direct Access
For working with large datasets it is useful to read
in only specific records or possibly go directly to the
specific bytes in the files. With the axis information
in XDF it becomes possible to directly access the values
for the most common queries. For field arrays a section of the
spacecan be specified and one can calculate the indices
of the data values where that space lies. For non marked
up data one can go directly to the sections of the files
where the data is found. For fully marked up data there
is no format information for direct file access. But,
the entire data set can be read in once and placed into
a DOM which can be written out to disk. The DOM provides
indexing of the elements and thus efficient direct access
is then possible.
Another tactic is to section the data into many smaller
files. This can be reflected in the XDF axis elements
with multiple values elements.
etc...
Here the IDRef's specify the appropriate sections of the
axes for the data in that file.