triple-s XML Standard

the website for the only open survey interchange standard

triple-s XML

The Survey Interchange Standard

A standard for moving surveys between survey packages on various hardware and software platforms

Version 1.1 February 2000

www.triple-s.org

triple-s XML

Version 1.1

February

2000

Introduction

This document describes the triple-s XML format for survey data and variables.

Background

The aim of the triple-s standard is to define a means of transferring the key elements of entire surveys between different survey software packages across various hardware and software platforms.

This initial version of the triple-s standard (version 1.0) was devised by Keith Hughes, Stephen Jenkins and Geoff Wright, and published in 1994. The impetus was a paper by Peter Wills. During 1996 the same group of people met to enhance and extend the standard, based on comments from implementers and users. An interim result of these meetings was presented as a paper to the ASC (Association for Survey Computing) International Conference in 1996. The preliminary specification for version 1.1 of the triple-s standard was agreed in December 1996 and published in March 1998.

Subsequently, a proposal for an XML translation of the standard was put forward in 1998 and triple-s XML³ was presented to the ASC Millennium Conference in 1999.

Summary

A triple-s survey is described in two text files. One, the Definition File, contains version and general information about the survey together with definitions of the survey variables. This is used to interpret the contents of the Data file. By default the Definition File has a file extension of 'SSS' and the corresponding Data File has the same name but with the extension 'ASC'.

The format of each of the files has been designed to enable software read/write routines to be easy to implement. To further aid the development process the files are relatively simple to read by eye.

Compatibility

triple-s XML is a translation of triple-s 1.1 into an XML format as specified by the triple-s XML DTD (Document Type Descriptor).

The Definition File

Outline

The definition file is coded in XML syntax according to rules given by the associated triple-s XML DTD. The definition file contents describe two aspects:

a. the file itself in terms of version number, date and time of creation etc.

b. the survey in terms of the survey variables.

The following shows an outline of the contents of the definition file.

<origin>origin_text</origin>

<title>survey_title_text</title>

. . .

variable_details

. . .

</variable>

. . .

. . .

variable_details

. . .

</variable>

</record>

</survey>

</sss>

The file is specified in terms of elements such as <date> and <time>, some of which (such as <survey> ... </survey>) also encapsulate other elements and some of which (such as <record ident="record_id">) also include attributes.

Formatting

1. Recommendations

In order to improve the readability of the definition file, it is recommended that:

a. The file is organised into lines using CR, LF (decimal 13, decimal 10) combinations.

b. At most one element, or element with associated attributes, appears on one line.

c. Lines are indented with space or tab characters to reflect the structure inherent in the file. An indent is applied after every element that contains other elements.

2. Comments

Comments may be used to annotate contents or to temporarily hide sections of the file from the XML reading mechanism. Comments are introduced using the conventional XML construct of <!- and end with -->.

a.  can be used anywhere to indicate parts of the definition file that are to be ignored.

b. A comment_text may include any text except two successive dash characters, --.

Definition File Elements and Attributes

This section describes the syntax and function of each of the keywords and keyword phrases used to form a triple-s XML definition file. The keywords are shown in the order they are expected in the file.

<sss version="1.1" [ options="option_setting" ] >

The sss element is used to encapsulate the entire specification document. It contains a mandatory attribute version and an optional options attribute. The options attribute is used to indicate that some aspect of the triple-s definition meets a defined standard that may reduce the amount of checking required by an import program.

The only value that option_setting may have in version 1.1 is standardnames. The definition of triple-s standardnames is:

consist of 1-8 characters

consist of only the characters A-Z (and a-z) and 0-9

start with a letter (A-Z or a-z)

case insensitive (i.e. upper and lower case are equivalent)

unique within the survey (e.g. Q1 and Q001 are different, q1 and Q1 are the same)

appear as the "name_text" without leading, trailing, or embedded blanks

If all variable names in the definition file conform to this definition then the standardnames option is set.

For example: <sss version="1.1" options="standardnames">

Optional and can appear any number of times in the definition file anywhere where an element might be expected (including between values specified in a <values> element – see later). The comment_text may include any text except two successive dash characters, --.

For example:

<date>date_text</date>

Optional. The date_text should represent the date the file was created.

For example: <date>20 September 1999</date>

<time>time_text</time>

Optional. The time_text should represent the time the file was created.

For example: <time>18:32</time>

<origin>origin_text</origin>

Optional. The origin_text should describe the originating system (program and operating system).

For example: <origin>MyProg v3, Windows 2000</origin>

<user>user_text</user>

Optional. The user_text should indicate the name of the user who created the file.

For example: <user>A Smith</user>

<survey>

Mandatory. Introduces details of the survey.

<title>survey_title_text</title>

Optional, but if present appears between survey and record. The survey_title_text should represent the survey title.

For example: <title>The Fitness Centre Survey</title>

<record ident="record_id">

Mandatory. One record element starts after <survey> (or after <title> if present). It is used to introduce the definition of the variables. The record_id is any single character A to Z or a to z.

For example: <record ident="A">

The record_id can be used in conjunction with the variable_id (see the <variable> element later) to generate a unique variable name on import.

For each variable being described there should be a block comprising:

<variable ident="variable_id" type="variable_type">

Mandatory. The variable_id is an integer number of up to four digits, in the range 1 to 9999, with or without leading zeroes. Each variable_id must be unique within a <record> ... </record> block.

The variable_type must be one of:

single - categorical with one response allowed

multiple - categorical with any number of responses

quantity - numeric value (integer or real)

character - character value

logical - Yes/No or True/False value

For example: <variable ident="10" type="single">

<name>name_text</name>

Mandatory. The name_text should represent the name the variable had in the original survey.

For example: <name>Q1a</name>

Note that if the names conform to the definition of sss standardnames then a standardnames directive should appear in the options attribute of the <sss> element.

<label>label_text</label>

Mandatory. The label_text should represent the label or question text of the original variable.

For example: <label>First visited</label>

<position start="start_location" [ finish="finish_location" ] />

Mandatory. Describes the location of the data values within the data record. The start_location and finish_location are positive integers, which represent the character positions, with the first position in the data record being 1.

For example: <position start="21" finish="24"/>

The finish attribute may be omitted if the finish_location is the same as the start_location. If specified then the finish_location must be greater than or equal to the start_location.

The <position> element defines the part of the data record that is allocated to holding the value of the variable. The <size>, <values> and <spread> elements describe which parts of the data record are to be interpreted as the value, and what are the legal values of the variable. As a consequence the <position> element must define an area that is at least as large as that implied by the <size>, <values> and <spread> elements.

The parts of the data record defined by the <position> element may appear in any order, may overlap each other, and do not have to describe the entire data record. It is recommended that import programs ignore all parts of the data record not defined by <position> elements, including those beyond the highest location defined by a <position> element.

The elements that follow the <position> element vary according to the variable_type :-

single Mandatory values element

multiple Optional spread element

Mandatory values element

quantity Mandatory values element

character Mandatory size element

logical Nothing extra

<spread subfields="num_subfields" [ width="subfield_width" ] />

Optional and only used with multiple type variables. The <spread> element indicates that the data values are coded as a series of category values in consecutive subfields (rather than as a series of 0/1 characters).

The num_subfields attribute must be a positive integer, and denotes the number of subfields within the overall field that is defined by the <position> element. The subfield_width is also a positive integer and denotes the width of each subfield. Therefore the <position> element must define a width of at least (num_subfields * subfield_width).

For example: <spread subfields="5" width="3"/> …5 subfields of width 3

The width attribute may be omitted if the num_subfields exactly fill the area defined by the <position> element. In this case the subfield_width is determined by dividing the width derived from the <position> element by num_subfields.

<values> … </values>

Mandatory for single, multiple and quantity types. The <values> element is used to define the range of legal values and optional text labels for values (e.g. categorical codes).

A <values> element contains the following elements:

<range from="start_value" to="finish_value" />

Optional first or only element that indicates an overall range of legal values for the variable. The finish_value must be equal to or greater than the start_value. There may also be any number of <value> elements, each defining a particular value, that follow within the outer <values> element (see below).

<value code="code_value">value_text</value>

Any number of optional elements that are used to give labels to specific values. If no <range> element has been specified then there must be at least one <value> element. If a <range> element has been specified then the code_value may lie within or outside the defined start_value and finish_value.

The details of the start_value, finish_value and code_value depend on the type specification:-

For single and multiple variables:

The start_value, finish_value and code_value must all be positive integers. The <value> elements do not need to be in any order, nor be complete. There is no upper limit to the number of <value> elements which may be specified within the corresponding variable definition.

For example: <values>

<value code="9">Refused</value>

or: <values>

<value code="99">Refused<value>

</values>

For quantity variables:

The start_value, finish_value and code_value explicitly define the valid range, and implicitly define the format and physical size of data for the variable. The valid range for a variable of type quantity can include positive or negative values. Negative values are identified by a single leading minus sign, '-'. Positive values are identified by the absence of a sign.

For example: <values>

</values>

or: <values>

<!—0 to 500 with 2 dp, plus 1 explicit value-->

The number of decimal places must be the same for all values used in the values block. The number of decimal places must be identical to the number of decimal places used to represent the data in the corresponding data file.

Values in the definition file must contain at least one digit. The use of a decimal point is optional for integer values. The following table gives examples of correct and incorrect representations:

Value
1.0	Correct
+1.0	Incorrect - 'plus' sign not allowed
-1.0	Correct
- 1.0	Incorrect - contains embedded spaces
1.	Correct
.1	Correct
-.1	Correct
-.	Incorrect - no numeric digits present

There is no upper or lower limit to the magnitude of the values that may be assigned to a quantity variable.

<size>size_specification</size>

Mandatory for character type variables. Defines the maximum number of characters in the data for the variable. The size_specification must be a positive integer; there is no defined upper limit to the size_specification.

For example: <size>100</size>

Finally, for all variable types:

</variable>

Mandatory. Completes definition of the variable.

Then either the definition of another variable (introduced by another variable element), or:

</record>

Mandatory. Finishes the definition for the set of variables.

</survey>

Mandatory. Finishes the definition for the survey.

</sss>

Mandatory. Finishes the definition file.

The Data File

Overview

The data file is comprised of fixed-length records. Each record contains the responses for each of the variables in the corresponding definition file given by one respondent. All records must be of the same length and must be at least as long as the highest location defined in a <position> element.

Data is recorded in fields of fixed length and arranged in the manner defined by the <position> elements of the variables in the definition file. The type and other definitions for the corresponding variable determine the interpretation of each field.

Basic Formatting Rules

1. Other than the record terminator (see below), only characters in the range decimal 32 to decimal 255 are considered valid - any others are considered an error when read.

2. The corresponding definition file determines the minimum length of each record. This minimum length is taken from the highest location defined in a position statement. There is no maximum record length.

3. Each record is terminated by either CR/LF, LF/CR, CR or LF, where CR is the carriage return character (decimal 13) and LF is the line feed character (decimal 10). Whichever terminator is used must be employed consistently - that is the same terminator must be used throughout the file.

4. The number of records in the file determines the number of respondents. There is no maximum number of records (and hence respondents) in the file.

5. There is no specific end-of-file character. The end of the file is determined by its physical size.

Individual Data Items

The following pages describe the methods used to represent data for each type of variable. In all cases, a field comprised entirely of space characters represents missing data for that variable.

Variables of type single

Data is recorded as an integer number as described by the <values> element. The number 0 can be used to represent missing data.

The data field length is derived from the <value> and <range> elements in the <values> element, and is the minimum number of characters required to represent the largest value. Thus, variables with values up to 9 have a data field one character long; variables with values up to 99 have a data field length of 2, and so on. If a particular data value requires less than the maximum for the field, it should be right justified using leading space or zero characters as padding.

If the data field length from each <value> or <range> element is less than that defined in the corresponding <position> element then it is assumed to be right justified within the locations defined in the <position> element. Import programs should then ignore any extra.

For example:-

Data value	Maximum in <values> element	<position> element	Data record b=space, x=ignore
7	9	start="21" finish="21"	7
7	9	start="21" finish="22"	x7
7	20	start="21" finish="22"	07 or b7
7	20	start="21"	illegal
7	20	start="21" finish="24"	xx07 or xxb7
17	20	start="21" finish="22"	17
17	20	start="21" finish="24"	xx17
142	9999	start="21" finish="24"	0142 or b142
missing	9999	start="21" finish="24"	bbbb

Variables of type multiple

Data for a Multiples may be recorded as either one character per value (bitstring format), or as a list of

values (spread format).

Bitstring format

Data is recorded with one character per category of the corresponding variable. A character ‘1’ is used to signify that a category has been selected, a character ‘0’ signifies that a category is not selected. The category value refers to the relative position of the 0/1 code in the data field: thus a category value of 9 will always refer to the code in the 9th location of the data field even if some lower category values have not been defined. An import program should ignore the locations of undefined category values.

The data field length is the highest category value in the associated value or range elements. If the data field length is less than the position element then it is assumed to be left justified within the locations defined by the position. Import programs should then ignore any extra parts of the position field.

For example,:-

Data value	Maximum in <values> element	<position> element	Data record b=space, x=ignore
1	1 to 9	start="21" finish="29"	100000000
1	1, 2, 3 and 9	start="21" finish="29"	100xxxxx0
1, 3	1 to 9	start="21" finish="29"	101000000
none	1 to 9	start="21" finish="29"	000000000
2, 8	1 to 9	start="21" finish="30"	010000010x
2	1, 2, 3 and 9	start="21" finish="24"	illegal
missing	1 to 9	start="21" finish="29"	bbbbbbbbb
missing	1, 2, 3 and 9	start="21" finish="29"	bbbxxxxxb

Spread format

Data is recorded as a series of subfields each containing one category value of the variable. The category value is recorded as an integer number as described in the values element. The number 0 should be used to represent subfields that are not needed

The data subfield length is the minimum number of characters required to represent the largest value in the values block. Thus variables with values up to 9 have a data subfield one character long, variables with values up to 99 have a data subfield length of 2, and so on. If any particular data requires less than the maximum for the subfield, it should be right justified using leading space or zero characters as padding. Data values may be stored in any or all subfields.

If the data subfield length is less than the subfield defined in the spread element then it is assumed to be right justified within the width defined in the spread. Import programs should then ignore any extra parts of the subfields.

If the total width of the subfields is less than that defined in the position element then the subfields are stored consecutively left justified within the locations defined by the position. Import programs should then ignore any extra parts of the position field.

For example:-

Data value	Maximum in <values> element	<spread> element	<position> element	Data record b=space, x=ignore
1	1 to 9	subfields="2" width="1"	start="21" finish="22"	10 or 01
1	1, 2, 3 and 9	subfields="2" width="1"	start="21" finish="22"	10 or 01
1, 3	1 to 9	subfields="2" width="1"	start="21" finish="22"	13
1	1 to 9	subfields="2" width="2"	start="21" finish="24"	x1x0 or x0x1
none	1 to 9	subfields="2" width="1"	start="21" finish="22"	00
2	1, 2, 3 and 9	subfields="2" width="1"	start="21" finish="24"	20xx or 02xx
1, 42	1 to 999	subfields="2" width="3"	start="21" finish="26"	001042
missing	1 to 999	subfields="2" width="3"	start="21" finish="26"	bbbbbb

Variables of type quantity

Data is recorded as a number with the same number of decimal places as were used in the values element specification of the corresponding variable. A decimal point should always appear if one was used in the values element specification.

The data field length should just accommodate the longest allowable value defined by the values element specification. When calculating the physical size of data for the variable, an allowance should be made for the sign of negative values. Negative numbers are represented with a leading minus sign, '-'. No such allowance should be made for (the sign of) positive values. If a particular value can be represented in a smaller length then it is right justified in the data field and leading spaces or zeros are used as padding. For negative values the spaces should appear to the left of the '-', but leading zeros should appear to the right of the '-'.

If the data field length from the values element is less than that defined in the position element then it is assumed to be right justified within the locations defined in the position. Import programs should then ignore any extra parts of the position field.

For example:-

Data value	<range> element	<position> element	Data record b=space, x=ignore
7	from="0" to="99"	start="21" finish="22"	b7 or 07
7.00	from="0.00" to="99.99"	start="21" finish="25"	b7.00 or 07.00
-7	from="-99" to="99"	start="21" finish="23"	b-7 or -07
7	from="-1" to="99"	start="21" finish="22"	b7 or 07
7	from="-1" to="99"	start="21" finish="23"	xb7 or x07
-1.00	from="-1.00" to="99.99"	start="21" finish="26"	x-1.00
17	from="0" to="999"	start="21" finish="22"	illegal
99	from="0" to="50" with additional <value code="99">	start="21" finish="22"	99
missing	from="0" to="999"	start="21" finish="23"	bbb

Variables of type character

Data is recorded as the original character string.

The length of the field is simply the value defined by the size element of the corresponding variable. If the data field length from the size element is less than that defined in the position element then it is assumed to be left justified within the locations defined in the position. Import programs should then ignore any extra parts of the position field.

For example a character variable of: <size>10</size> and data as the word character would be recorded as: "character "

Variables of type logical

Data is recorded such that character ‘0’ represents FALSE and character ‘1’ represents TRUE.

The length of the field is always one character. If the <position> element defines a width of more than one character then the rightmost character is used and all others should be ignored.

For example, a value of true would be represented as: 1

Examples

Example triple-s Definition File

The example defines a survey with six variables (one each of types single, multiple bitstring and spread format, character, quantity and logical).

<?xml version="1.0"?>

<!DOCTYPE sss SYSTEM "triple-s.dtd">

<date>21 September 1999</date>

<origin>Export 1.42</origin>

<title>Historic House Exit Survey</title>

<label>Number of visits</label>

<value code="1">First visit</value>

<value code="2">Visited before within the year</value>

<value code="3">Visited before that</value>

</values>

</variable>

<label>Attractions visited</label>

<value code="1">Sherwood Forest</value>

<value code="2">Nottingham Castle</value>

<value code="3">"Friar Tuck" Restaurant</value>

<value code="4">"Maid Marion" Cafe</value>

<value code="5">Mining museum</value>

<value code="9">Other</value>

</values>

</variable>

<label>Other attractions visited</label>

</variable>

<label>Two favourite attractions visited</label>

<value code="1">Sherwood Forest</value>

<value code="2">Nottingham Castle</value>

<value code="3">"Friar Tuck" Restaurant</value>

<value code="4">"Maid Marion" Cafe</value>

<value code="5">Mining museum</value>

<value code="9">Other</value>

</values>

</variable>

<label>Miles travelled</label>

<value code="999">Not stated</value>

</values>

</variable>

<label>Enjoyed visit</label>

</variable>

</record>

</survey>

</sss>

Example triple-s Data File

2101000001Amusement Park 190121

3010000000 209991

2100100001"Marco's" Restaurant 940580

Interpretation

The pervious data corresponds to respondent data as follows:

Respondent 1:

Number of visits:	Visited before within the year	[2]
Attractions visited:	Sherwood Forest	[1]
	"Friar Tuck" Restaurant	[3]
	Other	[9]
Other attractions:	Amusement Park
Favourite attractions:	Sherwood Forest	[1]
	Amusement Park	[9]
Miles travelled:	12
Enjoyed visit:	TRUE

Respondent 2:

Number of visits:	Visited before that	[3]
Attractions visited:	Nottingham Castle	[2]
Other attractions:
Favourite attractions:	Nottingham Castle	[2]
Miles travelled:	Not stated (999)
Enjoyed visit:	TRUE

Respondent 3:

Number of visits:	Visited before within the year	[2]
Attractions visited:	Sherwood Forest	[1]
"Maid Marion" Cafe	[4]
Other	[9]
Other attractions:	"Marco's" Restaurant
Favourite attractions:	"Marco's" Restaurant	[9]
"Maid Marion" Cafe	[4]
Miles travelled:	58
Enjoyed visit:	FALSE

The triple-s XML DTD

Summary

The triple-s XML DTD is given below.

As with all XML code, this document is required if the syntax of a triple-s XML Description File is to be verified as ‘valid’ rather than simply being considered ‘well formed’.

<!—temporary parameter entities -->

<!ENTITY % vartype "single |

multiple |

quantity |

character |

logical" >

<!ELEMENT sss (date?, time?, origin?, user?, survey)>

<!ATTLIST sss version CDATA #REQUIRED

options CDATA #IMPLIED>

<!ELEMENT date (#PCDATA)>

<!ELEMENT time (#PCDATA)>

<!ELEMENT origin (#PCDATA)>

<!ELEMENT user (#PCDATA)>

<!ELEMENT survey (title?, record)>

<!ELEMENT title (#PCDATA)>

<!ELEMENT record (variable+)>

<!ATTLIST record ident CDATA #REQUIRED>

<!ELEMENT variable (name, label, position,

((spread?, values) | size)?)>

<!ATTLIST variable ident CDATA #REQUIRED

type (%vartype;) #REQUIRED>

<!ELEMENT name (#PCDATA)>

<!ELEMENT label (#PCDATA)>

<!ELEMENT position EMPTY>

<!ATTLIST position start CDATA #REQUIRED

finish CDATA #IMPLIED>

<!ELEMENT spread EMPTY>

<!ATTLIST spread subfields CDATA #REQUIRED

width CDATA #IMPLIED>

<!ELEMENT values (value+ | (range, value*))>

<!ELEMENT value (#PCDATA)>

<!ATTLIST value code CDATA #IMPLIED>

<!ELEMENT range EMPTY>

<!ATTLIST range from CDATA #REQUIRED

to CDATA #REQUIRED>

<!ELEMENT size (#PCDATA)>

Changes from triple-s 1.1 to triple-s XML 1.1

Summary

The definition file is now expressed in XML syntax according to rules expressed in the triple-s XML 1.1 DTD. triple-s XML version 1.1 implements the same feature set as triple-s version 1.1 with the exception of the SPECIAL directive which is now obsolete

Changes from triple-s 1.1 to triple-s XML 1.1

Summary

The triple-s version 1.1 standard is based on the version 1.0 standard, but is not a true superset. The following sections provide a summary of the main changes from version 1.0.

New statements

The  method of identifying comments allows parts of the definition file to be skipped.

The standardnames option will assist importing programs in generating the names of variables.

The specifications for the <position> element mean that the location of the data values in the data file are explicit. Parts of the data record may be skipped or used for more than one variable.

The <spread> statement allows the data for multiple type variables to be stored as actual category values in the data file.

Changed statements

The <values> element can (now) define both a range of legal data values and explicitly named codes. In this new specification it is now the only way to define the data values for single, multiple and quantity variable types.

The <size> element is (now) only used for character variable types.

Obsolete statements

The size statement for single, multiple and quantity variable types has been replaced by the value range syntax in the values block
A values block with a list of unnumbered categories is no longer supported.

top of page

triple-s and triple-s XML are trademarks of The triple-s Group.
All other trademarks are properties of their respective companies.

If you have any comments on, or problems with this site, please contact webmaster@triple-s.org

© 2000 The triple-s Group. All rights reserved.