dirfile-format

Langue: en

Autres versions - même langue

Version: 22 December 2008 (fedora - 06/07/09)

Section: 5 (Format de fichier)

NAME

dirfile-format --- the dirfile database format specification file

DESCRIPTION

The dirfile format file fully specifies the raw and derived time streams and auxiliary information for a dirfile(5) database.

The format file is a case sensitive text file called format located in the dirfile directory. The explicit text encoding of the file is not specified by these standards, but must be 7-bit ASCII compatible. Examples of acceptable character encodings include all the ISO~8859 character sets (i.e. Latin-1 through Latin-10, among others), as well as the UTF-8 encoding of Unicode and UCS.

SYNTAX

The format file is composed of field specification lines and directive lines, optionally separated by blank lines or lines containing only whitespace. Lines are separated by the line-feed character (0x0A). Unless escaped (see below), the hash mark (#) is the comment delimiter; the comment delimiter, and any text following it to the end of the line, is ignored.

Tokens

Both field specification lines and directive lines consist of several tokens separated by whitespace. Whitespace consists of one or more whitespace characters. These are: space (0x20), horizontal tab (0x09), vertical tab (0x0B), form-feed (0x0C), and carriage return (0x0D). The first token of a directive line is always a reserved word. The first token of a field specification line is never a reserved word.

Since tokens are separated by whitespace, to include a whitespace character in a token, it must either escaped by preceding it by a backslash character (\), or be replaced by a character escape sequence (see below), or else the token must be enclosed in quotation marks ("). The quotation marks themselves will be stripped from the token. The null-token (that is, the token consisting of zero characters) may be specified by a pair of quotation marks with nothing between them (""). To include a literal quotation mark in a token, it must be escaped (\ ). Similarly, a hash mark may be included in a token by including it in a quoted token or else by escaping it (\#), otherwise the hash mark will be understood as the comment delimiter.

It is a syntax error to have a line which contains unmatched quotation marks, or in which the last character is the backslash character.

Several characters when escaped by a preceding backslash character are interpreted as special characters in tokens. The character escape sequences are:

\a
an alert (bell) character (ASCII 0x07 / U+0007)
\b
a backspace character (ASCII 0x08 / U+0008)
\e
an escape character (ASCII 0x1B / U+001B)
\f
a form-feed character (ASCII 0x0C / U+000C)
\n
a line-feed character (ASCII 0x0A / U+000A)
\r
a carriage return character (ASCII 0x0D / U+000D)
\t
a horizontal tab character (ASCII 0x09 / U+0009)
\v
a vertical tab character (ASCII 0x0B / U+000B)
\\
a backslash character (ASCII 0x5C / U+005C)
\ooo
the single byte given by the octal number ooo.
\xhh
the single byte given by the hexadecimal number hh.
\uhhhhhhh
the UTF-8 byte sequence encoding the Unicode code point given by the hexadecimal number hhhhhhh.

Any other character which is escaped is interpreted as the character itself. (i.e. \c is interpreted as c).

No token may contain the NULL character (ASCII 0x00 / U+0000). Furthermore, although support is present to create UTF-8 byte sequences, tokens are not required to be valid UTF-8 sequences. Any byte sequence not containing the NULL character forms a valid token. However, there may be further restrictions on allowed characters for a token in a particular situation, (for example, when used as a field name).

Directives

There are eight reserved words, which cannot be used as field names in the dirfile. Instead, these specify directives. Any reserved word may omit its initial forward slash (/), without change in meaning. Future versions of the Standards may require the slash to distinguish a reserved word from a field name. Like the rest of the format file, directives are case sensitive.

A number of the directives have fragment scope. A directive with fragment scope only applies to the fragment in which it is present, plus any sub-fragments indicated by the /INCLUDE directive, but only if those sub-fragments don't have their own corresponding directive. Directives which have fragment scope are: /ENCODING,~/ENDIAN,~/FRAMEOFFSET, and /PROTECT. Because of these scoping rules, different portions of the dirfile may have different encodings, endiannesses, frame offsets, or protection levels.

If a directive with fragment scope appears more than once in a fragment, only the last such directive will be honoured, with the exception that the effect of a directive will not be propagated to sub-fragments if the directive line appears after the sub-fragment is included. The scoping rules of the remaining directives are discussed below.

/ENCODING
The ENCODING directive specifies the encoding scheme used to encode binary files in the dirfile. The encoding scheme may be one of the predefined names listed below, which are described in more detail in dirfile-encoding(5), or any other site-specific encoding scheme. The predefined scheme names are:
none
The dirfile is unencoded.
bzip2
The dirfile is compressed using the bzip compression scheme.
gzip
The dirfile is compressed using the gzip compression scheme.
slim
The dirfile is compressed using the slim compression scheme.
text
The dirfile is text encoded.

Implementations should fail gracefully when encountering an unknown encoding scheme. If no encoding scheme is specified, behaviour is implementation dependent. Syntax is:

/ENCODING~<scheme>

The ENCODING directive has fragment scope.

/ENDIAN
The ENDIAN directive specifies the endianness of the raw data in the database. In previous versions of the Dirfile Standard, raw data was always assumed to be little-endian. This assumption has been removed. The assumed endianness of raw data in dirfiles which omit this directive is implementation dependent. Syntax is:
/ENDIAN ( big | little )

The ENDIAN directive has fragment scope.

/FRAMEOFFSET
The FRAMEOFFSET directive specifies the frame number of the first frame for which data exists in binary files associated with RAW fields. Syntax is:
/FRAMEOFFSET~<integer>

The FRAMEOFFSET directive has fragment scope.

/INCLUDE
The INCLUDE directive specifies another file (called a format file fragment) to parse for additional format specification for the dirfile. The inclusion is treated as if the lines of the fragment were pasted verbatim in place of the INCLUDE directive line. The exception to this is that RAW fields specified in the fragment are located in the directory containing the fragment and not in the directory containing the parent format file, and the binary file encoding may be different for each fragment. The fragment may be specified either with an absolute path, or else a relative path from the current file. Syntax is:
/INCLUDE~<file>

The INCLUDE directive has no scope: it is processed immediately and has no long-term effect.

/META
The META directive specifies a metafield attached to a particular parent field. The field metadata may be of any allowed type except RAW. Metafields are retrieved in exactly the same way as regular field data, but the field code specified consists of the parent and metafield names joined with a forward slash:
<parent-field>/<meta-field>

META fields may not be specified before their parent field has been. Syntax is:

/META <parent-field> {field specification line}

As an illustration of this concept,

/META pfield meta CONST FLOAT64 3.291882

provides a scalar metadatum called meta with value 3.291882 attached to the field pfield. This particular metafield may be referred to by the field code "pfield/meta". Note that different parent fields may have metafields with the same name, since all references to metafields must include the parent field name. Metafields may not themselves have further sub-metafields.

The META directive has no scope: it is processed immediately and has no long-term effect.

/PROTECT
The PROTECT directive specifies the advisory protection level of the current fragment and of the RAW fields defined therein. The protection level indicates whether writing to the format file fragment, or the binary data on disk is permitted. Syntax is:
/PROTECT~<level>

Four advisory protection levels are defined:

none
No protection at all: data and metadata may be freely changed. This is the default, if no PROTECT directive is present.
format
The dirfile metadata is protected from change, but RAW data on disk may be modified.
data
The RAW data on disk is protected from change, but metadata may be modified.
all
Both metadata and data on disk are protected from change.

The PROTECT directive has fragment scope.

/REFERENCE
The REFERENCE directive specifies the name of the field to use as the dirfile's reference field (see dirfile(5)). If no REFERENCE directive is specified, the first RAW field encountered is used as the reference field. The REFERENCE directive must specify a RAW field. Syntax is:
/REFERENCE~<field-code>

The REFERENCE directive has global scope: if multiple REFERENCE directives appear in the dirfile metadata, only the last such will be honoured.

/VERSION
The VERSION directive specifies the particular version of the Dirfile Standards to which the dirfile format file conforms. This directive should occur before any version dependent syntax is encountered. As of Standards Version 6, no such syntax exists, and this directive is provided primarily to ease forward compatibility. Syntax is:
/VERSION~<integer>

The VERSION directive has immediate scope: its effect is immediate, and it applies only to metadata below it, including and propagating downwards to sub-fragments after the directive. Its effect will also propagate upwards back to the parent fragment, and affect subsequent metadata.

Field Specification Lines

Any line which does not start with a reserved word is assumed to be a field specification line. The first token in a field specification line is the field name. The field name consists of one or more characters, excluding both ASCII control characters (the bytes 0x01 through 0x1F), and the characters

& / ; < > | .

which are reserved. The field name may not be INDEX, which is a special, implicit field which contains the integer frame index. Field names are case sensitive. The second token in the field specification line is the field type. The meaning of subsequent tokens depends on the field type.

Some of the parameters in a field specification line may be either literal numbers or else the field code of a CONST field containing the number. Such parameters are indicated below. Since it is possible to create a field code which is identical to a literal number, a parameter is assumed to be the field code of a CONST field only if the entire token cannot be parsed as a literal number using the rules outlined in strtod(3). (So, for example, a CONST field whose field code consists solely of digits can never be used as a parameter in a field specification line.)

There are eight field types. Of these, six are of vector type (BIT, LINCOM, LINTERP, MULTIPLY, PHASE, and RAW) and two are of scalar type (CONST and STRING). The possible fields types are:

BIT
The BIT vector field type extracts one or more bits out of an input vector field. Syntax is:
<field-name> BIT <input> <first-bit> [<bits>]

which specifies field-name to be the value of bits first-bit through first-bit+bits-1 of the input vector field input, when input is converted from its native type to an (endianness corrected) unsigned 64-bit integer. If bits is omitted, it is assumed to be 1. Both first-bit and bits may be either literal numbers, or else the field code of a CONST field type containing their values.

CONST
The CONST scalar field type is a constant fully specified in the format file metadata. Syntax is:
<field-name> CONST <type> <value>

where type may be any supported native data type (see the description of the RAW field type below), and value is the numerical value of the constant interpreted as indicated by type.

LINCOM
The LINCOM vector field type is the linear combination of one, two or three input vector fields. Syntax is:
<field-name> LINCOM <n> <field1> <a1> <b1>~[<field2> <a2> <b2>~[<field3> <a3> <b3>]]

where n indicates the number of input vector fields (1, 2, or 3). The derived field will be computed as:

field-name[n] = (a1 * field1[n] + b1) + (a2 * field2[n2] + b2) + (a3 * field3[n3] + b3)

with the field2 and field3 terms included only if specified and the indices n2 and n3 computed appropriately for the (potentially differing) sample rates of the input fields. The resultant field will have the same sample rate as field1. Each supplied co-efficient (a1,~b1,~a2, &c.) may be either a literal number, or else the field code of a CONST field type containing its value.

LINTERP
The LINTERP vector field type specifies a table look up based on another vector field. Syntax is:
<field-name> LINTERP <input> <table>

where input is the input vector field for the table lookup, and table is the path to the lookup table file for the field. If this path is relative, it is assumed to be relative to the directory containing the format file fragment defining this field. The lookup table file is an ASCII text file with two whitespace separated columns of x and y values. Values are linearly interpolated between the points specified in the lookup table.

MULTIPLY
The MULTIPLY vector field type is the product of two vector fields. Syntax is:
<field-name> MULTIPLY <field1> <field2>

The derived field will be computed as:

field-name[n] = field1[n] * field2[n2]

with the index n2 computed appropriately for the (potentially differing) sample rates of the input fields. The resultant field will have the same sample rate as field1.

PHASE
The PHASE vector field type shifts an input vector field by the specified number of samples. Syntax is:
<field-name> PHASE <input> <shift>

which specifies field-name to be the input vector field, input, shifted by shift samples. A positive shift indicates a shift forward in time. Results of shifting past the beginning- or end-of-file is implementation dependent. The shift parameter may be either a literal number, or else the field code of a CONST field type containing its values.

RAW
The RAW vector field type specifies raw time streams on disk. In this case, the field name should correspond to the name of the file containing the time stream. Syntax is:
<field-name> RAW <type> <sample-rate>

where sample-rate is the number of samples per dirfile frame for the time stream and type is a token specifying the native data format type:

UINT8
unsigned 8-bit integer
INT8
signed 8-bit integer
UINT16
unsigned 16-bit integer
INT16
signed 16-bit integer
UINT32
unsigned 32-bit integer
INT32
signed 32-bit integer
UINT64
unsigned 64-bit integer
INT64
signed 64-bit integer
FLOAT32~or~FLOAT
IEEE-754 standard 32-bit single precision floating point number
FLOAT64~or~DOUBLE
IEEE-754 standard 64-bit double precision floating point number

For backwards compatibility, implementations should also recognise the following single character type aliases in use prior to Standards Version 5:

c
UINT8
u
UINT16
s
INT16
U
UINT32
i,~S
INT32
f
FLOAT32
d
FLOAT64

Types INT8,~UINT64, and INT64 are not supported before Standards Version 5, so no single character type aliases exist for these types.

The sample-rate parameter may be either a literal number, or else the name of a CONST field type containing its values.

STRING
The STRING scalar field type is a character string fully specified in the format file metadata. Syntax is:
<field-name> STRING <value>

where value is the string value of the field. Note that value is a single token. To include whitespace in the string, enclose value in quotation marks ("), or else escape the whitespace with the backslash character (\).

STANDARDS VERSIONS

This document describes Version 6 of the Dirfile Standards.

Version 6 of the Standards (October 2008) added the /ENCODING,~/META,~/PROTECT, and /REFERENCE directives, and the CONST and STRING field types. It permitted whitespace in tokens and introduced the character escape sequences. It allowed CONST fields to be used as parameters in field specification lines. It also removed FILEFRAM as an alias for INDEX, and allowed '#' and '\' in field codes.

Version 5 of the Standards (August 2008) added /VERSION and /ENDIAN, slash demarcation of reserved words, and removed the restriction on field name length. It introduced the data types INT8,~INT64, and UINT64, the new-style type specifiers, and increased the range of the BIT field type from 32 to 64 bits. It also prohibited the characters #&/;<>\.| in field codes.

Version 4 of the Standards (October 2006) added the PHASE field type.

Version 3 of the Standards (January 2006) added INCLUDE and increased the allowed length of a field name from 16 to 50 characters.

Version 2 of the Standards (September 2005) added the MULTIPLY field type.

Version 1 of the Standards (November 2004) added FRAMEOFFSET and the optional fourth argument to the BIT field type.

Version 0 of the Standards (before March 2003) refers to the dirfile standards supported by the getdata(3) library originally introduced into the kst(1) sources, which contained support for all other features covered by this document.

AUTHORS

The dirfile specification was developed by C. B. Netterfield <netterfield@astro.utoronto.ca>

Since Standards Version 3, the dirfile specification has been maintained by D. V. Wiebe <dwiebe@physics.utoronto.ca>

SEE ALSO

dirfile(5), dirfile-encoding(5)