fastrdata

Langue: en

Version: fastr data 2.04 (mandriva - 01/05/08)

Section: 1 (Commandes utilisateur)

NAME

fastrdata - the syntax of data structures

DESCRIPTION

For automatic indexing, fastr compiles three data bases:
1.
A list of single-word rules. This list must include minimally all the single-words found in the terms, their morphological and semantic links, and the words in their morphological and semantic families.
2.
A list of term rules. The controlled vocabulary.
3.
A list of meta-rules. The linguistic transformations generating variants of the controlled vocabulary.

The single-word list and the term list are independent files. They are compiled with the option -c (see the fastr section). The meta-rule list is part of the Language file (see the fastrlang section) and is automatically compiled upon startup. All the single words composing a term rule must be compiled before compiling the term. The compiling times are shorter if the single words are sorted by alphabetical order.

DATA STRUCTURES

Each description (single-word rule term rule, meta-rule, or tagged word) is basically composed of two parts: a structural and an informational part.
Structural part
This part is trivially reduced to an atom for single-word rules and tagged words; and is, therefore, not described.
For term rules, the structural part is a Context-free skeleton; for meta-rules, it is a pair of context-free skeletons.
For example:
 
     N1 -> N2 P3 Dd4 N5
 
 
is a context-free skeleton of depth 1 with one root node N1 and four daughter nodes N2, P3, Dd4, and N5.
Each node receives a unique identifier composed of a part of speech category (see the Categories in the fastrlang section) followed by an index: a digit in the range 0-9. The reserved category X is a wild-card category denoting any category or no category.
Informational part
This part is a set of constraints (equalities or inequalities) denoting Feature structures.
The feature structures are automatically anchored to the current atom for single-word rules and tagged words.
The paths used in the constraints of term rules or meta-rules must begin with a node of the context-free skeleton(s) in order to stipulate the anchoring of feature structures to the structural part.


A Path is a sequence of features between angle brackets. The sequence of features is preceded by a node identifier in the case to term rules or meta-rules. For example, <N1 head> or <cat> are 2 paths.


A Value is either a Feature (see the Features in the fastrlang section), or an Integer, or a String between single quotes (single quotes within the string must be back-slashed), or a List of values between parentheses and separated by commas.


A Constraint is an equality (=) or an inequality (!) between a path and another path or between a path and a value:

1. A constraint equating two paths stipulates that these two paths share the same value.
For example:
 
         <N1 head> = <N2 head>
 
 
.TP 2. A constraint equating a path and a value assigns a value to the final node of the path. (A node has at most one value.)
For example:
 
         <P3 lemma> = 'of'
 
 
.TP 3. A constraint inequating a path and a value stipulates that the final node cannot be assigned this value.
For example:
 
         <V3 agreement tense> ! infinitive
 
 
.SS

PRE-TAGGING

There are three modes of indexing: from a tagged, a partially tagged, or an untagged corpus. Each mode of indexing entails different constraints on the features expected for representing the single-word and term rules.
Untagged corpus
Each word of the corpus is morphologically analyzed by fastr using the morphological information encompassed in the language file (see the fastrlang section). Unknown words or unknown inflections of known words receive an arbitrary part of speech category (see the Category Unknown category in the fastrlang section).
Partially tagged corpus
Each word of the corpus is morphologically analyzed by fastr using the morphological information encompassed in the language file (see the fastrlang section). Among the different morphological analysis produced by fastr, only the ones that are compatible with the partial tagging are reported. Henceforth, partial tagging will be considered as just a specific case of untagged corpus.
Tagged corpus
A tagged corpus is expected to contain all the features associated with each word. We do not recommend this mode unless you are very familiar with fastr. In this case, fastr collects the information linked to the inflected words and the derivational and semantic families directly from the corpus. No morphological analysis is performed by the application.

The last part of this section describe the syntax of a tagged or partially tagged corpus.

SINGLE-WORD RULES

Each single-word encountered in a term rule must be previously compiled in order to make the compilation of the term rule possible.
The first line of a single word rule is the string of the lemma. It is followed by the Informational part of the rule.
The value of the Category path is mandatory whatever the mode (see the Paths in the fastrlang section).
Tagged mode
In the tagged mode, the value of the Reference path is mandatory. It is a key for accessing the lemma.
For example:
 
     Word 'acce\'le\'ration':
         <cat> = N
         <reference> =  1558.
 
 
is the rule of the French noun "acceleration".
Untagged mode
In the untagged mode, if the value of the Inflection path is not provided, it receives automatically the value 1. The key for accessing a lemma is the 3-uple composed of its string, its category, and its inflection number.
For example:
 
     Word 'write':
         <cat> =  V
         <inflection> = 10
         <auxStem> = ('wrote','written').
 
 
is the rule of the English verb "to write" with two auxiliary stems.
In the untagged mode, the single-word list can include any single word that is likely to be found within the corpus under study. The more single words are known, the better the recognition of the variants.
Morphological and semantic links
A morphological link denote a relation between a derived lemma and its root. The feature structure of the root lemma is attached to the Root path. It is a 3-uple composed of a string, a part-of-speech category, and an inflection number. Similarly semantic links are attached to the Semantic paths. For example:
 
     Word 'categorization':
         <cat> =  N
         <inflection> = 1
         <syn> = ('categor',N,2) | ('label',N,1) | ('tag',N,1).
         <root> = ('categor',N,2).
 
 
indicates a morphological link between the lemma "categorization" and its root "category", and a semantic link between the lemma "categorization" and the 3 semantically related words "category", "label", and "tag". (Vertical bars | denote disjunctions of values.)

TERM RULES

The first line of a term rule is the Context-free skeleton denoting the syntactic structure of the term. It is followed by the Informational part of the rule.
The features expected for the lexical leaves are the same as the features expected for Single-word rules.
The value of the Lexicalization path is mandatory. It is a link for accessing the rule in bottom-up filtering. The value of the Lexicalization path is the identifier of one of the lexical leaves of the Context-free skeleton.
Tagged mode
For example:
 
     Rule N1 -> N2 A3:
         <N1 lexicalization> = 'N2'
         <N1 label> = '202907'
         <N2 lemma> = 'dispositif'
         <N2 reference> = 25956
         <A3 lemma> = 'expe\'rimental'
         <A3 reference> = 32154
         <N2 head agreement> = <N2 head agreement>
         <N1 head> = <N2 head>.
 
 
is the rule of the French term "dispositif experimental". The noun N2 and the adjective A3 must agree in gender and number.
Untagged mode
For example:
 
     Rule N1 -> A2 N3:
         <N1 lexicalization> = 'N3'
         <N1 label> = '001258' 
         <A2 lemma> = 'fatty'
         <A2 inflection> = 1 
         <N3 lemma> = 'acid'
         <N3 inflection> = 1
         <N3 number> = plural
         <N1 head> = <N3 head>.
 
 
is the rule of the English term "fatty acids". The head noun is expected to be plural.

Additionally, the root node may have two optional features (see the Paths in the fastrlang section):

1.
The value of the Label path is used for recursively calling rules.
2.
The value of the Meta-label path is used for selecting meta-rules.

META-RULES

The first line of a meta-rule is a pair of Context-free skeletons.
The left-hand side skeleton is paired with the syntactic structure of a term rule through unification.
The right-hand side skeleton denotes the the syntactic structure of the variant after unification of the left-hand side skeleton.
The pair of Context-free skeletons is followed by the Informational part of the meta-rule.

Except for the left-most one, the daughter nodes of the right-hand side skeleton can be preceded by a Regular expression. The vocabulary of these expressions is the set of part of speech categories (see the Categories in the fastrlang section). The strings accepted by these expressions is bounded by the preceding node on its left and by the current one on its right.

For example:

 
     Metarule NtoV( N1 -> N2 A3 ) 
                  = X1 -> V4 <{P? {Dd|Di} | P} A? N> A3:
         <V4 root reference> = <N2 root reference>
         <N1 metaLabel> = 'XX'.
 
 
is a meta-rule which transforms a noun-adjective French term such as "analyse cytologique" into a variant rule which accepts, for example, "analyser les desordres cytologiques". The verb V4 of the variant, here "analyser", must be morphologically related to the noun N2 of the term, here "analyse" (they must have a common root).
The regular expression <{P? {Dd|Di} | P} A? N> matches the sequence Dd N (the categories of "les desordres"). The language accepted by this expressions is composed of the ten following strings: P Dd A N, Dd A N, P Di A N, Di A N, P A N, P Dd N, Dd N, P Di N, Di N, P N.
 
 
A Starred metarule is a metarule whose identifier is followed by a star (*). In such metarules, in addition to the normal unification, an external process is called for validating the unification (see the External unifier in the fastr section). The features of the current rule are written to a file so as to be read by the external unifier (see the Unification file in the fastr section).

REGULAR EXPRESSIONS

A regular expression matching a category may be followed by one of several repetition operators:
?
The preceding item is optional and matched at most once.
*
The preceding item will be matched zero or more times.
+
The preceding item will be matched one or more times.
n-m
The preceding item is matched at least n times, but not more than m times.

Two regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions.

Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression.

Repetition takes precedence over concatenation, which in turn takes precedence over alternation. An alternation may be enclosed in curly braces and a concatenation may be enclosed in angle brackets to override these precedence rules.

TAGGED CORPUS

The syntax of the tagged corpus is very similar to the syntax of single-word rules.
The first line of a tagged word is the string of the inflected word. It is followed by the Informational part of the tagged word.
The value of the Reference path is mandatory in case of a fully tagged corpus (see the Paths in the fastrlang section).


Each inflected word is optionally followed by a root enclosed in a pair of curly brackets {{ }}, a list of derivational relatives enclosed in a pair of square brackets [[ ]] and separated by + signs, and a list of semantic relatives enclosed in a pair of angle brackets << >> and separated by + signs.
Each derivational relative has only an Informational part.
The value of the Derivational reference path is mandatory in case of a fully tagged corpus (see the Paths in the fastrlang section).

For example:

 
    'category'
       <cat> = N
       <agr num> = sin
       <infl> = 2
       <syn list> = 1002
       - - <> = 'tag'
       - - <> = N
       - - <> = 1
       <lem> = 'categor'
       <ref> = 1005
       <self(1*) list> = 1005
       - - <> = 'categor'
       - - <> = N
       - - <> = 2
       <root> = (1*)
       <hyp> = (1*)
       <seeAlso> = (1*)
       <form> = 'N'.
         [[
           <cat> = A
           <infl> = 1
           <root list> = 1005
           - - <> = 'categor'
           - - <> = N
           - - <> = 2
           <his> = '>AL>'
           <lem> = 'categorial'
           <ref> = 1003
           <self(1*) list> = 1003
           - - <> = 'categorial'
           - - <> = A
           - - <> = 1
           <syn> = (1*)
           <hyp> = (1*)
           <seeAlso> = (1*).
         ]]
         <<
           <cat> = N
           <infl> = 1
           <syn list+> = 1005
           - - <> = 'categor'
           - - <> = N
           - - <> = 2
           -  list+> = 1001
           - - <> = 'label'
           - - <> = N
           - - <> = 1
           -  list+> = 1002
           - - <> = 'tag'
           - - <> = N
           - - <> = 1
           <lem> = 'tag'
           <ref> = 1002
           <self(1*) list> = 1002
           - - <> = 'tag'
           - - <> = N
           - - <> = 1
           <root> = (1*)
           <hyp> = (1*)
           <seeAlso> = (1*).
         >>
 
 
is the representation of the inflected English word "category". This word is followed by one morphological relative, the adjective "categorial", and one semantic relative "tag".