mmorph

Langue: en

Version: Version 2.3, October 1995 (debian - 07/07/09)

Autres sections - même nom

Section: 5 (Format de fichier)

NAME

mmorph - MULTEXT morphology tool formalism syntax

DESCRIPTION

A mmorph morphology description file is divided into declaration sections. Each section starts by a section header (`@ Alphabets', `@ Attributes', etc.) followed by a sequence of declarations. Each declaration starts by a name, followed by a colon (`:') and the definition associated to the name. Here is a brief description of each section:

@ Alphabets

In this section the lexical and surface alphabet are declared. All symbols forming each alphabet has to be listed. Symbols may appear in both the lexical and surface alphabet definition in which case it is considered a bi-level symbol, otherwise it is a lexical only or surface only symbol. Symbols are usually letters (eg. a, b, c) , but may also consist of longer names (beta, schwa). Symbol names consisting of one special character (`:' or `(') may be specified by enclosing them in double quotes (`:' or `(').
Example:
Lexical : a b c d e f g h i j k l m n o p q r s t u v w x y z "-" "." "," "?" "!" "\"" "'" ":" ";" "(" ")" strong_e
Surface : a b c d e f g h i j k l m n o p q r s t u v w x y z "-" "." "," "?" "!" "\"" "'" ":" ";" "(" ")" " "

In this example, the symbol strong_e is lexical only, the symbol " " (space) is surface only. All the other symbols are bi-level.

All the strings appearing in the rest of the grammar will be made exclusively of symbols declared in this section.  

@ Attributes

In this section, the name of attributes (sometimes called features) and their associated value set. At most 32 different values may be declared for an attribute.
Examples:
 
 Gender : feminine masculine neuter
 Number : singular plural
 Person : 1st 2nd 3rd
 Transitive : yes no
 Inflection : base intermediate final
 

In the current version of the implementation value sets of different attributes are incompatible, even if they are defined identically. To overcome this restriction, in a future version this section will be split into two: declaration of value sets and declaration of attributes.

@ Types

In this section, the different types of feature structures are declared. The attributes allowed for each type are listed. Attributes that are only used within the scope of the tool and have no meaning outside can be listed after a bar (`|'). The values of these local attributes ar not stored in the database or written on the final output of the program.
Examples:
 
 Noun : Gender Number
 Verb : Tense Person Gender Number Transitive | Inflection
 
 

Typed feature structures

Typed feature structures are used in the grammar and spelling rules. It is the specification of a type and the value of some associated attributes. The list of attribute specifications is enclosed in square brackets (`[' and `]').
Example:
 
 Noun[ Gender=feminine Number=singular ]
 
 

It is possible to specify a set of values for an attribute by listing the possible valuse separated with a bar (`|'), or the complement of a set (with respect to all possible values of that attribute) indicated with `!=' instead of `='.
Example: Assuming the declaration of Gender as above, the following two typed feature structures are equivalent

 
 Noun[ Gender=masculine|neuter ]
 Noun[ Gender!=feminine ]
 
 

@ Grammar

This section contains the rules that specify the structure of words. It has the general shape of a context free grammar over typed feature structures. There are three basic types of rules: binary, goal and affixes.

Binary rules specify the result of the concatenation of two elements. This is written as:

 
 Rule_name : Lhs <- Rhs1 Rhs2
 

where Lhs is called the left hand side, and Rhs1 and Rhs2 the first and second part of the right hand side. Lhs, Rhs1 and Rhs2 are specified as typed feature structures.
Example:

 
 
 Rule_1  : Noun[ Gender=feminine Number=singular ]
         <- Noun[ Gender=feminine Number=singular ]
            NounSuffix[ Gender=feminine ]
 
 

Variables can be used to indicate that some attributes have the same value. A variable is a name starting with a dollar (`$').
Example:

 
Rule_2 : Noun[ Gender=$A Number=$number ] <- Noun[ Gender=$A Number=$number ] NounSuffix[ Gender=$A ]

 
 
If needed, both a variable and a value specification can be given for an attribute (only once per attribute):
Example:
 
 
 Rule_3  : Noun[ Gender=$A Number=$number ]
         <- Noun[ Gender=$A Number=$number ]
            NounSuffix[ Gender=$A=masculine|neuter ]
 

 
 
Affix rules define basic elements of the concatenations specified by binary rules (together with lexical entries, see the section @ Lexicon below). An affix rule consists of lexical string associated to a typed feature structure.
Examples:

 
Plural_s : "s" NounSuffix[ Number=plural ] Feminine_e : "e" NounSuffix[ Gender=feminine ] ing : "ing" VerbSuffix[ Tense=present_participle ]

 
 
Goal rules specify the valid results constructed by the grammar. They consist of just a typed feature structure.
Examples:

 
Goal_1 : Noun[] Goal_2 : Verb[ inflection=final ]

 
 
In addition to these three basic rule types, there are prefix or suffix composite rules and unary rules. A unary rule consist of a left hand side and a right hand side.
Example:
 
 Rule_4  : Noun[ gender=$G number=plural ]
         <- Noun[ gender=$G number=singular invariant=yes]
 

 
 
Prefix and suffix composite rules have the same shape as binary rules except that one part of the right hand side is an affix (i.e. has an associated string).
Examples:

 
Append_e : Noun[ Gender=feminine Number=$number ] <- Noun[ Gender=feminine Number=$number ] "e" NounSuffix[ Gender=feminine ] anti : Noun[ Gender=$gender Number=$number ] <- "anti" NounPrefix[] Noun[ Gender=$gender Number=$number ]

 
 

@ Classes


 
 This optional section contains the definition of symbol classes. Each class is
 defined as a set of symbols, or other classes. If the class contains
 only bi-level elements it is a bi-level class, otherwise it is a lexical
 or surface class.
 
Examples:
 
 Dental : d t
 Vowel : a e i o u
 Vowel_y : Vowel y
 Consonant: b c d f g h j k l m n p q r s t v w x z
 
 

 
 

@ Pairs


 
 This optional section contains the definition of pair disjunctions.  Each
 disjunction is defined as a set of pairs.  Explicit pairs specify a
 sequence of surface symbols and a sequence of zero or one lexical symbol,
 one of them possibly empty.  A sequence is enclosed between angle brackets
 `<' and `>'.  The empty sequence is indicated with `<>'.  In
 the current implementation only the surface part of a pair can be a
 sequence of more than one element.  The special symbol `?' stands for
 the class of all possible symbols, including the morpheme and word
 boundary.
 
Examples:
 
 s_x_z_1 : s/s x/x z/z
 VowelPair1: a/a e/e i/i o/o u/u 
 VowelPair2: Vowel/Vowel
 ie.y: <i e>/y
 Delete_e: <>/e
 Insert_d: d/<>
 Surface_Vowel: Vowel/?
 Lexical_s:  ?/s
 
DoubleConsonant: <b b>/b <d d>/d <f f>/f <g g>/g <k k>/k <m m>/m <p p>/p <s s>/s <t t>/t <v v>/v <z z>/z

 
 
Note that VowelPair1 and VowelPair2 don't specify the same thing: VowelPair2 would match a/o but VowelPair1 would not.
Implicit pairs are specified by the name of a bi-level symbol or a bi-level class.
Examples: the following s_x_z_2 and VowelPair3 are equivalent to the above s_x_z_1 and VowelPair2 (assuming that s, x, z and Vowel are bi-level symbols and classes).
 
 s_x_z_2 : s x z
 VowelPair3 : Vowel
 

 
 
In a pair disjunction all lexical parts should be disjoint. This means you cannot specify for the same pair disjunction a/a and o/a or a/a and Vowel/Vowel.
In a future version this section will be split in two: simple pair disjunctions and pair sequences.

@ Spelling


 
 In this section are declared the two level spelling rules.  A spelling rule
 consist of a kind indicator followed by a left context a focus and a right
 context.  The kind indicator is `=>' if the rule is optional,
 `<=>' if it is obligatory and `<=' if it is a surface coercion
 rule.  The contexts may be empty.  The focus is surrounded by two
 `-'.  The contexts and the focus consist of a sequence of pairs or
 pair disjunctions declared in the `@ Pairs section.  A morpheme
 boundary is indicated by a `+' or a `*', a word boundary is
 indicated by a `~'.
 
Examples:
 
 Sibilant_s: <=> s_x_z_1 * - e/<> - s
 Gemination: <=>
         Consonant Vowel - DoubleConsonant - * Vowel
 i_y_optionnel: => a - i/y - * ?/e
 

 
 
Constraints may be specified in the form of a list of typed feature structures. They are affix-driven: the rule is licensed if at least one of them subsumes the closest corresponding affix. The morpheme boundary indicated by a star (`*') will be used to determine which affix it is. If there is no such indication, then the affix adjacent to the morpheme where the first character of the focus occurs is used. In case there is no affix, the typed feature structure of the lexical stem is used.
Example:
 
 Sibilant_s: <=>
     s_x_z_1 * - e/<> - s NounSuffix[ Number=plural ]
 

 
 

@ Lexicon


 
 This section is optional and can also be repeated.  This section lists all
 the lexical entries of the morphological description.  Unlike the other
 sections, definitions do not have a name.  A definition consist of a typed
 feature strucure followed by a list of lexical stems that share that
 feature structure.  A lexical stem consists of the string used in the
 concatenation specified by the grammar rules followed by `=' and a
 reference string.  The reference string can be anything and usually is used
 to indicate the canonical form of the word or an identifier of an external
 database entry.
 
Examples:

 
Noun[ Number=singular ] "table" = "table" "chair" = "chair" Verb[ Transitive=yes|no Inflection=base ] "bow" = "bow1" Noun[ Number=singular ] "bow" = "bow2"
 
 
 
If the stem string and the reference strings are identical, only one needs to be specified.
Example:

 
Noun[ Number=singular ] "table" "chair"
 
 
 

FORMAL SYNTAX


 
 The formal syntax description below is in Backus Naur Form (BNF).
 The following conventions apply:
 

 
 
 <id>      is a non-terminal symbol (within angle brackets).
 ID        is a token (terminal symbol, all uppercase).
 <id>?     means zero or one occurrence of <id> (i.e. <id> is optional).
 <id>*     is zero or more occurrences of <id>.
 <id>+     is one or more occurrences of <id>.
 ::=       separates a non-terminal symbol and its expansion.
 |         indicates an alternative expansion.
 ;         starts a comment (not part of the definition).
 
 

 
 The start symbol corresponding to a complete description is named
 <Start>.
 Symbols that parse but do nothing are marked with
 `; not operational'.
 

 
 
 <Start>           ::= <AlphabetDecl> <AttDecl> <TypeDecl> <GramDecl>
                       <ClassDecl>? <PairDecl>? <SpellDecl>? <LexDecl>*
 
 <AlphabetDecl>    ::= ALPHABETS <LexicalDef> <SurfaceDef>
 
 <LexicalDef>      ::= <LexicalName> COLON <LexicalSymbol>+
     
 <SurfaceDef>      ::= <SurfaceName> COLON <SurfaceSymbol>+
 
 <LexicalSymbol>   ::= <LexicalSymbolName>    ; lexical only
                   |   <BiLevelSymbolName>    ; both lexical and surface 
 
 <SurfaceSymbol>   ::= <SurfaceSymbolName>    ; surface only
                   |   <BiLevelSymbolName>    ; both lexical and surface
 
 <AttDecl>         ::= ATTRIBUTES <AttDef>+
 
 <AttDef>          ::= <AttName> COLON <ValName>+
 
 <TypeDecl>        ::= TYPES <TypeDef>+
 
 <TypeDef>         ::= <TypeName> COLON <AttName>+ <NoProjAtt>?
 
 <NoProjAtt>       ::= BAR <AttName>+
 
 <LexDecl>         ::= LEXICON <LexDef>+
 
 <LexDef>          ::= <Tfs> <Lexical>+
 
 <Lexical>         ::= LEXICALSTRING <BaseForm>?
 
 <BaseForm>        ::= EQUAL LEXICALSTRING 
 
 <Tfs>             ::= <TypeName> <AttSpec>? 
 
 <VarTfs>          ::= <TypeName> <VarAttSpec>? 
 
 <AttSpec>         ::= LBRA <AttVal>* RBRA 
 
 <VarAttSpec>      ::= LBRA <VarAttVal>* RBRA 
 
 <AttVal>          ::= <AttName> <ValSpec> 
 
 <VarAttVal>       ::= <AttName> <VarValSpec> 
 
 <ValSpec>         ::= EQUAL <ValSet>
                   |   NOTEQUAL <ValSet>
 
 <VarValSpec>      ::= <ValSpec>
                   |   EQUAL DOLLAR <VarName>
                   |   EQUAL DOLLAR <VarName> <ValSpec>
 
 <ValSet>          ::= <ValName> <ValSetRest>* 
 
 <ValSetRest>      ::= BAR <ValName> 
 
 <GramDecl>        ::= GRAMMAR <Rule>+
 
 <RuleDef>         ::= <RuleName> COLON <RuleBody>
 
 <RuleBody>        ::= <VarTfs> LARROW <Rhs>
                   |   <Tfs>    ; goal rule
                   |   LEXICALSTRING <Tfs>    ; lexical affix
 
 <Rhs>             ::= <VarTfs>    ; unary rule
                   |   <VarTfs> <VarTfs>    ; binary rule
                   |   LEXICALSTRING <Tfs> <VarTfs>   ; prefix rule
                   |   <VarTfs> <Tfs> LEXICALSTRING    ; suffix rule
 
 <ClassDecl>       ::= CLASSES<ClassDef>+
 
 <ClassDef>        ::= <LexicalClassName> COLON <LexicalClass>+
                   |   <SurfaceClassName> COLON <SurfaceClass>+
                   |   <BiLevelClassName> COLON <BiLevelClass>+
 
 <LexicalClass>    ::= <LexicalSymbol>
                   |   <LexicalClassName>
                   |   <BiLevelClassName>
 
 <SurfaceClass>    ::= <SurfaceSymbol>
                   |   <SurfaceClassName>
                   |   <BiLevelClassName>
 
 <BiLevelClass>    ::= <BiLevelSymbolName>
                   |   <BiLevelClassName>
 
 <PairDecl>        ::= PAIRS <PairDef>+
 
 <PairDef>         ::= <PairName> COLON <PairDef>+
 
 <PairDef>         ::= <PairName> COLON <Pair>+
 
 <Pair>            ::= <SurfaceSequence> SLASH <LexicalSequence>
                   |   <PairName>
                   |   <BiLevelClassName>
                   |   <BiLevelSymbolName>
 
 SurfaceSequence   ::= LANGLE <SurfaceSymbol>* RANGLE
                   |   SURFACESTRING
                   |   <SurfaceClass>
                   |   ANY
 
 LexicalSequence   ::= LANGLE <LexicalSymbol>* RANGLE
                   |   LEXICALSTRING
                   |   <LexicalClass>
                   |   ANY
 
 <SpellDecl>       ::= SPELLING <SpellDef>+
 
 <SpellDef>        ::= <SpellName> COLON <Arrow> <LeftContext> <Focus>
                           <RightContext> <Constraint>*
 
 <LeftContext>     ::= <Pattern>*
 
 <RightContext>    ::= <Pattern>*
 
 <Focus>           ::= CONTEXTBOUNDARY <Pattern>+ CONTEXTBOUNDARY
 
 <Pattern>         ::= <Pair>
                   |   MORPHEMEBOUNDARY
                   |   WORDBOUNDARY
                   |   CONCATBOUNDARY
 
 <Constraint>      ::= <Tfs>
 
 <Arrow>           ::= RARROW 
                   |   BIARROW 
                   |   COERCEARROW
 
 
 <AttName>           ::= NAME
 <BiLevelClassName>  ::= NAME
 <BiLevelSymbolName> ::= NAME  | SYMBOLSTRING
 <LexicalClassName>  ::= NAME
 <LexicalName>       ::= NAME
 <LexicalSymbolName> ::= NAME  | SYMBOLSTRING
 <PairName>          ::= NAME
 <RuleName>          ::= NAME
 <SpellName>         ::= NAME
 <SurfaceClassName>  ::= NAME
 <SurfaceName>       ::= NAME
 <SurfaceSymbolName> ::= NAME  | SYMBOLSTRING
 <TypeName>          ::= NAME
 <ValName>           ::= NAME
 <VarName>           ::= NAME
 

 
 

Simple tokens


 
 Simple tokens of the BNF above are defined as follow:
 The token name on the left correspond to the literal character
 or characters on the right:
 

 
 ANY                 ?
 BAR                 |
 BIARROW             <=>
 COERCEARROW         <=
 COLON               :
 CONCATBOUNDARY      *
 CONTEXTBOUNDARY     -
 DOLLAR              $
 EQUAL               =
 LANGLE              <
 LARROW              <-
 LBRA                ]
 MORPHEMEBOUNDARY    +
 NOTEQUAL            !=
 RARROW              =>
 RANGLE              <
 RBRA                [
 SLASH               /
 WORDBOUNDARY        ~
 
 ALPHABETS           @Alphabets
 ATTRIBUTES          @Attributes
 CLASSES             @Classes
 GRAMMAR             @Grammar
 LEXICON             @Lexicon
 PAIRS               @Pairs
 SPELLING            @Spelling
 TYPES               @Types
 

 
 
In the section header tokens above, spaces may separate the `@' from the reserved word.

Complex tokens


 
 
NAME

is any sequence of letter, digit, underline (`_'), period (`.')
Examples:
 
category 33 Rule_9 __2__ Proper.Noun
 
 
LEXICALSTRING

is a string of lexical symbols
SURFACESTRING
is a string of surface symbols
SYMBOLSTRING

is a string of just just one character (used only in alphabet declaration).
 
 
A string consist of zero or more characters within double quotes (`"'). Characters preceded by a backslash (`\') are escaped (the usual C escaping convention apply). Symbols that have a name longer than one character are represented using a SGML entity like notation: `&symbolname;'. The maximum number of symbols in a string is 127.
Examples:

 
"table" "," "" "double quote is \" and backslash is \\" "&strong_e;" "escape like in C : \t is ASCII tab" "escape with octal code: \011 is ASCII tab"
 
 
 
Tokens can be separated by one or many blanks or comments.
A blank separator is space, tab or newline.
A comment starts with a semicolon and finishes at the next newline (except when the semicolon occurs in a string.
Inclusion of files can be specified with the usual `#include' directive:
Example:
#include "verb.entries"
 
 
 
will splice in the content of the file verb.entries at the point where this directive occurs.
The `#' should be the first character on the line. Tabs or spaces may separate `#' and `include'. The file name must be quoted. Only tabs or spaces may occur on the rest of the line. Inclusion can be nested up to 10 levels.

SEE ALSO


 
 mmorph(1).
 
 
G. Russell and D. Petitpierre, MMORPH - The Multext Morphology Program, Version 2.3, October 1995, MULTEXT deliverable report for task 2.3.1.
 
 

AUTHOR


 
 Dominique Petitpierre, ISSCO, <petitp@divsun.unige.ch>
 

COMMENTS


 
 The parser for the morphology description formalims above was written
 using
 yacc
 
 (1) and
 flex
 
 (1).  Flex was written by Vern Paxson, <vern@ee.lbl.gov>, and is
 distributed in the framework of the GNU project under the condition of the
 GNU General Public License