sgrpg

Langue: en

Version: 113499 (mandriva - 01/05/08)

Section: 1 (Commandes utilisateur)

NAME

sgrpg - SGML selection and transformation tool (Sgml RePort Generator).

SYNOPSIS

Usage: sgrpg [-d ddb-file] [-r] ([-v] query sub-query regexp out-fmt oargs... |
                           -f pat-file) 
   -d ddb-file:  Take Dtd from specified ddb file 
   -v: invert sense of sub-query+regexp 
   -r: attribute values in queries are regular expressions 
   -f: pattern-action file, nSGML format, sgrpg.dtd as DTD 
   query:  pattern on items to select, basically path based with terms 
     separated by /, <term>:=<GI><cond>?'*'? 
                     <GI>:=<elementName>|'.'                   <cond>:='['<index>|<atests>|<index> <atests>']'
                 <index>:=<number>
                 <atests>:=<atest>(' '<atest>)*
                 <atest>:=<aname>('='<aval>)?

     Aname and aval are as per SGML, except that if the -r flag is given,  
     aval are regular expressions. 
     A GI of . matches any tag.  A condition with an index matches only the 
     indexth sub-element of the enclosing element.  Attribute tests are not 
     exhaustive, and will match against both explicitly present and defaulted 
     attribute values, using string equality.  Bare anames are satisfied by 
     ANY value, explicit or defaulted.  Terms ending with * match 
     any number of links in the chain, including 0. 
 
   sub-query:  selects sub-elements of query-selected item for
               regexp to match -- use '.' for whole item


   regexp:  Regular expression to match against text directly contained in
            query-selected item in sub-query-selected sub-element of
            query-selected item.
            If empty (i.e. '') matches anything, including empty elements,
            indeed this is the ONLY way to get empty elements if required.
 
   out-fmt:  a FORMAT string 
 
   oargs:  arguments to format, either <GI>, <DATA>, or attribute name 

DESCRIPTION

The material below may be out of date: consult LT XML documentation please.

sgrpg is an nSGML-aware query and transformation program. Sgrpg allows one to select a set of SGML elements from a document and optionally to transform them into a new format. Sgrpg allows nested queries and lists of alternative queries, and hence allows more complex queries than sggrep (q.v.). In addition, it allows one to specify what to output when one finds one of the SGML elements which match one of the queries. This means that sgrpg is the tool of choice when converting SGML into different file formats (e.g. Latex or another text formatting language). It is a filter, ie it reads from stdin and writes to stdout.
There are two different methods of calling sgrpg; in the first one specifies the query and the output format on the command line; in the second (using the -f option) more complex sequences of queries and formats can be specified in a control file.

DESCRIPTION: Parameters


<query>
is an NSL query which selects the set of matching elements from the input stream.
<sub-query>
is an optional NSL query which if present, selects sub-elements of query-selected item for regexp to match.
<regexp>
A regular expression to match against text directly contained in query-selected item (if no sub-query) or in any sub-query selected sub-element of query-selected item. If empty (i.e. '') matches anything, including empty elements, indeed this is the ONLY way to get empty elements if required.
<out-fmt>
A Format string.
<oargs>
A list of arguments to format, either <GI>, <DATA>, or attribute name.

DESCRIPTION: Input/Output

Description of the input/output files involved in this program.
Input ==> NSL input file : [stdin]
Output ==> Result of printing format statements for all
matching SGML elements : [stdout]

OPTIONS

-d <ddb-file>
is the name of a file containing a representation of a DTD. Can be used if the DTD is not specified in the input document iself.
-v 
Complement operation. If this option is specified then only elements which do not match the regexp are output. Default is normal matching.
-r 
Interpret values of attributes in queries as regular expressions. Default is to treat attribute values as plain strings.
-h 
Print out usage information about sgrpg.
-f <control-file>
The name of a sgrpg control file, in nSGML format, with sgrpg.dtd as DTD. See below for a description of the format of this file.

EXAMPLES

sgrpg ".*/W" ".*" ".*" "%s/%s" "<DATA>" TYPE < temp.sgm
       prints out a list of all the <W> elements anywhere
        in the input document, in the form of <word>/<type>
        one per line.

sgrpg ".*/P/S/W" ".*" "theatre" "%s" "<DATA>" < temp.sgm
       prints out a list of all the <W> elements (inside <P>
        and <S>) which contain the string "theatre".

NSL QUERIES


 NSL queries are patterns on items (SGML elements) to select. A query describes a path from the root of the document to the desired element and is basically a path based with terms separated by / as follows:

        <query> := <term> ('/' <term>)*

        <term>  := <GI> <cond>? '*'?

        <GI>    := <elementName> | '.' | '#'

        <cond>  := '[' <index>|<atests>|<index> <atests>']'

        <index> := <number>

        <atests>:= <atest> (' ' <atest>)*

        <atest> := <aname> ('=' <aval>)?
Aname and aval are as per SGML, except that if the -r flag is given, aval is a regular expression. A GI of . matches any tag. A condition with an index matches only the index'th sub-element of the enclosing element. Attribute tests are not exhaustive, and will match against both explicitly present and defaulted attribute values, using string equality. Bare anames are satisfied by ANY value, explicit or defaulted. Terms ending with * match any number of links in the chain, including 0.
A query which ends in an item '#' matches textual content.

FORMAT OF SGRPG CONTROL FILES


A sgrpg control file is an nSGML file based on the sgrpg.dtd DTD (in the lib subdirectory of the NSL release). It does not need to have a <?NSL DDB> statement since sgrpg knows what DTD you should be using.

The file consists of a sequence of <Q> elements, which is an implied

A <Q> element consists of subqueries or output format elements.

Subqueries consist of <S>, <G> or <OR> elements.

An <S> element represents a sub-query.
The LINK attribute of an <S> can be one of DEPSER, DEPSEQ, DEPPAR,
INDEPENDENT(default).
INDEPENDENT means start searching
at same point in containing element, regardless of success or failure of others.
DEPSER means start where previous
finished, provided it succeeded.
DEPSEQ means must match next sub-elt immediately after the
previous match.
DEPPAR means start at same
point in containing element, provided others so far have succeeded, i.e. AND.
A <G> element groups together a group of queries and/or format
statements which are to be repeated... EXP, ID and REF attributes...
<OR> elements describe a disjunction of sub-queries,
of which the first match wins.
Format elements consist of <F> statements, which describe output
strings which are printed when we find an element which matches the query. <F> elements can contain <A> elements, which describe where to find the data required by the format string.
So
       <F S="{%s/%s}"><A TYPE=DATA><A A=TYPE>

defines a format string, the %s fields of which are filled from the data content of the matching element and the value of the TYPE attribute respectively.
<F> elements can alternatively be of the form <F TYPE=ELT [DN=number]>, which mean print the matching element (or the DNth daughter, if DN is specified) as normalised SGML.
<A> elements can in the following forms

<A TYPE=GI> - the name of the SGML element.
<A TYPE=DATA [DN=number]> - The DNth bit of text content of the element (default value of DN is 0).
<A A=Attribute_name> - The value of the attribute called attribute_name.
<A TYPE=PATN [RN=number]> - the RNth match from a previous regular expression
Any of the above can have a VTYPE attribute, with a value
of one of STRING, INTEGER, or FLOAT. If specified then the value of the <A> is converted to that type if possible.

EXAMPLE

The rule file
       <Q Q=".*/DIV1">

       <S Q=".*/TITLE"><F S="DIV1: %s

       "><A TYPE=DATA></F></S>

       <S Q=".*/DIV2">

       <S Q=".*/TITLE"><F S="DIV2: %s

       "><A TYPE=DATA></F></S>

       <S Q=".*/DIV3">

       <S Q=".*/TITLE"><F S="DIV3: %s

       "><A TYPE=DATA></F></S>

       <S Q=".*/DIV4">

       <S Q=".*/TITLE"><F S="DIV4: %s

       "><A TYPE=DATA></F></S>

       </S></S></S></Q>

prints out the titles of <DIV1> ... <DIV4> elements.

SEE ALSO

nsl-query(5), sggrep(1)

AUTHOR

Henry Thompson (ht@cogsci.ed.ac.uk)
David McKelvie (dmck@cogsci.ed.ac.uk)

Language Technology Group, Human Communication Research Centre, Edinburgh University,
2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND
Tel:(44) 131 650-4630
Fax:(44) 131 650-4587 email: dmck@cogsci.ed.ac.uk

Comments, suggestions, and bug reports are always welcome.