Regexp::Parser::Handlers.3pm

Langue: en

Version: 2004-07-06 (mandriva - 01/05/08)

Section: 3 (Bibliothèques de fonctions)

NAME

Regexp::Parser::Handlers - handlers for Perl 5 regexes

DESCRIPTION

This module holds the init() method for the Regexp::Parser class, which installs all the handlers for standard Perl 5 regexes. This documentation contains a sub-classing tutorial.

SUB-CLASSING

I will present two example sub-classes, Regexp::NoCode, and Regexp::AndBranch.

Parser Internals

The parser object is a hash reference with the following keys:

regex
A reference to the original string representation of the regex.
len
The length of the original string representation of the regex.
tree
During the first pass, "tree" is undef, which instructs the object() method not to actually create any objects. Afterwards, it is an array reference of (node) objects.
stack
Initially an array reference, used to store the tree as a new scope is entered and then exited. The general concept is:
   if (into_scope) {
     push STACK, TREE;
     TREE = CURRENT->DATA;
   }
   if (outof_scope) {
     TREE = pop STACK;
   }
 
 

After the tree has been created, this key is deleted; this gives the code a way to be sure compilation was successful.

maxpar
The highest number of parentheses. It will end up being identical to "nparen", but it is incremented during the initial pass, so that on the second pass (the tree-building), it can distinguish back-references from octal escapes. (The source code to Perl's regex compiler does the same thing.)
nparen
The number of OPENs (capturing groups) in the regex.
captures
An array reference to the 'open' nodes.
flags
An array reference of flag values. When a scope is entered, the top value is copied and pushed onto the stack. When a scope is left, the top value is popped and discarded.

It is important to do this copy-and-push before you do any flag-parsing, if you're adding a handle that might parse flags, because you do not want to accidentally affect the previous scope's flag values.

Here is example code from the handler for "(?ismx)" and "(?ismx:...)":

   # (?i:...) <-- the 'i' is only inside the (?:...)
   # (?i)     <-- the 'i' affects the rest of this scope
 
   # so if we're a (?:...), copy-and-push
   if ($type eq 'group') {
     push @{ $S->{flags} }, &Rf;
     push @{ $S->{next} }, qw< c) atom >;
   }
 
   for (split //, $on) {
     if (my $h = $S->can("FLAG_$_")) {
       my $v = $h->(1);       # 1 means this is 'on'
       if ($v) { &Rf |= $v }  # turn the flag on
       else { ... }           # the flag's value is 0
       next;
     }
     # throw an error if the flag isn't supported
   }
 
   for (map "FLAG_$_", split //, $off) {
     if (my $h = $S->can("FLAG_$_")) {
       my $v = $h->(0);        # 0 means this is 'off'
       if ($v) { &Rf &= ~$v }  # turn the flag off
       else { ... }            # the flag's value is 0
       next;
     }
     # throw an error if the flag isn't supported
   }
 
 

You'll probably not be adding handlers that have to parse flags, but if you do, remember to follow this model correctly.

next
An array reference of what handles (or ``rules'') to try to match next.

Devices and Standards

I made a few C-macro-style functions for easy access to the parser object's most important attributes:

   # access to the regex
     # reference to the regex string
     sub Rx :lvalue          { $_[0]{regex} }
 
     # the position in the regex string
     sub RxPOS :lvalue       { pos ${&Rx} }
 
     # the regex string from the current position on
     sub RxCUR               { substr ${&Rx}, &RxPOS }
 
     # the length of the regex string
     sub RxLEN               { $_[0]{len} }
 
   # access to the flag stack
     # the top flag value
     sub Rf :lvalue          { $_[0]{flags}[-1] }
 
   # access to the tree
     # is this the first pass?
     sub SIZE_ONLY           { ! $_[0]{tree} }
 
     # the most recent addition to the tree
     sub LATEST :lvalue      { $_[0]{tree}[-1] }
 
 

You may find it helpful to copy these to your sub-class. If you're curious why the "regex" value is a reference, and thus why I'm using "${&Rx}" everywhere, it's because an lvalued subroutine returning a normal scalar doesn't work quite right with a regex that's supposed to update its target's "pos()". This method, where it returns a reference to a scalar, makes it work (!).

These functions can only work if called with ampersands, and only if the parser object is the first value in @_. I made sure of this in my code; you should make sure in yours.

Matching against the regex is done in scalar context, globally, like so:

   if (${&Rx} =~ m{ \G pattern }xgc) {
     # it matched
   }
 
 

If the match fails, the "pos()" value won't be reset (due to the "/c" modifier). Remember to use scalar context. If you need to access capture groups, use the digit variables, but only if you're sure the match succeeded.

Parser Methods

my $obj = $parser->object(TYPE => ARGS...)
This creates a node of package "TYPE" and sends the constructor whatever other arguments are included. This method takes care of building the proper inheritance for the node; it uses %Regexp::Parser::loaded to keep track of which object classes have been loaded already.
$parser->init()
This method installs all the flags and handlers. Regexp::Parser does this automatically, but if you are sub-classing it, you'll probably want to call it in your own module.
   package Regexp::AndBranch;
   use base 'Regexp::Parser';
 
   sub init {
     my $self = shift;
 
     # installs Regexp::Parser's handlers
     $self->SUPER::init();
 
     # now add your own...
     $self->add_handler('&' => ...);  # see below
   }
 
 
$parser->add_flag($flag, $code)
This method creates a method of the parser "FLAG_$flag", and sets it to the code reference in $code. Example:
   $parser->add_flag("u" => sub { 0x10 });
 
 

This makes 'u' a valid flag for your regex, and creates the method "FLAG_u". This doesn't mean you can use them on "qr//", but rather that you can write "(?u:...)" or "(?u)". The values 0x01, 0x02, 0x04, and 0x08 are used for "/m", "/s", "/i", and "/x" in Perl's regexes.

The flag handler gets the parser object and a boolean as arguments. The boolean is true if the flag is going to be turned on, and false if it's going to be turned off. For "(?i-s)", "FLAG_i" would be called with a true argument, and "FLAG_s" would be called with a false one.

If the flag handler returns 0, the flag is removed from the resulting object's visual flag set, so "(?ig-o)" becomes "(?i)".

$parser->del_flag(@flags)
Deletes the handlers for the flags --- you need only pass the flag names, without the ``FLAG_'' prefix.
$parser->add_handler($seq, $code)
This method creates a method of the parser named $seq, and set it to the code reference in $code. Example:
   # continuing from above...
   sub init {
     my $self = shift;
     $self->SUPER::init();
 
     $self->add_handler('&' => sub {
       # $S will be the Regexp::AndBranch object, $self
       my ($S) = @_;
       push @{ $S->{next} }, qw< atom >;
       return $S->object('and');
     });
   }
 
 

There is a specific scheme to how you must name your handlers. If you want to install a handler for '&&', you must first install a handler for '&' that calls the handler for '&&' if it can consume an ampersand. Handle names that have no ``predecessor'' (that is, a '&&' without a '&') are pre-consumption: that is, they have not matched something yet. Handle names that do have a ``predecessor'' (that is, a '&&' with a '&') are post-consumption: they have already matched what they are named.

The handle 'atom' is pre-consumptive (because there is no 'ato' handle, basically). In order for the 'atom' handle to be executed, you must explicitly add it to the queue ("$parser->{next}").

The handle '|' is post-consumptive. It happens to be executed when 'atom' matches a '|'. This means the handler for '|' does not need to match it; it has already been consumed.

If you created a handle for '&&' without a predecessor, you would have to add it explicity to the queue for it to ever be executed. As such, it would be pre-consumptive.

There is an interesting case of the right parenthesis ')'. There cannot be one without a matching left parenthesis '('; if there is an extra ')' a fatal error is thrown. However, the nature of 'atom' is to match a character, see if there's a handler installed, and call it if there is. I don't want atom to handle ')', so the handler is:

   $self->add_handler(')' => sub {
     my ($S) = @_;
     pop @{ $S->{next} };  # there was an 'atom' there
     &RxPOS--;             # this does pos(regex)--
     return;
   });
 
 

This handler un-consumes the ')' (via "&RxPOS--") and returns false, to pretend it didn't actually match. The real closing parenthesis handler is:

   $self->add_handler('c)' => sub {
     my ($S) = @_;
     $S->error(RPe_LPAREN) if ${&Rx} !~ m{ \G \) }xgc;
     pop @{ $S->{flags} };
     return $S->object(close =>);
   });
 
 

The name is 'c)' which has no predecessor 'c', so that means it is pre-consumptive, which is why it must match the right parenthesis itself. The handler throws an error if it can't match the ')', because if the 'c)' handler gets called, it's expected to match! It pops the flag stack, and returns an object.

Finally, if you want to add a new POSIX character class, its handler must start with ``POSIX_''.

$parser->del_handler(@handle_names)
This uninstalls the given handles. You send the names (like '|' or 'atom'). Here is a very simple complete sub-class that does not allow the "(?{ ... })" and "(??{ ... })" assertions:
   package Regexp::NoCode;
   use base 'Regexp::Parser';
 
   sub init {
     my $self = shift;
     $self->SUPER::init();
     $self->del_handler(qw<
       (?{   (??{   (?p{
     >);
   }
 
 

For those of you that don't know, "(?p{ ... })" is a synonym for the more common "(??{ ... })". Using the 'p' form is deprecated, but is still allowed, so I delete its handler too. You can use this class to ensure that there is are no code-execution statements in a regex:

   use Regexp::NoCode;
   my $p = Regexp::NoCode->new;
 
   # if it failed, reject it how you choose
   if (! $p->regex($regex)) {
     reject_regex(...);
   }
 
 

Any regex containing those assertions will fail to compile and throw an error (specifically, RPe_NOTREC, ``Sequence (?xx not recognized''). If you want to throw your own error, see ``ERROR HANDLING''.

Walking an Object

Most objects inherit their ender() and walk() methods from the base object class; most have no ending node, and most don't need to to do anything to the walking stack.

When an object does have an ending node, its ender() method should return an array reference of arguments to object() that will produce its ending node:

   # the 'open' node's ender:
   sub ender {
     my $self = shift;
     [ 'close', $self->nparen ];
   }
 
 

That means that when an 'open' node is walked into, after it has been walk()ed, it will insert the matching 'close' node into the walking stack.

The purpose of adding an ending node to the walking stack is that ending nodes are all omitted from the tree because of the stacked nature of the tree. However, having them returned while walking the tree is helpful.

The walk() method is used to modify the walking stack before the node is returned. Here is the walk() method for all the quantifier and 'minmod' nodes:

   # star, plus, curly, minmod
   sub walk {
     my ($self, $walk_stack, $depth) = @_;
     unshift(@$walk_stack,
       sub { -1 },
       $self->{data},
       sub { +1 },
     ) if $depth;
   }
 
 

The two additional arguments sent are the walking stack and the current depth in the walking stack. Elements are taken from the front of the walking stack, so we add them in the order they are to be encountered with unshift(). The two code references are used to go deeper and shallower in scope; "sub { -1 }" is used to go down into a deeper scope, and "sub{ +1 }" is used to come up out of it. In between these is "$self->{data}", which is the node's child.

Creating an Object

Ok, back to our Regexp::AndBranch example. Let me explain what the '&' metacharacter will mean. If you've used vim, you might know about its '\&' regex assertion. It's an ``AND'', much like '|' is an ``OR''. The vim regex "/x\&y/" means "match y if x can be matched at the same location". Therefore it would be represented in Perl with a look-ahead around the left-hand branch: "/(?=x)y/". We can expand this to any number of branches: "/a\&b\&c\&d/" in vim would be "/(?=a)(?=b)(?=c)d/" in Perl. We will support this with the '&' metacharacter.

We have added a handler for the '&' metacharacter, but now we need to write the supporting class for the Regexp::AndBranch::and object it creates!

A method call for a Regexp::MyRx::THING object will look in its own package first, then in Regexp::MyRx::__object__ (if it exists), then in Regexp::Parser::THING (if it exists), and finally in Regexp::Parser::__object__.

Here is the definition of Regexp::AndBranch::and:

   package Regexp::AndBranch::and;
   @ISA = qw( Regexp::Parser::branch );
 
   sub new {
     my ($class, $rx) = @_;
     my $self = bless {
       rx => $rx,
       flags => $rx->{flags}[-1],
       data => [ [] ],
       family => 'branch',
       type => 'and',
       raw => '&',
       branch => 1,
     }, $class;
     return $self;
   }
 
 

We inherit the merge() method defined in Regexp::Parser::branch, which is used when two of the same node are matched in succession. We also inherit visual()Z and walk().

However, we need to define our own qr() method, because we don't want to have &'s in our real regex. We need to convert "A&B&C" to "(?=A)(?=B)C".

   # still in Regexp::AndBranch::and
   sub qr {
     my ($self) = @_;
     my @kids = @{ $self->{data} };
 
 

Here, @kids is an array that holds array references; each of those array references is the body of one and-branch. We will take the last one off and keep it normal, but the others we will make to be look-aheads. To make an object, we need to access "$self->{rx}".

     my $consume = pop @kids;
     for (@kids) {
       $_ = $self->{rx}->object(ifmatch => 1, @$_);
     }
 
 

The 'ifmatch' object is a positive looking assertion, and the argument of 1 means it's a look-ahead. We send the unrolled contents of the array reference as the contents of the look-ahead, and we're done. Now we just need to return the regex representation of our children:

     return join "",
       map($_->qr, @kids),
       map($_->qr, @$consume);
   }
 
 

Here's a sample execution:

   use Regexp::AndBranch;
   my $parser = Regexp::AndBranch->new;
 
   # matches the first number found in a string
   # that contains 'foo' somewhere init
   my $rx = q{^.*foo&\D*(\d+)};
 
   $parser->regex($rx) or die $parser->errmsg;
   print "VISUAL: ", $parser->visual, "\n";
   print "REGEX:  ", $parser->qr, "\n";
 
 

Here's the output:

   VISUAL: ^(?:.*foo&\D*(\d+))
   REGEX:  (?-xism:^(?:(?=.*foo)\D*(\d+)))
 
 

Extending the Extension

Here's a final example. I'm going to rewrite Regexp::AndBranch to handle both '&' and '!'. '!' will indicate a negative look-ahead.

   package Regexp::AndBranch;
   use base 'Regexp::Parser';
 
   sub init {
     my $self = shift;
     $self->SUPER::init();
 
     # X&Y = match Y if match X at the same place
     $self->add_handler('&' => sub {
       my ($S) = @_;
       push @{ $S->{next} }, qw< atom >;
       return $S->object(and => 1);
     });
 
     # X!Y = match Y unless match X at the same place
     $self->add_handler('!' => sub {
       my ($S) = @_;
       push @{ $S->{next} }, qw< atom >;
       return $S->object(and => 0);
     });
   }
 
 

We've added a handler, and added an argument to the constructor. The argument is a true or false value determining whether this is a positive assertion. Here's the new class for the object:

   package Regexp::AndBranch::and;
   @ISA = qw( Regexp::Parser::branch );
 
   sub new {
     my ($class, $rx, $pos) = @_;
     my $self = bless {
       rx => $rx,
       flags => $rx->{flags}[-1],
       data => [ [] ],
       family => 'branch',
       branch => 1,
       neg => !$pos,
     }, $class;
     return $self;
   }
 
 

We've added a "neg" attribute, and removed both "type" and "raw". We will replace them with methods:

   sub raw {
     my $self = shift;
     $self->{neg} ? '!' : '&';
   }
 
   sub type {
     my $self = shift;
     $self->{neg} ? 'not' : 'and';
   }
 
 

And finally, the small change to the qr() method:

   sub qr {
     my ($self) = @_;
     my @kids = @{ $self->{data} };
     my $consume = pop @kids;
     my $type = $self->{neg} ? 'unlessm' : 'ifmatch';
 
     for (@kids) {
       $_ = $self->{rx}->object($type => 1, @$_);
     }
 
     return join "",
       map($_->qr, @kids),
       map($_->qr, @$consume);
   }
 
 

Here's a sample run:

   use Regexp::AndBranch;
   my $parser = Regexp::AndBranch->new;
 
   my @RX = (
     q{^(?:.*foo&\D*(\d+))},
     q{^(?:.*foo!\D*(\d+))},
   );
 
   for (@RX) {
     $parser->regex($_) or die $parser->errmsg;
     print "VISUAL: ", $parser->visual, "\n";
     print "REGEX:  ", $parser->qr, "\n";
   }
 
 

The output is:

   VISUAL: ^(?:.*foo&\D*(\d+))
   REGEX:  (?-xism:^(?:(?=.*foo)\D*(\d+)))
   VISUAL: ^(?:.*foo!\D*(\d+))
   REGEX:  (?-xism:^(?:(?!.*foo)\D*(\d+)))
 
 

Presto!

Escape Sequences

If you are creating a new escape sequence, like '\y', your handler will receive an additional argument which tells it if it's inside a character class.

   $parser->add_handler('\y' => sub {
     my ($S, $cc) = @_;
     if ($cc) {
       # character class specific code
     }
     else {
       # elsewhere
     }
   });
 
 

Character Classes

Character classes are not returned all at once, but piece by piece. Because range checking ("[a-z]") requires knowledge of the characters on the lower and upper side of the range, objects must be created during the first pass. To accomplish this, use force_object(), which creates an object regardless of what pass it's on.

   $parser->add_handler('\y' => sub {
     my ($S, $cc) = @_;
     if ($cc) {
       # so that $S->object(...) creates an object:
       $S->warn(RPe_BADESC, "y", " in character class");
       return $S->force_object(exact => "y");
     }
     else {
       # ...
     }
   });
 
 

Also note the "RPe_BADESC" warning takes two arguments: the character that was unexpectedly escaped, and a string. If the warning is called from a character class, pass `` in character class''; otherwise, pass an empty string.

ERROR HANDLING


Creating Custom Messages

It is probably easiest to follow my module when creating warning and error messages for your sub-class.

   package Your::SubClass;
   # use constant NAME => VALUE, FMTSTRING
   use constant err_FOOBAR => 1, 'You broke the %s';
   use constant err_TOOBIG => 2, 'Regex too large';
 
 

Then you can access them via "$parser->err_FOOBAR", etc.

Throwing Warnings and Errors

There are three methods you can use when a problem arises. They use Carp::carp() or Carp::croak(). The argument list is used to fill in the format string for sprintf().

$parser->warn(RPe_ERRMSG, ARGS...)
Throws a warning only during the first pass over the regex.
$parser->awarn(RPe_ERRMSG, ARGS...)
Unconditionally throws a warning. Primarily useful when you need to throw a warning that can only be figured out the second pass.
$parser->error(RPe_ERRMSG, ARGS...)
Throws a fatal error.

SEE ALSO

Regexp::Parser, Regexp::Parser::Objects.

AUTHOR

Jeff "japhy" Pinyan, japhy@perlmonk.org Copyright (c) 2004 Jeff Pinyan japhy@perlmonk.org. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.