Rechercher une page de manuel
Lingua::Stem::Snowball.3pm
Langue: en
Version: 2008-11-08 (ubuntu - 08/07/09)
Section: 3 (Bibliothèques de fonctions)
Sommaire
NAME
Lingua::Stem::Snowball - Perl interface to Snowball stemmers.SYNOPSIS
my @words = qw( horse hooves ); # OO interface: my $stemmer = Lingua::Stem::Snowball->new( lang => 'en' ); $stemmer->stem_in_place( \@words ); # qw( hors hoov ) # plain interface: my @stems = stem( 'en', \@words );
DESCRIPTION
Stemming reduces related words to a common root form. For instance, ``horse'', ``horses'', and ``horsing'' all become ``hors''. Most commonly, stemming is deployed as part of a search application, allowing searches for a given term to match documents which contain other forms of that term.This module is very similar to Lingua::Stem --- however, Lingua::Stem is pure Perl, while Lingua::Stem::Snowball is an XS module which provides a Perl interface to the C version of the Snowball stemmers. (<http://snowball.tartarus.org>).
Supported Languages
The following stemmers are available (as of Lingua::Stem::Snowball 0.94):|-----------------------------------------------------------| | Language | ISO code | default encoding | also available | |-----------------------------------------------------------| | Danish | da | ISO-8859-1 | UTF-8 | | Dutch | nl | ISO-8859-1 | UTF-8 | | English | en | ISO-8859-1 | UTF-8 | | Finnish | fi | ISO-8859-1 | UTF-8 | | French | fr | ISO-8859-1 | UTF-8 | | German | de | ISO-8859-1 | UTF-8 | | Italian | it | ISO-8859-1 | UTF-8 | | Norwegian | no | ISO-8859-1 | UTF-8 | | Portuguese | pt | ISO-8859-1 | UTF-8 | | Spanish | es | ISO-8859-1 | UTF-8 | | Swedish | sv | ISO-8859-1 | UTF-8 | | Russian | ru | KOI8-R | UTF-8 | |-----------------------------------------------------------|
Benchmarks
Here is a comparison of Lingua::Stem::Snowball and Lingua::Stem, using The Works of Edgar Allen Poe, volumes 1-5 (via Project Gutenberg) as source material. It was produced on a 3.2GHz Pentium 4 running FreeBSD 5.3 and Perl 5.8.7. (The benchmarking script is included in this distribution: bin/benchmark_stemmers.plx.)|--------------------------------------------------------------------| | total words: 454285 | unique words: 22748 | |--------------------------------------------------------------------| | module | config | avg secs | rate | |--------------------------------------------------------------------| | Lingua::Stem 0.81 | no cache | 2.029 | 223881 | | Lingua::Stem 0.81 | cache level 2 | 1.280 | 355025 | | Lingua::Stem::Snowball 0.94 | stem | 1.426 | 318636 | | Lingua::Stem::Snowball 0.94 | stem_in_place | 0.641 | 708495 | |--------------------------------------------------------------------|
METHODS / FUNCTIONS
new
my $stemmer = Lingua::Stem::Snowball->new( lang => 'es', encoding => 'UTF-8', ); die $@ if $@;
Create a Lingua::Stem::Snowball object. new() accepts the following hash style parameters:
- •
- lang: An ISO code taken from the table of supported languages, above.
- •
- encoding: A supported character encoding.
Be careful with the values you supply to new(). If "lang" is invalid, Lingua::Stem::Snowball does not throw an exception, but instead sets $@. Also, if you supply an invalid combination of values for "lang" and "encoding", Lingua::Stem::Snowball will not warn you, but the behavior will change: stem() will always return undef, and stem_in_place() will be a no-op.
stem
@stemmed = $stemmer->stem( WORDS, [IS_STEMMED] ); @stemmed = stem( ISO_CODE, WORDS, [LOCALE, IS_STEMMED] );
Return lowercased and stemmed output. WORDS may be either an array of words or a single scalar word.
In a scalar context, stem() returns the first item in the array of stems:
$stem = $stemmer->stem($word); $first_stem = $stemmer->stem(\@words); # probably wrong
LOCALE has no effect; it is only there as a placeholder for backwards compatibility (see Changes). IS_STEMMED must be a reference to a scalar; if it is supplied, it will be set to 1 if the output differs from the input in some way, 0 otherwise.
stem_in_place
$stemmer->stem_in_place(\@words);
This is a high-performance, streamlined version of stem() (in fact, stem() calls stem_in_place() internally). It has no return value, instead modifying each item in an existing array of words. The words must already be in lower case.
lang
my $lang = $stemmer->lang; $stemmer->lang($iso_language_code);
Accessor/mutator for the lang parameter. If there is no stemmer for the supplied ISO code, the language is not changed (but $@ is set).
encoding
my $encoding = $stemmer->encoding; $stemmer->encoding($encoding);
Accessor/mutator for the encoding parameter.
stemmers
my @iso_codes = stemmers(); my @iso_codes = $stemmer->stemmers();
Returns a list of all valid language codes.
REQUESTS & BUGS
Please report any requests, suggestions or bugs via the RT bug-tracking system at http://rt.cpan.org/ or email to bug-Lingua-Stem-Snowball@rt.cpan.org.http://rt.cpan.org/NoAuth/Bugs.html?Dist=Lingua-Stem-Snowball is the RT queue for Lingua::Stem::Snowball. Please check to see if your bug has already been reported.
AUTHORS
Lingua::Stem::Snowball was originally developed to provide access to stemming algorithms for the OpenFTS (full text search engine) project (<http://openfts.sourceforge.net>), by Oleg Bartunov, <oleg at sai dot msu dot su> and Teodor Sigaev, <teodor at stack dot net>.Currently maintained by Marvin Humphrey <marvin at rectangular dot com>. Previously maintained by Fabien Potencier <fabpot at cpan dot org>.
COPYRIGHT
Copyright 2004-2006This software may be freely copied and distributed under the same terms and conditions as Perl.
Snowball files and stemmers are covered by the BSD license.
SEE ALSO
<http://snowball.tartarus.org>, Lingua::Stem.Contenus ©2006-2024 Benjamin Poulain
Design ©2006-2024 Maxime Vantorre