KinoSearch1::Analysis::Token.3pm

Langue: en

Version: 2010-06-16 (ubuntu - 24/10/10)

Section: 3 (Bibliothèques de fonctions)

NAME

KinoSearch1::Analysis::Token - unit of text

SYNOPSIS

     # private class - no public API
 
 

PRIVATE CLASS

You can't actually instantiate a Token object at the Perl level --- however, you can affect individual Tokens within a TokenBatch by way of TokenBatch's (experimental) API.

DESCRIPTION

Token is the fundamental unit used by KinoSearch1's Analyzer subclasses. Each Token has 4 attributes: text, start_offset, end_offset, and pos_inc (for position increment).

The text of a token is a string.

A Token's start_offset and end_offset locate it within a larger text, even if the Token's text attribute gets modified --- by stemming, for instance. The Token for ``beating'' in the text ``beating a dead horse'' begins life with a start_offset of 0 and an end_offset of 7; after stemming, the text is ``beat'', but the end_offset is still 7.

The position increment, which defaults to 1, is a an advanced tool for manipulating phrase matching. Ordinarily, Tokens are assigned consecutive position numbers: 0, 1, and 2 for ``three blind mice''. However, if you set the position increment for ``blind'' to, say, 1000, then the three tokens will end up assigned to positions 0, 1, and 1001 --- and will no longer produce a phrase match for the query '``three blind mice'''.

Copyright 2006-2010 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.

See KinoSearch1 version 1.00.