unicharset_extractor

Langue: en

Version: 258796 (debian - 07/07/09)

Section: 1 (Commandes utilisateur)

NAME

tesseract - command line OCR tool

SYNOPSIS

Part of the process to train tesseract for a new language. Tesseract needs to know the set of possible characters it can output. To generate the unicharset data file, use the unicharset_extractor program on the training pages bounding box files:

unicharset_extractor fontfile_1.box fontfile_2.box ...

DESCRIPTION

This manual page documents briefly the unicharset_extractor command.

tesseract is a commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005.

Tesseract needs to have access to character properties isalpha, isdigit, isupper, islower. This data must be encoded in the unicharset data file. Each line of this file corresponds to one character. The character in UTF-8 is followed by a hexadecimal number representing a binary mask that encodes the properties. Each bit corresponds to a property. If the bit is set to 1, it means that the property is true. The bit ordering is (from least significant bit to most significant bit): isalpha, islower, isupper, isdigit.

SEE ALSO

feh(1), convert(1), mftraining(1), cntraining(1), tesseract(1), wordlist2dawg(1).

AUTHOR

tesseract was written by Ray Smith.

This manual page was written by Jeffrey Ratcliffe <Jeffrey.Ratcliffe@gmail.com>, for the Debian project (but may be used by others).