-
Notifications
You must be signed in to change notification settings - Fork 3
Fast and trainable tokenizer for natural languages relying on maximum entropy methods.
License
jirkamarsik/trainable-tokenizer
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
trtok - a fast and trainable tokenizer for natural languages ------------------------------------------------------------ Trtok is a very universal performance-oriented tokenizer for processing natural languages. It reads text and tries to correctly detect sentence boundaries and divide the text into tokens. Trtok does not implement any specific heuristic to perform these tasks, instead it lets the user define rules for potential joining and splitting of words into tokens and sentences. The final decision whether to split or join words and whether to break sentences is left to a conditional probabilistic model which is trained from user-supplied annotated data. The way the trainer understands the data can be extensively customized by the user who can define his own features and specify which features are significant for what tokens. 1) Tokenization schemes ----------------------- The user might want to use trtok for processing more than 1 language or for processing 1 language in many ways. These different ways of tokenization are described by "tokenization schemes". Their definitions reside in the "schemes" subdirectory of the installation directory. Every folder inside "schemes" defines a single tokenization scheme by way of various configuration files. Tokenization schemes may be nested to represent a sort of scheme inheritance where a scheme inherits all the configuration files of its ancestors unless it redefines them by having a configuration file of the same name. a) Rough tokenization rules The tokenizer identifies all potential token and sentence boundaries within the text and uses them and the whitespace to split the text into short segments called rough tokens. The ambiguous boundaries are placed according to the tokenization scheme. Files with the .split extension define positions where a word may be broken into two tokens (called a MAY_SPLIT). Files with the .join extension define positions where two words may be joined into a single token (MAY_JOIN). Finally, files with the .break extension define positions at which there might be a sentence break (MAY_BREAK_SENTENCE). All of the above-named files must contain lines of pairs of whitespace-delimited regular expressions. If the text leading to a position and the text following it match the two paired regular expressions respectively, the ambiguous boundary (MAY_SPLIT for .split files, MAY_JOIN for .join files or MAY_BREAK_SENTENCE for .break files) is placed at that position. The grammar of the regular expressions in these files is the one used by Quex and described in detail at http://quex.sourceforge.net/doc/html/usage/patterns/context-free.html. Particularly take care since Quex does not handle Unicode characters directly in its regular expression syntax, so be sure to use the \UXXXXXX escape notation if you need to make use of them. The files may contain comments which are lines that begin with the # symbol. b) User-defined properties Files with a .rep extension contain a single regular expression from the family of expressions allowed in PCRE (see pcre.org). A rough token is marked as having this property if it can be matched to the regular expression. Files with a .listp extension define properties using lists of token types. If a rough token's text is exactly the same as a line from a .listp file, then that rough token is marked as having the property defined by that .listp file. c) Feature selection Every tokenization scheme must have a file named "features". For each rough token in the vicinity of the potential split/join/sentence break, it specifies which features are important for the decision. A typical line starts by declaring a set of interesting offsets (0 is the rough token preceding the decision point, -1 the one before it, +1 the one after it, etc...). These offsets are separated by commas and intervals can be used for convenience (e.g. -4,-2..+2,5 selects -4,-2,-1,0,1,2,5). After the offsets comes a colon and a comma separated list of properties. The property names are the filenames of their definitions without the extensions and they are limited to the common identifier character set [a-zA-Z0-9_]. The line is closed with a terminating semicolon. Apart from these simple features, it is possible to ask for combined features which bundle the value of different properties of tokens at different offsets into a single feature value. These are defined on their own line and are enclosed in parentheses. Inside the parentheses is a "^" separated list of offset:property pairs. If a combined feature takes properties from a single token only, the parenthesized expression can appear on the right-hand side of a typical line instead of a simple property name and the offsets within its definition are omitted. Apart from the user-defined properties from the .rep and .listp files, the tokenizer defines the non-binary property "%length" whose value is the length of the rough tokenizer and the meta-property "%Word" which generates a property for each rough token type. Example: -2..+2: %Word; -5..5: uppercase, abbreviation, (starts_with_number ^ ends_with_period); (0:fullstop ^ 1:initial) d) Maxent training parameters More control over the process of training the probabilistic model can be had by manipulating the "maxent.params" file. This file is an INI-style configuration file which lets the user set the following parameters, which get passed directly to the training toolkit. event_cutoff=<int> All training events which occur less times than event_cutoff are ignored. Default 1. n_iterations=<int> How many iterations at most will the iterative method use. Default 15. method_name=lbfgs|gis Which of the two methods L-BFGS or GIS is to be used. L-BFGS is recommended. Default lbfgs. smoothing_coefficient=<double> Sigma, the coefficient in Gaussian smoothing. Default 0 (no smoothing). convergence_tolerance=<double> The model is regarded as convergent when the relative difference between the log-likelihood of the succeeding models is < convergence_tolerance. Default 1e-05. save_as_binary=false|true Whether to save the file in a binary format which is faster to load and smaller if Maxent was compiled with zlib support. Default false. e) File lists and filename replacement regular expressions Files [prepare|train|heldout|tokenize|evaluate].[fl|fnre] are for convenience only and are described later. 2) Running the tokenizer ------------------------ a) Different ways of selecting input The first argument passed to the tokenizer selects its mode, which can be either "prepare", "train", "tokenize" or "evaluate". The second argument is a path relative to the directory "schemes" which selects the tokenization scheme to be used. The rest of the arguments are input files and options. Input files can be specified explicitly on the command line. More files can be given using the -l (--file-list) option which takes a path to a file and adds every line of it as another input file. When running in prepare mode or tokenize mode, an output file for each file has to be specified and when running in train mode or evaluate mode, a file with the annotated version has to be specified. These secondary files are selected by taking the input file's path and transforming it using a regular expression/replacement string. The filename regular expression/replacement string is specified using the -r (--filename-regexp) option. The strings look like replacement commands in sed, where the first character can be any ASCII character and that character separates the regular expression from the replacement string and also terminates the entire string. Unlike sed, this special character cannot be present anywhere else in the string (no escaping). The breed of regular expressions used here is the one supported by PCRE, the replacement strings contain the placeholders \0, \1... for the entire matched string, first captured sequence... Example: trtok train en/simple/brown -l data/brown/train.fl -r "|raw|txt|" In the annotated/tokenized files, sentences are split by newlines and tokens are split by spaces. If no input file or file lists were given, a default file list named <mode_name>.fl, which is part of the tokenization scheme, is used. If no filename regular expression/replacement string is given, the one in the file named <mode_name>.fnre from the tokenization scheme is used. In both cases <mode_name> is expanded to either "prepare", "train", "tokenize" or "evaluate" depending on the current mode. If no input file or file lists were given and there are no default file lists defined by the tokenization scheme, then the tokenizer processes the standard input and writes to the standard output. This is, however, only possible for the "prepare" and "tokenize" modes. The standard input/output combo can also be explicitly selected by specifying the input file "-" on the command line. b) Different modes of execution In "prepare" mode, the tokenizer reads the input, splits it into rough tokens and then outputs it with all possible splits and sentence breaks performed. This format might be handy for manual annotators who then only have to join together parts of tokens and sentences. In "train" mode, the tokenizer reads both the input and its annotated version. It uses the annotated data to get pairs of questions (values of features in a given context surrounding a decision point) and answers (whether the decision point is to become a joining of tokens, a splitting of tokens or a sentence break). These pairs are then used to train the probabilistic model and store it in a file under the "build" directory. In "tokenize" mode, the tokenizer relies on the presence of an already trained model and uses it to classify every decision point in the input file and output the tokenized and segmented text. In "evaluate" mode, the tokenizer reads both the input and its annotation as in "train" mode, but now it also queries the trained model for an opinion and compares it with the one found in the annotated data. The tokenizer outputs a log of every context and both the predicted and correct outcomes for later analysis. The "analyze" script provided with trtok will let you read this output and determine the accuracy of your system. c) Different options If you launch trtok with no command line arguments, you will get a summary of all the supported command line options and their meaning. These include options for setting the encoding of the input and output files, options for controlling the output (preserving the original tokenization, segmentation or paragraph division), the preprocessing of input (if entities are to be expanded for the duration of the tokenization and if they are to be kept expanded in the output; if XML should be hidden from tokenization), options for logging the contexts and outcomes to a third file and others.
About
Fast and trainable tokenizer for natural languages relying on maximum entropy methods.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published