GitHub - jirkamarsik/trainable-tokenizer: Fast and trainable tokenizer for natural languages relying on maximum entropy methods.

jirkamarsik / trainable-tokenizer Public
Notifications You must be signed in to change notification settings
Fork 3
Star 22
Fast and trainable tokenizer for natural languages relying on maximum entropy methods.
Notifications
Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
doc		doc
models		models
src		src
.gitignore		.gitignore
INSTALL		INSTALL
LICENCE.html		LICENCE.html
README		README
Repository files navigation

trtok - a fast and trainable tokenizer for natural languages
------------------------------------------------------------

  Trtok is a very universal performance-oriented tokenizer for processing
natural languages. It reads text and tries to correctly detect sentence
boundaries and divide the text into tokens.

  Trtok does not implement any specific heuristic to perform these tasks,
instead it lets the user define rules for potential joining and splitting of
words into tokens and sentences. The final decision whether to split or join
words and whether to break sentences is left to a conditional probabilistic
model which is trained from user-supplied annotated data. The way the trainer
understands the data can be extensively customized by the user who can define
his own features and specify which features are significant for what tokens.


1) Tokenization schemes
-----------------------

  The user might want to use trtok for processing more than 1 language or for
  processing 1 language in many ways. These different ways of tokenization are
  described by "tokenization schemes". Their definitions reside in the
  "schemes" subdirectory of the installation directory. Every folder inside
  "schemes" defines a single tokenization scheme by way of various
  configuration files.

  Tokenization schemes may be nested to represent a sort of scheme inheritance
  where a scheme inherits all the configuration files of its ancestors unless
  it redefines them by having a configuration file of the same name.

  a) Rough tokenization rules

    The tokenizer identifies all potential token and sentence boundaries within
    the text and uses them and the whitespace to split the text into short
    segments called rough tokens. The ambiguous boundaries are placed according
    to the tokenization scheme. Files with the .split extension define positions
    where a word may be broken into two tokens (called a MAY_SPLIT). Files with
    the .join extension define positions where two words may be joined into a
    single token (MAY_JOIN). Finally, files with the .break extension define
    positions at which there might be a sentence break (MAY_BREAK_SENTENCE).

    All of the above-named files must contain lines of pairs of
    whitespace-delimited regular expressions. If the text leading to a position
    and the text following it match the two paired regular expressions
    respectively, the ambiguous boundary (MAY_SPLIT for .split files, MAY_JOIN
    for .join files or MAY_BREAK_SENTENCE for .break files) is placed at that
    position.

    The grammar of the regular expressions in these files is the one used by
    Quex and described in detail at
    http://quex.sourceforge.net/doc/html/usage/patterns/context-free.html. Particularly
    take care since Quex does not handle Unicode characters directly in its
    regular expression syntax, so be sure to use the \UXXXXXX escape notation
    if you need to make use of them.

    The files may contain comments which are lines that begin with the # symbol.

  b) User-defined properties

    Files with a .rep extension contain a single regular expression from the
    family of expressions allowed in PCRE (see pcre.org). A rough token is
    marked as having this property if it can be matched to the regular
    expression.

    Files with a .listp extension define properties using lists of token types.
    If a rough token's text is exactly the same as a line from a .listp file,
    then that rough token is marked as having the property defined by that
    .listp file.

  c) Feature selection

    Every tokenization scheme must have a file named "features". For each rough
    token in the vicinity of the potential split/join/sentence break, it
    specifies which features are important for the decision.

    A typical line starts by declaring a set of interesting offsets (0 is the
    rough token preceding the decision point, -1 the one before it, +1 the one
    after it, etc...). These offsets are separated by commas and intervals can
    be used for convenience (e.g. -4,-2..+2,5 selects -4,-2,-1,0,1,2,5).

    After the offsets comes a colon and a comma separated list of properties.
    The property names are the filenames of their definitions without the
    extensions and they are limited to the common identifier character set
    [a-zA-Z0-9_]. The line is closed with a terminating semicolon.

    Apart from these simple features, it is possible to ask for combined
    features which bundle the value of different properties of tokens at
    different offsets into a single feature value. These are defined on their
    own line and are enclosed in parentheses. Inside the parentheses is a "^"
    separated list of offset:property pairs. If a combined feature takes
    properties from a single token only, the parenthesized expression can
    appear on the right-hand side of a typical line instead of a simple
    property name and the offsets within its definition are omitted.

    Apart from the user-defined properties from the .rep and .listp files, the
    tokenizer defines the non-binary property "%length" whose value is the
    length of the rough tokenizer and the meta-property "%Word" which generates
    a property for each rough token type.

      Example:
        
        -2..+2: %Word;
        -5..5: uppercase, abbreviation, (starts_with_number ^ ends_with_period);
        (0:fullstop ^ 1:initial)
        

  d) Maxent training parameters

    More control over the process of training the probabilistic model can be
    had by manipulating the "maxent.params" file. This file is an INI-style
    configuration file which lets the user set the following parameters, which
    get passed directly to the training toolkit.

      event_cutoff=<int>                 All training events which occur less
            times than event_cutoff are ignored. Default 1.

      n_iterations=<int>                 How many iterations at most will the
            iterative method use. Default 15.

      method_name=lbfgs|gis              Which of the two methods L-BFGS or GIS
            is to be used. L-BFGS is recommended. Default lbfgs.

      smoothing_coefficient=<double>     Sigma, the coefficient in Gaussian
            smoothing. Default 0 (no smoothing).

      convergence_tolerance=<double>     The model is regarded as convergent
            when the relative difference between the log-likelihood of the
            succeeding models is < convergence_tolerance. Default 1e-05.

      save_as_binary=false|true          Whether to save the file in a binary
            format which is faster to load and smaller if Maxent was compiled
            with zlib support. Default false.

  e) File lists and filename replacement regular expressions

    Files [prepare|train|heldout|tokenize|evaluate].[fl|fnre] are for
    convenience only and are described later.
 

2) Running the tokenizer
------------------------

  a) Different ways of selecting input

    The first argument passed to the tokenizer selects its mode, which can be
    either "prepare", "train", "tokenize" or "evaluate". The second argument is
    a path relative to the directory "schemes" which selects the tokenization
    scheme to be used. The rest of the arguments are input files and options.

    Input files can be specified explicitly on the command line. More files can
    be given using the -l (--file-list) option which takes a path to a file and
    adds every line of it as another input file.

    When running in prepare mode or tokenize mode, an output file for each file
    has to be specified and when running in train mode or evaluate mode, a file
    with the annotated version has to be specified. These secondary files are
    selected by taking the input file's path and transforming it using a regular
    expression/replacement string. The filename regular expression/replacement
    string is specified using the -r (--filename-regexp) option. The strings
    look like replacement commands in sed, where the first character can be any
    ASCII character and that character separates the regular expression from
    the replacement string and also terminates the entire string. Unlike sed,
    this special character cannot be present anywhere else in the string (no
    escaping). The breed of regular expressions used here is the one supported
    by PCRE, the replacement strings contain the placeholders \0, \1... for the
    entire matched string, first captured sequence...
      
      Example:

        trtok train en/simple/brown -l data/brown/train.fl -r "|raw|txt|"
        
    In the annotated/tokenized files, sentences are split by newlines and
    tokens are split by spaces.

    If no input file or file lists were given, a default file list named
    <mode_name>.fl, which is part of the tokenization scheme, is used. If no
    filename regular expression/replacement string is given, the one in the
    file named <mode_name>.fnre from the tokenization scheme is used. In both
    cases <mode_name> is expanded to either "prepare", "train", "tokenize" or
    "evaluate" depending on the current mode.

    If no input file or file lists were given and there are no default file
    lists defined by the tokenization scheme, then the tokenizer processes the
    standard input and writes to the standard output. This is, however, only
    possible for the "prepare" and "tokenize" modes. The standard input/output
    combo can also be explicitly selected by specifying the input file "-" on
    the command line.

  b) Different modes of execution

    In "prepare" mode, the tokenizer reads the input, splits it into rough
    tokens and then outputs it with all possible splits and sentence breaks
    performed. This format might be handy for manual annotators who then only
    have to join together parts of tokens and sentences.

    In "train" mode, the tokenizer reads both the input and its annotated
    version. It uses the annotated data to get pairs of questions (values of
    features in a given context surrounding a decision point) and answers
    (whether the decision point is to become a joining of tokens, a splitting
    of tokens or a sentence break). These pairs are then used to train the
    probabilistic model and store it in a file under the "build" directory.

    In "tokenize" mode, the tokenizer relies on the presence of an already
    trained model and uses it to classify every decision point in the input
    file and output the tokenized and segmented text.

    In "evaluate" mode, the tokenizer reads both the input and its annotation
    as in "train" mode, but now it also queries the trained model for an
    opinion and compares it with the one found in the annotated data. The
    tokenizer outputs a log of every context and both the predicted and correct
    outcomes for later analysis. The "analyze" script provided with trtok will
    let you read this output and determine the accuracy of your system.

  c) Different options

    If you launch trtok with no command line arguments, you will get a summary
    of all the supported command line options and their meaning. These include
    options for setting the encoding of the input and output files, options for
    controlling the output (preserving the original tokenization, segmentation
    or paragraph division), the preprocessing of input (if entities are to be
    expanded for the duration of the tokenization and if they are to be kept
    expanded in the output; if XML should be hidden from tokenization), options
    for logging the contexts and outcomes to a third file and others.