Skip to content

RoDmitry/langram

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Langram - the most accurate language detection library

Crate API

317 ScriptLanguages (187 models + 130 single language scripts)

One language can be written in multiple scripts, so it will be detected as a different ScriptLanguage (language + script)

Uses alphabet_detector as a word separator + language prefilter.

Based on chars (1 - 5) and 1 word n-gram language model modified algorithm.

RAM requirements are low, but it may take up to the provided models binary file's size, but this memory is shared (Virtual space, Mmap), so it's not required to have that amount of RAM available. But if it won't be able to cache the whole models file in RAM, it's speed will be affected.

This library is a complete rewrite of Lingua: much faster, more accuracy, more languages, etc.

Also more accurate than Whatlang or Whichlang. More info at the Comparison with other language detectors.

To better understand the accuracy of different modes, look into the Accuracy report.

Setup

To use this library, you need a binary models file, which must be placed near the executable, or set LANGRAM_MODELS_PATH.

It can be:

  • Downloaded from langram_models releases;

  • Built (recommened if big-endian target) langram_models. Which is more advanced and allows you to remove model ngrams, and recompile, so that models binary would be lighter.

About

Natural language detection library

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Languages