fast-langdetect is an ultra-fast and highly accurate language detection library based on FastText, a library developed by Facebook. Its incredible speed and accuracy make it 80x faster than conventional methods and deliver up to 95% accuracy.
- Supported Python
3.9to3.13. - Works offline with the lite model.
- No
numpyrequired (thanks to @dalf).
This project builds upon zafercavdar/fasttext-langdetect with enhancements in packaging. For more information about the underlying model, see the official FastText documentation: Language Identification.
The lite model runs offline and is memory-friendly; the full model is larger and offers higher accuracy.
Approximate memory usage (RSS after load):
- Lite: ~45–60 MB
- Full: ~170–210 MB
- Auto: tries full first, falls back to lite only on MemoryError.
Notes:
- Measurements vary by Python version, OS, allocator, and import graph; treat these as practical ranges.
- Validate on your system if constrained; see
examples/memory_usage_check.py(credit: script by github@JackyHe398`).- Run memory checks in a clean terminal session. IDEs/REPLs may preload frameworks and inflate peak RSS (ru_maxrss), leading to very large peaks with near-zero deltas.
Choose the model that best fits your constraints.
To install fast-langdetect, you can use either pip or pdm:
pip install fast-langdetectpdm add fast-langdetectFor higher accuracy, prefer the full model via detect(text, model='full'). For robust behavior under memory pressure, use detect(text, model='auto') which falls back to the lite model only on MemoryError.
- If the sample is too long (generally over 80 characters) or too short, accuracy will be reduced.
- The model downloads to the system temporary directory by default. You can customize it by:
- Setting
FTLANG_CACHEenvironment variable - Using
LangDetectConfig(cache_dir="your/path")
- Setting
from fast_langdetect import detect
print(detect("Hello, world!", model="auto", k=1))
print(detect("Hello 世界 こんにちは", model="auto", k=3))detect always returns a list of candidates ordered by score. Use model="full" for the best accuracy or model="lite" for an offline-only workflow.
from fast_langdetect import LangDetectConfig, LangDetector
config = LangDetectConfig(cache_dir="/custom/cache", model="lite")
detector = LangDetector(config)
print(detector.detect("Bonjour", k=1))
print(detector.detect("Hola", model="full", k=1))Each LangDetector instance maintains its own in-memory model cache. Once loaded, models are reused for subsequent calls within the same instance. The global detect() function uses a shared default detector, so it also benefits from automatic caching.
Create a custom LangDetector instance when you need specific configuration (custom cache directory, input limits, etc.) or isolated model management.
Keep It Simple!
- Only
MemoryErrortriggers fallback (inmodel='auto'): when loading the full model runs out of memory, it falls back to the lite model. - I/O/network/permission/path/integrity errors raise standard exceptions (e.g.,
FileNotFoundError,PermissionError) or library-specific errors where applicable — no silent fallback. model='lite'andmodel='full'never fallback by design.
- Base error:
FastLangdetectError(library-specific failures). - Model loading failures:
ModelLoadError. - Standard Python exceptions (e.g.,
ValueError,TypeError,FileNotFoundError,MemoryError) propagate when they are not library-specific.
For text splitting based on language, please refer to the split-lang repository.
You can control log verbosity and input normalization via LangDetectConfig:
from fast_langdetect import LangDetectConfig, LangDetector
config = LangDetectConfig(max_input_length=200)
detector = LangDetector(config)
print(detector.detect("Some very long text..." * 5))- Newlines are always replaced with spaces to avoid FastText errors (silent, no log).
- When truncation happens, a WARNING is logged because it may reduce accuracy.
- The default
max_input_lengthis 80 characters (optimal for accuracy); increase it if you need longer samples, or setNoneto disable truncation entirely.
- Default cache: if
cache_diris not set, models are stored under a system temp-based directory specified byFTLANG_CACHEor an internal default. This directory is created automatically when needed. - User-provided cache_dir: if you set
LangDetectConfig(cache_dir=...)to a path that does not exist, the library raisesFileNotFoundErrorinstead of silently creating or using another location. Create the directory yourself if that’s intended.
The constructor exposes a few advanced knobs (proxy, normalize_input, max_input_length). These are rarely needed for typical usage and can be ignored. Prefer detect(..., model=...) unless you know you need them.
fastText reports BCP-47 style tags such as en, zh-cn, pt-br, yue. The detector keeps those codes so you can decide how to display them. Choose the approach that fits your product:
- Small, fixed list? Maintain a hand-written mapping and fall back to the raw code for anything unexpected.
FASTTEXT_DISPLAY_NAMES = {
"en": "English",
"zh": "Chinese",
"zh-cn": "Chinese (China)",
"zh-tw": "Chinese (Taiwan)",
"pt": "Portuguese",
"pt-br": "Portuguese (Brazil)",
"yue": "Cantonese",
"wuu": "Wu Chinese",
"arz": "Egyptian Arabic",
"ckb": "Central Kurdish",
"kab": "Kabyle",
}
def code_to_display_name(code: str) -> str:
return FASTTEXT_DISPLAY_NAMES.get(code.lower(), code)
print(code_to_display_name("pt-br"))
print(code_to_display_name("de"))- Need coverage for all 176 fastText languages? Use a language database library that understands subtags and scripts. Two popular libraries are
langcodesandpycountry.
# pip install langcodes
from langcodes import Language
LANG_OVERRIDES = {
"pt-br": "Portuguese (Brazil)",
"zh-cn": "Chinese (China)",
"zh-tw": "Chinese (Taiwan)",
"yue": "Cantonese",
}
def fasttext_to_name(code: str) -> str:
normalized = code.replace("_", "-").lower()
if normalized in LANG_OVERRIDES:
return LANG_OVERRIDES[normalized]
try:
return Language.get(normalized).display_name("en")
except Exception:
base = normalized.split("-")[0]
try:
return Language.get(base).display_name("en")
except Exception:
return code
from fast_langdetect import detect
result = detect("Olá mundo", model="full", k=1)
print(fasttext_to_name(result[0]["lang"]))pycountry works similarly (pip install pycountry). Use pycountry.languages.lookup("pt") for fuzzy matching or pycountry.languages.get(alpha_2="pt") for exact lookups, and pair it with a small override dictionary for non-standard tags such as pt-br, zh-cn, or dialect codes like yue.
# pip install pycountry
import pycountry
FASTTEXT_OVERRIDES = {
"pt-br": "Portuguese (Brazil)",
"zh-cn": "Chinese (China)",
"zh-tw": "Chinese (Taiwan)",
"yue": "Cantonese",
}
def fasttext_to_name_pycountry(code: str) -> str:
normalized = code.replace("_", "-").lower()
if normalized in FASTTEXT_OVERRIDES:
return FASTTEXT_OVERRIDES[normalized]
try:
return pycountry.languages.lookup(normalized).name
except LookupError:
base = normalized.split("-")[0]
try:
return pycountry.languages.lookup(base).name
except LookupError:
return code
from fast_langdetect import detect
result = detect("Olá mundo", model="full", k=1)
print(fasttext_to_name_pycountry(result[0]["lang"]))from importlib import resources
from fast_langdetect import LangDetectConfig, LangDetector
with resources.path("fast_langdetect.resources", "lid.176.ftz") as model_path:
config = LangDetectConfig(custom_model_path=str(model_path))
detector = LangDetector(config)
print(detector.detect("Hello world", k=1))When using a custom model via custom_model_path, the model parameter in detect() calls is ignored since your custom model file is always loaded directly. The model="lite", model="full", and model="auto" parameters only apply when using the built-in models.
For detailed benchmark results, refer to zafercavdar/fasttext-langdetect#benchmark.
[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
@article{joulin2016bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.01759},
year={2016}
}[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models
@article{joulin2016fasttext,
title={FastText.zip: Compressing text classification models},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
journal={arXiv preprint arXiv:1612.03651},
year={2016}
}- Code: Released under the MIT License (see
LICENSE). - Models: This package uses the pre-trained fastText language identification models (
lid.176.ftzbundled for offline use andlid.176.bindownloaded as needed). These models are licensed under the Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) license. - Attribution: fastText language identification models by Facebook AI Research. See the fastText docs and license for details:
- Note: If you redistribute or modify the model files, you must comply with CC BY-SA 3.0. Inference usage via this library does not change the license of the model files themselves.