Skip to content

⚡️ 80x faster Fasttext language detection out of the box | Split text by language

License

Notifications You must be signed in to change notification settings

LlmKira/fast-langdetect

Folders and files

NameName
Last commit message
Last commit date

Latest commit

f4fc032 · Mar 29, 2025

History

89 Commits
Mar 4, 2025
Mar 29, 2025
Mar 29, 2025
Mar 29, 2025
Jan 9, 2025
Jul 6, 2024
Jan 18, 2024
Jan 17, 2024
Jan 18, 2024
Mar 29, 2025
Jan 28, 2025
Mar 29, 2025

Repository files navigation

fast-langdetect 🚀

PyPI version Downloads Downloads

Overview

fast-langdetect is an ultra-fast and highly accurate language detection library based on FastText, a library developed by Facebook. Its incredible speed and accuracy make it 80x faster than conventional methods and deliver up to 95% accuracy.

  • Supported Python 3.9 to 3.13.
  • Works offline in low memory mode
  • No numpy required (thanks to @dalf).

Background

This project builds upon zafercavdar/fasttext-langdetect with enhancements in packaging. For more information about the underlying model, see the official FastText documentation: Language Identification.

Possible memory usage

This library requires at least 200MB memory in low-memory mode.

Installation 💻

To install fast-langdetect, you can use either pip or pdm:

Using pip

pip install fast-langdetect

Using pdm

pdm add fast-langdetect

Usage 🖥️

In scenarios where accuracy is important, you should not rely on the detection results of small models, use low_memory=False to download larger models!

Prerequisites

  • If the sample is too long or too short, the accuracy will be reduced.
  • The model will be downloaded to system temporary directory by default. You can customize it by:
    • Setting FTLANG_CACHE environment variable
    • Using LangDetectConfig(cache_dir="your/path")

Native API (Recommended)

from fast_langdetect import detect, detect_multilingual, LangDetector, LangDetectConfig, DetectError

# Simple detection
print(detect("Hello, world!"))
# Output: {'lang': 'en', 'score': 0.12450417876243591}

# Using large model for better accuracy
print(detect("Hello, world!", low_memory=False))
# Output: {'lang': 'en', 'score': 0.98765432109876}

# Custom configuration with fallback mechanism
config = LangDetectConfig(
    cache_dir="/custom/cache/path",  # Custom model cache directory
    allow_fallback=True             # Enable fallback to small model if large model fails
)
detector = LangDetector(config)

try:
    result = detector.detect("Hello world", low_memory=False)
    print(result)  # {'lang': 'en', 'score': 0.98}
except DetectError as e:
    print(f"Detection failed: {e}")

# How to deal with multiline text
multiline_text = """
Hello, world!
This is a multiline text.
"""
multiline_text = multiline_text.replace("\n", " ")  
print(detect(multiline_text))
# Output: {'lang': 'en', 'score': 0.8509423136711121}

# Multi-language detection
results = detect_multilingual(
    "Hello 世界 こんにちは", 
    low_memory=False,  # Use large model for better accuracy
    k=3               # Return top 3 languages
)
print(results)
# Output: [
#     {'lang': 'ja', 'score': 0.4}, 
#     {'lang': 'zh', 'score': 0.3}, 
#     {'lang': 'en', 'score': 0.2}
# ]

Fallbacks

We provide a fallback mechanism: when allow_fallback=True, if the program fails to load the large model (low_memory=False), it will fall back to the offline small model to complete the prediction task.

# Disable fallback - will raise error if large model fails to load
# But fallback disabled when custom_model_path is not None, because its a custom model, we will directly use it.
import tempfile
config = LangDetectConfig(
    allow_fallback=False, 
    custom_model_path=None,
    cache_dir=tempfile.gettempdir(),
    )
detector = LangDetector(config)

try:
    result = detector.detect("Hello world", low_memory=False)
except DetectError as e:
    print("Model loading failed and fallback is disabled")

Convenient detect_language Function

from fast_langdetect import detect_language

# Single language detection
print(detect_language("Hello, world!"))
# Output: EN

print(detect_language("Привет, мир!"))
# Output: RU

print(detect_language("你好,世界!"))
# Output: ZH

Load Custom Models

# Load model from local file
config = LangDetectConfig(
    custom_model_path="/path/to/your/model.bin",  # Use local model file
    disable_verify=True                     # Skip MD5 verification
)
detector = LangDetector(config)
result = detector.detect("Hello world")

Splitting Text by Language 🌐

For text splitting based on language, please refer to the split-lang repository.

Benchmark 📊

For detailed benchmark results, refer to zafercavdar/fasttext-langdetect#benchmark.

References 📚

[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}