A comprehensive Natural Language Processing library for Elixir, providing tokenization, stemming, ranking algorithms, similarity metrics, and text analysis tools. Inspired by Python's NLTK and designed with idiomatic Elixir patterns.
- 🔤 Tokenization: Multiple tokenizers (standard, whitespace, regex, n-gram, keyword) with NLTK-inspired API
- ✂️ Stemming: Snowball stemmers for 7 languages (English, Spanish, Portuguese, French, German, Italian, Polish)
- 📊 Ranking Algorithms: TF-IDF and BM25 implementations for document ranking and search
- 🔍 Similarity Metrics: Levenshtein, Jaccard, Dice, Jaro-Winkler, Hamming distance, and more
- 🚫 Stopwords: Built-in stopword lists for 30+ languages
- 🔧 Text Filtering: Case conversion, length filtering, pattern replacement, and stopword removal
- 📈 Statistics: Term frequency, document frequency, corpus-level statistics
- 🔗 Co-occurrence Analysis: Term co-occurrence matrices and analysis
- 📝 N-grams: Character and word n-gram generation
- 🎯 Idiomatic Elixir: Clean, functional code following Elixir best practices
Add ex_nlp to your list of dependencies in mix.exs:
def deps do
[
{:ex_nlp, "~> 0.1.0"}
]
endThen run mix deps.get.
# Simple word tokenization
iex> ExNlp.Tokenizer.word_tokenize("Hello, world!")
["Hello", "world"]
# Get tokens with position and offset information
iex> ExNlp.Tokenizer.tokenize("Hello, world!")
[
%ExNlp.Token{text: "Hello", position: 0, start_offset: 0, end_offset: 5},
%ExNlp.Token{text: "world", position: 1, start_offset: 7, end_offset: 12}
]
# Custom regex tokenizer
iex> ExNlp.Tokenizer.regexp_tokenize("abc123 def456", "\\d+")
["123", "456"]# Stem words in multiple languages
iex> ExNlp.Snowball.stem("running", :english)
"run"
iex> ExNlp.Snowball.stem("caminando", :spanish)
"camin"
iex> ExNlp.Snowball.stem_words(["running", "jumping", "beautiful"], :english)
["run", "jump", "beauti"]
# Check supported languages
iex> ExNlp.Snowball.supported_languages()
[:english, :spanish, :portuguese, :french, :german, :italian, :polish]# Calculate TF-IDF score for a term in a document
iex> documents = ["The quick brown fox", "A brown dog"]
iex> ExNlp.Ranking.TfIdf.calculate("fox", "The quick brown fox", documents)
0.5108256237659907
# With preprocessing options
iex> ExNlp.Ranking.TfIdf.calculate("running", "The runner is running fast", documents,
...> stem: true, language: :english, remove_stopwords: true)
0.6931471805599453# Score documents against a query
iex> documents = ["BM25 is a ranking function", "used by search engines"]
iex> ExNlp.Ranking.Bm25.score(documents, ["ranking", "search"])
[1.8455076734299591, 1.0126973514850315]
# Rank documents with custom parameters
iex> ExNlp.Ranking.Bm25.score(documents, ["ranking", "search"],
...> k1: 1.5, b: 0.75, stem: true, language: :english)
[1.923456, 1.123456]# Levenshtein distance
iex> ExNlp.Similarity.levenshtein("kitten", "sitting")
3
# Levenshtein similarity (normalized)
iex> ExNlp.Similarity.levenshtein_similarity("kitten", "sitting")
0.5714285714285714
# Jaccard similarity
iex> ExNlp.Similarity.jaccard(["cat", "dog"], ["cat", "bird"])
0.3333333333333333
# Jaro-Winkler similarity
iex> ExNlp.Similarity.jaro_winkler_similarity("martha", "marhta")
0.9611111111111111
# Dice coefficient
iex> ExNlp.Similarity.dice_coefficient(["cat", "dog"], ["cat", "bird"])
0.5# Check if a word is a stopword
iex> ExNlp.Stopwords.is_stopword?("the", :english)
true
# Remove stopwords from a list
iex> words = ["the", "quick", "brown", "fox"]
iex> ExNlp.Stopwords.remove(words, :english)
["quick", "brown", "fox"]
# Get list of stopwords
iex> ExNlp.Stopwords.list(:english)
["a", "all", "and", "as", "at", ...]# Build a filtering pipeline
iex> tokens = ExNlp.Tokenizer.tokenize("The Quick Brown Fox")
iex> tokens
...> |> ExNlp.Filter.lowercase()
...> |> ExNlp.Filter.stop_words(:english)
...> |> ExNlp.Filter.min_length(3)
[
%ExNlp.Token{text: "quick", ...},
%ExNlp.Token{text: "brown", ...},
%ExNlp.Token{text: "fox", ...}
]# Character n-grams
iex> ExNlp.Ngram.char_ngrams("hello", 2)
["he", "el", "ll", "lo"]
# Word n-grams
iex> ExNlp.Ngram.word_ngrams(["the", "quick", "brown", "fox"], 2)
[["the", "quick"], ["quick", "brown"], ["brown", "fox"]]# Term frequency in a document
iex> ExNlp.Statistics.term_frequency("cat", ["the", "cat", "sat", "on", "the", "mat"])
1
# Document frequency in a corpus
iex> corpus = [["cat", "dog"], ["cat", "bird"], ["dog", "fish"]]
iex> ExNlp.Statistics.document_frequency("cat", corpus)
2
# Most frequent terms
iex> corpus = [["cat", "dog"], ["cat", "cat"], ["dog"]]
iex> ExNlp.Statistics.most_frequent(corpus, 2)
[{"cat", 3}, {"dog", 2}]# Build co-occurrence matrix
iex> corpus = [["cat", "dog"], ["cat", "bird", "dog"], ["bird"]]
iex> matrix = ExNlp.Cooccurrence.cooccurrence_matrix(corpus)
iex> matrix["cat"]["dog"]
2
# Find co-occurring terms
iex> ExNlp.Cooccurrence.cooccurring_terms("cat", corpus, 2)
[{"dog", 2}, {"bird", 1}]- English - Porter2 algorithm (Porter stemmer v2)
- Spanish - Spanish stemmer
- Portuguese - Portuguese stemmer
- French - French stemmer
- German - German stemmer
- Italian - Italian stemmer
- Polish - Polish stemmer
Stopword lists are available for 30+ languages including: English, Spanish, Portuguese, French, German, Italian, Polish, Russian, Dutch, Swedish, Norwegian, Danish, Finnish, Turkish, Arabic, Chinese, and more. See priv/stopwords/ for the complete list.
The library is organized into logical modules:
ExNlp.Tokenizer- Text tokenization with multiple strategiesExNlp.Snowball- Word stemming algorithmsExNlp.Ranking- Document ranking (TF-IDF, BM25)ExNlp.Similarity- String and set similarity metricsExNlp.Stopwords- Stopword detection and filteringExNlp.Filter- Token filtering and transformationExNlp.Statistics- Text and corpus statisticsExNlp.Cooccurrence- Term co-occurrence analysisExNlp.Ngram- N-gram generation
The library includes benchmark suites for critical operations. Run benchmarks with:
mix run benchmarks/tokenizer_bench.exs
mix run benchmarks/similarity_bench.exs
mix run benchmarks/ranking_bench.exsRun the test suite with:
mix testGenerate documentation with:
mix docsContributions are welcome! This library aims to be a comprehensive NLP toolkit for Elixir. Areas for contribution:
- Additional language support for stemming
- More stopword lists
- Additional similarity metrics
- Performance optimizations
- Documentation improvements
- Stemming algorithms based on the Snowball Stemming Algorithms
- Inspired by Python's NLTK and spaCy
- Stopword lists compiled from various open sources
MIT License - see LICENSE file for details.