Skip to content

gawryco/ex_nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ExNLP

A comprehensive Natural Language Processing library for Elixir, providing tokenization, stemming, ranking algorithms, similarity metrics, and text analysis tools. Inspired by Python's NLTK and designed with idiomatic Elixir patterns.

Features

  • 🔤 Tokenization: Multiple tokenizers (standard, whitespace, regex, n-gram, keyword) with NLTK-inspired API
  • ✂️ Stemming: Snowball stemmers for 7 languages (English, Spanish, Portuguese, French, German, Italian, Polish)
  • 📊 Ranking Algorithms: TF-IDF and BM25 implementations for document ranking and search
  • 🔍 Similarity Metrics: Levenshtein, Jaccard, Dice, Jaro-Winkler, Hamming distance, and more
  • 🚫 Stopwords: Built-in stopword lists for 30+ languages
  • 🔧 Text Filtering: Case conversion, length filtering, pattern replacement, and stopword removal
  • 📈 Statistics: Term frequency, document frequency, corpus-level statistics
  • 🔗 Co-occurrence Analysis: Term co-occurrence matrices and analysis
  • 📝 N-grams: Character and word n-gram generation
  • 🎯 Idiomatic Elixir: Clean, functional code following Elixir best practices

Installation

Add ex_nlp to your list of dependencies in mix.exs:

def deps do
  [
    {:ex_nlp, "~> 0.1.0"}
  ]
end

Then run mix deps.get.

Quick Start

Tokenization

# Simple word tokenization
iex> ExNlp.Tokenizer.word_tokenize("Hello, world!")
["Hello", "world"]

# Get tokens with position and offset information
iex> ExNlp.Tokenizer.tokenize("Hello, world!")
[
  %ExNlp.Token{text: "Hello", position: 0, start_offset: 0, end_offset: 5},
  %ExNlp.Token{text: "world", position: 1, start_offset: 7, end_offset: 12}
]

# Custom regex tokenizer
iex> ExNlp.Tokenizer.regexp_tokenize("abc123 def456", "\\d+")
["123", "456"]

Stemming

# Stem words in multiple languages
iex> ExNlp.Snowball.stem("running", :english)
"run"

iex> ExNlp.Snowball.stem("caminando", :spanish)
"camin"

iex> ExNlp.Snowball.stem_words(["running", "jumping", "beautiful"], :english)
["run", "jump", "beauti"]

# Check supported languages
iex> ExNlp.Snowball.supported_languages()
[:english, :spanish, :portuguese, :french, :german, :italian, :polish]

Ranking Algorithms

TF-IDF

# Calculate TF-IDF score for a term in a document
iex> documents = ["The quick brown fox", "A brown dog"]
iex> ExNlp.Ranking.TfIdf.calculate("fox", "The quick brown fox", documents)
0.5108256237659907

# With preprocessing options
iex> ExNlp.Ranking.TfIdf.calculate("running", "The runner is running fast", documents,
...>   stem: true, language: :english, remove_stopwords: true)
0.6931471805599453

BM25

# Score documents against a query
iex> documents = ["BM25 is a ranking function", "used by search engines"]
iex> ExNlp.Ranking.Bm25.score(documents, ["ranking", "search"])
[1.8455076734299591, 1.0126973514850315]

# Rank documents with custom parameters
iex> ExNlp.Ranking.Bm25.score(documents, ["ranking", "search"], 
...>   k1: 1.5, b: 0.75, stem: true, language: :english)
[1.923456, 1.123456]

Similarity Metrics

# Levenshtein distance
iex> ExNlp.Similarity.levenshtein("kitten", "sitting")
3

# Levenshtein similarity (normalized)
iex> ExNlp.Similarity.levenshtein_similarity("kitten", "sitting")
0.5714285714285714

# Jaccard similarity
iex> ExNlp.Similarity.jaccard(["cat", "dog"], ["cat", "bird"])
0.3333333333333333

# Jaro-Winkler similarity
iex> ExNlp.Similarity.jaro_winkler_similarity("martha", "marhta")
0.9611111111111111

# Dice coefficient
iex> ExNlp.Similarity.dice_coefficient(["cat", "dog"], ["cat", "bird"])
0.5

Stopwords

# Check if a word is a stopword
iex> ExNlp.Stopwords.is_stopword?("the", :english)
true

# Remove stopwords from a list
iex> words = ["the", "quick", "brown", "fox"]
iex> ExNlp.Stopwords.remove(words, :english)
["quick", "brown", "fox"]

# Get list of stopwords
iex> ExNlp.Stopwords.list(:english)
["a", "all", "and", "as", "at", ...]

Text Filtering

# Build a filtering pipeline
iex> tokens = ExNlp.Tokenizer.tokenize("The Quick Brown Fox")
iex> tokens
...> |> ExNlp.Filter.lowercase()
...> |> ExNlp.Filter.stop_words(:english)
...> |> ExNlp.Filter.min_length(3)
[
  %ExNlp.Token{text: "quick", ...},
  %ExNlp.Token{text: "brown", ...},
  %ExNlp.Token{text: "fox", ...}
]

N-grams

# Character n-grams
iex> ExNlp.Ngram.char_ngrams("hello", 2)
["he", "el", "ll", "lo"]

# Word n-grams
iex> ExNlp.Ngram.word_ngrams(["the", "quick", "brown", "fox"], 2)
[["the", "quick"], ["quick", "brown"], ["brown", "fox"]]

Statistics

# Term frequency in a document
iex> ExNlp.Statistics.term_frequency("cat", ["the", "cat", "sat", "on", "the", "mat"])
1

# Document frequency in a corpus
iex> corpus = [["cat", "dog"], ["cat", "bird"], ["dog", "fish"]]
iex> ExNlp.Statistics.document_frequency("cat", corpus)
2

# Most frequent terms
iex> corpus = [["cat", "dog"], ["cat", "cat"], ["dog"]]
iex> ExNlp.Statistics.most_frequent(corpus, 2)
[{"cat", 3}, {"dog", 2}]

Co-occurrence Analysis

# Build co-occurrence matrix
iex> corpus = [["cat", "dog"], ["cat", "bird", "dog"], ["bird"]]
iex> matrix = ExNlp.Cooccurrence.cooccurrence_matrix(corpus)
iex> matrix["cat"]["dog"]
2

# Find co-occurring terms
iex> ExNlp.Cooccurrence.cooccurring_terms("cat", corpus, 2)
[{"dog", 2}, {"bird", 1}]

Supported Languages

Stemming

  • English - Porter2 algorithm (Porter stemmer v2)
  • Spanish - Spanish stemmer
  • Portuguese - Portuguese stemmer
  • French - French stemmer
  • German - German stemmer
  • Italian - Italian stemmer
  • Polish - Polish stemmer

Stopwords

Stopword lists are available for 30+ languages including: English, Spanish, Portuguese, French, German, Italian, Polish, Russian, Dutch, Swedish, Norwegian, Danish, Finnish, Turkish, Arabic, Chinese, and more. See priv/stopwords/ for the complete list.

Architecture

The library is organized into logical modules:

  • ExNlp.Tokenizer - Text tokenization with multiple strategies
  • ExNlp.Snowball - Word stemming algorithms
  • ExNlp.Ranking - Document ranking (TF-IDF, BM25)
  • ExNlp.Similarity - String and set similarity metrics
  • ExNlp.Stopwords - Stopword detection and filtering
  • ExNlp.Filter - Token filtering and transformation
  • ExNlp.Statistics - Text and corpus statistics
  • ExNlp.Cooccurrence - Term co-occurrence analysis
  • ExNlp.Ngram - N-gram generation

Performance

The library includes benchmark suites for critical operations. Run benchmarks with:

mix run benchmarks/tokenizer_bench.exs
mix run benchmarks/similarity_bench.exs
mix run benchmarks/ranking_bench.exs

Testing

Run the test suite with:

mix test

Documentation

Generate documentation with:

mix docs

Contributing

Contributions are welcome! This library aims to be a comprehensive NLP toolkit for Elixir. Areas for contribution:

  • Additional language support for stemming
  • More stopword lists
  • Additional similarity metrics
  • Performance optimizations
  • Documentation improvements

Credits

License

MIT License - see LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages