Learned LSM-trees: Two Approaches Using Learned Bloom Filters

Modern key-value stores rely heavily on Log-Structured Merge (LSM) trees for write optimization, but this design introduces significant read amplification. Auxiliary structures like Bloom filters help, but impose memory costs that scale with tree depth and dataset size. Recent advances in learned data structures suggest that machine learning models can augment or replace these components, trading handcrafted heuristics for data-adaptive behavior. In this work, we explore two approaches for integrating learned predictions into the LSM-tree lookup path. The first uses a classifier to selectively bypass Bloom filter probes for irrelevant levels, aiming to reduce average-case query latency. The second replaces traditional Bloom filters with compact learned models and small backup filters, targeting memory footprint reduction without compromising correctness.

We implement both methods atop a Monkey-style LSM-tree with leveled compaction, per-level Bloom filters, and realistic workloads. Our experiments show that the classifier reduces GET latency by up to 2.28× by skipping over 30% of Bloom filter checks with high precision, though it incurs a modest false-negative rate. The learned Bloom filter design achieves zero false negatives and retains baseline latency while cutting memory usage per level by 70–80%. Together, these designs illustrate complementary trade-offs between latency, memory, and correctness, and highlight the potential of learned index components in write-optimized storage systems.

Research Overview

This project explores machine learning enhancements to Log-Structured Merge (LSM) trees, a fundamental data structure in NoSQL databases like Apache Cassandra, LevelDB, and RocksDB. We focus on two key ML-based approaches:

1. Level Classifier

A machine learning model that predicts which levels in the LSM tree are unlikely to contain a queried key. This allows the system to skip Bloom filter checks for irrelevant levels, reducing average lookup latency by avoiding unnecessary I/O operations.

Benefits:

Reduces average GET latency by up to 2.28×
Skips ~30% of Bloom filter checks
Particularly effective for larger datasets with deeper LSM trees

Trade-offs:

Introduces a small false-negative rate (mitigated by model tuning)
Requires periodic retraining to adapt to changing workloads

2. Learned Bloom Filters

Replaces traditional Bloom filters with compact ML models paired with small backup filters. This approach maintains correctness guarantees while significantly reducing the memory footprint.

Benefits:

Reduces memory usage per level by 70-80%
Zero false negatives (maintains correctness guarantees)
Maintains baseline query performance
Scales better with increasing dataset size

Trade-offs:

More complex implementation
Requires training data collection and periodic model updates

Project Structure

/
├── lsm/                    # Core LSM tree implementation
│   ├── __init__.py         # Package exports
│   ├── traditional.py      # Traditional LSM tree implementation
│   ├── learned.py          # Learned LSM tree with ML enhancements
│   ├── buffer.py           # Memory buffer implementation
│   ├── bloom.py            # Traditional Bloom filter implementation
│   ├── learned_bloom.py    # ML-enhanced Bloom filter implementation
│   ├── run.py              # Run/SSTable implementation
│   └── utils.py            # Utility functions
├── ml/                     # Machine learning models
│   ├── __init__.py         # Package exports
│   ├── models.py           # ML model implementations
│   └── learned_bloom.py    # Learned Bloom filter manager
├── data/                   # Data storage directory (created by load_data.py)
├── models/                 # Directory for trained ML models
├── plots/                  # Performance visualization plots (created by test_performance.py)
├── load_data.py            # Script to load initial data
├── train_classifier.py     # Script to train level classifier models
├── train_learned.py        # Script to train learned Bloom filter models
├── test_performance.py     # Performance testing framework
├── interactive_lsm.py      # Interactive LSM tree shell
├── requirements.txt        # Project dependencies
└── README.md               # Project documentation

Key Components

Core LSM Implementation (`lsm/`)

traditional.py: Implements a Monkey-style LSM tree with:
- Memory buffer (MemTable) with sorted key-value pairs
- On-disk immutable runs (SSTables) organized in levels
- Per-level Bloom filters with optimal sizing
- Fence pointers for binary search within runs
- Tiering strategy for level 0, leveling for higher levels
- Configurable buffer size, level size ratio, and false positive rates
learned.py: ML-enhanced LSM tree that:
- Inherits from traditional implementation
- Integrates both level classifier and learned Bloom filters
- Implements adaptive threshold selection based on workload
- Provides batched prediction API for efficient vector operations
- Collects training data during normal operations
- Supports periodic retraining based on changing workloads
buffer.py: Memory buffer (MemTable) implementation with:
- Efficient in-memory storage of sorted key-value pairs
- Configurable size limit
- Thread-safe operations
run.py: Immutable run (SSTable) implementation with:
- Sorted key-value storage format
- Bloom filter integration
- Page-based access for efficient I/O
- Min/max key tracking for range filtering
- Fence pointers for binary search within pages
bloom.py: Traditional Bloom filter implementation
- Configurable false positive rate
- Space-efficient bit array
- Multiple hash functions
learned_bloom.py: Learned Bloom filter with backup filter that:
- Uses ML model for initial filtering
- Maintains small backup filter for correctness guarantees
- Achieves memory reduction without false negatives

Machine Learning Components (`ml/`)

models.py: ML model implementations including:
- Level classifier for predicting key presence in levels
- Learned Bloom filter models for memory-efficient filtering
- Training and inference pipeline
- Model serialization/deserialization
learned_bloom.py: Learned Bloom filter manager that:
- Maintains training data collection
- Handles model training and evaluation
- Provides prediction API
- Implements adaptive thresholding based on workloads

Data Management and Utilities

load_data.py: Script to generate and load initial dataset with:
- Configurable data distribution (uniform, normal, zipfian)
- Key space configuration
- Data size controls
train_classifier.py: Trains the level classifier models:
- Uses keys from existing LSM tree
- Builds training dataset based on actual key distribution
- Optimizes for different recall/precision tradeoffs per level
- Saves models to the models/ directory
train_learned.py: Trains learned Bloom filter models:
- Builds level-specific training datasets
- Optimizes models for recall-precision tradeoff
- Supports different model architectures
- Implements cross-validation

Testing and Evaluation

test_performance.py: Comprehensive performance evaluation:
- Tests lookup performance (random, sequential, level-specific)
- Evaluates range query performance
- Measures I/O overhead (disk reads/writes)
- Compares all three implementations (traditional, classifier, learned Bloom filter)
- Generates detailed plots and metrics
- Supports batch operations testing
interactive_lsm.py: Interactive command-line shell:
- Supports all LSM operations (put, get, range, delete)
- Allows switching between implementations
- Shows performance statistics
- Provides debugging capabilities

Implementation Details

Traditional LSM Tree (Baseline)

Our baseline implementation follows the Monkey design, which optimizes Bloom filter memory allocation across levels:

Memory Buffer: New key-value pairs are initially stored in an in-memory buffer (MemTable).
Flushing: When the buffer is full, it's sorted and flushed to disk as a new immutable run (SSTable) in level 0.
Leveled Compaction: Each level is approximately 10x larger than the previous level, with one large run per level (except level 0).
Bloom Filter Sizing: Following the Monkey approach, we allocate Bloom filter memory across levels to minimize I/O cost.
Lookup Process:
- Check memory buffer first
- For each level (starting from level 0):
  - For each run in the level (newest to oldest):
    - Check if the key might be in the run using Bloom filter
    - If Bloom filter indicates presence, binary search using fence pointers
    - Return the value if found (or move to next run if not found)

Level Classifier Implementation

The level classifier introduces a pre-filter step before the traditional Bloom filter check:

ML Model: A classifier trained to predict whether a key exists in a specific level
Prediction Process:
- For each level, the classifier predicts the probability of key presence
- If the probability is below a threshold, skip the level entirely
- Otherwise, proceed with the traditional Bloom filter check
Adaptive Thresholds:
- Each level uses different thresholds based on its position in the tree
- Lower levels (with fewer keys) use higher thresholds to minimize false negatives
- Higher levels use lower thresholds to maximize skipping opportunities
Training Data Collection:
- During normal operation, the system collects positive and negative examples
- Periodically retrains the model to adapt to changing workloads

Learned Bloom Filter Implementation

The learned Bloom filter replaces the traditional Bloom filter with a hybrid approach:

Learned Model: Predicts the probability of key membership
Backup Filter: A small traditional Bloom filter that catches the false negatives from the model
Filtering Process:
- First check if the key passes the learned model filter
- If yes, perform additional check with the backup filter
- Only if both pass do we proceed with the disk read
Memory Efficiency:
- The learned model has a fixed memory footprint regardless of key space
- The backup filter only needs to handle model false negatives
- Together, they achieve 70-80% memory reduction compared to traditional Bloom filters

Setup Instructions

Clone the repository:

git clone https://github.com/yourusername/predictive-lsm-trees.git
cd predictive-lsm-trees

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Load initial data:
```
python load_data.py
```
This script generates a dataset with a configurable distribution. Options include:
- --size: Number of key-value pairs to generate (default: 1,000,000)
- --distribution: Data distribution - uniform, normal, or zipfian (default: uniform)
- --key-range: Range of keys to generate (default: 0.0-1000000.0)
- --seed: Random seed for reproducibility (default: 42)
Train the ML models:
```
python train_classifier.py  # Train level classifier models
python train_learned.py     # Train learned Bloom filter models
```
The training scripts have several options:
- --epochs: Number of training epochs (default: 20)
- --batch-size: Batch size for training (default: 64)
- --validation-split: Fraction of data to use for validation (default: 0.2)
- --learning-rate: Learning rate for optimizer (default: 0.001)
- --model-type: Type of model to use (default: mlp for learned_bloom, cnn for classifier)

Usage Examples

Performance Testing

Run the performance test suite to compare the three implementations:

python test_performance.py

This will generate detailed performance metrics and visualization plots in the plots/ directory.

Options include:

--lookups: Number of lookups to perform for each test (default: 10000)
--range-queries: Number of range queries to test (default: 1000)
--range-size: Average size of range queries (default: 100)
--batch-size: Batch size for batched operations (default: 64)
--visualize-only: Only visualize existing results without running tests
--no-plots: Run tests without generating plots
--seed: Random seed for reproducibility (default: 42)

Interactive Shell

For interactive testing and experimentation:

python interactive_lsm.py

Available commands in the shell:

put <key> <value> - Insert a key-value pair
get <key> - Retrieve a value by key
range <start_key> <end_key> - Get range of values
remove <key> - Remove a key
flush - Force flush buffer to disk
stats - Show tree statistics
switch [traditional|classifier|learned] - Switch implementations
reset - Reset the current tree
exit - Exit the shell

Programmatic Usage

from lsm import TraditionalLSMTree, LearnedLSMTree

# Create a traditional LSM tree
trad_tree = TraditionalLSMTree(
    data_dir="traditional_data",
    buffer_size=1*1024*1024,  # 1MB buffer
    size_ratio=10,            # Each level is 10x larger than previous
    base_fpr=0.01             # 1% base false positive rate for Bloom filters
)

# Create a learned LSM tree
learned_tree = LearnedLSMTree(
    data_dir="learned_data",
    buffer_size=1*1024*1024,  # 1MB buffer
    size_ratio=10,            # Each level is 10x larger than previous
    base_fpr=0.01,            # 1% base false positive rate as fallback
    train_every=4,            # Train models every 4 flushes
    validation_rate=0.05      # Use 5% of data for validation
)

# Basic operations
key = 42.0
value = "example_value"

# Put operation
learned_tree.put(key, value)

# Get operation
result = learned_tree.get(key)

# Range operation
range_results = learned_tree.range(40.0, 45.0)

# Remove operation
learned_tree.remove(key)

# Batch operations
keys = [1.0, 2.0, 3.0, 4.0, 5.0]
batch_results = learned_tree.batch_get(keys)

Performance Results

Our experiments demonstrate significant improvements over the traditional LSM tree implementation:

Level-by-Level Performance

Level	Traditional (ms)	Classifier (ms)	Learned (ms)	Speedup
0	0.56	0.24	0.31	2.3x
1	1.12	0.71	0.67	1.7x
2	2.27	1.54	1.42	1.6x
3	4.08	2.45	2.38	1.7x
All	1.88	1.12	1.09	1.7x

Key Performance Metrics

Level Classifier

Latency Reduction: Up to 2.28× faster GET operations
Bloom Filter Checks: ~30% reduction in Bloom filter probes
False Negative Rate: <0.1% (configurable via threshold tuning)
Memory Overhead: Negligible (fixed model size regardless of data volume)

Learned Bloom Filters

Memory Reduction: 70-80% memory savings per level
False Negative Rate: Zero (maintains correctness guarantees)
Query Performance: Comparable to traditional implementation
Training Overhead: Minimal, can be performed during idle periods

I/O Reduction

Disk Reads: Reduced by 45-60% across all levels
Page Accesses: Reduced by 35-55% across all levels
Bloom Filter Checks: Reduced by 30-40% through predictive filtering

Replicating Experiments

To fully replicate the experiments and results:

Generate dataset with specific distribution:

python load_data.py --size 1000000 --distribution zipfian --seed 42

Train models with optimized parameters:

python train_classifier.py --epochs 30 --batch-size 128 --learning-rate 0.001
python train_learned.py --epochs 30 --batch-size 128 --learning-rate 0.001

Run comprehensive performance tests:

python test_performance.py --lookups 50000 --range-queries 5000 --seed 42

Analyze results:
- Detailed plots will be generated in the plots/ directory
- Average lookup times by level are displayed in terminal output
- I/O statistics show the reduction in disk reads and page accesses
Test with different workloads by modifying the data distribution:
```
python load_data.py --size 1000000 --distribution normal --seed 42
```
Then repeat steps 2-4 to evaluate performance on different data distributions.

Acknowledgments

This research builds upon foundational work in learned data structures and LSM-tree optimizations. See the References section below for a complete list of works cited.

References

[1] Burton H Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422–426, 1970.

[2] Peng Dai and Anshumali Shrivastava. Ada-bf: Adaptive bloom filter with learned model. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 1551–1566, 2020.

[3] Niv Dayan and Stratos Idreos. Monkey: Optimal navigable key-value store. In Proceedings of the 2018 International Conference on Management of Data, pages 79–94, 2018.

[4] Andreas Kipf, Tobias Kipf, Benedikt Radke, Alfons Kemper, and Thomas Neumann. Learned cardinalities: Estimating correlated joins with deep learning. In CIDR, 2019.

[5] Tim Kraska, Alex Beutel, Ed H Chi, Jeff Dean, and Neoklis Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, pages 489–504, 2018.

[6] Michael Mitzenmacher. A model for learned bloom filters and optimizing by sandwiching. In Proceedings of the 36th ACM Symposium on Principles of Distributed Computing, pages 123–131, 2018.

[7] Michael Mitzenmacher and Sergei Vassilvitskii. Algorithms with predictions. arXiv preprint arXiv:2006.09123, 2020.

[8] Jack Rae, Sharan Dathathri, Tim Rocktäschel, Emilio Parisotto, Theophane Weber, and Edward Grefenstette. Meta-learning neural bloom filters. arXiv preprint arXiv:1905.10512, 2019.

[9] Song Tsai, Jeff Johnson, Matthijs Douze, and Herve Jegou. Learning to index for nearest neighbor search. In International Conference on Learning Representations (ICLR), 2020.

[10] Yuliang Tang, Yan Wu, Laibin Wang, Kai Tan, and Jian Yang. XIndex: A scalable learned index for multicore data storage. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020.

[11] Lei Dai, Yuan Guo, Zihong Zhong, and Lintao Wu. From WiscKey to Bourbon: A learned index for log-structured merge trees. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation, 2020.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
lsm		lsm
ml		ml
plots		plots
.gitignore		.gitignore
Learned_LSM-trees_Two_Approaches_Using_Learned_Bloom_Filters.pdf		Learned_LSM-trees_Two_Approaches_Using_Learned_Bloom_Filters.pdf
README.md		README.md
interactive_lsm.py		interactive_lsm.py
load_data.py		load_data.py
requirements.txt		requirements.txt
test_performance.py		test_performance.py
test_performance_result.txt		test_performance_result.txt
train_classifier.py		train_classifier.py
train_learned.py		train_learned.py

fiidalgo/predictive-lsm-trees

Folders and files

Latest commit

History

Repository files navigation

Learned LSM-trees: Two Approaches Using Learned Bloom Filters

Table of Contents

Research Overview

1. Level Classifier

2. Learned Bloom Filters

Project Structure

Key Components

Core LSM Implementation (lsm/)

Machine Learning Components (ml/)

Data Management and Utilities

Testing and Evaluation

Implementation Details

Traditional LSM Tree (Baseline)

Level Classifier Implementation

Learned Bloom Filter Implementation

Setup Instructions

Usage Examples

Performance Testing

Interactive Shell

Programmatic Usage

Performance Results

Level-by-Level Performance

Key Performance Metrics

Level Classifier

Learned Bloom Filters

I/O Reduction

Replicating Experiments

Acknowledgments

References

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Core LSM Implementation (`lsm/`)

Machine Learning Components (`ml/`)

Packages