Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 17 additions & 18 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
@@ -1,33 +1,32 @@
name: Python package
name: CI

on:
push:
branches: ["main"]
pull_request:
branches: ["main"]

jobs:
build:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4.1.1
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
- name: Install uv
uses: astral-sh/setup-uv@v4
with:
python-version: "3.10"
cache: "pip"
enable-cache: true

- name: Set up Python
run: uv python install 3.11

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install -e .
run: uv sync --extra dev

- name: Run ruff check
run: uv run ruff check .

- name: Python Ruff Lint and Format
uses: adityabhangle658/ruff-python-lint-format[email protected]
- name: Run ruff format check
run: uv run ruff format --check .

- name: Run tests with pytest
run: |
pytest -v --tb=short .
- name: Run tests
run: uv run pytest -v --tb=short .
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,5 @@ env/
.env.remote
*.origami
dist/
*.pt
dump/
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.11
200 changes: 200 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

ORiGAMi is a transformer-based machine learning model for supervised classification from semi-structured data (MongoDB documents, JSON files). It directly operates on JSON data without requiring manual feature extraction or flattening to tabular format.

## Development Commands

### Installation and Setup
```bash
# Install in development mode
pip install -e .

# Install with dev dependencies
pip install -e ".[dev]"

# Install from requirements file
pip install -r requirements.txt
```

### Code Quality
```bash
# Run linting with ruff
ruff check .

# Format code with ruff
ruff format .

# Run tests
pytest
```

### CLI Usage
```bash
# Basic CLI help
origami --help

# Train a model
origami train <source> [options]

# Make predictions
origami predict <source> --target-field <field> [options]
```

## Architecture Overview

### Core Components

- **`origami/cli/`**: Command-line interface with `train` and `predict` commands
- **`origami/model/`**: Core transformer model implementation
- `origami.py`: Main ORiGAMi transformer model
- `vpda.py`: Variable Position Discriminant Analysis (VPDA) for guardrails
- `positions.py`: Position encoding implementations
- **`origami/preprocessing/`**: Data preprocessing pipeline
- `encoder.py`: Tokenization and encoding of JSON data
- `pipelines.py`: Data processing pipelines
- `df_dataset.py`: Dataset handling for pandas DataFrames
- **`origami/inference/`**: Model inference and prediction
- `predictor.py`: Main prediction interface
- `embedder.py`: Embedding generation
- `autocomplete.py`: Autocompletion functionality
- `sampler.py`: Generate samples from learned model distribution
- `mc_estimator.py`: Monte Carlo cardinality estimator for query selectivity
- `rejection_estimator.py`: Rejection sampling cardinality estimator
- **`origami/utils/`**: Utilities and configuration
- `config.py`: Configuration classes using OmegaConf
- `query_utils.py`: Query evaluation and selectivity calculation utilities
- `common.py`: Common utilities, symbols, operators, and helper functions

### Key Architecture Features

1. **Transformer-based**: Uses transformer architecture for processing sequential JSON tokens
2. **Guardrails System**: Three modes (NONE, STRUCTURE_ONLY, STRUCTURE_AND_VALUES) to enforce valid JSON generation
3. **Position Encoding**: Multiple methods (INTEGER, SINE_COSINE, KEY_VALUE) for sequence positioning
4. **Shuffled Training**: Can train with shuffled key/value pairs for better generalization
5. **Schema-aware**: Tracks field paths and value vocabularies for validation
6. **Cardinality Estimation**: Monte Carlo estimator for query selectivity prediction

### Configuration System

The project uses OmegaConf for configuration management with dataclasses:
- `ModelConfig`: Model architecture parameters
- `TrainConfig`: Training hyperparameters
- `PipelineConfig`: Data processing options
- `DataConfig`: Data source configuration

### Data Sources Supported

- MongoDB collections (with +srv URI support)
- JSON files (.json, .jsonl)
- CSV files
- Directories containing supported file types

### Model Presets

Available model sizes: xs, small (default), medium, large, xl
- Default: 4 layers, 4 attention heads, 128 hidden dimensions

## Testing

Tests are organized by component:
- `tests/cli/`: CLI functionality tests
- `tests/model/`: Model component tests
- `tests/preprocessing/`: Data preprocessing tests
- `tests/inference/`: Inference component tests (including MC estimator)
- `tests/utils/`: Utility function tests

Run tests with: `pytest`

## Key Files

- `setup.py`: Package configuration with dependencies
- `pyproject.toml`: Build system and ruff configuration
- `CLI.md`: Detailed CLI documentation
- `requirements.txt`: Python dependencies
- `experiments/`: Experiment scripts for paper reproduction
- `notebooks/`: Example Jupyter notebooks
- `example_origami_mc_ce.ipynb`: Monte Carlo cardinality estimation demo

## Monte Carlo Cardinality Estimation

The `MCEstimator` class provides query selectivity estimation using Monte Carlo sampling:

### Usage Example
```python
from origami.inference import MCEstimator
from mdbrtools.query import Query, Predicate

# Initialize estimator with trained model and pipeline
estimator = MCEstimator(model, pipeline, batch_size=1000)

# Create a query
query = Query()
query.add_predicate(Predicate('field_name', 'gte', (min_value,)))
query.add_predicate(Predicate('field_name', 'lte', (max_value,)))

# Estimate selectivity
probability, samples = estimator.estimate(query, n=1000)
cardinality = probability * collection_size
```

### How It Works
1. Calculates query region size |E| (number of discrete states matching query)
2. Generates n uniform samples within the query region
3. Computes model probability f(x) for each sample
4. Returns Monte Carlo estimate: P(query) = |E| * mean(f(x))

### Query Utilities
- `evaluate_ground_truth(query, docs)`: Count documents matching query predicates
- `calculate_selectivity(query, docs)`: Calculate fraction of documents matching query
- `compare_estimate_to_ground_truth(query, docs, estimated_prob)`: Compare estimates with actual counts and compute error metrics (q-error, relative error, etc.)

## Sampling and Rejection Sampling

### Sampler
The `Sampler` class generates unbiased samples from the learned model distribution:

```python
from origami.inference import Sampler

# Initialize sampler
sampler = Sampler(model, encoder, schema, temperature=1.0)

# Generate samples
documents, log_probs = sampler.sample(n=1000)

# documents: list of dicts sampled from P_model(x)
# log_probs: numpy array of log P(document)
```

### Rejection Sampling Estimator
The `RejectionEstimator` uses rejection sampling for query selectivity estimation:

```python
from origami.inference import Sampler, RejectionEstimator

# Initialize sampler and estimator
sampler = Sampler(model, encoder, schema)
estimator = RejectionEstimator(sampler)

# Create a query
query = Query()
query.add_predicate(Predicate('field_name', 'gte', (min_value,)))
query.add_predicate(Predicate('field_name', 'lte', (max_value,)))

# Estimate selectivity
selectivity, accepted_samples = estimator.estimate(query, n=1000)
cardinality = selectivity * collection_size
```

**How it works:**
1. Samples n documents from the learned model distribution
2. Rejects samples that don't match the query predicates
3. Returns unbiased estimate: selectivity = (# accepted) / n

**When to use:**
- **Rejection Sampling**: Best for common queries (high selectivity)
- **MC Sampling**: Best for rare queries (low selectivity)
33 changes: 25 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,9 @@

## Disclaimer

Please note: This tool is not officially supported or endorsed by MongoDB, Inc. The code is released for use "AS IS" without any warranties of any kind, including, but not limited to its installation, use, or performance. Do not run this tool against critical production systems.
This is a personal fork of the original [mongodb-labs/origami](https://github.com/mongodb-labs/origami) project. While I was the original author, I have since left MongoDB and am continuing development and maintenance of this fork independently.

This tool is not officially supported or endorsed by MongoDB, Inc. The code is released for use "AS IS" without any warranties of any kind, including, but not limited to its installation, use, or performance. Do not run this tool against critical production systems.

## Overview

Expand All @@ -22,22 +24,37 @@ ORiGAMi circumvents this by directly operating on JSON data. Once a model is tra

## Installation

ORiGAMi requires Python version 3.10 or 3.11. We recommend using a virtual environment, such as
Python's native [`venv`](https://docs.python.org/3/library/venv.html).
ORiGAMi requires Python 3.11. We recommend using [`uv`](https://docs.astral.sh/uv/) for dependency management and virtual environments.

To install ORiGAMi with `pip`, use
### Install from PyPI

```shell
pip install origami-ml
```

You can also clone the repository to your local machine and install the dependencies manually:
### Install from source with uv (recommended for development)

First, install `uv` if you haven't already:

```shell
curl -LsSf https://astral.sh/uv/install.sh | sh
```

Then clone and install the project:

```shell
git clone https://github.com/mongodb-labs/origami.git
git clone https://github.com/rueckstiess/origami.git
cd origami
pip install -r requirements.txt
pip install -e .
uv sync --extra dev
```

This will automatically create a virtual environment, install Python 3.11 if needed, and install all dependencies.

To run commands in the uv environment:

```shell
uv run origami --help
uv run pytest
```

## Usage
Expand Down
Loading
Loading