Fix sampling, MC estimator improvements, and OnTime CE notebook #16

rueckstiess · 2025-10-25T00:29:03Z

Summary

Fixed Sampler to skip bad/invalid samples instead of failing
Fixed 'pipe not fitted' warnings in preprocessing pipeline
Improved MC estimator performance by using random.choice() instead of np.random.choice()
Added OnTime dataset cardinality estimation notebook demonstrating the improvements
Applied ruff formatting

Changes

origami/inference/sampler.py: Skip invalid samples during generation instead of raising errors
origami/preprocessing/pipes.py: Avoid 'pipe not fitted' warnings by checking fit state
origami/inference/mc_estimator.py: Use random.choice() for better performance
notebooks/example_origami_ontime_ce.ipynb: New demo notebook showing cardinality estimation on OnTime dataset

Test plan

Run existing tests to ensure no regressions
Verify notebook executes successfully with OnTime dataset
Confirm sampling now handles edge cases gracefully
Check that MC estimator performance has improved

🤖 Generated with Claude Code

This commit adds a new MCEstimator class that estimates query selectivity using Monte Carlo sampling. The estimator leverages trained ORiGAMi models to predict the probability distribution over documents and estimates how many documents match a given query. New files: - origami/inference/mc_estimator.py: Core MCEstimator implementation - origami/utils/query_utils.py: Query evaluation and comparison utilities - tests/inference/test_mc_estimator.py: Comprehensive test suite - notebooks/example_origami_mc_ce.ipynb: Example notebook demonstrating usage Modified files: - origami/inference/__init__.py: Export MCEstimator - origami/utils/__init__.py: Export query utility functions - origami/utils/common.py: Add OPERATORS dict and comparison helper functions - CLAUDE.md: Document MC estimator architecture and usage The MCEstimator works by: 1. Calculating query region size based on discretized value spaces 2. Generating uniform samples within the query constraints 3. Computing model probabilities for each sample 4. Returning Monte Carlo estimate: P(query) = |E| * mean(f(x)) Includes utility functions for ground truth evaluation and error metrics (q-error, relative error, absolute error) for comparing estimates. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…ents - Introduced SortFieldsPipe to sort document fields alphabetically, ensuring consistent field ordering across all documents. - Updated MCEstimator to detect the presence of SortFieldsPipe in the pipeline and sort generated samples accordingly. - Enhanced the sort_dict_fields utility function to return an OrderedDict with alphabetically sorted keys. - Added unit tests for SortFieldsPipe to verify its functionality across various scenarios, including handling of empty documents and preservation of value types. - Updated existing tests to check for the correct detection of SortFieldsPipe in MCEstimator.

…ator Fixed a systematic 2x overestimation bug in the rejection sampling cardinality estimator caused by duplicate document generation in the Sampler class. ## Bug Description The Sampler._sample_batch() method was adding completed sequences twice: 1. When sequences reached PAD token (saved to completed_rows) 2. Again in the "handle uncompleted sequences" block after loop This occurred because when all sequences completed, the code would break BEFORE removing them from the idx tensor, causing them to be added again at the end. Result: n samples requested → 2n documents returned, with each document appearing exactly twice consecutively. ## Fix Reordered the completion handling logic in _sample_batch(): - Move sequence removal BEFORE checking if all sequences are done - Change from: save → break (if all done) → remove (never reached!) - Change to: save → remove → break (if idx.size(0) == 0) This ensures completed sequences are removed from the batch immediately, preventing duplicate additions. ## Impact on RejectionEstimator The RejectionEstimator was counting acceptance rate as len(accepted)/n, but n samples were duplicated, so accepted samples were also duplicated, leading to systematic 2x overestimation of selectivity. After fix: - MC Estimator: Median Q-Error 1.09x (unchanged, was already correct) - Rejection Estimator: Median Q-Error 1.08x (was ~2x before fix) - Both estimators now track ground truth accurately Note: Occasional probabilistic duplicates in sampling output are expected and correct behavior (model genuinely sampling same document by chance). ## Files Changed - origami/inference/sampler.py: Fixed _sample_batch() duplicate bug - origami/inference/rejection_estimator.py: New rejection sampling estimator - notebooks/example_origami_mc_ce.ipynb: Updated to compare both estimators - tests/: Added test files for both new classes - CLAUDE.md: Updated documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…mators Add Monte Carlo and Rejection Sampling Cardinality Estimators

CI tests, migration to `uv`

…ndom.choice() instead of np.random.choice() in MCEstimator, notebook CE on OnTime data

rueckstiess · 2025-10-25T00:29:23Z

Closing - creating PR against fork instead

Thomas Rueckstiess and others added 15 commits October 22, 2025 16:32

Set NUM_BINS back to 50

46043d0

Re-export SampleEstimator from mdbrtools

0ec25c1

lint and format

8136720

Update readme.

11e33b3

Fix nested tensor warning

c649a7d

Merge pull request #1 from rueckstiess/monte-carlo-and-rejection-esti…

2b46f41

…mators Add Monte Carlo and Rejection Sampling Cardinality Estimators

CI tests, migration to uv

872f436

ruff format

9ef70ee

Merge pull request #2 from rueckstiess/migrate-to-uv

1507793

CI tests, migration to `uv`

remove setup.cfg, bump version to 0.2.0

1e24d49

skip bad samples in Sampler, avoid 'pipe not fitted' warnings, use ra…

30fbfa8

…ndom.choice() instead of np.random.choice() in MCEstimator, notebook CE on OnTime data

ruff formatting

08192be

rueckstiess closed this Oct 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sampling, MC estimator improvements, and OnTime CE notebook #16

Fix sampling, MC estimator improvements, and OnTime CE notebook #16

Uh oh!

rueckstiess commented Oct 25, 2025

Uh oh!

rueckstiess commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix sampling, MC estimator improvements, and OnTime CE notebook #16

Fix sampling, MC estimator improvements, and OnTime CE notebook #16

Uh oh!

Conversation

rueckstiess commented Oct 25, 2025

Summary

Changes

Test plan

Uh oh!

rueckstiess commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant