Skip to content

Conversation

@rueckstiess
Copy link
Contributor

Summary

  • Fixed Sampler to skip bad/invalid samples instead of failing
  • Fixed 'pipe not fitted' warnings in preprocessing pipeline
  • Improved MC estimator performance by using random.choice() instead of np.random.choice()
  • Added OnTime dataset cardinality estimation notebook demonstrating the improvements
  • Applied ruff formatting

Changes

  • origami/inference/sampler.py: Skip invalid samples during generation instead of raising errors
  • origami/preprocessing/pipes.py: Avoid 'pipe not fitted' warnings by checking fit state
  • origami/inference/mc_estimator.py: Use random.choice() for better performance
  • notebooks/example_origami_ontime_ce.ipynb: New demo notebook showing cardinality estimation on OnTime dataset

Test plan

  • Run existing tests to ensure no regressions
  • Verify notebook executes successfully with OnTime dataset
  • Confirm sampling now handles edge cases gracefully
  • Check that MC estimator performance has improved

🤖 Generated with Claude Code

Thomas Rueckstiess and others added 15 commits October 22, 2025 16:32
This commit adds a new MCEstimator class that estimates query selectivity
using Monte Carlo sampling. The estimator leverages trained ORiGAMi models
to predict the probability distribution over documents and estimates how
many documents match a given query.

New files:
- origami/inference/mc_estimator.py: Core MCEstimator implementation
- origami/utils/query_utils.py: Query evaluation and comparison utilities
- tests/inference/test_mc_estimator.py: Comprehensive test suite
- notebooks/example_origami_mc_ce.ipynb: Example notebook demonstrating usage

Modified files:
- origami/inference/__init__.py: Export MCEstimator
- origami/utils/__init__.py: Export query utility functions
- origami/utils/common.py: Add OPERATORS dict and comparison helper functions
- CLAUDE.md: Document MC estimator architecture and usage

The MCEstimator works by:
1. Calculating query region size based on discretized value spaces
2. Generating uniform samples within the query constraints
3. Computing model probabilities for each sample
4. Returning Monte Carlo estimate: P(query) = |E| * mean(f(x))

Includes utility functions for ground truth evaluation and error metrics
(q-error, relative error, absolute error) for comparing estimates.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…ents

- Introduced SortFieldsPipe to sort document fields alphabetically, ensuring consistent field ordering across all documents.
- Updated MCEstimator to detect the presence of SortFieldsPipe in the pipeline and sort generated samples accordingly.
- Enhanced the sort_dict_fields utility function to return an OrderedDict with alphabetically sorted keys.
- Added unit tests for SortFieldsPipe to verify its functionality across various scenarios, including handling of empty documents and preservation of value types.
- Updated existing tests to check for the correct detection of SortFieldsPipe in MCEstimator.
…ator

Fixed a systematic 2x overestimation bug in the rejection sampling cardinality
estimator caused by duplicate document generation in the Sampler class.

## Bug Description

The Sampler._sample_batch() method was adding completed sequences twice:
1. When sequences reached PAD token (saved to completed_rows)
2. Again in the "handle uncompleted sequences" block after loop

This occurred because when all sequences completed, the code would break
BEFORE removing them from the idx tensor, causing them to be added again
at the end. Result: n samples requested → 2n documents returned, with each
document appearing exactly twice consecutively.

## Fix

Reordered the completion handling logic in _sample_batch():
- Move sequence removal BEFORE checking if all sequences are done
- Change from: save → break (if all done) → remove (never reached!)
- Change to: save → remove → break (if idx.size(0) == 0)

This ensures completed sequences are removed from the batch immediately,
preventing duplicate additions.

## Impact on RejectionEstimator

The RejectionEstimator was counting acceptance rate as len(accepted)/n,
but n samples were duplicated, so accepted samples were also duplicated,
leading to systematic 2x overestimation of selectivity.

After fix:
- MC Estimator: Median Q-Error 1.09x (unchanged, was already correct)
- Rejection Estimator: Median Q-Error 1.08x (was ~2x before fix)
- Both estimators now track ground truth accurately

Note: Occasional probabilistic duplicates in sampling output are expected
and correct behavior (model genuinely sampling same document by chance).

## Files Changed

- origami/inference/sampler.py: Fixed _sample_batch() duplicate bug
- origami/inference/rejection_estimator.py: New rejection sampling estimator
- notebooks/example_origami_mc_ce.ipynb: Updated to compare both estimators
- tests/: Added test files for both new classes
- CLAUDE.md: Updated documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…mators

Add Monte Carlo and Rejection Sampling Cardinality Estimators
…ndom.choice() instead of np.random.choice() in MCEstimator, notebook CE on OnTime data
@rueckstiess
Copy link
Contributor Author

Closing - creating PR against fork instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant