-
Notifications
You must be signed in to change notification settings - Fork 1
Fix sampling, MC estimator improvements, and OnTime CE notebook #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit adds a new MCEstimator class that estimates query selectivity using Monte Carlo sampling. The estimator leverages trained ORiGAMi models to predict the probability distribution over documents and estimates how many documents match a given query. New files: - origami/inference/mc_estimator.py: Core MCEstimator implementation - origami/utils/query_utils.py: Query evaluation and comparison utilities - tests/inference/test_mc_estimator.py: Comprehensive test suite - notebooks/example_origami_mc_ce.ipynb: Example notebook demonstrating usage Modified files: - origami/inference/__init__.py: Export MCEstimator - origami/utils/__init__.py: Export query utility functions - origami/utils/common.py: Add OPERATORS dict and comparison helper functions - CLAUDE.md: Document MC estimator architecture and usage The MCEstimator works by: 1. Calculating query region size based on discretized value spaces 2. Generating uniform samples within the query constraints 3. Computing model probabilities for each sample 4. Returning Monte Carlo estimate: P(query) = |E| * mean(f(x)) Includes utility functions for ground truth evaluation and error metrics (q-error, relative error, absolute error) for comparing estimates. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…ents - Introduced SortFieldsPipe to sort document fields alphabetically, ensuring consistent field ordering across all documents. - Updated MCEstimator to detect the presence of SortFieldsPipe in the pipeline and sort generated samples accordingly. - Enhanced the sort_dict_fields utility function to return an OrderedDict with alphabetically sorted keys. - Added unit tests for SortFieldsPipe to verify its functionality across various scenarios, including handling of empty documents and preservation of value types. - Updated existing tests to check for the correct detection of SortFieldsPipe in MCEstimator.
…ator Fixed a systematic 2x overestimation bug in the rejection sampling cardinality estimator caused by duplicate document generation in the Sampler class. ## Bug Description The Sampler._sample_batch() method was adding completed sequences twice: 1. When sequences reached PAD token (saved to completed_rows) 2. Again in the "handle uncompleted sequences" block after loop This occurred because when all sequences completed, the code would break BEFORE removing them from the idx tensor, causing them to be added again at the end. Result: n samples requested → 2n documents returned, with each document appearing exactly twice consecutively. ## Fix Reordered the completion handling logic in _sample_batch(): - Move sequence removal BEFORE checking if all sequences are done - Change from: save → break (if all done) → remove (never reached!) - Change to: save → remove → break (if idx.size(0) == 0) This ensures completed sequences are removed from the batch immediately, preventing duplicate additions. ## Impact on RejectionEstimator The RejectionEstimator was counting acceptance rate as len(accepted)/n, but n samples were duplicated, so accepted samples were also duplicated, leading to systematic 2x overestimation of selectivity. After fix: - MC Estimator: Median Q-Error 1.09x (unchanged, was already correct) - Rejection Estimator: Median Q-Error 1.08x (was ~2x before fix) - Both estimators now track ground truth accurately Note: Occasional probabilistic duplicates in sampling output are expected and correct behavior (model genuinely sampling same document by chance). ## Files Changed - origami/inference/sampler.py: Fixed _sample_batch() duplicate bug - origami/inference/rejection_estimator.py: New rejection sampling estimator - notebooks/example_origami_mc_ce.ipynb: Updated to compare both estimators - tests/: Added test files for both new classes - CLAUDE.md: Updated documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…mators Add Monte Carlo and Rejection Sampling Cardinality Estimators
CI tests, migration to `uv`
…ndom.choice() instead of np.random.choice() in MCEstimator, notebook CE on OnTime data
Contributor
Author
|
Closing - creating PR against fork instead |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
random.choice()instead ofnp.random.choice()Changes
origami/inference/sampler.py: Skip invalid samples during generation instead of raising errorsorigami/preprocessing/pipes.py: Avoid 'pipe not fitted' warnings by checking fit stateorigami/inference/mc_estimator.py: Userandom.choice()for better performancenotebooks/example_origami_ontime_ce.ipynb: New demo notebook showing cardinality estimation on OnTime datasetTest plan
🤖 Generated with Claude Code