Skip to content

fix(plan): do not use min/max stats for >= NaN float predicate#27675

Closed
jonathansergio wants to merge 2 commits into
pola-rs:mainfrom
jonathansergio:fix-float-geq-nan-skip-predicate
Closed

fix(plan): do not use min/max stats for >= NaN float predicate#27675
jonathansergio wants to merge 2 commits into
pola-rs:mainfrom
jonathansergio:fix-float-geq-nan-skip-predicate

Conversation

@jonathansergio
Copy link
Copy Markdown

Resolves #26805

What does this PR do?

Fixes an incorrect skip-batch optimization for floating-point columns: prevent using min/max statistics when the operator is >= and the literal is NaN.

Why?

Min/max stats exclude NaN. For col >= NaN, NaN satisfies the predicate (TotalOrd: NaN >= NaN is true). Generating a skip predicate like max(A) < NaN may evaluate to true (e.g. 0.0 < NaN) and incorrectly skip a batch that contains NaN, producing wrong results.

Changes

  • Updated can_use_min_max_stats in skip_batches.rs:
    • Split Gt and GtEq handling:
    • Keep Some(O::Gt) => lv_is_nan (safe: nothing is greater than NaN).
    • Change Some(O::GtEq) => false (unsafe: >= NaN may match NaN rows but stats exclude NaN).
  • Expanded docstring to explain rationale.
  • Tests in test_skip_batch_predicate.py:
    • Fixed an incorrect assertion for >= NaN.
    • Added a deterministic regression test test_float_geq_nan_skip_batch_predicate() that ensures col >= NaN does not produce a skip predicate and that rows with NaN are not skipped.
    • Removed the PR-number suffix from the test name.

Backward compatibility

No API or semantic changes for users. This only disables an unsafe optimization in a specific NaN case to avoid incorrect filtering.

Tests

  • Added regression test test_float_geq_nan_skip_batch_predicate.
  • Local tests for the modified module pass.
Screenshot from 2026-05-20 12-41-30

Notes

  • Hypothesis (the parametric test) discovered the failing example (e.g., [NaN, null, 0.0, null]). The root cause was introduced when enabling rowgroup/rowgroup skipping for float columns (feat: Enable rowgroup skipping for float columns #26805).
  • Recommend merging this PR before the CSV-related PR so the CSV PR CI will be stable.
  1. I used an AI assistant to analyze the predicate logic and craft the fix.
  2. I confirm that I have reviewed all changes myself, and I believe they are relevant and correct.

@ritchie46
Copy link
Copy Markdown
Member

This is already fixed in main. Your other PR can rebase on main and that should be fine.

@ritchie46 ritchie46 closed this May 20, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.27%. Comparing base (9d8a77e) to head (ab83baa).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #27675      +/-   ##
==========================================
- Coverage   81.48%   81.27%   -0.22%     
==========================================
  Files        1840     1840              
  Lines      256109   256110       +1     
  Branches     3181     3180       -1     
==========================================
- Hits       208696   208145     -551     
- Misses      46588    47139     +551     
- Partials      825      826       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

first-contribution First contribution by user fix Bug fix title needs formatting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants