-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat: Add progress bar with ETA estimation to datafusion-cli #17867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit implements a comprehensive progress bar feature for the DataFusion CLI, providing real-time feedback during query execution with ETA estimation. Key features: - Progress bar with percentage, throughput, and ETA display - Kalman filter smoothed ETA estimation algorithm - TTY auto-detection (shows progress on terminal, disabled when piped) - Configurable progress styles (bar/spinner) and update intervals - Support for multiple data sources (Parquet, CSV, JSON) - Graceful fallback to spinner mode when totals are unknown CLI flags added: - --progress {auto|on|off}: Progress bar mode (default: auto) - --progress-style {bar|spinner}: Visual style (default: bar) - --progress-interval <ms>: Update frequency (default: 200ms) - --progress-estimator {kalman|linear}: ETA algorithm (default: kalman) The implementation uses DataFusion's existing ExecutionPlan metrics and statistics APIs to provide accurate progress tracking with minimal performance overhead. Addresses: apache#17812
- Replace deprecated statistics() with partition_statistics(None) - Use datafusion_common::instant::Instant for WASM compatibility - Replace tokio::spawn with SpawnedTask::spawn for better cancel safety - Fix needless borrow in metrics polling - Simplify progress reporter lifecycle and remove unused shutdown_tx - Add required datafusion-common and datafusion-common-runtime dependencies - Fix all formatting issues
@MrPowers and @timsaucer were talking about such a feature at the sync today |
I am not qualified to review the code, but the functionality is quite exciting!! |
I tried it locally and it's not working, it's always 0% during execution data generation: https://github.com/clflushopt/tpchgen-rs/tree/main/tpchgen-cli
I also tried several queries takes longer to execute, the result is the same. Do you have a working example to demo this feature? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @EeshanBembi
} | ||
|
||
/// Kalman filter update step | ||
fn kalman_update(&mut self, measured_rate: f64, dt: f64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, there is an algebra lib which may make code neat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xudong963 Great suggestion on the algebra library! I actually looked into using nalgebra for
the Kalman filter matrix operations.
After some consideration, I decided to stick with the current hand-optimized approach for a
few reasons:
- The 2x2 matrix operations are pretty straightforward and fast as-is
- Adding nalgebra would bring in quite a few extra dependencies
- The current code is actually quite readable with the explicit math
- Keeps the CLI nice and lightweight
That said, if we need more complex state estimation, we can consider using it
} | ||
|
||
/// Create a visual progress bar | ||
fn create_bar(&self, percent: f64) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, there is a progress bar crate that you may like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xudong963 Thanks for the progress bar crate suggestion! I did take a look at indicatif and found that it's pretty much the standard for progress bars in Rust.
Ultimately decided to keep our custom implementation for a few reasons:
- It's working really well (tested it with massive queries processing 500M+ rows!)
- Zero extra dependencies to worry about
- We get exactly the database-specific formatting we want (rows/bytes/etc.)
- The whole thing is only ~200 lines, so pretty manageable
I figured the custom route made sense here but I'm open to changing it if that makes more sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some doc for the algorithms is helpful
And some tuning comments for Kalman params make a lot of sense for reviewers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xudong963 Good call on the documentation! I've added much more detailed explanations for the
Kalman filter implementation.
Now includes:
- Full breakdown of the 2D state vector model and how it tracks
[progress_rate, acceleration]
- Explanation of the state transition and measurement models
- Step-by-step walkthrough of the prediction and update equations
- Clear guidance on tuning the noise parameters (with recommended ranges)
You can check it out in datafusion-cli/src/progress/estimator.rs
around lines 91-130. Should
be much easier for reviewers and future maintainers to understand what's going on under the
hood!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a Kalman filter not overkill for an ETA estimate in a TUI? Perhaps a simple alpha filter would suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
Switched to a simple exponential moving average instead. Much cleaner.
- Fix metric name extraction in metrics_poll.rs to properly handle all MetricValue variants - Add comprehensive Kalman filter documentation with algorithm explanations - Add parameter tuning guidance for process_noise and measurement_noise - Fix missing progress field in cli-session-context example - Fix clippy warning in estimator tests The 0% progress issue was caused by hardcoded empty metric names. Progress tracking now works correctly for all query types with real-time row counts and time updates.
@2010YOUY01 Thanks for reporting this! The 0% progress issue has been fixed (changes will be Root Cause: The metric name extraction was hardcoded, so no metrics were Fix Applied: Updated Verification: Tested with your exact scenario, it now show real progress:
The progress bar now works correctly for all query types including your |
Resolves merge conflicts between progress bar functionality and instrumented object store registry: - Combined both progress bar config and instrumented registry fields in PrintOptions struct - Updated CLI arguments to support both features - Modified examples and tests to include both fields - Maintains backward compatibility while enabling both features
Resolves compilation error in command.rs test where PrintOptions initialization was missing the progress field after the merge that added progress bar functionality. Changes: - Add progress: ProgressConfig::default() to PrintOptions initialization - Import ProgressConfig in test module
|
||
impl TotalsVisitor { | ||
/// Check if this plan node is a data source (leaf node that reads data) | ||
fn is_data_source(&self, plan: &dyn ExecutionPlan) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a bit too brittle. Any changes in the execution plan names would break progress reporting. Is 'is leaf node' not a sufficient filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, that was brittle. Changed it to check plan.children().is_empty() instead. Won't break if plan names change
- Replace complex Kalman filter with simple exponential moving average for ETA estimation - Alpha filter is more appropriate for TUI progress bars and easier to maintain - Fix brittle string-based data source detection in plan introspection - Use ExecutionPlan::children().is_empty() instead of string matching on plan names - Update config and CLI to use "Alpha" instead of "Kalman" estimator option - All tests pass with new alpha filter implementation
Summary
Adds a comprehensive progress bar feature to the DataFusion CLI with DuckDB-style ETA estimation,
providing real-time feedback during query execution.
CLI Usage
New flags added:
Example Output
Progress bar (when totals known):
▉▉▉▉▉▊▏ 63% 12.3M / 19.5M rows • 48.1 MB/s • ETA 00:27
Spinner (when totals unknown):
⠋ rows: 1.2M elapsed: 00:11
Implementation Details
TTY-aware display
Test Plan
Backwards Compatibility
Closes #17812