-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[RDF] Add Skewness and Kurtosis actions to RDataFrame using Welford's algorithm #20517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thanks for the pointers to the numerically stable algorithm, I think I will borrow this for the global statistics of ROOT's new histograms 😉 |
|
@hahnjo Thanks, It’s really cool that the algorithm might be useful for the new histograms too. 🚀 I’m currently working to implement the weighted version (West's algorithm) to support weighted skewness/kurtosis in RDF. The parallel merge steps get complex for cubic terms, but I verified the prototype against calculations, it holds up perfectly. If you end up needing the weighted merge derivations (adapted from West/Chan) for the histogram work, please let me know—happy to share my prototype if it saves you some digging! 😊 |
|
Hello. A bit of context around this PR might be useful. Did you propose these changes to address an issue related to a task at hand or about which you learned? In other words, is this done to fix someone's problems? |
|
@dpiparo Hello, thanks for looking into this. Motivation for this was primarily about the feature completeness. Since Mean and StdDev are already supported, adding Skewness and Kurtosis seemed like a logical step to bring RDataFrame closer to parity with other data analysis tools. I used Welford's algorithm to make sure the calculations are numerically stable, and I'm currently working on the statistical analysis for weighted distributions for these metrics as well. I was also happy to see Jonas mention above that this approach might be helpful for the new histograms, so hopefully, this proves to be a useful addition to the codebase. Best regards, |
|
Hello @edish-github, I could imagine extending RDF, but before that, I have a little request 😅: We typically use |
a011edb to
bc4d707
Compare
bc4d707 to
64041cc
Compare
|
Hello @hageboeck, thanks for reviewing! I've cleaned up the diff by reverting the unrelated whitespace changes and applying clang-format strictly to the new code. I also added the example usage snippets to the documentation to match the style of the other actions. On a side note, I am really enjoying working with the RDF internals. Currently I'm working on tackling weighted skewness/kurtosis next (e.g., Best regards, |
|
Hello @hageboeck (and @vepadulano @martamaja10), I've updated the PR and ran the tests locally. Best, |
|
Hello @hageboeck (@vepadulano and @martamaja10) It's been a little while, and I completely respect your time and busy schedules. I just wanted to gently bump this PR to see if we can get the workflows approved or if you have any initial feedback on the implementation. I am standing by for any feedback or changes required. Thanks! |
vepadulano
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @edish-github for this contribution! I have left some minor comments to start the review process. Even if I started the CI workflows now, there would be no new test to verify the new features. Could you add a few tests so that the CI will have something to process? For example you could add them in tree/dataframe/test/dataframe_simple.cxx
| #include <utility> // std::index_sequence | ||
| #include <vector> | ||
| #include <numeric> // std::accumulate in MeanHelper | ||
| #include <cmath> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to know that we need this header, please add a comment to the right signaling which function is used from this header
| } | ||
| }; | ||
|
|
||
| // Implements Welford's Online Algorithm for Skewness |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for reference, I would add a link to the explanation of the algorithm, e.g. https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
| SkewnessHelper(SkewnessHelper &&) = default; | ||
| SkewnessHelper(const SkewnessHelper &) = delete; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should apply rule of five here (missing assignment operators and destructor)
| // Helper functions for RMergeableValue | ||
| std::unique_ptr<RMergeableValueBase> GetMergeableValue() const final { return nullptr; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a TODO here, without the mergeable value implementation we can't support these methods in distributed RDataFrame
|
Hi @vepadulano, really appreciate your detailed feedback! I have updated the PR to address the comments: Documentation: Added the explanation for and the Wikipedia references for Welford's algorithm. |
This Pull Request:
Adds
.Skewness()and.Kurtosis()actions to RDataFrame.Motivation
In high-energy physics, looking beyond the mean and standard deviation is usually critical.
But currently it requires users to either bin data into
TH1(losing precision) or write manual loops. This action allows for exact calculation in a single pass.Implementation Details
I used Welford's Online Algorithm, which allows us to calculate mean, variance, skewness, and kurtosis simultaneously in one pass. This is numerically stable and fits the RDataFrame parallel map-reduce pattern.
ROOT::RVec.However, benchmarks on my local machine (M3, Clang -O3) showed that the compiler auto-vectorizes the scalar Welford loop very effectively, so i went for the scalar implementationChanges or fixes:
SkewnessHelperandKurtosisHelperintree/dataframe/inc/ROOT/RDF/ActionHelpers.hxx..Skewness()and.Kurtosis()intree/dataframe/inc/ROOT/RDF/RInterface.hxx.tree/dataframe/inc/ROOT/RDF/InterfaceUtils.hxx.Checklist: