-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using MetPy to split up testing/training/validation xarray datasets for Machine Learning #3579
Comments
@anacmontoya and I have been working on a function/notebook that might be a good starting point for this work. |
https://gist.github.com/anacmontoya/35156d81fec1fe790b67916d2339d793 Here's the code! |
|
As for what to add, a few things jump out at me from a climate modeling perspective:
I don't think MetPy needs to fully support all of these, but it would be good to have a way of specifying the splits that could accommodate them. |
All great points! In the inception of this function, was mainly trying to match the scikit-learn interface/output of train_test_split but for xarray. Most of your requests I think are straightforward enough using I do like the idea of adding even/odd year, or more advanced sampling that is not as easily done. |
I think a lot of it could be handled by just allowing the user to specify a list of elements for each split instead of the boundaries. What would be really keen is if those lists could contain just years, instead of the full set of datetimes within each year. The next step beyond that would then be to allow the user to change the date when the year begins/ends, so that you could use water years or winters or whatever depending on what you're studying... |
What should we add?
Creating testing/training/validation datasets is a key step in machine learning workflows. Usually for Climate/Weather ML analysis, we split these datasets on a time dimension.
Scikit-learn has a function that does this for 2D arrays / pandas dataframes here. This function can't split xarray datasets.
Improvements on the scikit-learn implementation:
Big questions:
Reference
No response
The text was updated successfully, but these errors were encountered: