Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converters for Resampling <-> data.table #1162

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

mllg
Copy link
Member

@mllg mllg commented Sep 4, 2024

No description provided.

@jemus42
Copy link
Member

jemus42 commented Sep 4, 2024

Observations from implementing this for my use case where the as.data.table(Resampling) is stored as CSV:

  • I needed to ensure the dt is keyed appropriately (data.table::fread(..., key = c("set", "iteration"))) for the join(?) to work (x[list("train"), list(ids = list(row_id)), by = "iteration"]$ids), to avoid a warning. Maybe asserting keys is useful?
  • assert_factor(x$set, any.missing = FALSE) might be a little strict, given that it should suffice for set to be a character and assert_set_equal(unique(x$set), c("train", "test"))? When taking the detour to a CSV file set could end up being either character or factor I guess.

@berndbischl
Copy link
Member

@mllg can you please describe what this supposed does or solve?

@berndbischl
Copy link
Member

i mean, i can see what happens in the code. i think this is good, if we test and doc this well

@berndbischl
Copy link
Member

i would add a small feature like this:
a) if we go from a Resample (instantiated) to a table, this is easy.
b) if we read from a table, we go to a ResamplingCustom (if the user gives no extra type)
and we perform some very basic validity checks
c) if the user provides the type (we should at least for now support HO, subsamp, CV, rep-CV) we convert into that type (not custom) and perform a few more validity checks

@jemus42
Copy link
Member

jemus42 commented Oct 25, 2024

Motivation here was something I do for the survival benchmark:

Load tasks, then either

  • Instantiate resamplings on tasks and write resamplings to disk as portable format for future reproducibility (Resampling > data.table -> CSV)

or, if resamplings are already stored on disk:

  • Load resampling from CSV, convert to Resampling, and use those for the benchmark.

That way the the stored resamplings are always "the truth", which is nice because I found that instantiating resamplings with a set seed in a loop over the tasks was not so smart because excluding a task then changes the RNG state for subsequent instantiations.
Also, if you git commit the CSVs of the resamplings you immediately see in the diff if something unexpected happens here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants