-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: torch cache-able dataset, with sampling support #1591
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small question
python/python/lance/sampler.py
Outdated
|
||
if n >= len(dataset): | ||
# Dont have enough data in the dataset. Just do a full scan | ||
dataset.to_batches(columns=columns, batch_size=batch_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a return/yeild here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And maybe also add this condition in test_sampler, where. n>len(ds)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Wrote too much rust recently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache looks useful. I don't know how useful the sampling functions are given they do nothing to randomize the order in which the dataset is iterated over, and ML users seem to care a lot about randomization.
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Will Jones <[email protected]>
This can be done with shuffle in [torch.dataset.DataLoader(https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) or tf.data.pipeline Sampling was mostly used by training IVF kmeans in GPU. If you are concern about the api capablity, i can mark it as private API. |
python/python/lance/cache.py
Outdated
def __del__(self): | ||
if self.cache_dir is not None: | ||
self.cache_dir.cleanup() | ||
self.cache_dir = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't this this is garueeteed to be called since this is only called during GC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can use import atexit
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
atexit
is strictly later than GC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GC isn't garueeteed to happen on an object. But atexit is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, add ti both to atext
and __exit__
self.cache_dir.cleanup() | ||
self.cache_dir = None | ||
|
||
def __iter__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we wannt check and make sure the second stream doesn't iter faster than the first stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, added one.
writer = pa.ipc.new_stream(str(self.cache_file), batch.schema) | ||
writer.write(batch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this append or overwrite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pytorch Dataset