Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: torch cache-able dataset, with sampling support #1591

Merged
merged 19 commits into from
Nov 14, 2023
Merged

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Nov 13, 2023

Pytorch Dataset

  • Work with PyTorch pipelines
  • Local filesystem backed cache
  • Faster pseudo random samplings.
  • Support project projection and filterings.

Copy link
Contributor

@AyushExel AyushExel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small question


if n >= len(dataset):
# Dont have enough data in the dataset. Just do a full scan
dataset.to_batches(columns=columns, batch_size=batch_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a return/yeild here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe also add this condition in test_sampler, where. n>len(ds)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Wrote too much rust recently

@eddyxu eddyxu self-assigned this Nov 13, 2023
@eddyxu eddyxu changed the title feat: torch on disk sampler feat: torch cache-able dataset, with sampling support Nov 13, 2023
@eddyxu eddyxu requested a review from changhiskhan November 13, 2023 20:40
@eddyxu eddyxu marked this pull request as ready for review November 13, 2023 20:40
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache looks useful. I don't know how useful the sampling functions are given they do nothing to randomize the order in which the dataset is iterated over, and ML users seem to care a lot about randomization.

@eddyxu
Copy link
Contributor Author

eddyxu commented Nov 13, 2023

how useful the sampling functions are given they do nothing to randomize the order in which the dataset is iterated over, and ML users seem to care a lot about randomization.

This can be done with shuffle in [torch.dataset.DataLoader(https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) or tf.data.pipeline

Sampling was mostly used by training IVF kmeans in GPU. If you are concern about the api capablity, i can mark it as private API.

Comment on lines 43 to 46
def __del__(self):
if self.cache_dir is not None:
self.cache_dir.cleanup()
self.cache_dir = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't this this is garueeteed to be called since this is only called during GC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can use import atexit instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atexit is strictly later than GC?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GC isn't garueeteed to happen on an object. But atexit is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, add ti both to atext and __exit__

self.cache_dir.cleanup()
self.cache_dir = None

def __iter__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we wannt check and make sure the second stream doesn't iter faster than the first stream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, added one.

Comment on lines +55 to +56
writer = pa.ipc.new_stream(str(self.cache_file), batch.schema)
writer.write(batch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this append or overwrite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eddyxu eddyxu merged commit f450ebc into main Nov 14, 2023
@eddyxu eddyxu deleted the lei/sampler branch November 14, 2023 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants