feat: torch cache-able dataset, with sampling support #1591

eddyxu · 2023-11-13T04:39:58Z

Pytorch Dataset

Work with PyTorch pipelines
Local filesystem backed cache
Faster pseudo random samplings.
Support project projection and filterings.

AyushExel

One small question

AyushExel · 2023-11-13T06:17:45Z

python/python/lance/sampler.py

+
+    if n >= len(dataset):
+        # Dont have enough data in the dataset. Just do a full scan
+        dataset.to_batches(columns=columns, batch_size=batch_size)


This should have a return/yeild here?

And maybe also add this condition in test_sampler, where. n>len(ds)

Good catch. Wrote too much rust recently

wjones127

The cache looks useful. I don't know how useful the sampling functions are given they do nothing to randomize the order in which the dataset is iterated over, and ML users seem to care a lot about randomization.

python/python/lance/cache.py

python/python/lance/sampler.py

Co-authored-by: Will Jones <[email protected]>

eddyxu · 2023-11-13T22:06:13Z

how useful the sampling functions are given they do nothing to randomize the order in which the dataset is iterated over, and ML users seem to care a lot about randomization.

This can be done with shuffle in [torch.dataset.DataLoader(https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) or tf.data.pipeline

Sampling was mostly used by training IVF kmeans in GPU. If you are concern about the api capablity, i can mark it as private API.

chebbyChefNEQ · 2023-11-13T22:15:46Z

python/python/lance/cache.py

+    def __del__(self):
+        if self.cache_dir is not None:
+            self.cache_dir.cleanup()
+            self.cache_dir = None


I don't this this is garueeteed to be called since this is only called during GC.

maybe we can use import atexit instead?

atexit is strictly later than GC?

GC isn't garueeteed to happen on an object. But atexit is

Ok, add ti both to atext and __exit__

chebbyChefNEQ · 2023-11-13T22:19:31Z

python/python/lance/cache.py

+            self.cache_dir.cleanup()
+            self.cache_dir = None
+
+    def __iter__(self):


nit: do we wannt check and make sure the second stream doesn't iter faster than the first stream?

Ok, added one.

chebbyChefNEQ · 2023-11-13T22:19:47Z

python/python/lance/cache.py

+                    writer = pa.ipc.new_stream(str(self.cache_file), batch.schema)
+                writer.write(batch)


does this append or overwrite?

this is append to a stream

https://arrow.apache.org/docs/python/generated/pyarrow.ipc.RecordBatchStreamWriter.html#pyarrow.ipc.RecordBatchStreamWriter.write

eddyxu added 5 commits November 12, 2023 10:15

add sampler

0f6748d

more efficient sampler

12c587a

fsl to tensor

36efee6

reader

673b44b

torch data loader

0fa43dc

eddyxu requested review from wjones127, chebbyChefNEQ and AyushExel November 13, 2023 05:04

AyushExel reviewed Nov 13, 2023

View reviewed changes

eddyxu self-assigned this Nov 13, 2023

eddyxu added 5 commits November 13, 2023 11:05

do not use torchdata

8c5ec9c

do not use torchdata

55511ea

cached dataset

8ae6ae4

read from cahe

9c225dd

cache

dfd5c67

eddyxu changed the title ~~feat: torch on disk sampler~~ feat: torch cache-able dataset, with sampling support Nov 13, 2023

eddyxu requested a review from changhiskhan November 13, 2023 20:40

eddyxu marked this pull request as ready for review November 13, 2023 20:40

eddyxu added 2 commits November 13, 2023 12:46

default cache

2356d08

remove ignore_cleanup_errors for python3.8 3.9 compat

f920e01

wjones127 reviewed Nov 13, 2023

View reviewed changes

python/python/lance/cache.py Outdated Show resolved Hide resolved

python/python/lance/sampler.py Outdated Show resolved Hide resolved

eddyxu and others added 2 commits November 13, 2023 14:03

Update python/python/lance/sampler.py

14d193d

Co-authored-by: Will Jones <[email protected]>

Update python/python/lance/cache.py

d46daff

Co-authored-by: Will Jones <[email protected]>

chebbyChefNEQ reviewed Nov 13, 2023

View reviewed changes

fix mypy lint

031ca9e

chebbyChefNEQ reviewed Nov 13, 2023

View reviewed changes

eddyxu added 2 commits November 13, 2023 15:04

check the first iteration must finish

e678ac3

add some mypy checks

ff16161

eddyxu added 2 commits November 13, 2023 15:18

add atexit and __exit__

1f6042e

fix lint

aa51063

wjones127 approved these changes Nov 14, 2023

View reviewed changes

eddyxu merged commit f450ebc into main Nov 14, 2023

eddyxu deleted the lei/sampler branch November 14, 2023 01:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: torch cache-able dataset, with sampling support #1591

feat: torch cache-able dataset, with sampling support #1591

eddyxu commented Nov 13, 2023 •

edited

Loading

AyushExel left a comment

AyushExel Nov 13, 2023

AyushExel Nov 13, 2023

eddyxu Nov 14, 2023

wjones127 left a comment

eddyxu commented Nov 13, 2023 •

edited

Loading

chebbyChefNEQ Nov 13, 2023

chebbyChefNEQ Nov 13, 2023

eddyxu Nov 13, 2023

chebbyChefNEQ Nov 13, 2023

eddyxu Nov 13, 2023

chebbyChefNEQ Nov 13, 2023

eddyxu Nov 13, 2023

chebbyChefNEQ Nov 13, 2023

eddyxu Nov 13, 2023

		writer = pa.ipc.new_stream(str(self.cache_file), batch.schema)
		writer.write(batch)

feat: torch cache-able dataset, with sampling support #1591

feat: torch cache-able dataset, with sampling support #1591

Conversation

eddyxu commented Nov 13, 2023 • edited Loading

AyushExel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 left a comment

Choose a reason for hiding this comment

eddyxu commented Nov 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eddyxu commented Nov 13, 2023 •

edited

Loading

eddyxu commented Nov 13, 2023 •

edited

Loading