-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] caching on demand #1653
base: master
Are you sure you want to change the base?
[feat] caching on demand #1653
Conversation
Where does the cached file go? Is there a way to get the equivalent cached filesystem, so you can interact with the local cached files? I wonder what the connection with fsspec.open_local should be. |
(a separate issue I think was mentioned elsewhere, is whether there should be an Intake "cache" reader which acts on any filetype in exactly this way, and returns a datatype object of the same type as the original but with the appropriate local path) |
Oh, I see on the outside that open_local does what I'm looking for. Now I see that it ends up calling open_many and then downloads the file to cache. I was unaware of this method. In [2]: import fsspec
In [3]: import fsspec.callbacks
In [4]: fsspec.callbacks.DEFAULT_CALLBACK = fsspec.callbacks.TqdmCallback()
In [5]: url = "github://albertdefusco:datasets@main/auto-mpg.csv"
In [6]: fn = fsspec.open_local(f"simplecache::{url}")
100%|███████████████████████████████████████████████████████████████████████████| 21021/21021 [00:00<00:00, 107260905.58it/s]
In [7]: fn
Out[7]: '/var/folders/cg/wyhz9cvx06vdt65k8tr31trc0000gp/T/tmpu9_vcbpt/5a6fcc477034509ca56b58f0c0db6a17baa534ecf3c31fde04ddda8ee9a0f7e8' |
You're right that there is something missing when working with more than one file. I see where having a local fs for the cached items would be useful. I'll give it some thought. In [1]: import fsspec
In [2]: import fsspec.callbacks
In [3]: fsspec.callbacks.DEFAULT_CALLBACK = fsspec.callbacks.TqdmCallback()
In [4]: url = "github://albertdefusco:datasets@main/weather/*.csv"
In [5]: paths = fsspec.open_local(f"filecache::{url}")
100%|███████████████████████████████████████████████████████████████████████████| 39576/39576 [00:00<00:00, 137868583.97it/s]
79382it [00:00, 291807397.13it/s]
119207it [00:00, 344820963.40it/s]
159723it [00:00, 368902432.70it/s]
198669it [00:00, 410481862.75it/s]
In [6]: paths
Out[6]:
['/var/folders/cg/wyhz9cvx06vdt65k8tr31trc0000gp/T/tmpgseqn860/5955f07f6293ad9a465f38ed02edec0131ce19985be9ad2c443ca8184b2b1065',
'/var/folders/cg/wyhz9cvx06vdt65k8tr31trc0000gp/T/tmpgseqn860/0c501daeded3e12aa5c2e473a8dc79893dc38cdeecf84a7581fe44513d11e747',
'/var/folders/cg/wyhz9cvx06vdt65k8tr31trc0000gp/T/tmpgseqn860/2e40e4c868226583370492e940167bdff8f309c46bb16f1794126ccbd0f13f38',
'/var/folders/cg/wyhz9cvx06vdt65k8tr31trc0000gp/T/tmpgseqn860/58c6f6669cd55443cca8722320a0f0cb13340be88e1049b0c3f8ba8fa722113b',
'/var/folders/cg/wyhz9cvx06vdt65k8tr31trc0000gp/T/tmpgseqn860/d7f5750dd42f0200576c8b039a995c981c1d603f2f6dbd6532f9b959d2dd5540'] |
This PR solves a use case I have where I want to use
filecache
orsimplecache
to assist me in downloading a file and using the local path. Further, I may be working with extremely large files and I want to avoid calling.read()
since that places the whole file into memory after it has been cached.Todo:
A new method has been added called
.cache_path(path)
to invoke caching (I'm open to better name) and return the local filename.Here's an example with simplecache, the same works for filecache.