You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm presently using Fsspec with one of the built in caching file systems, WholeFileCacheFileSystem, and enjoy using the callback feature that is present in functions like .get() to communicate to users the download progress of getting files from remote file stores. I am dealing is some files that are several Gb.
Since I largely want to pipe these remote files into downstream load functions (like numpy.load() for example) I use .open() which gets a IOReader instead of .get() since I just want a copy of it in the cache not elsewhere on the system. The trouble is that despite these caching file systems using .get_file to pull from remote when the file is not present in the cache, a callback passed with .open doesn't make it all the way to the get call.
Since the default callback in fsspec is to do nothing, my users are left without knowledge of if the program is doing anything while its pulling files.
To include kwargs that are relevant to get_file. Unfortunately present kwargs in the WholeFileCacheFileSystem._open() can have some parameters not accepted by get_file, thus some filtering solution is needed.
For the package I'm working on, I have a pretty primitive fix just to verify it works:
but a potential full fix would be a more complete solution that covers all parameters in get_file. I'm curious what the maintainers think. I'd be happy to open a PR if something simple like this is fine for the built in caching file systems.
Here's a minimal working example I am testing using the Tqdm plug in, fsspec version 2024.2.0 and above. I haven't tested this with many other file systems other than Http and also a third party file system for Huggingface.
Maybe the caching filesystem should take a callback= kwarg and apply these to the get operations? That seems a bit more obvious to me than adding things into open().
Hi Fsspec experts,
I'm presently using Fsspec with one of the built in caching file systems,
WholeFileCacheFileSystem
, and enjoy using the callback feature that is present in functions like.get()
to communicate to users the download progress of getting files from remote file stores. I am dealing is some files that are several Gb.Since I largely want to pipe these remote files into downstream load functions (like numpy.load() for example) I use
.open()
which gets a IOReader instead of.get()
since I just want a copy of it in the cache not elsewhere on the system. The trouble is that despite these caching file systems using.get_file
to pull from remote when the file is not present in the cache, a callback passed with .open doesn't make it all the way to the get call.Since the default callback in fsspec is to do nothing, my users are left without knowledge of if the program is doing anything while its pulling files.
I propose modifying lines such as:
To include kwargs that are relevant to
get_file
. Unfortunately presentkwargs
in theWholeFileCacheFileSystem._open()
can have some parameters not accepted byget_file
, thus some filtering solution is needed.For the package I'm working on, I have a pretty primitive fix just to verify it works:
but a potential full fix would be a more complete solution that covers all parameters in
get_file
. I'm curious what the maintainers think. I'd be happy to open a PR if something simple like this is fine for the built in caching file systems.Here's a minimal working example I am testing using the Tqdm plug in, fsspec version 2024.2.0 and above. I haven't tested this with many other file systems other than Http and also a third party file system for Huggingface.
(Note, with the fix and on second run, no progress bar is shown because file is loaded from cache as expected.)
Thanks!
The text was updated successfully, but these errors were encountered: