Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing callbacks to get_file calls used by CacheFileSystems #1623

Open
NickGeneva opened this issue Jun 10, 2024 · 1 comment
Open

Passing callbacks to get_file calls used by CacheFileSystems #1623

NickGeneva opened this issue Jun 10, 2024 · 1 comment

Comments

@NickGeneva
Copy link

Hi Fsspec experts,

I'm presently using Fsspec with one of the built in caching file systems, WholeFileCacheFileSystem, and enjoy using the callback feature that is present in functions like .get() to communicate to users the download progress of getting files from remote file stores. I am dealing is some files that are several Gb.

Since I largely want to pipe these remote files into downstream load functions (like numpy.load() for example) I use .open() which gets a IOReader instead of .get() since I just want a copy of it in the cache not elsewhere on the system. The trouble is that despite these caching file systems using .get_file to pull from remote when the file is not present in the cache, a callback passed with .open doesn't make it all the way to the get call.

Since the default callback in fsspec is to do nothing, my users are left without knowledge of if the program is doing anything while its pulling files.

I propose modifying lines such as:

To include kwargs that are relevant to get_file. Unfortunately present kwargs in the WholeFileCacheFileSystem._open() can have some parameters not accepted by get_file, thus some filtering solution is needed.

For the package I'm working on, I have a pretty primitive fix just to verify it works:

if "callback" in kwargs:  # Patch here
    self.fs.get_file(path, fn, callback=kwargs["callback"])
else:
    self.fs.get_file(path, fn)

but a potential full fix would be a more complete solution that covers all parameters in get_file. I'm curious what the maintainers think. I'd be happy to open a PR if something simple like this is fine for the built in caching file systems.

Here's a minimal working example I am testing using the Tqdm plug in, fsspec version 2024.2.0 and above. I haven't tested this with many other file systems other than Http and also a third party file system for Huggingface.

url = "https://upload.wikimedia.org/wikipedia/en/4/42/Master_chief_halo_infinite.png"
fs_http = HTTPFileSystem(block_size=2**10)
fs_http = WholeFileCacheFileSystem(
    fs=fs_http, cache_storage=".cache", same_names=True
)

filename = os.path.basename(url)
with TqdmCallback(
    tqdm_kwargs={
        "desc": "desc",
        "bar_format": f"Downloading {filename}: " +"{percentage:.0f}%|{bar}{r_bar}",
        "unit": "B",
        "unit_scale": True,
        "unit_divisor": 1024,
    },
) as callback:
    f = fs_http.open(url, callback=callback)

(Note, with the fix and on second run, no progress bar is shown because file is loaded from cache as expected.)

Thanks!

@martindurant
Copy link
Member

Maybe the caching filesystem should take a callback= kwarg and apply these to the get operations? That seems a bit more obvious to me than adding things into open().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants