Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsspec-related issue or question #825

Closed
kthyng opened this issue Jun 18, 2024 · 13 comments · Fixed by #828
Closed

fsspec-related issue or question #825

kthyng opened this issue Jun 18, 2024 · 13 comments · Fixed by #828

Comments

@kthyng
Copy link
Contributor

kthyng commented Jun 18, 2024

This might be user-error in which case I apologize in advance. I'm not very versed in the nuances of fsspec.

I am able to open this file with xarray regularly but I am not able to figure out the right combination in my intake catalog, at least v2.

import intake

# this works
url = "https://researchworkspace.com/files/42712165/lower-ci_system-B_2006-2007.nc"
of_local = fsspec.open_local(f"simplecache://::{url}", mode="rb")
ds = xr.open_dataset(of_local)

# this doesn't work
data = intake.readers.datatypes.NetCDF3(url)  # should this be HDF5?
initial_reader = data.to_reader("xarray:Dataset", chunks={})
initial_reader.read()

Hits error at

                f = fsspec.open(data.url, **(data.storage_options or {})).open()

https://github.com/intake/intake/blob/cdea0c903948187784451f4a92804c349b4da700/intake/readers/readers.py#L1113C24-L1113C45

which uses fsspec.open instead of fsspec.open_local so I assume that is the issue. Can I create the same behavior using data.storage_options?

Thanks!

@martindurant
Copy link
Member

I had forgotten about open_local. Probably readers that might need this should have an extra flag in their kwargs - which might only be XArrayDatasetReader . It is highly unusual for a package to accept fsspec files sometimes, but require local file names (not file-like objects) in other situations!

@kthyng
Copy link
Contributor Author

kthyng commented Jun 19, 2024

Probably readers that might need this should have an extra flag in their kwargs

Is there such a flag I can use to get the behavior I need?

which might only be XArrayDatasetReader

I am not sure what you mean about XArrayDatasetReader. I tried it for the reader to see if that is what you meant but I got the same error.

It is highly unusual for a package to accept fsspec files sometimes, but require local file names (not file-like objects) in other situations!

Should I be taking a different route altogether?

@martindurant
Copy link
Member

Something like this

--- a/intake/readers/readers.py
+++ b/intake/readers/readers.py
@@ -1080,7 +1080,7 @@ class XArrayDatasetReader(FileReader):
     other_urls = {"xarray:open_dataset": "filename_or_obj"}
     url_arg = "paths"

-    def _read(self, data, **kw):
+    def _read(self, data, open_local=False, **kw):
         from xarray import open_dataset, open_mfdataset

         if "engine" not in kw:
@@ -1100,6 +1100,9 @@ class XArrayDatasetReader(FileReader):
                 kw["group"] = data.path
         if isinstance(data.url, (tuple, set, list)) or "*" in data.url:
             # use fsspec.open_files? (except for zarr)
+            if open_local:
+                files = fsspec.open_local(data.url, **(data.storage_options or {}))
+                return open_mfdataset(files)
             return open_mfdataset(data.url, **kw)
         else:

@martindurant
Copy link
Member

And then

initial_reader = data.to_reader("xarray:Dataset", chunks={}, open_local=True)

(I haven't tried that this works, but it would be something similar)

@kthyng
Copy link
Contributor Author

kthyng commented Jun 19, 2024

Oh ok sure, I can try making a PR like that. I wasn't quite getting your drift before.

@kthyng
Copy link
Contributor Author

kthyng commented Jun 20, 2024

PR #828 addresses this. I wasn't careful with the url and in my intake catalog case above, the url wasn't in the "simplecache" encasing. Adding that as a case to the fsspec logic fixed my issue, but I also added the case for open_local thinking someone might want that in the future. I am not very familiar with fsspec so please take a look at how I did it.

This code snippet didn't work before and now does:

import intake
import xarray as xr
import fsspec

url = "https://researchworkspace.com/files/42712165/lower-ci_system-B_2006-2007.nc"
url = f"simplecache://::{url}"
data = intake.readers.datatypes.NetCDF3(url)
initial_reader = data.to_reader("xarray:Dataset", chunks={}, open_local=True)
initial_reader.read()

So far I have been using the NetCDF3 reader for my netCDF4 files — is that your recommendation?

@martindurant
Copy link
Member

So far I have been using the NetCDF3 reader for my netCDF4 files

I suppose they are all passing to xarray, which handles all these file types and is making a decent guess. netCDF3 is NOT the same as netCDF4, the latter of which is just a special-case of HDF5 file. So you probably wanted HDF5, and maybe then you didn't need a local copy at all.

>>> fsspec.filesystem("http").head("https://researchworkspace.com/files/42712165/lower-ci_system-B_2006-2007.nc")
b'\x89HDF....'

@kthyng
Copy link
Contributor Author

kthyng commented Jun 21, 2024

I have been experimenting with HDF5 too, since like you said netCDF4 is a special case of HDF5. But, for example, without the code in the PR, the following again doesn't work with the same error as before:

import intake
import xarray as xr
import fsspec

url = "https://researchworkspace.com/files/42712165/lower-ci_system-B_2006-2007.nc"
#url = f"simplecache://::{url}"
data = intake.readers.datatypes.HDF5(url)
initial_reader = data.to_reader("xarray:Dataset", chunks={})#, open_local=True)
initial_reader.read()

But, sometimes one might want to have the option to get the local cache of the file too.

I have been trying netCDF3 vs HDF5 in other cases too and haven't had a clear outcome in my head of what to do other than it seemed like I should use netCDF3 because it worked more regularly than HDF5.

@martindurant
Copy link
Member

If you want the caching to be optional, you could make a user parameter, where url="{MAYBE_CACHE}https://researchworkspace.com/files/42712165/lower-ci_system-B_2006-2007.nc" and MAYBE_CACHE can have the values ["", "simplecache::"].

But you actual problem is "The HTTP server doesn't appear to support range requests.", which the remote reader needs to have random access. You are actually seeing fsspec/filesystem_spec#1631 , fsspec/filesystem_spec#1626 , which will be resolved somehow soon. I checked, your server does not explicitly say that it doesn't accept Range. Actually, the server sends the whole file every time, whatever you ask for; I wonder if we should explicitly handle that case when the file is small enough.

@kthyng
Copy link
Contributor Author

kthyng commented Jun 21, 2024

If you want the caching to be optional, you could make a user parameter, where url="{MAYBE_CACHE}https://researchworkspace.com/files/42712165/lower-ci_system-B_2006-2007.nc" and MAYBE_CACHE can have the values ["", "simplecache::"].

Ah, interesting. Thanks for that idea. What I meant at the moment though is that I think it's worth adding the modification I made in the PR to intake because that way a user can have the "simplecache" prefix to their url and have it work; otherwise I don't think it will go through the logic in the xarray reader correctly.

But you actual problem is "The HTTP server doesn't appear to support range requests." ...

I see! Yes, every error I have been hitting has been that one, and "simplecache" has been an accidental workaround. Thanks.

@martindurant
Copy link
Member

In [11]: import intake
    ...: import xarray as xr
    ...: import fsspec
    ...:
    ...: url = "https://researchworkspace.com/files/42712165/lower-ci_system-B_2006-2007.nc"
    ...: #url = f"simplecache://::{url}"
    ...: data = intake.readers.datatypes.HDF5(url, storage_options={"cache_type": "all"})
    ...: initial_reader = data.to_reader("xarray:Dataset", chunks={})#, open_local=True)
    ...: initial_reader.read()
<xarray.Dataset>
Dimensions:  (x: 26, y: 35, time: 7630)
Coordinates:
    lat      (x, y) float64 dask.array<chunksize=(26, 35), meta=np.ndarray>
    lon      (x, y) float64 dask.array<chunksize=(26, 35), meta=np.ndarray>
  * time     (time) datetime64[ns] 2006-11-12 ... 2007-11-10T19:00:00
    z        float64 ...
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9 10 ... 16 17 18 19 20 21 22 23 24 25
  * y        (y) int64 0 1 2 3 4 5 6 7 8 9 10 ... 25 26 27 28 29 30 31 32 33 34
Data variables:
    u        (time, x, y) float64 dask.array<chunksize=(3815, 13, 18), meta=np.ndarray>
    v        (time, x, y) float64 dask.array<chunksize=(3815, 13, 18), meta=np.ndarray>
    crs      int32 ...

@martindurant
Copy link
Member

(this caches in memory, not on disk)

@kthyng
Copy link
Contributor Author

kthyng commented Jun 21, 2024

This looks like a good way to move forward with this catalog. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants