Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practice for using rename_target_files #392

Open
rsignell opened this issue Nov 9, 2023 · 5 comments
Open

Best practice for using rename_target_files #392

rsignell opened this issue Nov 9, 2023 · 5 comments

Comments

@rsignell
Copy link

rsignell commented Nov 9, 2023

We've created some references for NetCDF3-64bit-offset files, which we need to do locally (since we can't access them from object storage).

So to convert the local combined64.json to point to the files on object storage, we did:

from kerchunk.utils import rename_target_files

rename_target_files('combined64.json',
                   {'/shared/users/rsignell/data/jzambon/nc64/his_20231027.nc':'s3://rsignellbucket1/jzambon/his_20231027.nc',
                    '/shared/users/rsignell/data/jzambon/nc64/his_20231029.nc':'s3://rsignellbucket1/jzambon/his_20231029.nc',
                    '/shared/users/rsignell/data/jzambon/nc64/his_20231030.nc':'s3://rsignellbucket1/jzambon/his_20231030.nc',
                    '/shared/users/rsignell/data/jzambon/nc64/his_20231031.nc':'s3://rsignellbucket1/jzambon/his_20231031.nc'},
                    'combined64_s3.json')

which works fine for our test case (4 files), but we are guessing there is a smarter way for lots of URLs, right?

@rsignell rsignell changed the title Best practice for using Best practice for using rename_target_files Nov 9, 2023
@martindurant
Copy link
Member

martindurant commented Nov 9, 2023

You can phrase the dict as a comprehension

{k: k.replace('/shared/users/rsignell/data/jzambon/nc64', 's3://rsignellbucket1/jzambon/') for k in fs.glob("/shared/users/rsignell/data/jzambon/nc64/*.nc"}

where fs is a localFS.

That's all I can immediately think of.

Have you tried rename_target_files with parquet? I don't think that's come up yet.

@rsignell
Copy link
Author

rsignell commented Nov 9, 2023

  1. I like the dict comprehension!
  2. I have not tried rename_target_files with parquet. And I guess we would need that if we were working with NetCDF3 or NetCDF3-64-bit-offset files where the references got too big and we want to access them from object storage!

@martindurant
Copy link
Member

kerchunk.netCDF3 does support scanning directly from remote.

Following #391 (I think), the version= should be inferred rather than any need to pass it, and it enables writing references directly to parquet during the initial file scan.

@martindurant
Copy link
Member

martindurant commented Nov 9, 2023

https://github.com/fsspec/kerchunk/pull/391/files#diff-5fc74e71e7b4cdb2921590ed60a21bae7a9fe30c8ffeb62a3fb13066ebb01bbdR73 (and actually, it won't allow version= to override the value here, which maybe I should fix)

@martindurant
Copy link
Member

OK, should now work whether you pass version= or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants