Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow disabling inlining for MultiZarrToZarr #506

Merged
merged 5 commits into from
Sep 26, 2024

Conversation

maresb
Copy link
Contributor

@maresb maresb commented Sep 21, 2024

I am trying to combine some massive S3-based Zarr references. A MultiZarrToZarr operation which would otherwise take a minute would take days because the size of each file on S3 is being queried. Thus I want to be able to pass in inline_threshold=None so that my datasets combine in a reasonable timeframe.

@martindurant
Copy link
Member

I think this was already possible with inline_threashold=0 ?

@maresb
Copy link
Contributor Author

maresb commented Sep 24, 2024

Nope, if inline_threshold=0 then fs.info(fn) still gets called in order to ensure that the file size isn't negative. 😅

We could change this so that we use 0 in place of None. That would probably make more sense actually. Then my conditional self.inline is not None becomes self.inline > 0.

@martindurant
Copy link
Member

That sounds like a good plan

@maresb
Copy link
Contributor Author

maresb commented Sep 25, 2024

Ok @martindurant, that should take care of it.

Now the code should be functionally identical to the previous, except that the file size lookup is skipped when unneeded. In other words, no more inline_threshold=None but instead inline_threshold=0 is optimized.

kerchunk/combine.py Outdated Show resolved Hide resolved
kerchunk/combine.py Outdated Show resolved Hide resolved
@maresb
Copy link
Contributor Author

maresb commented Sep 25, 2024

Oh, I'm sorry, I'm on my laptop at the moment and it looks like I forgot to set up pre-commit 🙈

Please feel free to squash this PR.

@maresb
Copy link
Contributor Author

maresb commented Sep 25, 2024

Hmm, this test failure I can reproduce locally, also from main. I'm guessing it's a dependency issue.

@martindurant
Copy link
Member

@Anu-Ra-g , any idea?

>       assert (
            idx_df["typeOfLevel"][[42, 46, 49, 50]]
            == ["heightAboveGround", "isobaricInhPa", "surface", "heightAboveGround"]
        ).all()
E       AssertionError: assert False
E        +  where False = <bound method NDFrame._add_numeric_operations.<locals>.all of 42     True\n46     True\n49     True\n50    False\nName: typeOfLevel, dtype: bool>()
E        +    where <bound method NDFrame._add_numeric_operations.<locals>.all of 42     True\n46     True\n49     True\n50    False\nName: typeOfLevel, dtype: bool> = 42    heightA... dtype: object == ['heightAbove...tAboveGround']
E             Full diff:
E             - [
E             -  'heightAboveGround',
E             ? ^^                 --
E             + 42    heightAboveGround
E             ? ^^^^^^
E             -  'isobaricInhPa',...
E
E             ...Full output truncated (8 lines hidden), use '-vv' to show.all

@maresb
Copy link
Contributor Author

maresb commented Sep 25, 2024

Interesting, it seems like something's getting mangled. When I run locally the assert above passes for me, but the subsequent one fails.

Note that for me stepType is instant instead of the asserted accum.

idx_df.iloc[83]
varname                                                             wz
typeOfLevel                                              isobaricInhPa
stepType                                                       instant
name                                       Geometric vertical velocity
isobaricInhPa                                                    975.0
step                                                   0 days 06:00:00
time                                               2023-09-28 00:00:00
valid_time                                         2023-09-28 06:00:00
uri                            ~/repos/kerchunk/kerchunk/tests/gfs....
offset                                                        21234675
length                                                         1035696
inline_value                                                      None
surface                                                            NaN
heightAboveGround                                                  NaN
meanSea                                                            NaN
Name: 83, dtype: object

@maresb
Copy link
Contributor Author

maresb commented Sep 26, 2024

I'm guessing it's a dependency issue.

The culprit is eccodes v2.38.0.

@Anu-Ra-g
Copy link
Contributor

Anu-Ra-g commented Sep 26, 2024

@maresb I tried to emulate the above steps on my local system. The error is not occurring consistently.
And this object is still accum when I tried it locally,

varname                                                           watr
typeOfLevel                                                    surface
stepType                                                         accum
name                                                      Water runoff
isobaricInhPa                                                      NaN
step                                                   0 days 06:00:00
time                                               2023-09-28 00:00:00
valid_time                                         2023-09-28 06:00:00
uri                  ./kerchunk/tests/gfs.t00z.pgrb2.0p25.f006.test...
offset                                                        59215098
length                                                          348767
inline_value                                                      None
surface                                                            0.0
heightAboveGround                                                  NaN
meanSea                                                            NaN
Name: 83, dtype: object

Even though the error is occuring, the remaining steps work fine.
image

@maresb
Copy link
Contributor Author

maresb commented Sep 26, 2024

Thanks @Anu-Ra-g! May I ask which version of eccodes you are using?

@Anu-Ra-g
Copy link
Contributor

I'm on eccodes version 2.36.0

@maresb
Copy link
Contributor Author

maresb commented Sep 26, 2024

Since the test failure is unrelated to this PR, I recommend moving discussion to #508.

@Anu-Ra-g, to reproduce the failure you need to upgrade to eccodes v2.38.0.

@martindurant martindurant merged commit b59e0a6 into fsspec:main Sep 26, 2024
2 of 5 checks passed
@maresb maresb deleted the allow-no-inline branch September 26, 2024 13:19
@maresb
Copy link
Contributor Author

maresb commented Sep 26, 2024

Thanks a lot @martindurant for the merge!!! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants