retry bulk rm #608

martindurant · 2024-02-27T18:55:42Z

Fixes #558

martindurant · 2024-02-27T20:31:05Z

cc @slevang @mil-ad - this can't really be tested in CI

slevang

This looks good - I tried it on my example in #558 and added some additional logging:

only a few random files end up in remaining
all the errors I'm seeing are 429
the failed files all go through on first retry, and the overall request then succeeds

Thanks!

martindurant · 2024-02-29T19:35:53Z

I did some cleanup here, if people wouldn't mind trying again.

mil-ad · 2024-02-29T20:34:02Z

Thanks @martindurant ! I'll try reproducing tomorrow

slevang · 2024-02-29T21:08:11Z

On the latest commit I'm getting failures like this now:

  File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/zarr/storage.py", line 1531, in rmdir
    self.fs.rm(store_path, recursive=True)
  File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
                ^^^^^^^^^^
  File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/gcsfs/core.py", line 1286, in _rm
    raise errors[0]
OSError: {    "code": 503,    "message": "Server is not responding",    "errors": [      {        "message": "Server is not responding",        "domain": "global",        "reason": "backendError"      }    ]  }

When I use the previous commit, today I am also getting 503 instead of 429 but the failed files are correctly filtered out and retried.

martindurant · 2024-02-29T21:22:09Z

I made a small change to batch the leftovers, if any, instead of repeating for each batch.

I don't see how 503 is escaping, unless you are running out of retries (could add logging here):

                    if code in [200, 204]:
                        out.append(path)
                    elif code in errs and i < 5:
                        remaining.append(path)
                    else:
                        ...

should mean 503 either gets the path back in remaining and is retried, so long as retries are left. Maybe the server really was flaky? errs contains [500, 501, 502, 503, 504, 408, 429].

slevang · 2024-03-01T02:56:19Z

Of course I can't trip any errors now that I try again (regardless of commit). I'll try again tomorrow.

When I hit the 503 earlier, I did have a print statement that should have logged if anything went into remaining but I never saw that, suggesting it was not getting flagged as retriable somehow.

slevang

Just had the wrong enumerator for the current retry state. With this change it seems to work well.

Any particular reason for dropping the batchsize to 20? You can apparently go as high as 2000, although I expect failures may become more likely if this is too big. Don't know if it makes any difference to speed, and this is way faster than gsutil already.

gcsfs/core.py

slevang · 2024-03-02T18:59:33Z

On batch size, Google says this:

You should not include more than 100 calls in a single batch request. If you need to make more calls than that, use multiple batch requests. The total batch request payload must be less than 10MB.

martindurant · 2024-03-03T01:31:36Z

I was thinking that with the requests concurrent, smaller batches would be better. Maybe that's wrong, since sending the requests should take about the same bandwidth regardless.

We have two batch sizes here: in the outer _rm and the inner _rm_files. Should they be the same?

Co-authored-by: Sam Levang <[email protected]>

slevang · 2024-03-03T01:44:04Z

I'm not sure but that seems ok. Then the outer _rm ends up handling all the batching and each call to _rm_files only runs a single chunk. So the code could actually be simplified to drop a loop from _rm_files right?

Also should a user be able to configure batchsize from the public API? AbstractFileSystem.rm() doesn't pass through additional **kwargs.

martindurant · 2024-03-03T02:42:46Z

Actually, it's this one: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/asyn.py#L319

And you are right, if _rm_files is not designed to be called from outside, there's no reason it should do its own internal looping.

Co-authored-by: Sam Levang <[email protected]>

… rm_retry

slevang · 2024-03-18T19:58:25Z

Thanks!

This updates the xee dataflow example to prevent users from accidentally deleting storage bucket when running the example. This is a really simple fix for a bug in a recent [push to gcsfs](fsspec/gcsfs#608) paired with some [logic in the zarr library for writing datasets](https://github.com/zarr-developers/zarr-python/blob/df4c25f70c8a1e2b43214d7f26e80d34df502e7e/src/zarr/v2/storage.py#L567) which allows users to accidentally remove their bucket if writing to the root of a cloud storage bucket. This is problematic because users may have other data in a cloud storage bucket they may try to write to and accidental deletion of the bucket removes everything. Changes in this PR include: 1. pinning the `gcsfs` version to `<=2024.4.0` before the PR that introduced the bug 2. point to write to subdirectory on the bucket in the example PiperOrigin-RevId: 655683820

This updates the xee dataflow example to prevent users from accidentally deleting storage bucket when running the example. This is a really simple fix for a bug in a recent [push to gcsfs](fsspec/gcsfs#608) paired with some [logic in the zarr library for writing datasets](https://github.com/zarr-developers/zarr-python/blob/df4c25f70c8a1e2b43214d7f26e80d34df502e7e/src/zarr/v2/storage.py#L567) which allows users to accidentally remove their bucket if writing to the root of a cloud storage bucket. This is problematic because users may have other data in a cloud storage bucket they may try to write to and accidental deletion of the bucket removes everything. Changes in this PR include: 1. pinning the `gcsfs` version to `<=2024.2.0` before the PR that introduced the bug 2. point to write to subdirectory on the bucket in the example PiperOrigin-RevId: 655683820

This updates the xee dataflow example to prevent users from accidentally deleting storage bucket when running the example. This is a really simple fix for a bug in a recent [push to gcsfs](fsspec/gcsfs#608) paired with some [logic in the zarr library for writing datasets](https://github.com/zarr-developers/zarr-python/blob/df4c25f70c8a1e2b43214d7f26e80d34df502e7e/src/zarr/v2/storage.py#L567) which allows users to accidentally remove their bucket if writing to the root of a cloud storage bucket. This is problematic because users may have other data in a cloud storage bucket they may try to write to and accidental deletion of the bucket removes everything. Changes in this PR include: 1. pinning the `gcsfs` version to `<=2024.2.0` before the PR that introduced the bug 2. point to write to subdirectory on the bucket in the example PiperOrigin-RevId: 656046609

martindurant added 2 commits February 27, 2024 13:54

retry bulk rm

b652449

fix

bffcd93

slevang approved these changes Feb 28, 2024

View reviewed changes

martindurant marked this pull request as ready for review February 29, 2024 16:43

cleanup

a26390e

define end parameter

718ffad

put remaining outside batch loop

1d0fdf9

slevang suggested changes Mar 2, 2024

View reviewed changes

gcsfs/core.py Outdated Show resolved Hide resolved

gcsfs/core.py Outdated Show resolved Hide resolved

Update gcsfs/core.py

0f53bcb

Co-authored-by: Sam Levang <[email protected]>

martindurant and others added 3 commits March 18, 2024 15:46

Update gcsfs/core.py

404029f

Co-authored-by: Sam Levang <[email protected]>

Merge branch 'main' into rm_retry

41f5687

Merge branch 'rm_retry' of https://github.com/martindurant/gcsfs into…

0674e95

… rm_retry

martindurant merged commit ffe56bd into fsspec:main Mar 18, 2024
5 checks passed

martindurant deleted the rm_retry branch March 18, 2024 19:56

copybara-service bot mentioned this pull request Jul 24, 2024

Update xee dataflow example google/Xee#164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retry bulk rm #608

retry bulk rm #608

martindurant commented Feb 27, 2024

martindurant commented Feb 27, 2024

slevang left a comment

martindurant commented Feb 29, 2024

mil-ad commented Feb 29, 2024

slevang commented Feb 29, 2024

martindurant commented Feb 29, 2024

slevang commented Mar 1, 2024

slevang left a comment

slevang commented Mar 2, 2024

martindurant commented Mar 3, 2024

slevang commented Mar 3, 2024

martindurant commented Mar 3, 2024 •

edited

Loading

slevang commented Mar 18, 2024

retry bulk rm #608

retry bulk rm #608

Conversation

martindurant commented Feb 27, 2024

martindurant commented Feb 27, 2024

slevang left a comment

Choose a reason for hiding this comment

martindurant commented Feb 29, 2024

mil-ad commented Feb 29, 2024

slevang commented Feb 29, 2024

martindurant commented Feb 29, 2024

slevang commented Mar 1, 2024

slevang left a comment

Choose a reason for hiding this comment

slevang commented Mar 2, 2024

martindurant commented Mar 3, 2024

slevang commented Mar 3, 2024

martindurant commented Mar 3, 2024 • edited Loading

slevang commented Mar 18, 2024

martindurant commented Mar 3, 2024 •

edited

Loading