Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retry bulk rm #608

Merged
merged 9 commits into from
Mar 18, 2024
Merged

retry bulk rm #608

merged 9 commits into from
Mar 18, 2024

Conversation

martindurant
Copy link
Member

Fixes #558

@martindurant
Copy link
Member Author

cc @slevang @mil-ad - this can't really be tested in CI

Copy link
Contributor

@slevang slevang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good - I tried it on my example in #558 and added some additional logging:

  1. only a few random files end up in remaining
  2. all the errors I'm seeing are 429
  3. the failed files all go through on first retry, and the overall request then succeeds

Thanks!

@martindurant martindurant marked this pull request as ready for review February 29, 2024 16:43
@martindurant
Copy link
Member Author

I did some cleanup here, if people wouldn't mind trying again.

@mil-ad
Copy link

mil-ad commented Feb 29, 2024

Thanks @martindurant ! I'll try reproducing tomorrow

@slevang
Copy link
Contributor

slevang commented Feb 29, 2024

On the latest commit I'm getting failures like this now:

  File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/zarr/storage.py", line 1531, in rmdir
    self.fs.rm(store_path, recursive=True)
  File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
                ^^^^^^^^^^
  File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/gcsfs/core.py", line 1286, in _rm
    raise errors[0]
OSError: {    "code": 503,    "message": "Server is not responding",    "errors": [      {        "message": "Server is not responding",        "domain": "global",        "reason": "backendError"      }    ]  }

When I use the previous commit, today I am also getting 503 instead of 429 but the failed files are correctly filtered out and retried.

@martindurant
Copy link
Member Author

I made a small change to batch the leftovers, if any, instead of repeating for each batch.

I don't see how 503 is escaping, unless you are running out of retries (could add logging here):

                    if code in [200, 204]:
                        out.append(path)
                    elif code in errs and i < 5:
                        remaining.append(path)
                    else:
                        ...

should mean 503 either gets the path back in remaining and is retried, so long as retries are left. Maybe the server really was flaky? errs contains [500, 501, 502, 503, 504, 408, 429].

@slevang
Copy link
Contributor

slevang commented Mar 1, 2024

Of course I can't trip any errors now that I try again (regardless of commit). I'll try again tomorrow.

When I hit the 503 earlier, I did have a print statement that should have logged if anything went into remaining but I never saw that, suggesting it was not getting flagged as retriable somehow.

Copy link
Contributor

@slevang slevang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had the wrong enumerator for the current retry state. With this change it seems to work well.

Any particular reason for dropping the batchsize to 20? You can apparently go as high as 2000, although I expect failures may become more likely if this is too big. Don't know if it makes any difference to speed, and this is way faster than gsutil already.

gcsfs/core.py Outdated Show resolved Hide resolved
gcsfs/core.py Outdated Show resolved Hide resolved
@slevang
Copy link
Contributor

slevang commented Mar 2, 2024

On batch size, Google says this:

You should not include more than 100 calls in a single batch request. If you need to make more calls than that, use multiple batch requests. The total batch request payload must be less than 10MB.

@martindurant
Copy link
Member Author

I was thinking that with the requests concurrent, smaller batches would be better. Maybe that's wrong, since sending the requests should take about the same bandwidth regardless.

We have two batch sizes here: in the outer _rm and the inner _rm_files. Should they be the same?

Co-authored-by: Sam Levang <[email protected]>
@slevang
Copy link
Contributor

slevang commented Mar 3, 2024

I'm not sure but that seems ok. Then the outer _rm ends up handling all the batching and each call to _rm_files only runs a single chunk. So the code could actually be simplified to drop a loop from _rm_files right?

Also should a user be able to configure batchsize from the public API? AbstractFileSystem.rm() doesn't pass through additional **kwargs.

@martindurant
Copy link
Member Author

martindurant commented Mar 3, 2024

Actually, it's this one: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/asyn.py#L319

And you are right, if _rm_files is not designed to be called from outside, there's no reason it should do its own internal looping.

@martindurant martindurant merged commit ffe56bd into fsspec:main Mar 18, 2024
5 checks passed
@martindurant martindurant deleted the rm_retry branch March 18, 2024 19:56
@slevang
Copy link
Contributor

slevang commented Mar 18, 2024

Thanks!

copybara-service bot pushed a commit to google/Xee that referenced this pull request Jul 24, 2024
This updates the xee dataflow example to prevent users from accidentally deleting storage bucket when running the example.

This is a really simple fix for a bug in a recent [push to gcsfs](fsspec/gcsfs#608) paired with some [logic in the zarr library for writing datasets](https://github.com/zarr-developers/zarr-python/blob/df4c25f70c8a1e2b43214d7f26e80d34df502e7e/src/zarr/v2/storage.py#L567) which allows users to accidentally remove their bucket if writing to the root of a cloud storage bucket. This is problematic because users may have other data in a cloud storage bucket they may try to write to and accidental deletion of the bucket removes everything.

Changes in this PR include:
1. pinning the `gcsfs` version to `<=2024.4.0` before the PR that introduced the bug
2. point to write to subdirectory on the bucket in the example

PiperOrigin-RevId: 655683820
copybara-service bot pushed a commit to google/Xee that referenced this pull request Jul 24, 2024
This updates the xee dataflow example to prevent users from accidentally deleting storage bucket when running the example.

This is a really simple fix for a bug in a recent [push to gcsfs](fsspec/gcsfs#608) paired with some [logic in the zarr library for writing datasets](https://github.com/zarr-developers/zarr-python/blob/df4c25f70c8a1e2b43214d7f26e80d34df502e7e/src/zarr/v2/storage.py#L567) which allows users to accidentally remove their bucket if writing to the root of a cloud storage bucket. This is problematic because users may have other data in a cloud storage bucket they may try to write to and accidental deletion of the bucket removes everything.

Changes in this PR include:
1. pinning the `gcsfs` version to `<=2024.4.0` before the PR that introduced the bug
2. point to write to subdirectory on the bucket in the example

PiperOrigin-RevId: 655683820
copybara-service bot pushed a commit to google/Xee that referenced this pull request Jul 25, 2024
This updates the xee dataflow example to prevent users from accidentally deleting storage bucket when running the example.

This is a really simple fix for a bug in a recent [push to gcsfs](fsspec/gcsfs#608) paired with some [logic in the zarr library for writing datasets](https://github.com/zarr-developers/zarr-python/blob/df4c25f70c8a1e2b43214d7f26e80d34df502e7e/src/zarr/v2/storage.py#L567) which allows users to accidentally remove their bucket if writing to the root of a cloud storage bucket. This is problematic because users may have other data in a cloud storage bucket they may try to write to and accidental deletion of the bucket removes everything.

Changes in this PR include:
1. pinning the `gcsfs` version to `<=2024.2.0` before the PR that introduced the bug
2. point to write to subdirectory on the bucket in the example

PiperOrigin-RevId: 655683820
copybara-service bot pushed a commit to google/Xee that referenced this pull request Jul 25, 2024
This updates the xee dataflow example to prevent users from accidentally deleting storage bucket when running the example.

This is a really simple fix for a bug in a recent [push to gcsfs](fsspec/gcsfs#608) paired with some [logic in the zarr library for writing datasets](https://github.com/zarr-developers/zarr-python/blob/df4c25f70c8a1e2b43214d7f26e80d34df502e7e/src/zarr/v2/storage.py#L567) which allows users to accidentally remove their bucket if writing to the root of a cloud storage bucket. This is problematic because users may have other data in a cloud storage bucket they may try to write to and accidental deletion of the bucket removes everything.

Changes in this PR include:
1. pinning the `gcsfs` version to `<=2024.2.0` before the PR that introduced the bug
2. point to write to subdirectory on the bucket in the example

PiperOrigin-RevId: 655683820
copybara-service bot pushed a commit to google/Xee that referenced this pull request Jul 25, 2024
This updates the xee dataflow example to prevent users from accidentally deleting storage bucket when running the example.

This is a really simple fix for a bug in a recent [push to gcsfs](fsspec/gcsfs#608) paired with some [logic in the zarr library for writing datasets](https://github.com/zarr-developers/zarr-python/blob/df4c25f70c8a1e2b43214d7f26e80d34df502e7e/src/zarr/v2/storage.py#L567) which allows users to accidentally remove their bucket if writing to the root of a cloud storage bucket. This is problematic because users may have other data in a cloud storage bucket they may try to write to and accidental deletion of the bucket removes everything.

Changes in this PR include:
1. pinning the `gcsfs` version to `<=2024.2.0` before the PR that introduced the bug
2. point to write to subdirectory on the bucket in the example

PiperOrigin-RevId: 656046609
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Errors when deleting a directory with huge number of files
3 participants