Skip to content

Conversation

@AdrS
Copy link
Contributor

@AdrS AdrS commented Apr 21, 2025

This is to enable customizing how sets are serialized to increase the pickling determinism. I'm modifying the vendored cloudpickle as a stop-gap measure until the cloudpickle maintainers review cloudpipe/cloudpickle#563.

Note: It's easiest to review this change with the setting to hide whitespace.

Issue: #34410

This is to enable customizing how sets are serialized to increase the
pickling determinism. I'm modifying the vendored cloudpickle as a
stop-gap measure until the cloudpickle maintainers review
cloudpipe/cloudpickle#563.

Issue: apache#34410
@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @jrmccluskey for label python.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@jrmccluskey
Copy link
Contributor

Passing to @claudevdm who has been working with dill and cloudpickle as of late, should have better context here

with _pickle_lock:
with io.BytesIO() as file:
pickler = cloudpickle.CloudPickler(file)
pickler = cloudpickle.PurePythonPickler(file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline we probably want to have this default to the fast pickler, and add an option to use slow pickler.

@claudevdm
Copy link
Collaborator

Should we vendor the cloudpickle tests as well if we are modifying the implementation this much?

@claudevdm
Copy link
Collaborator

There are a bunch of failing tests e.g.

self = <apache_beam.internal.cloudpickle.cloudpickle.PurePythonPickler object at 0x7a8974470940>
obj = b''

    def memoize(self, obj):
        """Store an object in the memo."""
    
        # The Pickler memo is a dictionary mapping object ids to 2-tuples
        # that contain the Unpickler memo key and the object being memoized.
        # The memo key is written to the pickle and will become
        # the key in the Unpickler's memo.  The object is stored in the
        # Pickler memo so that transient objects are kept alive during
        # pickling.
    
        # The use of the Unpickler memo length as the memo key is just a
        # convention.  The only requirement is that the memo values be unique.
        # But there appears no advantage to any other scheme, and this
        # scheme allows the Unpickler memo to be implemented as a plain (but
        # growable) array, indexed by memo key.
        if self.fast:
            return
>       assert id(obj) not in self.memo
E       AssertionError

@github-actions
Copy link
Contributor

github-actions bot commented May 2, 2025

Reminder, please take a look at this pr: @jrmccluskey

@github-actions
Copy link
Contributor

github-actions bot commented May 6, 2025

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @tvalentyn for label python.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@tvalentyn
Copy link
Contributor

Should we vendor the cloudpickle tests as well if we are modifying the implementation this much?

+1.

Given cloudpickle authors are not responsive, we might have to maintain this fork until our changes can be upstreamed.

@github-actions
Copy link
Contributor

Reminder, please take a look at this pr: @tvalentyn

@tvalentyn
Copy link
Contributor

waiting on author

@github-actions
Copy link
Contributor

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jul 15, 2025
@github-actions
Copy link
Contributor

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Jul 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants