Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hotfix] fix data race in method drain() in TaskMailboxImpl.java #26219

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

raulpardo
Copy link

What is the purpose of the change

This is a hotfix fixing a data race in drain() method in TaskMailboxImpl.java. This can lead to undesired executions. For instance, consider the that mailbox thread is adding a mail by executing putFirst, and concurrently another thread is executing drain(). This can produce the following execution:

  1. The read of batch here is executed before addFirst(mail) (in this line) adds the element,
  2. Then, mail is added to batch (consequentely, mail is not in drainedMails)
  3. Finally, batch.clear() is executed.

This execution results in missing mail (as it is neither included in batch nor drainedMails).

This hotfix PR resolves the data race by including the read to batch and its clearing within the critical section in the drain() method.

Brief change log

  • Reading and clearing variable batch in the method drain() in TaskMailboxImpl.java are moved into the critical section within the method.

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: don't know
  • The runtime per-record code paths (performance sensitive): don't know
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: don't know
  • The S3 file system connector: don't know

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@raulpardo raulpardo changed the title [hotfix] fix data race in method in [hotfix] fix data race in drain() method in TaskMailboxImpl.java Feb 26, 2025
@raulpardo raulpardo changed the title [hotfix] fix data race in drain() method in TaskMailboxImpl.java [hotfix] fix data race in method drain() in TaskMailboxImpl.java Feb 26, 2025
@flinkbot
Copy link
Collaborator

flinkbot commented Feb 26, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@raulpardo
Copy link
Author

CI report:

* [9400657](https://github.com/apache/flink/commit/9400657bdbc244a1eb03d157b13de9ca5f7f7555) UNKNOWN

Bot commands

The link to the commit is unkown because I amended the commit message to clarify the subject of the PR. The new commit link is 7e3700a, and the changes are the same.

Copy link
Contributor

@davidradl davidradl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a unit test?

@@ -243,11 +243,11 @@ private Mail takeOrNull(Deque<Mail> queue, int priority) {

@Override
public List<Mail> drain() {
List<Mail> drainedMails = new ArrayList<>(batch);
batch.clear();
Copy link
Contributor

@davidradl davidradl Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be eagerly grabbing the lock in all these methods so all processing is done under the lock? for example all references to batch should be under the lock as the value could change under us. I was thinking we could synchronize all the methods referencing batch, so we do not need the lock.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea that would solve all data races. However, I noticed that the current implementation seems to be acquiring locks depending on what thread is executing the method by invoking checkIsMailboxThread(). It might be that this was introduced for performance reasons; as acquiring the lock is a rather slow operation. But this is not entirely clear to me, as there are other methods like createBatch that acquire the lock even though the first instruction is to check is to ensure that only the mailbox thread is executing this.

This is why I reported only the data race on the drain method, as there is no instruction indicating that only the mailbox thread can execute that method.

@raulpardo
Copy link
Author

raulpardo commented Mar 2, 2025

can we have a unit test?

Sure, I can write a test. I am looking at this page to read about how to write unit tests for flink. I would like to add a java test, but there is no information on how to run Java tests. Is there some documentation on this? (it would be better, if it is possible to write a junit5 test. The test must be executed repeatedly to increase the chance of triggering the bug, and junit5 has a @RepeatedTest functionality which would be handy in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants