-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[hotfix] fix data race in method drain()
in TaskMailboxImpl.java
#26219
base: master
Are you sure you want to change the base?
Conversation
drain()
method in TaskMailboxImpl.java
drain()
method in TaskMailboxImpl.java
drain()
in TaskMailboxImpl.java
The link to the commit is unkown because I amended the commit message to clarify the subject of the PR. The new commit link is 7e3700a, and the changes are the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a unit test?
@@ -243,11 +243,11 @@ private Mail takeOrNull(Deque<Mail> queue, int priority) { | |||
|
|||
@Override | |||
public List<Mail> drain() { | |||
List<Mail> drainedMails = new ArrayList<>(batch); | |||
batch.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we be eagerly grabbing the lock in all these methods so all processing is done under the lock? for example all references to batch should be under the lock as the value could change under us. I was thinking we could synchronize all the methods referencing batch, so we do not need the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea that would solve all data races. However, I noticed that the current implementation seems to be acquiring locks depending on what thread is executing the method by invoking checkIsMailboxThread()
. It might be that this was introduced for performance reasons; as acquiring the lock is a rather slow operation. But this is not entirely clear to me, as there are other methods like createBatch
that acquire the lock even though the first instruction is to check is to ensure that only the mailbox thread is executing this.
This is why I reported only the data race on the drain
method, as there is no instruction indicating that only the mailbox thread can execute that method.
Sure, I can write a test. I am looking at this page to read about how to write unit tests for flink. I would like to add a java test, but there is no information on how to run Java tests. Is there some documentation on this? (it would be better, if it is possible to write a junit5 test. The test must be executed repeatedly to increase the chance of triggering the bug, and junit5 has a |
What is the purpose of the change
This is a hotfix fixing a data race in
drain()
method inTaskMailboxImpl.java
. This can lead to undesired executions. For instance, consider the that mailbox thread is adding amail
by executingputFirst
, and concurrently another thread is executingdrain()
. This can produce the following execution:batch
here is executed beforeaddFirst(mail)
(in this line) adds the element,mail
is added tobatch
(consequentely,mail
is not indrainedMails
)batch.clear()
is executed.This execution results in missing
mail
(as it is neither included inbatch
nordrainedMails
).This hotfix PR resolves the data race by including the read to
batch
and its clearing within the critical section in thedrain()
method.Brief change log
batch
in the methoddrain()
inTaskMailboxImpl.java
are moved into the critical section within the method.Verifying this change
This change is a trivial rework / code cleanup without any test coverage.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation