-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#1796] fix(spark): Implicitly unregister map output on fetch failure #1797
base: master
Are you sure you want to change the base?
Conversation
Could you help review this? @advancedxy If I understand incorrectly, feel free to point out |
Test Results2 116 files - 533 2 116 suites - 533 2h 28m 20s ⏱️ - 2h 58m 20s For more details on these failures and errors, see this check. Results for commit 73e5020. ± Comparison against base commit f8e4329. This pull request removes 289 tests.
♻️ This comment has been updated with latest results. |
@@ -371,6 +374,19 @@ public static RssException reportRssFetchFailedException( | |||
rssFetchFailedException.getMessage()); | |||
RssReportShuffleFetchFailureResponse response = client.reportShuffleFetchFailure(req); | |||
if (response.getReSubmitWholeStage()) { | |||
TaskContext taskContext = TaskContext.get(); | |||
RssReassignServersRequest rssReassignServersRequest = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bring this up.
I think I made a mistake in the previous impl, which doesn't unregister all the map output with shuffle fetch failure.
I think the right place to unregister the map output should be ShuffleManagerGrpcService
's reportShuffleFetchFailure
. When enough number of fetch failure is reported, it should unregister all the map output and tell the client to report a FetchFailedException
. The shuffle server reassignment could be triggered too if configured to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good insight. Thanks for your advice and I will go on
cc @yl09099 Could you help check some write failure logic, I have refactored these part code and make it align with the fetch failure. |
LOG.warn("The shuffleId:{}, stageId:{} has been retried. Ignore it."); | ||
return false; | ||
} | ||
if (shuffleStatus.getTaskFailureAttemptCount() >= sparkTaskMaxFailures) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In SMJ, one stage has two shuffle readers. If a task fails due to two different shuffle reader, the condition readerShuffleStatus.getTaskFailureAttemptCount()
>= sparkTaskMaxFailures, will not expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me think more about this case. Do you some further solutions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we can add a state for the stage contain all the shuffleStatus
in this stage.
class RssShuffleStageFailureState {
int stageId;
List<RssShuffleStatus> shuffleStatusList;
boolean activateStageRetry();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. StageId is a good trigger condition for the retry checking. But I'm not sure what I missed, especially for some corner cases? Could you help give some extra advice? @jerqi @advancedxy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
&& writerShuffleStatus.isStageAttemptRetried(stageAttemptNumber)) { | ||
return true; | ||
} | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In one case, the Reader triggers retry, and the retry is recorded. After the Writer fails to write data for several times, the retry is triggered. However, this method returns that the retry has been performed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The same stageIdAttemptNumber retry will ocurr one time, is this incorrect? @yl09099
Could you help review this? @advancedxy |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1797 +/- ##
============================================
- Coverage 53.53% 53.43% -0.11%
- Complexity 2356 2395 +39
============================================
Files 368 371 +3
Lines 16852 17156 +304
Branches 1540 1571 +31
============================================
+ Hits 9022 9167 +145
- Misses 7303 7451 +148
- Partials 527 538 +11 ☔ View full report in Codecov by Sentry. |
The's CI failures, you may need to take a look at that first. I'll take a look at this later tonight or tomorrow. |
Thanks for working on this. I did a quick overview about this change, I think it's quite large to review. It would best to keep
|
Emm... Sorry I don't think this is a huge change, this is mostly based on your previous great work, just fix some bugs. And I don't have much time to rework to split multi PRs, so I hope we could review this in this PR to check whether some critical problems exist. |
I agree it's not huge, but it's large enough that requires sufficient meta capacity and time to review, which unfortunately I don't have until this weekend.
Well understood. However, I think this PR should be split into two PRs at least: handle fetch failure and handle write failures. I think the write failure is different from fetch failure and should be decoupled. @jerqi do you have the time to review this by any chance? Otherwise, it will take some time and probably be reviewed in this weekend, do that sound right to you? @zuston |
Good to know this. Thanks for your determined reply. @advancedxy |
Could you help show what’s the difference between fetc and write failure? |
For starters, you should report shuffle failure and write failure via different request types. The stage retry logic is also a bit off as you need to retry the parent stage for fetch failure, but the current stage for write failure. They might have common logic to remove map output data, etc. But they should be handled in separate PRs instead of one, which makes the PR changes huge and hard to review. |
After digging into this feature, I found there are many bugs and improvement need to be done. So I have to split them into small patch to fix. And this PR will be as the collection to place them to test the availablity
Yes. This is the difference that I have distinguished in current PR. |
What changes were proposed in this pull request?
Why are the changes needed?
Fix: #1796. #1801 #1798
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit tests