-
Couldn't load subscription status.
- Fork 113
feat: add StragglerDetectionWrapper decorator class and report for monitoring rank performances #697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| @@ -0,0 +1,32 @@ | |||
| try: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this file to flagscale/train/straggler_detection.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
flagscale/train/train.py
Outdated
| ) | ||
|
|
||
| stragglers = report.identify_stragglers(gpu_rel_threshold=0.7, gpu_indiv_threshold=0.7) | ||
| pprint(stragglers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
display your results in the PR description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
cc3d3b9 to
4cf9b53
Compare
3613752 to
d7a8b4a
Compare
03b5abf to
6e8b324
Compare
162bae5 to
13605de
Compare
Implemented a class-based decorator wrapping nvidia-resiliency-ext's straggler detection functionality.
Key features:
--straggler-detection-levelparameter--straggler-detection-interval