Skip to content

Conversation

@Eric-Wei-10
Copy link
Contributor

@Eric-Wei-10 Eric-Wei-10 commented Jul 28, 2025

Implemented a class-based decorator wrapping nvidia-resiliency-ext's straggler detection functionality.
Key features:

  1. Enables optional straggler detection through --straggler-detection-level parameter
  2. Configurable performance reporting intervals (iterations) via --straggler-detection-interval
  3. Generates periodic reports containing:
    • Per-function CPU performance metrics
    • Aggregate GPU performance metrics across all monitored functions
  4. Identifies stragglers based on individual and relative performance. Output example:
[default0]:Rank 0 GPUs relative perf: {0: 0.9606720209121704, 1: 0.9886988401412964, 2: 0.9780263304710388, 3: 0.9869872331619263, 4: 0.9935539364814758, 5: 0.9700131416320801, 6: 0.9966533780097961, 7: 0.9721004366874695}
[default0]:Rank 0 GPUs individual perf: {0: 1.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 1.0, 5: 1.0, 6: 1.0, 7: 1.0}
[default0]:Rank 0 sections relative perf: {'microbatch_forward': {0: 0.9669181704521179, 1: 0.8367228507995605, 2: 0.9881120324134827, 3: 0.9784548878669739, 4: 0.952782928943634, 5: 0.9913772940635681, 6: 1.0, 7: 0.9699316620826721}, 'microbatch_backward': {0: 0.6724448800086975, 1: 1.0, 2: 0.6805492639541626, 3: 0.6923477649688721, 4: 0.8935050368309021, 5: 0.6902452111244202, 6: 0.6816517114639282, 7: 0.7524182200431824}}
[default0]:Rank 0 sections individual perf: {'microbatch_forward': {0: 1.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 1.0, 5: 1.0, 6: 1.0, 7: 1.0}, 'microbatch_backward': {0: 1.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 1.0, 5: 1.0, 6: 1.0, 7: 1.0}}
[default0]:Rank 0,1,2,3,5,7 is/are identified as 'straggler_gpus_relative' stragglers
[default0]:No rank is identified as as 'straggler_gpus_individual' stragglers

@@ -0,0 +1,32 @@
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this file to flagscale/train/straggler_detection.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

)

stragglers = report.identify_stragglers(gpu_rel_threshold=0.7, gpu_indiv_threshold=0.7)
pprint(stragglers)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

display your results in the PR description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@Eric-Wei-10 Eric-Wei-10 force-pushed the dev-wx branch 22 times, most recently from cc3d3b9 to 4cf9b53 Compare August 6, 2025 04:01
@Eric-Wei-10 Eric-Wei-10 force-pushed the dev-wx branch 2 times, most recently from 3613752 to d7a8b4a Compare August 25, 2025 06:54
@Eric-Wei-10 Eric-Wei-10 force-pushed the dev-wx branch 2 times, most recently from 03b5abf to 6e8b324 Compare August 25, 2025 09:14
zhaoyinglia

This comment was marked as duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants