fix(megatron_training_lib): correct training-log parsing and drop sudo from libbnxt cp by atnair-amd · Pull Request #177 · ROCm/cvs

atnair-amd · 2026-05-19T03:41:36Z

Summary

Three fixes in MegatronLlamaTrainingJob, all in
cvs/lib/megatron_training_lib.py. The first two share a single failure
mode — the parser silently sees the wrong slice of training.log and
downstream code treats an empty / mid-run metrics dict as PASS rather
than an explicit error. The third is a one-line drive-by in the same
file's container-exec path; included here rather than split out because
it's a single character change with no shared surface area.

What changed

1. `TRAINING_PROGRESS_PATTERNS` — match the final iter, not every iter

cvs/lib/megatron_training_lib.py (lines 44-46).

The previous patterns matched any throughput per GPU: / tokens/GPU/s
line. Megatron emits those every iteration after the first, so
_is_training_complete (call site at line 705) returned True after iter 1
and the poll loop tail-extracted metrics mid-run.

Replaced with a single pattern anchored on megatron's final-iter marker:

TRAINING_PROGRESS_PATTERNS = [
    r'iteration\s+(\d+)/\s+\1\s*\|',
]

The \1 backreference requires current-iter == total-iter, so the
pattern fires exactly once per run (e.g. iteration 100/ 100 | ...).

2. `get_training_results_dict` — filter before tail

cvs/lib/megatron_training_lib.py (~line 587).

Was:

out_dict = self.phdl.exec(
    f'cat {self.log_dir}/megatron-logs/out-node{last_node_num}/training.log | tail -15'
)

On a healthy run, post-training cleanup (aiter import spam, validation eval
banner, NCCL teardown warnings) writes well over 15 lines after the final
iter, pushing every iteration N/... line out of the tail -15 window.
_parse_training_results then matches nothing and returns an empty
results_dict, which downstream code treats as PASS.

Now:

out_dict = self.phdl.exec(
    f"grep -E 'iteration[[:space:]]+[0-9]+/[[:space:]]+[0-9]+[[:space:]]*\\|' "
    f"{self.log_dir}/megatron-logs/out-node{last_node_num}/training.log | tail -15"
)

The grep keeps only iteration lines; tail -15 then yields the last 15
of those, which is what _parse_training_results was already assumed to
be fed.

3. `exec_nic_setup_scripts` — drop `sudo` from the libbnxt `docker exec`

cvs/lib/megatron_training_lib.py (~line 387, broadcom/thor branch).

The training container already runs as root, so the nested sudo in:

f'docker exec {self.container_name} /bin/bash -c "sudo cp ... so.host ... so; ..."'

is unnecessary. On training images that don't install sudo at all, the
exec returns 127 and the libbnxt copy never happens — surfacing later in
the run as RDMA / verbs failures rather than a clean setup error.
Dropping the sudo prefix makes the command work on both flavors of
image; root-in-container can cp into /usr/lib directly.

…o from libbnxt cp Three fixes in MegatronLlamaTrainingJob. The first two share a single failure mode (the parser silently sees the wrong slice of training.log and downstream code treats an empty/mid-run dict as PASS); the third is a one-line drive-by in the same file. 1. TRAINING_PROGRESS_PATTERNS matched any `throughput per GPU:` / `tokens/GPU/s` line. Megatron emits those every iteration after the first, so _is_training_complete returned True after iter 1 and the poll loop tail-extracted metrics mid-run. Replaced with a single pattern anchored on `iteration N/ N |`; the `\1` backreference forces current-iter == total-iter, so it fires exactly once per run (at the final iteration line). 2. get_training_results_dict was `cat ... | tail -15` on training.log. Post-training cleanup on a healthy run (aiter import spam, validation eval banner, NCCL teardown warnings) can write well over 15 lines after the final iter, pushing every iteration line out of the tail window. _parse_training_results then matched nothing and returned an empty results_dict, which downstream code treats as PASS. Pre-filter with `grep -E 'iteration N/ N |'` before `tail -15` so the tail window contains the last 15 iteration lines, which is what the parser was always assumed to be fed. 3. exec_nic_setup_scripts broadcom/thor branch runs `docker exec ... /bin/bash -c "sudo cp ... so.host ... so; ..."`. The training container already runs as root, so the nested sudo is unnecessary; on training images that don't install sudo it returns 127 and the libbnxt copy never happens, surfacing later as RDMA/verbs failures. Dropping the sudo prefix makes the command work on both flavors.

anujmittal-amd · 2026-05-21T17:48:36Z

 TRAINING_PROGRESS_PATTERNS = [
-    r'throughput per GPU(?:\s*\([^)]*\))?\s*:|tokens\/GPU\/s\s+[0-9]+',
-    r'throughput per GPU:|tokens\/GPU\/s\s+[0-9]+',
+    r'iteration\s+(\d+)/\s+\1\s*\|',


r'iteration\s+(\d+)/\s+\1\s*|' requires whitespace after /. If Megatron logs ever emit iteration 100/100 | or vary spacing by version, _is_training_complete() may stop matching entirely.
Would r'iteration\s+(\d+)\s*/\s*\1\s*|' be safer here?

Good catch — switched to r'iteration\s+(\d+)\s*/\s*\1\s*\|' in f91e6c8 so spacing around / is tolerated.

anujmittal-amd · 2026-05-21T17:49:13Z

        last_node_num = len(self.host_list) - 1
-        out_dict = self.phdl.exec(f'cat {self.log_dir}/megatron-logs/out-node{last_node_num}/training.log | tail -15')
+        out_dict = self.phdl.exec(
+            f"grep -E 'iteration[[:space:]]+[0-9]+/[[:space:]]+[0-9]+[[:space:]]*\\|' "


Same comment as above

Same fix applied to the grep ERE in f91e6c8: [[:space:]]+/[[:space:]]+ → [[:space:]]*/[[:space:]]* around the /.

Per review on #177: the previous regex/grep both required at least one space after the '/'. Megatron versions that emit 'iteration 100/100 |' (no space) or 'iteration 100 / 100 |' (space on both sides) would have stopped matching, silently regressing _is_training_complete and get_training_results_dict. Switch both to use [[:space:]]*/[[:space:]]* (and the Python equivalent \s*/\s*) so any spacing around '/' is accepted.

atnair-amd requested a review from anujmittal-amd May 19, 2026 03:42

anujmittal-amd reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(megatron_training_lib): correct training-log parsing and drop sudo from libbnxt cp#177

fix(megatron_training_lib): correct training-log parsing and drop sudo from libbnxt cp#177
atnair-amd wants to merge 2 commits into
mainfrom
atnair/megatron-log-output-fix

atnair-amd commented May 19, 2026

Uh oh!

anujmittal-amd May 21, 2026

Uh oh!

atnair-amd May 21, 2026

Uh oh!

anujmittal-amd May 21, 2026

Uh oh!

atnair-amd May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

atnair-amd commented May 19, 2026

Summary

What changed

1. TRAINING_PROGRESS_PATTERNS — match the final iter, not every iter

2. get_training_results_dict — filter before tail

3. exec_nic_setup_scripts — drop sudo from the libbnxt docker exec

Uh oh!

anujmittal-amd May 21, 2026

Choose a reason for hiding this comment

Uh oh!

atnair-amd May 21, 2026

Choose a reason for hiding this comment

Uh oh!

anujmittal-amd May 21, 2026

Choose a reason for hiding this comment

Uh oh!

atnair-amd May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `TRAINING_PROGRESS_PATTERNS` — match the final iter, not every iter

2. `get_training_results_dict` — filter before tail

3. `exec_nic_setup_scripts` — drop `sudo` from the libbnxt `docker exec`