Skip to content

fix(megatron_training_lib): correct training-log parsing and drop sudo from libbnxt cp#177

Open
atnair-amd wants to merge 2 commits into
mainfrom
atnair/megatron-log-output-fix
Open

fix(megatron_training_lib): correct training-log parsing and drop sudo from libbnxt cp#177
atnair-amd wants to merge 2 commits into
mainfrom
atnair/megatron-log-output-fix

Conversation

@atnair-amd
Copy link
Copy Markdown
Contributor

Summary

Three fixes in MegatronLlamaTrainingJob, all in
cvs/lib/megatron_training_lib.py. The first two share a single failure
mode — the parser silently sees the wrong slice of training.log and
downstream code treats an empty / mid-run metrics dict as PASS rather
than an explicit error. The third is a one-line drive-by in the same
file's container-exec path; included here rather than split out because
it's a single character change with no shared surface area.

What changed

1. TRAINING_PROGRESS_PATTERNS — match the final iter, not every iter

cvs/lib/megatron_training_lib.py (lines 44-46).

The previous patterns matched any throughput per GPU: / tokens/GPU/s
line. Megatron emits those every iteration after the first, so
_is_training_complete (call site at line 705) returned True after iter 1
and the poll loop tail-extracted metrics mid-run.

Replaced with a single pattern anchored on megatron's final-iter marker:

TRAINING_PROGRESS_PATTERNS = [
    r'iteration\s+(\d+)/\s+\1\s*\|',
]

The \1 backreference requires current-iter == total-iter, so the
pattern fires exactly once per run (e.g. iteration 100/ 100 | ...).

2. get_training_results_dict — filter before tail

cvs/lib/megatron_training_lib.py (~line 587).

Was:

out_dict = self.phdl.exec(
    f'cat {self.log_dir}/megatron-logs/out-node{last_node_num}/training.log | tail -15'
)

On a healthy run, post-training cleanup (aiter import spam, validation eval
banner, NCCL teardown warnings) writes well over 15 lines after the final
iter, pushing every iteration N/... line out of the tail -15 window.
_parse_training_results then matches nothing and returns an empty
results_dict, which downstream code treats as PASS.

Now:

out_dict = self.phdl.exec(
    f"grep -E 'iteration[[:space:]]+[0-9]+/[[:space:]]+[0-9]+[[:space:]]*\\|' "
    f"{self.log_dir}/megatron-logs/out-node{last_node_num}/training.log | tail -15"
)

The grep keeps only iteration lines; tail -15 then yields the last 15
of those, which is what _parse_training_results was already assumed to
be fed.

3. exec_nic_setup_scripts — drop sudo from the libbnxt docker exec

cvs/lib/megatron_training_lib.py (~line 387, broadcom/thor branch).

The training container already runs as root, so the nested sudo in:

f'docker exec {self.container_name} /bin/bash -c "sudo cp ... so.host ... so; ..."'

is unnecessary. On training images that don't install sudo at all, the
exec returns 127 and the libbnxt copy never happens — surfacing later in
the run as RDMA / verbs failures rather than a clean setup error.
Dropping the sudo prefix makes the command work on both flavors of
image; root-in-container can cp into /usr/lib directly.

…o from libbnxt cp

Three fixes in MegatronLlamaTrainingJob. The first two share a single
failure mode (the parser silently sees the wrong slice of training.log
and downstream code treats an empty/mid-run dict as PASS); the third is
a one-line drive-by in the same file.

1. TRAINING_PROGRESS_PATTERNS matched any `throughput per GPU:` /
   `tokens/GPU/s` line. Megatron emits those every iteration after the
   first, so _is_training_complete returned True after iter 1 and the
   poll loop tail-extracted metrics mid-run. Replaced with a single
   pattern anchored on `iteration N/ N |`; the `\1` backreference
   forces current-iter == total-iter, so it fires exactly once per run
   (at the final iteration line).

2. get_training_results_dict was `cat ... | tail -15` on training.log.
   Post-training cleanup on a healthy run (aiter import spam,
   validation eval banner, NCCL teardown warnings) can write well over
   15 lines after the final iter, pushing every iteration line out of
   the tail window. _parse_training_results then matched nothing and
   returned an empty results_dict, which downstream code treats as
   PASS. Pre-filter with `grep -E 'iteration N/ N |'` before `tail -15`
   so the tail window contains the last 15 iteration lines, which is
   what the parser was always assumed to be fed.

3. exec_nic_setup_scripts broadcom/thor branch runs `docker exec ...
   /bin/bash -c "sudo cp ... so.host ... so; ..."`. The training
   container already runs as root, so the nested sudo is unnecessary;
   on training images that don't install sudo it returns 127 and the
   libbnxt copy never happens, surfacing later as RDMA/verbs failures.
   Dropping the sudo prefix makes the command work on both flavors.
@atnair-amd atnair-amd requested a review from anujmittal-amd May 19, 2026 03:42
Comment thread cvs/lib/megatron_training_lib.py Outdated
TRAINING_PROGRESS_PATTERNS = [
r'throughput per GPU(?:\s*\([^)]*\))?\s*:|tokens\/GPU\/s\s+[0-9]+',
r'throughput per GPU:|tokens\/GPU\/s\s+[0-9]+',
r'iteration\s+(\d+)/\s+\1\s*\|',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r'iteration\s+(\d+)/\s+\1\s*|' requires whitespace after /. If Megatron logs ever emit iteration 100/100 | or vary spacing by version, _is_training_complete() may stop matching entirely.
Would r'iteration\s+(\d+)\s*/\s*\1\s*|' be safer here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — switched to r'iteration\s+(\d+)\s*/\s*\1\s*\|' in f91e6c8 so spacing around / is tolerated.

Comment thread cvs/lib/megatron_training_lib.py Outdated
last_node_num = len(self.host_list) - 1
out_dict = self.phdl.exec(f'cat {self.log_dir}/megatron-logs/out-node{last_node_num}/training.log | tail -15')
out_dict = self.phdl.exec(
f"grep -E 'iteration[[:space:]]+[0-9]+/[[:space:]]+[0-9]+[[:space:]]*\\|' "
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same fix applied to the grep ERE in f91e6c8: [[:space:]]+/[[:space:]]+[[:space:]]*/[[:space:]]* around the /.

Per review on #177: the previous regex/grep both required at least
one space after the '/'. Megatron versions that emit
'iteration 100/100 |' (no space) or 'iteration 100 / 100 |' (space
on both sides) would have stopped matching, silently regressing
_is_training_complete and get_training_results_dict. Switch both
to use [[:space:]]*/[[:space:]]* (and the Python equivalent
\s*/\s*) so any spacing around '/' is accepted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants