fix(megatron_training_lib): correct training-log parsing and drop sudo from libbnxt cp#177
Open
atnair-amd wants to merge 2 commits into
Open
fix(megatron_training_lib): correct training-log parsing and drop sudo from libbnxt cp#177atnair-amd wants to merge 2 commits into
atnair-amd wants to merge 2 commits into
Conversation
…o from libbnxt cp Three fixes in MegatronLlamaTrainingJob. The first two share a single failure mode (the parser silently sees the wrong slice of training.log and downstream code treats an empty/mid-run dict as PASS); the third is a one-line drive-by in the same file. 1. TRAINING_PROGRESS_PATTERNS matched any `throughput per GPU:` / `tokens/GPU/s` line. Megatron emits those every iteration after the first, so _is_training_complete returned True after iter 1 and the poll loop tail-extracted metrics mid-run. Replaced with a single pattern anchored on `iteration N/ N |`; the `\1` backreference forces current-iter == total-iter, so it fires exactly once per run (at the final iteration line). 2. get_training_results_dict was `cat ... | tail -15` on training.log. Post-training cleanup on a healthy run (aiter import spam, validation eval banner, NCCL teardown warnings) can write well over 15 lines after the final iter, pushing every iteration line out of the tail window. _parse_training_results then matched nothing and returned an empty results_dict, which downstream code treats as PASS. Pre-filter with `grep -E 'iteration N/ N |'` before `tail -15` so the tail window contains the last 15 iteration lines, which is what the parser was always assumed to be fed. 3. exec_nic_setup_scripts broadcom/thor branch runs `docker exec ... /bin/bash -c "sudo cp ... so.host ... so; ..."`. The training container already runs as root, so the nested sudo is unnecessary; on training images that don't install sudo it returns 127 and the libbnxt copy never happens, surfacing later as RDMA/verbs failures. Dropping the sudo prefix makes the command work on both flavors.
| TRAINING_PROGRESS_PATTERNS = [ | ||
| r'throughput per GPU(?:\s*\([^)]*\))?\s*:|tokens\/GPU\/s\s+[0-9]+', | ||
| r'throughput per GPU:|tokens\/GPU\/s\s+[0-9]+', | ||
| r'iteration\s+(\d+)/\s+\1\s*\|', |
There was a problem hiding this comment.
r'iteration\s+(\d+)/\s+\1\s*|' requires whitespace after /. If Megatron logs ever emit iteration 100/100 | or vary spacing by version, _is_training_complete() may stop matching entirely.
Would r'iteration\s+(\d+)\s*/\s*\1\s*|' be safer here?
Contributor
Author
There was a problem hiding this comment.
Good catch — switched to r'iteration\s+(\d+)\s*/\s*\1\s*\|' in f91e6c8 so spacing around / is tolerated.
| last_node_num = len(self.host_list) - 1 | ||
| out_dict = self.phdl.exec(f'cat {self.log_dir}/megatron-logs/out-node{last_node_num}/training.log | tail -15') | ||
| out_dict = self.phdl.exec( | ||
| f"grep -E 'iteration[[:space:]]+[0-9]+/[[:space:]]+[0-9]+[[:space:]]*\\|' " |
Contributor
Author
There was a problem hiding this comment.
Same fix applied to the grep ERE in f91e6c8: [[:space:]]+/[[:space:]]+ → [[:space:]]*/[[:space:]]* around the /.
Per review on #177: the previous regex/grep both required at least one space after the '/'. Megatron versions that emit 'iteration 100/100 |' (no space) or 'iteration 100 / 100 |' (space on both sides) would have stopped matching, silently regressing _is_training_complete and get_training_results_dict. Switch both to use [[:space:]]*/[[:space:]]* (and the Python equivalent \s*/\s*) so any spacing around '/' is accepted.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three fixes in
MegatronLlamaTrainingJob, all incvs/lib/megatron_training_lib.py. The first two share a single failuremode — the parser silently sees the wrong slice of
training.loganddownstream code treats an empty / mid-run metrics dict as PASS rather
than an explicit error. The third is a one-line drive-by in the same
file's container-exec path; included here rather than split out because
it's a single character change with no shared surface area.
What changed
1.
TRAINING_PROGRESS_PATTERNS— match the final iter, not every itercvs/lib/megatron_training_lib.py(lines 44-46).The previous patterns matched any
throughput per GPU:/tokens/GPU/sline. Megatron emits those every iteration after the first, so
_is_training_complete(call site at line 705) returned True after iter 1and the poll loop tail-extracted metrics mid-run.
Replaced with a single pattern anchored on megatron's final-iter marker:
The
\1backreference requires current-iter == total-iter, so thepattern fires exactly once per run (e.g.
iteration 100/ 100 | ...).2.
get_training_results_dict— filter before tailcvs/lib/megatron_training_lib.py(~line 587).Was:
On a healthy run, post-training cleanup (aiter import spam, validation eval
banner, NCCL teardown warnings) writes well over 15 lines after the final
iter, pushing every
iteration N/...line out of thetail -15window._parse_training_resultsthen matches nothing and returns an emptyresults_dict, which downstream code treats as PASS.Now:
The
grepkeeps only iteration lines;tail -15then yields the last 15of those, which is what
_parse_training_resultswas already assumed tobe fed.
3.
exec_nic_setup_scripts— dropsudofrom the libbnxtdocker execcvs/lib/megatron_training_lib.py(~line 387, broadcom/thor branch).The training container already runs as root, so the nested
sudoin:f'docker exec {self.container_name} /bin/bash -c "sudo cp ... so.host ... so; ..."'is unnecessary. On training images that don't install
sudoat all, theexec returns 127 and the libbnxt copy never happens — surfacing later in
the run as RDMA / verbs failures rather than a clean setup error.
Dropping the
sudoprefix makes the command work on both flavors ofimage; root-in-container can
cpinto/usr/libdirectly.