lighteval support after checkpoint, UX refactor#222
lighteval support after checkpoint, UX refactor#222eliebak wants to merge 45 commits intohuggingface:mainfrom
Conversation
…ange this condition just to slurm in the future and support local lighteval)
…lighteval and log the eval to wandb
…h, solve serializaton pb of xPath
|
What's the reason we have to "Install lighteval from https://github.com/eliebak/lighteval/tree/current-nanotron otherwise it will not work" |
I changed the installation to https://github.com/huggingface/lighteval/tree/nanotron-compatible in the .toml file. It's an older version of lighteval that work with the current code, at the time i made the change lighteval was changing a bit so used an older commit that was the one for the FW ablation. I (or other ppl) can update to the current version after :) |
src/nanotron/trainer.py
Outdated
|
|
||
| current_time = datetime.datetime.now().strftime("%d/%m/%Y_%H:%M:%S") | ||
| if dist.get_rank(self.parallel_context.world_pg) == self.logger_ranks[0] and wandb is not None: | ||
| datetime.datetime.now().strftime("%d/%m/%Y_%H:%M:%S") |
src/nanotron/trainer.py
Outdated
| # Log initial tokens to set the starting point | ||
| wandb.log({"Tokens": initial_tokens}) | ||
|
|
||
| print(f"Initial Tokens: {initial_tokens}") |
There was a problem hiding this comment.
log_rank(message, logger=logger, level=logging.INFO, rank=0)
src/nanotron/trainer.py
Outdated
| if wandb is not None and dist.get_rank(self.parallel_context.dp_pg) == 0: | ||
| if self.config.general.wandb_id is None: | ||
| self.config.general.wandb_id = wandb.run.id | ||
| self.config.general.wandb_project = wandb.run.project |
There was a problem hiding this comment.
why don't we keep it as the default in the initial config?
There was a problem hiding this comment.
was because we previously need to store the wandb.run.id (and project) to pass it to lighteval to save the eval on the same wandb run. But we don't save to wandb anymore so no need for this will remove it thanks for noticing it!
src/nanotron/trainer.py
Outdated
| "Update the wandb run due too resume from checkpoint", logger=logger, level=logging.WARNING, rank=0 | ||
| ) | ||
| self.config.general.wandb_id = wandb.run.id | ||
| self.config.general.wandb_project = wandb.run.project |
src/nanotron/utils.py
Outdated
| import re | ||
| from contextlib import ExitStack, contextmanager | ||
| from typing import ContextManager, List, Optional | ||
| import json |
There was a problem hiding this comment.
seems like we don't use it? could u rerun precommit please
| ] | ||
|
|
||
| lighteval = [ | ||
| "lighteval[nanotron]@git+https://github.com/huggingface/lighteval.git", |
| tokens=tokens, | ||
| optimizer=optimizer, | ||
| data_stages=data_stages, | ||
| lighteval=lighteval, |
src/nanotron/serialize/main.py
Outdated
| import os | ||
| from pathlib import Path | ||
| from typing import Optional, cast | ||
| from datasets.download.streaming_download_manager import xPath |
src/nanotron/config/config.py
Outdated
| class S3UploadArgs: | ||
| """Arguments related to uploading checkpoints on s3""" | ||
|
|
||
| remove_after_upload: bool |
src/nanotron/config/config.py
Outdated
| if hasattr(self, "_post_init_done"): | ||
| return | ||
| self._post_init_done = True |
| eval_slurm_config: Optional[str] = None | ||
| eval_slurm_template: Optional[str] = None | ||
| lighteval_config_path: Optional[str] = None | ||
| is_s3_available: Optional[bool] = None |
There was a problem hiding this comment.
remove? because we set in it line 406 and and 409?
There was a problem hiding this comment.
i think it's more clear to have it in the config and define it later no? don't know what's the standard practice here (update: i remove it i think it's better you're right)
src/nanotron/config/config.py
Outdated
| return config | ||
|
|
||
|
|
||
| def save_as_yaml(config, config_class, file_path: str): |
xrsrke
left a comment
There was a problem hiding this comment.
Looks good overall. I left a few requested changes and one question: When is the lighteval supposed to run? It doesn’t seem to launch any lighteval runs after each checkpoint is saved on slurm? Thanks
src/nanotron/trainer.py
Outdated
| # LogItem("consumed_samples", self.consumed_train_samples, "human_format"), # , "12d"), | ||
| LogItem( | ||
| "consumed_tokens", | ||
| self.metadata.consumed_train_samples * self.config.tokens.sequence_length, | ||
| "human_format", | ||
| ), # , "12d"), | ||
| # LogItem( | ||
| # "consumed_tokens", | ||
| # self.metadata.consumed_train_samples * self.config.tokens.sequence_length, | ||
| # "human_format", | ||
| # ), # , "12d"), |
| elif isinstance(value, xPath): | ||
| result[field.name] = str(value) |
There was a problem hiding this comment.
Why xPath ? I thought we removed it for Path
There was a problem hiding this comment.
No, we still need it in s3upload to upload to s3 (with path it consider s3 path as local path)
xrsrke
left a comment
There was a problem hiding this comment.
Hi, I think you made some breaking changes, the checkpoint saving is broke. reproduce:
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc-per-node 8 --master_port=25621 run_train.py --config /fsx/phuc/temp/env_for_review_pr/nanotron/pr_config_new.yaml
trainer.train(dataloader) File "/fsx/phuc/temp/env_for_review_pr/nanotron/src/nanotron/trainer.py", line 489, in train
File "/fsx/phuc/temp/env_for_review_pr/nanotron/src/nanotron/trainer.py", line 489, in train
File "/fsx/phuc/temp/env_for_review_pr/nanotron/src/nanotron/trainer.py", line 489, in train
File "/fsx/phuc/temp/env_for_review_pr/nanotron/src/nanotron/trainer.py", line 489, in train
File "/fsx/phuc/temp/env_for_review_pr/nanotron/run_train.py", line 237, in <module>
trainer.train(dataloader)
File "/fsx/phuc/temp/env_for_review_pr/nanotron/src/nanotron/trainer.py", line 489, in train
self.save_checkpoint()
File "/fsx/phuc/temp/env_for_review_pr/nanotron/src/nanotron/trainer.py", line 912, in save_checkpoint
checkpoint_path = Path(checkpoints_path) / f"{self.iteration_step}"
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 960, in __new__
self = cls._from_parts(args)
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 594, in _from_parts
drv, root, parts = self._parse_args(args)
trainer.train(dataloader)
File "/fsx/phuc/temp/env_for_review_pr/nanotron/src/nanotron/trainer.py", line 489, in train
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 578, in _parse_args
a = os.fspath(a)
self.save_checkpoint()
File "/fsx/phuc/temp/env_for_review_pr/nanotron/src/nanotron/trainer.py", line 912, in save_checkpoint
TypeError: expected str, bytes or os.PathLike object, not NoneType
self.save_checkpoint()
File "/fsx/phuc/temp/env_for_review_pr/nanotron/src/nanotron/trainer.py", line 912, in save_checkpoint
self.save_checkpoint()
File "/fsx/phuc/temp/env_for_review_pr/nanotron/src/nanotron/trainer.py", line 912, in save_checkpoint
self.save_checkpoint()self.save_checkpoint()
File "/fsx/phuc/temp/env_for_review_pr/nanotron/src/nanotron/trainer.py", line 912, in save_checkpoint
File "/fsx/phuc/temp/env_for_review_pr/nanotron/src/nanotron/trainer.py", line 912, in save_checkpoint
self.save_checkpoint()
File "/fsx/phuc/temp/env_for_review_pr/nanotron/src/nanotron/trainer.py", line 912, in save_checkpoint
self.save_checkpoint()
File "/fsx/phuc/temp/env_for_review_pr/nanotron/src/nanotron/trainer.py", line 912, in save_checkpoint
checkpoint_path = Path(checkpoints_path) / f"{self.iteration_step}"
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 960, in __new__
checkpoint_path = Path(checkpoints_path) / f"{self.iteration_step}"
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 960, in __new__
checkpoint_path = Path(checkpoints_path) / f"{self.iteration_step}"
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 960, in __new__
checkpoint_path = Path(checkpoints_path) / f"{self.iteration_step}"
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 960, in __new__
checkpoint_path = Path(checkpoints_path) / f"{self.iteration_step}"
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 960, in __new__
checkpoint_path = Path(checkpoints_path) / f"{self.iteration_step}"
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 960, in __new__
checkpoint_path = Path(checkpoints_path) / f"{self.iteration_step}"
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 960, in __new__
self = cls._from_parts(args)
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 594, in _from_parts
self = cls._from_parts(args)
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 594, in _from_parts
self = cls._from_parts(args) self = cls._from_parts(args)
self = cls._from_parts(args)
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 594, in _from_parts
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 594, in _from_parts
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 594, in _from_parts
drv, root, parts = self._parse_args(args)drv, root, parts = self._parse_args(args)
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 578, in _parse_args
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 578, in _parse_args
self = cls._from_parts(args)
self = cls._from_parts(args) File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 594, in _from_parts
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 594, in _from_parts
drv, root, parts = self._parse_args(args)
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 578, in _parse_args
drv, root, parts = self._parse_args(args)
drv, root, parts = self._parse_args(args) File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 578, in _parse_args
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 578, in _parse_args
a = os.fspath(a)a = os.fspath(a)
TypeErrorTypeError : : drv, root, parts = self._parse_args(args)expected str, bytes or os.PathLike object, not NoneTypeexpected str, bytes or os.PathLike object, not NoneTypedrv, root, parts = self._parse_args(args)
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 578, in _parse_args
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/pathlib.py", line 578, in _parse_args
a = os.fspath(a)
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType
TypeError: expected str, bytes or os.PathLike object, not NoneType
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType
[2024-10-02 20:09:03,528] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2082316 closing signal SIGTERM
[2024-10-02 20:09:03,529] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2082319 closing signal SIGTERM
[2024-10-02 20:09:03,529] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2082323 closing signal SIGTERM
[2024-10-02 20:09:03,706] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 2082317) of binary: /fsx/phuc/temp/env_for_review_pr/env/bin/python
Traceback (most recent call last):
File "/fsx/phuc/temp/env_for_review_pr/env/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/phuc/temp/env_for_review_pr/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
|
in your config the checkpoint_path is set to null |
Info
s3uploadPR220Changes
Automated Lighteval Job Execution
self.s3_mover.post_upload_callback(S3 enabled)self.post_checkpoint_callback(local)Enhanced Wandb
New scripts for users:
launcher.pyandcreate_config.pyFix model parameter counting
model_config.get_llama_param_count()get_flops()method:block_compute_costs:Customizable Slurm Folder
Usage Instructions
create_config.pypython create_config.py --save_path <path_to_save>launcher.pyBasic usage:
python launcher.py --logs_path <path_to_log> --run <name_of_the_run>logs_path/runAdditional options:
--slurm --nodes <number_of_node>python nanotron/launcher.py [other_args] --override KEY1=VALUE1 KEY2=VALUE2 ...Minor Changes
trust_remote_code=TruegloballyRecommended Workflow
launcher.pycreate_config.pyslurmfolder (if using Slurm)Fancy prints ✨
Command
python launcher.py --base-config "smollm-360M-4nodes" --run "smol-360M" --override "tokens.micro_batch_size=8" "optimizer.learning_rate_scheduler.learning_rate=1e-3" --slurm --nodes 4Output