[fsdp] fix: Detect whether model needs trust_remote_code during FSDP HugginFace checkpoint save#5200
[fsdp] fix: Detect whether model needs trust_remote_code during FSDP HugginFace checkpoint save#5200thvasilo wants to merge 3 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a crash during FSDP checkpoint saving for models requiring trust_remote_code=True by introducing logic to detect and pass this flag. However, a critical security concern arises as automatically enabling trust_remote_code bypasses an important security boundary, potentially allowing arbitrary code execution from malicious models during checkpoint saving. Furthermore, the new remote code detection logic has a potential critical issue that could lead to an AttributeError. The .pre-commit-config.yaml update is a good improvement.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
About the security concern raised, we could cross-verify with My thinking was that if we get to the saving point for a model that required remote code, we can assume the user has enabled trust_remote_code for the actor model, otherwise training would have failed at model loading time. Let me know if you'd prefer the explicit approach though |
What does this PR do?
Ensures that trust_remote_code is passed correctly to
auto_model_cls.from_configwhen the model requires remote code during saving.While testing GRPO full fine-tuning of the
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16model I gotTo replicate the issue use an 80GB+ GPU and run
After applying the fix I'm able to save the model
Update .pre-commit config
pre-commit checks were failing because the python compilation check scans the entire directory including .venv which can include other python files which do not conform to the requirements. The PR add some exclusions to avoid this
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
Despite my best efforts I'm not able to produce a unit test that detects this, I can only observe the issue while training with Ray.
I've written https://gist.github.com/thvasilo/76596a638440cd7f342ba4d23a2efb2e that replicates the process, but this does not trigger the error, my current assumption is an execution env discrepancy (I'm using torchrun to run the test that shares processes, my job fail when using Ray)
API and Usage Example
No API changes, we detect custom models from their contents
Design & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.