-
Notifications
You must be signed in to change notification settings - Fork 5.9k
[Auto-Parallel] fix auto_dp in fsdp #77147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
你的PR提交成功,感谢你对开源项目的贡献! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #77147 +/- ##
==========================================
Coverage ? 96.82%
==========================================
Files ? 1
Lines ? 63
Branches ? 0
==========================================
Hits ? 61
Misses ? 2
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| self._shard_fn._register_hook_for_param_grad(param) | ||
| if not in_auto_dp_mode(): | ||
| self._shard_fn._register_hook_for_param_grad(param) | ||
| if in_auto_dp_mode(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L78 是不是改成 else 就行?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的_register_comm_hook在for循环外面无法改成else。
| if not param.trainable: | ||
| continue | ||
| new_placements = [dist.Replicate() for _ in param.placements] | ||
| replicte_param = dist.reshard( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1、typo: replicte_param -> replicate_param
2、这里对所有参数都做了 reshard,是否有不必要的开销?要提前判断目标状态是否已经符合预期?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已增加判断param已符合目标时跳过冗余操作,感谢!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改typo,感谢
| def shard_comm(*_): | ||
| for key, param in sublayers._parameters.items(): | ||
| if param.trainable: | ||
| new_placements = get_placement_with_sharding(param, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的 0 是指 第0维吗?建议设置变量名,提高可读性
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已将fully_shard.py中所有相关维度修改为dp维,感谢!
| self._register_comm_hook(model) | ||
| os.environ["skip_sharding3_output_reshard"] = "1" | ||
|
|
||
| def _register_comm_hook(self, model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的通信逻辑,对每个参数分别处理了,后期考虑提高通信效率?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已增加判断param已符合目标时跳过冗余操作,感谢!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
后面希望通过类似tensor_fusion的方案来提高通信效率
| replicte_param = dist.reshard( | ||
| param, param.process_mesh, new_placements | ||
| ) | ||
| param.get_tensor()._share_data_with(replicte_param.get_tensor()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的 inplace 操作,是否会导致反向时找不到正确的数据地址
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不会导致反向找不到正确地址。
liym27
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR Category
Auto Parallel
PR Types
Bug fixes
Description
支持
FSDP+auto_dp的场景,在多模态模型上进行验证。auto_dp的FSDP,将根据切分推导自动在前反向插入reshard做all_gather和slice。auto_dp将dp维标记为fake_replicate,导致原本切分推导改变,不会自动插入all_gather和slice。pre_forward、post_forward、pre_backward、post_backward插入reshard。lm_head和embedding将会使用同一份参数,这里采用反向的层级hook插入通信。(若使用参数级的hook,如param._register_backward_hook,则只会在embedding层添加hook,不符合预期)card-92763