Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autotp training #6922

Closed
wants to merge 85 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
674a873
auto tp training
inkcherry Apr 3, 2024
a2e4c47
update parallel_states
inkcherry Apr 23, 2024
f4eb142
Merge branch 'master' into HEAD
inkcherry Nov 19, 2024
dd081ed
WA skips assertions, the loss remains exactly consistent with the low…
inkcherry Nov 19, 2024
cdaed2f
save/load ckpt & save/load hf model basic POC
inkcherry Nov 22, 2024
9aad0e7
finish all the basic functionalities
inkcherry Nov 27, 2024
2bb11fd
update
inkcherry Nov 28, 2024
e75c1c2
use groups for parallel_states
inkcherry Dec 2, 2024
840a5f2
enable bwd allreduce, enable scale loss by gas
inkcherry Dec 2, 2024
60bd6ab
add dataloader check
inkcherry Dec 4, 2024
9266383
refactor autoTP step1
inkcherry Dec 4, 2024
07174a9
rm parallel_states
inkcherry Dec 5, 2024
ee6323e
refactor autoTP step2
inkcherry Dec 5, 2024
6461b84
update ut step1
inkcherry Dec 10, 2024
4d73011
update
inkcherry Dec 11, 2024
c79c3bb
add uts
inkcherry Dec 11, 2024
97e659c
finished all ut code base
inkcherry Dec 12, 2024
a15905b
addllr scheduler test
inkcherry Dec 12, 2024
e9802b0
refine ut
inkcherry Dec 12, 2024
88b8acf
fix bcast_objlist
inkcherry Dec 15, 2024
868be0b
refine layers.py
inkcherry Dec 15, 2024
3788e07
refine gather
inkcherry Dec 15, 2024
27b24f6
pass codegen350M +TP2 ut
inkcherry Dec 16, 2024
3d7b89f
add mode choice
inkcherry Dec 16, 2024
47a6b0b
fix chatglm
inkcherry Dec 16, 2024
3a23997
fix chatglm2 with transformers=4.40 version
inkcherry Dec 16, 2024
e3ec46e
uneven
inkcherry Dec 16, 2024
9685879
fix uneven
inkcherry Dec 16, 2024
7b99b03
fix training
inkcherry Dec 16, 2024
570645f
refine code
inkcherry Dec 17, 2024
3729b64
remove skip bcase&reduce
inkcherry Dec 17, 2024
62d8858
fix typo
inkcherry Dec 17, 2024
dd17313
format
inkcherry Dec 17, 2024
93cf6f5
refine code
inkcherry Dec 18, 2024
87c4bc2
refine code
inkcherry Dec 18, 2024
1714bb5
refine
inkcherry Dec 18, 2024
dadf915
update yuan
inkcherry Dec 19, 2024
86c9399
optimize usage of move function
inkcherry Dec 19, 2024
2526dc6
refine args usage
inkcherry Dec 19, 2024
c9fd699
format
inkcherry Dec 19, 2024
797e71f
zero1 compatible
inkcherry Dec 19, 2024
86ae65e
remove wa
inkcherry Dec 22, 2024
3e40024
fix cpu device name
inkcherry Dec 22, 2024
7d94b77
fix lm-head
inkcherry Dec 23, 2024
b297950
add detach
inkcherry Dec 23, 2024
67ce220
fix ipex intergration
inkcherry Dec 23, 2024
f818be9
fix tied_embedding
inkcherry Dec 24, 2024
11c98f6
Merge remote-tracking branch 'origin/master' into autotp_training
inkcherry Jan 2, 2025
e22b625
format
inkcherry Jan 2, 2025
8531b64
Merge branch 'master' into autotp_training
tjruwase Jan 6, 2025
8d19e01
Merge branch 'master' into autotp_training
loadams Jan 6, 2025
060d48b
remove outdated comments
inkcherry Jan 13, 2025
6667ba1
Enhance unit test coverage
inkcherry Jan 13, 2025
84c9335
update ut
inkcherry Jan 13, 2025
cb29d7c
sequential some tests
inkcherry Jan 13, 2025
a49e77e
format
inkcherry Jan 13, 2025
0ef5274
use parameterized save path
inkcherry Jan 13, 2025
481088d
Merge remote-tracking branch 'my/autotp_training' into autotp_training
inkcherry Jan 13, 2025
f740de0
refactor infer/training path
inkcherry Jan 15, 2025
726004d
format
inkcherry Jan 15, 2025
bd8de77
remove empty line
inkcherry Jan 15, 2025
c334da0
remove autotp_size config from zero scope
inkcherry Jan 15, 2025
29eef07
update
inkcherry Jan 15, 2025
ba47ed1
format
inkcherry Jan 15, 2025
bbde63f
fix layer typo and rename
inkcherry Jan 15, 2025
bdca62c
fix python3.9
inkcherry Jan 15, 2025
5d89422
refine code
inkcherry Jan 15, 2025
0a9caff
refine
inkcherry Jan 15, 2025
c923a3b
refine config
inkcherry Jan 16, 2025
92be193
improve ut coverage for save
inkcherry Jan 17, 2025
23bd0fc
fix process exit early
inkcherry Jan 17, 2025
358f395
improve ut coverage
inkcherry Jan 17, 2025
cdfb54c
Merge remote-tracking branch 'origin/master' into autotp_training
inkcherry Jan 17, 2025
6d030c4
fix zero1 regression
inkcherry Jan 17, 2025
f9e7756
Merge branch 'master' into autotp_training
inkcherry Jan 20, 2025
6e7f846
fix ci
inkcherry Jan 20, 2025
c4fde7e
Merge branch 'autotp_training' of https://github.com/inkcherry/DeepSp…
inkcherry Jan 20, 2025
05bcecd
skip overflow test
inkcherry Jan 21, 2025
86f1c77
Merge branch 'master' into autotp_training
inkcherry Jan 22, 2025
668cb1a
Skip xpu tests until the ci is updated
inkcherry Jan 23, 2025
2e042a4
Merge branch 'autotp_training' of https://github.com/inkcherry/DeepSp…
inkcherry Jan 23, 2025
e08a234
Merge branch 'master' into autotp_training
delock Jan 24, 2025
20588f2
Merge branch 'master' into autotp_training
tjruwase Jan 30, 2025
1e05996
Merge branch 'master' into autotp_training
hwchen2017 Jan 30, 2025
affeb88
Merge branch 'master' into autotp_training
inkcherry Feb 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
sequential some tests
inkcherry committed Jan 13, 2025
commit cb29d7cf6c887e12ad6283644663fef40fb4055b
3 changes: 2 additions & 1 deletion tests/unit/model_parallelism/test_autotp_training.py
Original file line number Diff line number Diff line change
@@ -136,6 +136,7 @@ def process_linear_layer(hidden_dim, input):
torch_loss.backward()
return torch_linear, torch_out

@pytest.mark.sequential
@pytest.mark.parametrize("tp_size", [2,4])
class TestTpLayerFwdBwd(DistributedTest):
world_size = 4
@@ -234,7 +235,7 @@ def testColumnParallel(self, tp_size: int):
out.contiguous(),
atol=1e-3)


@pytest.mark.sequential
class TestParamsGather(DistributedTest):
world_size = 4
reuse_dist_env = True