-
Notifications
You must be signed in to change notification settings - Fork 29
Support frozen weights #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@tscholak PR is ready for review, feel free to volunteer another lucky reviewer. I guess it's too big for a full review, general structure should be fine. |
No regression found on the tutorial, so things seem to be working properly.
Might be worth trying another example closer to our experimental setup though (@RaymondLi0 ?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. This is mostly a refactor, and I appreciate that tests were updated and extended. I'm not going to deep-review every detail of the checkpointing and loading logic, but from what I see, the transition to named shards and frozen weight handling is clean and backward-compatible.
One requirement before merging: we need a real-world, non-trivial training run post-refactor to confirm nothing breaks under load. Ideally, something with a bit of complexity, perhaps, a multi-stage setup with MoEs and a mix of frozen and unfrozen weights.
@RaymondLi0: can you own this and re-run a chilled/frozen MoE test job from back when you experimented with this from this branch? or something else that is similarly complex and can be compared to previous results in wandb.
Sure! So just to confirm, the goal would be to reproduce a previously run experiment with frozen weights, and observe: less memory usage, same loss curve. Is my understanding correct? |
@RaymondLi0 Something like this. I already confirmed the lower memory usage, so the loss curve is the main thing to check (maybe throughut too). My main concern is about regressions from refactoring more than the new frozen feature, so strictly. speaking the test doesn't even need to use frozen weighs. |
You could also run two experiments, one with lr-scale set to 0 and one set to 1e-12 or something ridiculously small, making sure that the losses are consistent |
Sounds good, I'll re-run that experiment then. |
@jlamypoirier I got the following error trying to re-run a SLAM experiment:
|
Co-authored-by: Torsten Scholak <[email protected]>
Dataset sampling changes most likely |
Since it's unrelated with this PR, let's merge and open a separate issue? Doesn't look like a big difference but it's likely hiding a real bug. (@tscholak this is using the legacy dataset so should not have changed at all) Which commit was the previous run using? Could we bisect the commit that caused that difference? (And re-run the old one to rule out other factors?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thanks!
Re-running with the old commit fails because of memmap-version:
(indeed one of the dataset was deleted and re-tokenized a few weeks ago). |
You aren't using the exact same prepared dataset? That could explain the difference... |
The data was re-tokenized, but it should be the same, except the version is different ... |
The original commit was before this fix: 3921dae |
✨ Description
Fix #183
Stage
so we can have a separate one for frozen weights. (811739a){shard_name:shard}
instead of a single tensor of shape(num_shards, shard_size)
. (b9b017f)🔍 Type of change
Select all that apply: