-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] support for DeepseekV2 #129
Comments
this is causing loss calculations to be wildly different for some reason i will investigate further
i was able to fix this issue as follows: modeling_mod.DeepseekV2MLP.forward = LigerSwiGLUMLP.forward |
the rope method seems to be modified in deepseek v2? llama: cos = cos.unsqueeze(unsqueeze_dim)
sin = sin.unsqueeze(unsqueeze_dim)
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed deepseekv2: cos = cos[position_ids].unsqueeze(unsqueeze_dim)
sin = sin[position_ids].unsqueeze(unsqueeze_dim)
b, h, s, d = q.shape
q = q.view(b, h, s, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)
b, h, s, d = k.shape
k = k.view(b, h, s, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed |
deepseek v2 use MLA(Multi-head Latent Attention) to reduce the kv cache. |
Yeah, deepseekv2 one is quite interesting as it used decoupled RoPE. For the MLA part, since it mainly target on inference case speed up with absorbed low-rank projection matrices into the original linear matrices. Feel free to first try implementing the layers apart from that and can gradually improve with separate prs. Thanks for the interesting feature request and rapid kick off~ |
🚀 The feature, motivation and pitch
It would be nice to support DeepseekV2 models. Unfortunately the modeling code is not yet accepted into transformers, and requires trust_remote_code=True
I'm monkey-patching myself for now, and wanted to leave some notes that may be helpful when support is added officially down the road.
One initial issue when swapping in swiglu:
The text was updated successfully, but these errors were encountered: