[feat] support for DeepseekV2 #129

tmm1 · 2024-08-27T22:07:29Z

🚀 The feature, motivation and pitch

It would be nice to support DeepseekV2 models. Unfortunately the modeling code is not yet accepted into transformers, and requires trust_remote_code=True

I'm monkey-patching myself for now, and wanted to leave some notes that may be helpful when support is added officially down the road.

from accelerate import init_empty_weights
from transformers import AutoModelForCausalLM

with init_empty_weights():
    model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-Coder-V2-Lite-Base", trust_remote_code=True)
    modeling_mod = sys.modules[model.__class__.__module__]

modeling_mod.apply_rotary_pos_emb = liger_rotary_pos_emb
modeling_mod.DeepseekV2RMSNorm = LigerRMSNorm
modeling_mod.DeepseekV2MLP = LigerSwiGLUMLP
modeling_mod.CrossEntropyLoss = LigerCrossEntropyLoss
modeling_mod.DeepseekV2ForCausalLM.forward = deepseekv2_lce_forward

One initial issue when swapping in swiglu:

  File "/mnt/ML/huggingface/modules/transformers_modules/deepseek-ai/DeepSeek-Coder-V2-Lite-Base/ea9b066cee82f82906fdd58898cb3788b1c5d770/modeling_deepseek.py", line 555, in <listcomp>
    DeepseekV2MLP(
TypeError: LigerSwiGLUMLP.__init__() got an unexpected keyword argument 'intermediate_size'

The text was updated successfully, but these errors were encountered:

tmm1 · 2024-08-27T23:00:22Z

modeling_mod.apply_rotary_pos_emb = liger_rotary_pos_emb

this is causing loss calculations to be wildly different for some reason

i will investigate further

TypeError: LigerSwiGLUMLP.init() got an unexpected keyword argument 'intermediate_size'

i was able to fix this issue as follows:

modeling_mod.DeepseekV2MLP.forward = LigerSwiGLUMLP.forward

tmm1 · 2024-08-27T23:47:05Z

this is causing loss calculations to be wildly different for some reason

the rope method seems to be modified in deepseek v2?

llama:

    cos = cos.unsqueeze(unsqueeze_dim)
    sin = sin.unsqueeze(unsqueeze_dim)
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

deepseekv2:

    cos = cos[position_ids].unsqueeze(unsqueeze_dim)
    sin = sin[position_ids].unsqueeze(unsqueeze_dim)

    b, h, s, d = q.shape
    q = q.view(b, h, s, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)

    b, h, s, d = k.shape
    k = k.view(b, h, s, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)

    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

xinyubai1209 · 2024-08-28T03:30:43Z

deepseek v2 use MLA(Multi-head Latent Attention) to reduce the kv cache.

qingquansong · 2024-08-28T07:58:04Z

Yeah, deepseekv2 one is quite interesting as it used decoupled RoPE.

For the MLA part, since it mainly target on inference case speed up with absorbed low-rank projection matrices into the original linear matrices. Feel free to first try implementing the layers apart from that and can gradually improve with separate prs. Thanks for the interesting feature request and rapid kick off~

ByronHsu added help wanted Extra attention is needed huggingface feature labels Aug 28, 2024

ByronHsu mentioned this issue Sep 30, 2024

2024 Q4 Roadmap #285

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] support for DeepseekV2 #129

[feat] support for DeepseekV2 #129

tmm1 commented Aug 27, 2024 •

edited

Loading

tmm1 commented Aug 27, 2024

tmm1 commented Aug 27, 2024

xinyubai1209 commented Aug 28, 2024

qingquansong commented Aug 28, 2024 •

edited

Loading

[feat] support for DeepseekV2 #129

[feat] support for DeepseekV2 #129

Comments

tmm1 commented Aug 27, 2024 • edited Loading

🚀 The feature, motivation and pitch

tmm1 commented Aug 27, 2024

tmm1 commented Aug 27, 2024

xinyubai1209 commented Aug 28, 2024

qingquansong commented Aug 28, 2024 • edited Loading

tmm1 commented Aug 27, 2024 •

edited

Loading

qingquansong commented Aug 28, 2024 •

edited

Loading