-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MoE kernel #206
Comments
Will do a research on this more, if anyone has any insights on what could/should be implemented, resp. details on to how, cc me. |
Maybe a preliminary would be to support for example mixtral/nllb_moe from huggingface, to have the integration ready when the layers are done? |
@S1ro1 one straightforward idea is to parallelize expert forward (just like what megablock impl does). Right now in HF model code the MoE block is performed sequentiallyexpert-by-expert. Not sure if it's worth implementing load balancing loss too, haven't seen an actual profiling trace of MoE model training |
@yundai424 Haven't seen one either, gonna try patching either Mixtral or Nllb with our kernels and profile it, will decide on what to do after that I guess. Edit: to address your comment, parallelizing the experts is certainly a low hanging fruit |
🚀 The feature, motivation and pitch
Currently the most popular library might be https://github.com/databricks/megablocks. Would be interesting if we can implement it in triton and make it HF compatible
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: