Replies: 7 comments 20 replies
-
@iseeyuan Do you think it would be possible to get to a point where we can meet model definitions where they are - even if we're maybe getting 50%-80% of theoretical peak performance? The reason I ask is that there is a significant burden involved in writing the model the way that ET expects. If we could provide a "peak performance path" and an "out of box / just works" path, that would be very nice. My experience with some internal teams is that they try to run existing language models on ET and drop it when the out of box performance is far behind what they expect. Aside from that, I think standardizing the language model APIs will be a big win for usability. Thanks for putting this together. |
Beta Was this translation helpful? Give feedback.
-
meta comment: you need to format the RFC a bit. Indentation and spacing is off that it makes it a hard read |
Beta Was this translation helpful? Give feedback.
-
@iseeyuan I feel that you should split this RFC into two. One for model architecture definition and one for export_llama_lib refactoring. |
Beta Was this translation helpful? Give feedback.
-
To what extent is this true? If someone doesn't have permission to rewrite the modeling code, does that mean the model won't work for that backend at all? Or will it still work but just not achieve the best performance? Maybe @kimishpatel @digantdesai @cccclai can comment about it?
I agree that we should provide tools to make source code rewriting easier. However, I don’t think rewriting should always be done in the original modeling code, as this could impact dozens of models and make OSS contributions increasingly difficult as more models are covered. (This is how I interpret the proposal, given its goal of unifying code and reducing boilerplate.) For example, as a user, if the code I modify could affect the performance of numerous models and use cases, I’d be hesitant to make changes and would likely defer to ET developers instead. This not only places us at the center of enablement and improvement but also increases the risk of making contributions more intimidating.
Many of the proposed ideas already exist in HF. For example, the ability to add and register different attention implementations is already supported (pointer). Additionally, the lifted cache is already exported as IO in Exported IR (example). My impression is that this proposal is leaning toward consolidating HF Transformers' definitions and own it in our repo, aiming to support as many transformer models as possible—including text, audio, and vision transformers. Can this approach scale effectively? One of the core principles HF Transformers upholds is "single model, single file" (as mentioned at PTC 2024). I believe they are fully aware of the downside of this approach—namely, redundant code—but it provides significant flexibility in isolating performance impacts across models and reduces the complexity of performance testing. So far, this strategy has proven highly successful.
I want to second this. Some ML engineers who just want to prototype quickly in Python shouldn’t need to be aware of the runtime code (C++). Take HF workflow as an example—good UX means an ML engineer should be able to experiment with different models and recipes, validating end-to-end in Python without needing any knowledge of the underlying runtime. This requires the interface to runtime(s) to be not only backend-agnostic but also model-agnostic.
Back to the key problem highlighted in this proposal, having multiple modeling sources in our repo is indeed a challenge, but is having multiple modeling sources itself a problem? I see these as two distinct issues, and the latter doesn’t seem avoidable—it will happen somewhere regardless.
You mean decoder-only transformers, right? What about encoder-only transformers (like BERT) and encoder-decoder transformers (like T5)? What’s the plan for non-transformer models, such as diffusion models or Timm models? If we’re heading down this path, I think we need to consider the full picture. Q: Should the ExecuTorch repo serve as a recipe repository? If so, how many recipes do you expect to host in the ExecuTorch repo? This proposal seems to imply that the ExecuTorch repo will also function as a recipe repository. I agree that providing a default recipe for each backend makes sense. However, that alone doesn’t justify the need to host these recipes within ExecuTorch. Some of the proposed ideas, such as controlling recipes via a configuration file, are already well-supported by Hugging Face not just for eager, but also for ONNX and TFLite. Why is it necessary to rebuild a similar mechanism and maintain it in our repo? From the perspective of building a vibrant community, I think it is key that recipes are separated from the core. While we can offer a default recipe for each backend as an option, we shouldn’t restrict users to copying and customizing them for their own needs. To encourage organic community growth, users should be able to create as many recipes as they want and make recipes shareable so that other OSS users can benefit. This level of openness wouldn’t be possible if recipes were tightly coupled in our repo. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the feedback! I'm planning to revamp the RFC to highlight:
|
Beta Was this translation helpful? Give feedback.
-
Okay, Updated the discussion to V2. Thanks again for your comments. Please take a look and let me know if you have further comments! cc @GregoryComer @kimishpatel @cccclai @digantdesai @guangy10 @larryliu0820 @jackzhxng |
Beta Was this translation helpful? Give feedback.
-
Updated the discussion to V3 with further clarifications. |
Beta Was this translation helpful? Give feedback.
-
tl;dr
There are several issues for users to deploy LLM on devices using ExecuTorch with the existing flow. The goal of this RFC is to streamline the end-to-end on-device LLM deployment flow via ExecuTorch. Potential use cases:
Some successful metrics of this project:
No goals
Context
Popular LLMs share similar transformer based architectures. The fixed architecture brings some convenience on deployment. An example is llama.cpp. However, when deploying to a variety of backends, the flows can be different due to different limitations of the backends. Those limitations (from software point of view, which may be rooted by HW limitations like available memories and NPU limitations) include
Note that the limitations may apply for other model architectures as well. For example, the diffusion models.
The existing ExecuTorch flow is explained in this readme. There are a number of pains with the existing flow.
Too many options and error prone
There’s a long cli with a number of args. Users may feel difficulties to understand the details of args in each step: export, dynamic shapes, quantization schemes, etc. For one backend, only one combination of those args work, and it’s error prone (an example is different quantization primitives affecting the accuracy when exporting the new LLama model).
Complicated building command
There’s two building stages, for a runtime, with a long cli of CMake config/build.
Missing parts that other frameworks have
An example is LoRA, which is supported by llama.cpp, ONNX GenAI, etc.
Scattered codes
Sometimes, updating the exporting recipe is not sufficient or efficient. Supporting a specific backend may involve a different copy of model definition (and maybe QAT), and a different version of runtime code. On the other hand, we can see the potential trend of scale due to more use cases and models to be supported. In the appendix there’s a table to summarize those existing code versions, their unique properties and use cases.
Works for models with Llama architecture only
It works for Llama models (and those models with exactly the same architecture). There’s significant support for a new LLM model, with types categorized below.
Types of new LLM models
There are three types of “new” LLM models from the inference point of view:
RFC
We are exploring efficient solutions for those types. The high-level design thoughts are:
Note: the diagram is to provide a big picture of how the end to end flow looks like. We’ll focus on entry points and related parts, and mention other components for reference.
Entry points
export_llm
The interface can be as simple as:
The first two APIs can be used for LLM models of type 1 above, and the third API can be applied to all three types.
The check_point and model_config arguments are the same as existing export_llama. Compared to the existing flow, the differences are:
Usage
Runtime
Runtime is another user-facing entry point. It’s deployed to PC, Android or iOS, with the capability of loading and running the LLM models. Currently, there are APIs in
The LLM model artifacts (.pte files) can be from
Ref: Mengwei’s RFC: LLM Runner APIs
Relationship between etLLM and other libraries like HF optimum-executorch
Specifically, we provide ET-friendly and performant components, so that users can swap the components with those in the existing model. An example is SDPA (context: SDPA Interfaces). There are three places we can do such transforms:
The ET-friendly and performant artifacts (component definitions, source transform passes, graph transform passes etc) are provided under the etLLM umbrella, and can be used in existing model definitions, including HF model definitions, or users’ native LLM model definitions.
Beta Was this translation helpful? Give feedback.
All reactions