[RFC] Adding support for performant kernels in ET for MCUs #9767
digantdesai
started this conversation in
Ideas
Replies: 2 comments 2 replies
-
FYI @freddan80, @robell |
Beta Was this translation helpful? Give feedback.
0 replies
-
Hi Digant, thanks for the writeup. The optimized Q/DQ ops should be easy to integrate as an op-lib,
Finally, how will external libraries will be fetched for the op-libs? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Motivation
Strategically deploying Executorch (ET) on microcontroller units (MCUs) is vital, given the rise of AI on MCUs, specifically those using Cortex-M CPUs. We need to invest efforts in enhancing performance, optimizing binary size, and addressing MCU deployment issues for ET. This RFC aims to discuss potential implementation choices in ET for a new MCU backend.
Context
This document compares two backend implementation choices for ET focused on size-constrained microcontroller environments. The overarching goal is to position ET as a comparable solution to TFLite-micro in the MCU space. Here are the two options,
Option 1: Op-lib-backend
Details
ET Cadence backend implemented as op-lib with passes and quantizer: [link](https://github.com/pytorch/executorch/tree/a4925e4ca0c86e69ab560445720d38400fa40090/backends/cadence). Here are some key characteristics:* Does not use ET delegate interface * Heavily repurposes ET infrastructure to implement the backend * Buildtime: Selective build * Runtime: Interpreter, memory planning * TFLite-micro style, however ops are not stateful in ET
Option 2: Delegate
Details
ET XNNPACK delegate: [link](https://github.com/pytorch/executorch/tree/879b94f27717b8f264dae55fe89771524316eb6d/backends/xnnpack) (a new delegate for MCUs). Here are some key characteristics:* Relies on ET delegate infrastructure * AoT: Partitioner * Build time: Vanilla building (no selective build today) * Runtime: Delegate APIs with state, private interpreter, private memory planning (optional) * Allows to implement features like runtime graph recompilations, memory planning
Looking at the competition around this aspect of the design, TFLite has delegates that perform partitioning at runtime, but TFLite-micro dropped them to reduce complexity for small devices. In contrast, ET delegates are lightweight at runtime, with most of the complexity like partitioning shifted AoT compared to TFLite. Additionally, TFLite-micro operators can have states, like ET delegates but unlike ET operations which are stateless.
Goals for this review is to get feedback from relevant parties and get an alignment on which option to go with, before we involve partners and start implementing MCU backend with performant kernels. Please include other dimensions, particularly relevant for MCU deployment which I failed to consider. And Non-goals for this review are on how to get performance through kernels and kernel or operator implementation details.
Milestones
Requirements
(These are roughly in prioritized order.)
Comparisons
- Users can manually execute additional optimization passes between the to_edge and to_executorch steps, such as bn_conv_fusion.
- It may be challenging to guarantee correctness throughout the lowering process.
- This comes with AoT code complexity and a significant amount of boilerplate code.
+ Partitioner tagging is immutable, and delegate blob semantics are well-defined.
+ There is no need to consider local versus global state for the subgraph, everything is global.
+ Transparent fx graph for debugging and profiling. No need for additional devtools.
+ Overloading ET’s serialization, no need for another schema.
+ Easy global optimizations like memory format, input/output tensor validation
+ Encapsulates all backend passes, hiding from the user. Allowing more aggressive optimizations, in isolation from rest of ET.
+ Intermediate IR can support multiple ISA. Legal to do edge_ir++ like things inside the delegate preprocess.
+ Allows delegate state, which needed for performance i.e. CMSIS.
± Per subgraph AoT memory planning is needed. Cross delegate memory plan sharing at runtime can get tricky (i.e. workspace sharing in XNNPACK).
+ Easy to compose with other delegates.
- Need more wiring and adds extra layer of complexity for profiling and debugging.
- Need to create a special [de]serialization. May not add PTE bloat or code bloat at runtime (if using flatbuffer)
- ISA differences and 3p availability might bubble up all the way to the top in the form of some flag/passes for users to deal with
+ IR can be ISA independent (i.e. XNNPACK) or ISA specific. The serialized artifact can be portable i.e. same PTE works on Cortex-M3 and Cortex-M55
+ Hides all the complexity inside the delegate.
+ Friendly to 3p library lowering, especially with a persistent state
± Need to think about composability with other delegates with their delegate blobs, might not be a real issue
- Need some effort to implement cross-delegate memory plan combining.
- Need to support totally static memory planning for delegates through ET memory planning
- Edge_ir++ and my_ops.yml should align after running arbitrary passes, since non-edge-ir ops can’t be lowered to portable. This can be somewhat opaque in terms of debugging.
+ Easy for a 3p to specialize and polish it to their needs.
+ Easy to add support for new ISA or MCUs with slightly different
+ Quite flexible to hack up the build setup to do whatever we want
- Need to write new infra to allow leveraging this information to build selected_delegate.so
± Delegate state can help with perf, delegate state management can cost some perf
± Can be done through AoT tensor processing
Recommendation
Op-lib-backend is a more suitable choice for implementing the MCU backend in ET, given its efficient utilization of ExecuTorch's infrastructure. This results in a significantly lighter implementation compared to the delegate approach which is important for the MCU deployments. In most cases where runtime op state is required, AoT processing can be employed, and for runtime scenarios, mutable buffers in ExecuTorch can be further explored. To enhance the user experience, we can improve the op-lib-backend UX in ET, elevating it to a first-class backend status akin to Delegates from both API and UX perspectives. This will necessitate additional design efforts, which will be addressed in collaboration with the Jarvis team during the implementation phase.
Beta Was this translation helpful? Give feedback.
All reactions