[RFC] Adding support for performant kernels in ET for MCUs #9767

digantdesai · 2025-03-31T16:42:01Z

digantdesai
Mar 31, 2025
Collaborator

Motivation

Strategically deploying Executorch (ET) on microcontroller units (MCUs) is vital, given the rise of AI on MCUs, specifically those using Cortex-M CPUs. We need to invest efforts in enhancing performance, optimizing binary size, and addressing MCU deployment issues for ET. This RFC aims to discuss potential implementation choices in ET for a new MCU backend.

Context

This document compares two backend implementation choices for ET focused on size-constrained microcontroller environments. The overarching goal is to position ET as a comparable solution to TFLite-micro in the MCU space. Here are the two options,

Option 1: Op-lib-backend

Details

ET Cadence backend implemented as op-lib with passes and quantizer: [link](https://github.com/pytorch/executorch/tree/a4925e4ca0c86e69ab560445720d38400fa40090/backends/cadence). Here are some key characteristics:
* Does not use ET delegate interface * Heavily repurposes ET infrastructure to implement the backend * Buildtime: Selective build * Runtime: Interpreter, memory planning * TFLite-micro style, however ops are not stateful in ET

Option 2: Delegate

Details

ET XNNPACK delegate: [link](https://github.com/pytorch/executorch/tree/879b94f27717b8f264dae55fe89771524316eb6d/backends/xnnpack) (a new delegate for MCUs). Here are some key characteristics:
* Relies on ET delegate infrastructure * AoT: Partitioner * Build time: Vanilla building (no selective build today) * Runtime: Delegate APIs with state, private interpreter, private memory planning (optional) * Allows to implement features like runtime graph recompilations, memory planning

Looking at the competition around this aspect of the design, TFLite has delegates that perform partitioning at runtime, but TFLite-micro dropped them to reduce complexity for small devices. In contrast, ET delegates are lightweight at runtime, with most of the complexity like partitioning shifted AoT compared to TFLite. Additionally, TFLite-micro operators can have states, like ET delegates but unlike ET operations which are stateless.

Goals for this review is to get feedback from relevant parties and get an alignment on which option to go with, before we involve partners and start implementing MCU backend with performant kernels. Please include other dimensions, particularly relevant for MCU deployment which I failed to consider. And Non-goals for this review are on how to get performance through kernels and kernel or operator implementation details.

Milestones

Priority	Timeline	OKR
P0	H1, 2025	Support ET Ethos and Neutron NPU delegates with graph breaks on Cortex-M MCUs
P1	ET GA	Support performant MCU-only networks
P1	TBD	On par with TFLite-Micro (MCU-only and MCU+NPU*) for select models

Requirements

(These are roughly in prioritized order.)

Uncompromising Binary Size Optimizations
- Ensure ET implementation choice does not compromise binary size optimization for
  - PTE size
  - Runtime peak memory/planned memory
  - Code/binary size for runtime
- Selective Build: Optimize binary size by selectively building operators, data types, and dimension orders.
- Flexible Memory Planning: Ensure memory planning is flexible without relying on runtime dynamic allocation.
Performant MCU Ops with different ISAs
- Support Cortex-M and RISC-V ISAs in ET. Both ISAs have existing interest and are popular in OSS
- External 3p Library Support: Integrate with external 3p libraries like CMSIS that may or may not have state.
- Graph Optimizations: Implement graph optimizations, fusions, and decompositions to improve performance.
Composability with Other Delegates and Quantizers
- New backend may have its own quantizer, and/or lower ops when using with other delegates (e.g., q/dq and fp32 ops)
Guiding 3p/partner/OSS MCU development
- Set up repository scaffolding to support partner/community development for MCU-only deployments

Comparisons

AoT: PTE generation	Item	Op-lib-backend	Delegate	Winner
edge_ir	Do-Not-Decompose-op feature. This is needed for easier partitioning or recomposition of tricky ops like Linear	- Today, it is not possible to use to_edge_transform_and_lower, instead you must use to_edge and change the export decomp tables	Available	Delegate
Partition	Tagging nodes	* The op_lib_backend_ops.yml file will incorporate edge_ir++ ops after certain optimization passes have been applied to the graph following the to_edge step. - Users can manually execute additional optimization passes between the to_edge and to_executorch steps, such as bn_conv_fusion. - It may be challenging to guarantee correctness throughout the lowering process.	* The full partitioner infrastructure is available, offering all partitioner implementations and features. - This comes with AoT code complexity and a significant amount of boilerplate code. + Partitioner tagging is immutable, and delegate blob semantics are well-defined.	Delegate
	Preprocess_all, preprocess, serialization	- Distributed over ET infra and passes No intermediate IR to think about, but we will have edge_ir++ (illegal?) + There is no need to consider local versus global state for the subgraph, everything is global. + Transparent fx graph for debugging and profiling. No need for additional devtools. + Overloading ET’s serialization, no need for another schema. + Easy global optimizations like memory format, input/output tensor validation	Needs a new delegate Impl, + Encapsulates all backend passes, hiding from the user. Allowing more aggressive optimizations, in isolation from rest of ET. + Intermediate IR can support multiple ISA. Legal to do edge_ir++ like things inside the delegate preprocess. + Allows delegate state, which needed for performance i.e. CMSIS. ± Per subgraph AoT memory planning is needed. Cross delegate memory plan sharing at runtime can get tricky (i.e. workspace sharing in XNNPACK). + Easy to compose with other delegates. - Need more wiring and adds extra layer of complexity for profiling and debugging. - Need to create a special [de]serialization. May not add PTE bloat or code bloat at runtime (if using flatbuffer)	Op-lib-backend
	Multiple ISA support for AoT	+ Edge or backend OP in the graph makes an IR to write and register backend specific op against. - ISA differences and 3p availability might bubble up all the way to the top in the form of some flag/passes for users to deal with	- Need to come up with an IR, can be same as edge IR++. + IR can be ISA independent (i.e. XNNPACK) or ISA specific. The serialized artifact can be portable i.e. same PTE works on Cortex-M3 and Cortex-M55 + Hides all the complexity inside the delegate. + Friendly to 3p library lowering, especially with a persistent state	Delegate
to_executorch	Memory planning	+ As easy as writing a pass to do global planning. ± Need to think about composability with other delegates with their delegate blobs, might not be a real issue	+ Locally within a delegate, it can do its own memory planning. - Need some effort to implement cross-delegate memory plan combining. - Need to support totally static memory planning for delegates through ET memory planning	Op-lib-backend
UX and DevX	Lowering	- Needs almost specialized recipe to lower, can get brittle. - Edge_ir++ and my_ops.yml should align after running arbitrary passes, since non-edge-ir ops can’t be lowered to portable. This can be somewhat opaque in terms of debugging.	+ Hides a lot of complexity inside the delegate container. + Easy for a 3p to specialize and polish it to their needs. + Easy to add support for new ISA or MCUs with slightly different	Delegate
PTE size	-	+ No compromises	- Potential delegate blob size overhead	Op-lib-backend
Global graph optimizations	Memory format is one example	+ Easy, but can get tricky with other delegates present	- Can get tricky with other delegates present or with many graph-breaks	Op-lib-backend
AoT: Building	Objective	Op-lib-backend	Delegate
Selective Build	Flexible binary size optimization	+ Overloads ET’s powerful selective build infrastructure. + Quite flexible to hack up the build setup to do whatever we want	- Need to write delegate PTE blob and/or delegate yaml decoder to figure out what exactly will be needed to run the delegated nodes in the delegates in the PTE(s). - Need to write new infra to allow leveraging this information to build selected_delegate.so	Op-lib-backend
Runtime	Objective	Op-lib-backend	Delegate
Performance	Framework tax to run the optimized kernel	+ None to minimal	+ None to minimal for the delegate:execute. ± Delegate state can help with perf, delegate state management can cost some perf	Delegate (esp when execute/init call ratio is large)
	Interpreter loop	+ Using ETs	± Need to manage delegate init, execute, destroy. Need to implement a runtime interpreter. A per-op delegation can make this quite simple.	Op-lib-backend
	Runtime persistent state	± Using Mutable Buffers in ET. ± Can be done through AoT tensor processing	+ Easy	Delegate

Recommendation

Op-lib-backend is a more suitable choice for implementing the MCU backend in ET, given its efficient utilization of ExecuTorch's infrastructure. This results in a significantly lighter implementation compared to the delegate approach which is important for the MCU deployments. In most cases where runtime op state is required, AoT processing can be employed, and for runtime scenarios, mutable buffers in ExecuTorch can be further explored. To enhance the user experience, we can improve the op-lib-backend UX in ET, elevating it to a first-class backend status akin to Delegates from both API and UX perspectives. This will necessitate additional design efforts, which will be addressed in collaboration with the Jarvis team during the implementation phase.

digantdesai · 2025-03-31T18:55:37Z

digantdesai
Mar 31, 2025
Collaborator Author

FYI @freddan80, @robell

0 replies

AdrianLundell · 2025-04-03T07:52:32Z

AdrianLundell
Apr 3, 2025
Collaborator

Hi Digant, thanks for the writeup. The optimized Q/DQ ops should be easy to integrate as an op-lib,
and from CMSIS-NN side we have a couple of things that needs to be sorted out:

Runtime scratch buffers/ state (some examples as mentioned in yesterdays meeting):
- im2col is done in the runtime as an AOT computation would require a (kernel_size * output_size) consistent buffer: https://github.com/ARM-software/CMSIS-NN/blob/main/Source/ConvolutionFunctions/arm_convolve_s8.c
- transposed conv puts intermediate calculations in a scratch buffer: https://github.com/ARM-software/CMSIS-NN/blob/main/Source/ConvolutionFunctions/arm_transpose_conv_s8.c
- Similarly LSTM requires a scratch buffer for the output of each gate: https://github.com/ARM-software/CMSIS-NN/blob/main/Source/NNSupportFunctions/arm_nn_lstm_step_s8.c
- Buffer size given shapes are computed using the *_get_buffer_size functions.
Memory format:
- Like EthosU CMSIS-NN assumes a NHWC memory format
- A global memory format pass needed to avoid transposing out of the EthosU-delegate and then again
  directly into the CMSIS-NN delegate?
Transposed weights:
- All vector multipliplications are on the form x*W^t so weights most be transposed AOT.
- Should be easy to solve with a pass.
Effective bias:
- We make the following computation for each vector multiplication:
  output_i = sum_i((data_i + lhs_offset)*(weight_ij + rhs_offset)) + bias_j
  = sum_i(data_i*weight_ij + data_i*rhs_offset + lhs_offset*weights_ij + lhs_offset*rhs_offset) + bias_j
  = sum_i(data_i*weight_ij + data_i*rhs_offset) + sum_i(lhs_offset*weights_ij + lhs_offset*rhs_offset) + bias_j
- The last sum and bias is constant so it can be computed AOT and replace the bias buffer:
  = sum_i(data_i*weight_ij) + sum_i(data_i)*rhs_offset + effective_bias_j
- This should also be easy to solve with a pass

Finally, how will external libraries will be fetched for the op-libs?

2 replies

AdrianLundell Apr 3, 2025
Collaborator

Two more questions actually,

Does edge_ir++ differ from the edge ir described here? https://pytorch.org/executorch/stable/ir-exir.html
Could you expand on what is meant in the op-lib column for the Do-Not-Decompose-op feature? I know that we have some things in our backend which expects to_edge_transform_and_lower to be used.

digantdesai Apr 4, 2025
Collaborator Author

1

edge_ir++ implies there are nodes in the graph which are introduced by op-lib-backend passes to better suite their backend but not part of the edge_ir. Linear is a good example of this, which is not an edge_ir op.

2

to_edge_transform_and_lower fuses to_edge and to_backend calls. Which allows it to do tricks like alter decomposition tables and not decompose specified aten-ir operators for a given backend. This alleviates backends from dealing with tricky recomposition, while also avoid spilling non edge_ir operators after this by enforcing such do-not-decompose operator, requested by a backend, do indeed gets consumed by that backend.

That said, we are working on an update for to_edge API which lets you do the same. Though I am not sure about the enforcement part, I guess we have the IR verifier (with an ignore-list) for that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Adding support for performant kernels in ET for MCUs #9767

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[RFC] Adding support for performant kernels in ET for MCUs #9767

digantdesai Mar 31, 2025 Collaborator

Motivation

Context

Option 1: Op-lib-backend

Option 2: Delegate

Milestones

Requirements

Comparisons

Recommendation

Replies: 2 comments · 2 replies

digantdesai Mar 31, 2025 Collaborator Author

AdrianLundell Apr 3, 2025 Collaborator

AdrianLundell Apr 3, 2025 Collaborator

digantdesai Apr 4, 2025 Collaborator Author

digantdesai
Mar 31, 2025
Collaborator

Replies: 2 comments 2 replies

digantdesai
Mar 31, 2025
Collaborator Author

AdrianLundell
Apr 3, 2025
Collaborator

AdrianLundell Apr 3, 2025
Collaborator

digantdesai Apr 4, 2025
Collaborator Author