Enabling L2+ Optimizations for EPs #23517

chilo-ms · 2025-01-28T17:57:08Z

There are some requirements to modify the graph which are specific to the hardware.
ORT has the hardcoded EP list for optimizations but that can't scale and it's hard be extended to enable EP custom optimizations.

Here is the prototype to enable L2+ optimizations for EPs (The original overview is provided by @skottmckay) as well as the TRT EP implementation for the ConstantFoldingDQ optimization.

Signatures for selection/optimization functions:

  - Selection: std::function<std::vector<std::unique_ptr<ComputeCapability>>(const GraphViewer&)>
  - Optimization: std::function<Status(const Graph&, const ComputeCapability& this_optimization, ComputeCapability& cc_to_update)>

GetCapability

call (new) provider bridge API to lookup pre-defined optimizer by name and get selection function
- ComputeCapability.optimize_func would be set by the optimizer to the function that does the optimization
EP has to update the returning ComputeCapability to include the optimization ComputeCapability in nodes_to_optimize. So that later ORT can perform optimization/transformation accordingly.

GraphPartitioner

After assigning the ComputeCapability to the EP and prior to Compile, if the ComputeCapability has nodes_to_optimize, iterate that list
- optimization function needs to be called with
  - a mutable Graph instance
  - the ComputeCapability for the individual optimization
  - the overall ComputeCapability so it can be updated

…Capability, selection function and ComputeCapability

…uteCapability, selection function and ComputeCapability

adrianlizarraga · 2025-01-28T23:41:42Z

onnxruntime/core/session/provider_bridge_ort.cc

+        dq_node_index_set.insert(index);
+      }
+
+      ConstantFoldingDQ* transformer = static_cast<ConstantFoldingDQ*>(graph_transformer_mgr.GetTransformerByName(optimizer_name));


Just a thought to potentially get rid of GraphTransformerManager:

Would it make sense to instantiate the ConstantFoldingDQ transformer directly instead of querying the graph_transformer_manager? For example, we have code in InferenceSession::TransformGraph that directly instantiates an instance of EnsureUniqueDQ and just calls .Apply() on it:

onnxruntime/onnxruntime/core/session/inference_session.cc

Lines 1231 to 1232 in a770a8d

EnsureUniqueDQForNodeUnit ensure_unique_dq_for_node_unit{};

ORT_RETURN_IF_ERROR_SESSIONID_(apply_transformer_once(ensure_unique_dq_for_node_unit, *session_logger_, graph));

So in this case:

// Would need to pass the EP name to GetEPOptimizerByName()? std::unique_ptr<ConstantFoldingDQ> const_fold_dq_transformer = std::make_unique<ConstantFoldingDQ>(/*...*/); const_fold_dq_transformer.Apply(/*...*/);

This may also allow the EP to provide key/value strings to configure an optimizer. This way we can avoid having to create a completely new derived class every time an EP needs a slightly different variation of an optimizer.

Ah, this wouldn't work because we need to pass in CPU EP.

I think it makes sense to instantiate/run directly. May need some structure to facilitate that. e.g. if we need things like the CPU allocator to be available to create an optimizer.

In this case I think we need the derived class as we're augmenting the behavior of the constant folding optimizer, but I expect this to be an exception rather than typical.

For other optimizers I expect we'll want something like key/value strings to configure instead of creating derived classes. Need to figure out what sort of configuration is required. Some existing things are that an EP supports a subset of data types for a fusion, or a subset of operators for a fusion (e.g. subset of activation ops can be fused into Conv to create a FusedConv node).

Want to continue the discussion about key/value strings to configure an optimizer.

I added a new standalone and singleton registration/lookup class. See the comment here.

I think the question i have is when is better to instantiate the optimizer(s).

One option to instantiate the optimizer(s) is during InferenceSession initialization upfront.
However, by doing so, we might need another way to update/reconfigure the optimizer instance by key/value strings (ex: transformer.update(key_value_strings)) if EP needs the configurations.

Another option is to wait until EP requests the optimizer and then it instantiates (consumes key/value strings) the optimizer during ORT runs the optimization function after GetCapability but before Compile. Need to make sure the registration/lookup class saves the cpu allocator, cpu ep etc from InferenceSession so that it can successfully instantiate the optimizer.

Update: This PR provides these two options and chooses the second option.

skottmckay · 2025-01-28T22:54:11Z

onnxruntime/core/framework/graph_partitioner.h

+  GraphPartitioner(KernelRegistryManager& kernel_registry_mgr, const GraphTransformerManager& graph_transformer_mgr, const ExecutionProviders& providers)
      : kernel_registry_mgr_(kernel_registry_mgr),
+        graph_transformer_mgr_(graph_transformer_mgr),


Instead of wiring in the GraphTransformerManager would it be better to add a new standalone class that provides lookup-based access to a set of L2 optimizers that are directly usable by an EP? That would decouple the GraphPartitioner from the optimizer registration/lookup a bit more. I don't think GraphTransformerManager is providing value in this case as I don't think we need to loop.

The registration/lookup class for re-usable optimizers could be a singleton with a static 'create' method that calls GenerateTransformersForEP and saves the returned list. We could have InferenceSession call the 'create' method to provide any other required things like the CPU allocator, and potentially apply the optimization level when doing so if we want to do that on the ORT side. GraphPartitioner could call a static 'get' method to get the instance so we don't need to wire it through from the inference session.

I added a new standalone class called GraphOptimizerRegistry to mainly provide registration and lookup access.
This lookup class is a singleton and its 'create' method is being called during InferenceSession initialization and it calls GenerateTransformersForEP and saves the returned list. BTW, not sure this 'create' function needs to be static?

Given that this lookup instance is a singleton using static variable, provider bridge now can access this instance, which means we don't need to get the instance at graph partitioner and feed to GetCapability. Also, we don't need change GetCapability's signature to add a new parameter (i.e. GraphOptimizerRegistry& ).

onnxruntime/core/optimizer/constant_folding.h

skottmckay · 2025-01-28T23:15:43Z

onnxruntime/core/optimizer/qdq_transformer/constant_folding_dq_node.h

+                    const InlinedHashSet<std::string>& excluded_initializers = {}) noexcept;
+
+  bool AllowConstantFolding(const Node& node) const;
+  Status UpdateNodeIndexSet(InlinedHashSet<onnxruntime::NodeIndex>& node_index_set);


Can we create a new instance of the optimizer for each node_index_set instead of having an update method in order to keep it a little cleaner?

The optimizer can only be instantiated during InferenceSession initialization because it needs things like cpu allocator or cpu ep. But we don't know what node_index_set will be until graph partitioner calls ConstantFoldingDQ optimization function that takes in graph instance.

Wonder how we can do this without adding this UpdateNodeIndexSet member function.

Update:
I modified the code to make optimizer instantiated inside optimization function.
So now the node_inde_set can be consumed by the constructor and the UpdateNodeIndexSet can be removed.

onnxruntime/core/providers/shared_library/provider_interfaces.h

onnxruntime/core/session/provider_bridge_ort.cc

skottmckay · 2025-01-28T23:24:58Z

onnxruntime/core/session/provider_bridge_ort.cc

+      }
+      cc_to_update.sub_graph->nodes = updated_nodes;
+
+      auto original_meta_def = cc_to_update.sub_graph->GetMetaDef();


Instead of creating a new MetaDef, now that we have a use-case where it makes sense to update it could we instead add a method to get a mutable MetaDef from sub_graph?

skottmckay · 2025-01-28T23:44:12Z

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

+   */
+
+  std::function<std::vector<std::unique_ptr<ComputeCapability>>(const GraphViewer&)> selection_func;
+  auto status = g_host->GetEPOptimizerByName("ConstantFoldingDQ", graph_transformer_mgr, selection_func);


Should this check status?

Added ckecking status.

…ptimizerByName

onnxruntime/core/optimizer/graph_optimizer_registry.cc

onnxruntime/core/optimizer/graph_optimizer_registry.h

onnxruntime/core/optimizer/qdq_transformer/constant_folding_dq_node.h

onnxruntime/core/optimizer/selection_and_optimization_func.cc

onnxruntime/core/optimizer/selection_and_optimization_func.h

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/core/framework/compute_capability.h

onnxruntime/core/optimizer/constant_folding.h

onnxruntime/core/optimizer/graph_optimizer_registry.cc

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.h

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider_helper.cc

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/core/optimizer/graph_optimizer_registry.cc

onnxruntime/core/optimizer/graph_optimizer_registry.h

onnxruntime/core/optimizer/graph_transformer_mgr.cc

onnxruntime/core/optimizer/qdq_transformer/constant_folding_dq_node.h

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.h

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider_helper.cc

chilo-ms added 7 commits January 21, 2025 15:54

init

1d5ca89

include GraphTransformerManager to GetCapability

e9119d5

Add GraphTransformerManager for EP, optimization function and Compute…

b7a0b79

…Capability, selection function and ComputeCapability

refine GraphTransformerManager for EP, optimization function and Comp…

3b28ffc

…uteCapability, selection function and ComputeCapability

TRT EP creates optimization compute capability

309341e

add comments

d0cbc65

remove unnecessary code

b239db0

chilo-ms requested review from skottmckay, jywu-msft and adrianlizarraga January 28, 2025 17:57

adrianlizarraga reviewed Jan 28, 2025

View reviewed changes

skottmckay reviewed Jan 28, 2025

View reviewed changes

chilo-ms added 10 commits January 28, 2025 21:23

remove commented code

a83dd11

add a function to include DQ that is filtered out by TRT parser

372342c

add standalone GraphOptimizerRegistry as singleton

39fa897

remove redundant code

627a00a

remove redundant code

06ca086

remove redundant code

a965ffb

add back function

4c2697c

changed code per reviewer

2b81789

don't create optimizer instances until EP requests it by calling GetO…

0c10cd4

…ptimizerByName

minor modification

3360dfd

github-advanced-security bot found potential problems Feb 6, 2025

View reviewed changes

github-actions bot reviewed Feb 6, 2025

View reviewed changes

chilo-ms added 2 commits February 6, 2025 16:47

fix compiler error

5f7da9f

remove unnecessary member function

e610bc8

github-actions bot reviewed Feb 7, 2025

View reviewed changes

chilo-ms added 2 commits February 7, 2025 00:57

lintrunner -a

e95f2c3

handle status

bad19b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling L2+ Optimizations for EPs #23517

Enabling L2+ Optimizations for EPs #23517

chilo-ms commented Jan 28, 2025 •

edited

Loading

adrianlizarraga Jan 28, 2025

adrianlizarraga Jan 28, 2025

adrianlizarraga Jan 29, 2025

skottmckay Jan 29, 2025

chilo-ms Feb 5, 2025 •

edited

Loading

skottmckay Jan 28, 2025

chilo-ms Feb 4, 2025 •

edited

Loading

skottmckay Jan 28, 2025

chilo-ms Feb 4, 2025

chilo-ms Feb 7, 2025

skottmckay Jan 28, 2025

skottmckay Jan 28, 2025

chilo-ms Feb 7, 2025

github-actions bot left a comment

github-actions bot left a comment

	EnsureUniqueDQForNodeUnit ensure_unique_dq_for_node_unit{};
	ORT_RETURN_IF_ERROR_SESSIONID_(apply_transformer_once(ensure_unique_dq_for_node_unit, *session_logger_, graph));

Enabling L2+ Optimizations for EPs #23517

Are you sure you want to change the base?

Enabling L2+ Optimizations for EPs #23517

Conversation

chilo-ms commented Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chilo-ms Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chilo-ms Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

chilo-ms commented Jan 28, 2025 •

edited

Loading

chilo-ms Feb 5, 2025 •

edited

Loading

chilo-ms Feb 4, 2025 •

edited

Loading