Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling L2+ Optimizations for EPs #23517

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

chilo-ms
Copy link
Contributor

@chilo-ms chilo-ms commented Jan 28, 2025

There are some requirements to modify the graph which are specific to the hardware.
ORT has the hardcoded EP list for optimizations but that can't scale and it's hard be extended to enable EP custom optimizations.

Here is the prototype to enable L2+ optimizations for EPs (The original overview is provided by @skottmckay) as well as the TRT EP implementation for the ConstantFoldingDQ optimization.

Signatures for selection/optimization functions:

  - Selection: std::function<std::vector<std::unique_ptr<ComputeCapability>>(const GraphViewer&)>
  - Optimization: std::function<Status(const Graph&, const ComputeCapability& this_optimization, ComputeCapability& cc_to_update)>

GetCapability

  • call (new) provider bridge API to lookup pre-defined optimizer by name and get selection function

    • ComputeCapability.optimize_func would be set by the optimizer to the function that does the optimization
  • EP has to update the returning ComputeCapability to include the optimization ComputeCapability in nodes_to_optimize. So that later ORT can perform optimization/transformation accordingly.

GraphPartitioner

  • After assigning the ComputeCapability to the EP and prior to Compile, if the ComputeCapability has nodes_to_optimize, iterate that list
    • optimization function needs to be called with
      • a mutable Graph instance
      • the ComputeCapability for the individual optimization
      • the overall ComputeCapability so it can be updated

dq_node_index_set.insert(index);
}

ConstantFoldingDQ* transformer = static_cast<ConstantFoldingDQ*>(graph_transformer_mgr.GetTransformerByName(optimizer_name));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought to potentially get rid of GraphTransformerManager:

Would it make sense to instantiate the ConstantFoldingDQ transformer directly instead of querying the graph_transformer_manager? For example, we have code in InferenceSession::TransformGraph that directly instantiates an instance of EnsureUniqueDQ and just calls .Apply() on it:

EnsureUniqueDQForNodeUnit ensure_unique_dq_for_node_unit{};
ORT_RETURN_IF_ERROR_SESSIONID_(apply_transformer_once(ensure_unique_dq_for_node_unit, *session_logger_, graph));

So in this case:

// Would need to pass the EP name to GetEPOptimizerByName()?
std::unique_ptr<ConstantFoldingDQ> const_fold_dq_transformer = std::make_unique<ConstantFoldingDQ>(/*...*/);

const_fold_dq_transformer.Apply(/*...*/);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may also allow the EP to provide key/value strings to configure an optimizer. This way we can avoid having to create a completely new derived class every time an EP needs a slightly different variation of an optimizer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this wouldn't work because we need to pass in CPU EP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to instantiate/run directly. May need some structure to facilitate that. e.g. if we need things like the CPU allocator to be available to create an optimizer.

In this case I think we need the derived class as we're augmenting the behavior of the constant folding optimizer, but I expect this to be an exception rather than typical.

For other optimizers I expect we'll want something like key/value strings to configure instead of creating derived classes. Need to figure out what sort of configuration is required. Some existing things are that an EP supports a subset of data types for a fusion, or a subset of operators for a fusion (e.g. subset of activation ops can be fused into Conv to create a FusedConv node).

Copy link
Contributor Author

@chilo-ms chilo-ms Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to continue the discussion about key/value strings to configure an optimizer.

I added a new standalone and singleton registration/lookup class. See the comment here.

I think the question i have is when is better to instantiate the optimizer(s).

One option to instantiate the optimizer(s) is during InferenceSession initialization upfront.
However, by doing so, we might need another way to update/reconfigure the optimizer instance by key/value strings (ex: transformer.update(key_value_strings)) if EP needs the configurations.

Another option is to wait until EP requests the optimizer and then it instantiates (consumes key/value strings) the optimizer during ORT runs the optimization function after GetCapability but before Compile. Need to make sure the registration/lookup class saves the cpu allocator, cpu ep etc from InferenceSession so that it can successfully instantiate the optimizer.

Update: This PR provides these two options and chooses the second option.

Comment on lines 28 to 30
GraphPartitioner(KernelRegistryManager& kernel_registry_mgr, const GraphTransformerManager& graph_transformer_mgr, const ExecutionProviders& providers)
: kernel_registry_mgr_(kernel_registry_mgr),
graph_transformer_mgr_(graph_transformer_mgr),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of wiring in the GraphTransformerManager would it be better to add a new standalone class that provides lookup-based access to a set of L2 optimizers that are directly usable by an EP? That would decouple the GraphPartitioner from the optimizer registration/lookup a bit more. I don't think GraphTransformerManager is providing value in this case as I don't think we need to loop.

The registration/lookup class for re-usable optimizers could be a singleton with a static 'create' method that calls GenerateTransformersForEP and saves the returned list. We could have InferenceSession call the 'create' method to provide any other required things like the CPU allocator, and potentially apply the optimization level when doing so if we want to do that on the ORT side. GraphPartitioner could call a static 'get' method to get the instance so we don't need to wire it through from the inference session.

Copy link
Contributor Author

@chilo-ms chilo-ms Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new standalone class called GraphOptimizerRegistry to mainly provide registration and lookup access.
This lookup class is a singleton and its 'create' method is being called during InferenceSession initialization and it calls GenerateTransformersForEP and saves the returned list. BTW, not sure this 'create' function needs to be static?

Given that this lookup instance is a singleton using static variable, provider bridge now can access this instance, which means we don't need to get the instance at graph partitioner and feed to GetCapability. Also, we don't need change GetCapability's signature to add a new parameter (i.e. GraphOptimizerRegistry& ).

onnxruntime/core/optimizer/constant_folding.h Outdated Show resolved Hide resolved
onnxruntime/core/optimizer/constant_folding.h Outdated Show resolved Hide resolved
const InlinedHashSet<std::string>& excluded_initializers = {}) noexcept;

bool AllowConstantFolding(const Node& node) const;
Status UpdateNodeIndexSet(InlinedHashSet<onnxruntime::NodeIndex>& node_index_set);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create a new instance of the optimizer for each node_index_set instead of having an update method in order to keep it a little cleaner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimizer can only be instantiated during InferenceSession initialization because it needs things like cpu allocator or cpu ep. But we don't know what node_index_set will be until graph partitioner calls ConstantFoldingDQ optimization function that takes in graph instance.

Wonder how we can do this without adding this UpdateNodeIndexSet member function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update:
I modified the code to make optimizer instantiated inside optimization function.
So now the node_inde_set can be consumed by the constructor and the UpdateNodeIndexSet can be removed.

onnxruntime/core/session/provider_bridge_ort.cc Outdated Show resolved Hide resolved
}
cc_to_update.sub_graph->nodes = updated_nodes;

auto original_meta_def = cc_to_update.sub_graph->GetMetaDef();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating a new MetaDef, now that we have a use-case where it makes sense to update it could we instead add a method to get a mutable MetaDef from sub_graph?

*/

std::function<std::vector<std::unique_ptr<ComputeCapability>>(const GraphViewer&)> selection_func;
auto status = g_host->GetEPOptimizerByName("ConstantFoldingDQ", graph_transformer_mgr, selection_func);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added ckecking status.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

onnxruntime/core/framework/compute_capability.h Outdated Show resolved Hide resolved
onnxruntime/core/framework/compute_capability.h Outdated Show resolved Hide resolved
onnxruntime/core/optimizer/constant_folding.h Outdated Show resolved Hide resolved
onnxruntime/core/optimizer/constant_folding.h Outdated Show resolved Hide resolved
onnxruntime/core/optimizer/graph_optimizer_registry.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants