Skip to content

Conversation

russellbrooks
Copy link

Optional passthrough onnxruntime.SessionOptions to the underlying ONNX InferenceSession – re: slack thread

When inference is running in a virtualized environment, e.g. docker container, the ONNX inference session infers the underlying hardware of the host machine such as number of CPU cores, rather than what the container actually has access to (similar issues can arise in other python multiprocessing tools). This can result in oversaturating the CPU cores and having noisy-neighbor issues across containers sharing the host (e.g. a large EC2 with 96 cores).

These passthrough options can be specified using the existing JSON file pointed to via the env variable UNSTRUCTURED_DEFAULT_MODEL_INITIALIZE_PARAMS_JSON_PATH and as an example, here's JSON that would limit the model inference to 4 CPU cores:

{
    "model_name": "detectron2_onnx",
    "session_options_dict": {
        "intra_op_num_threads": 4,
        "inter_op_num_threads": 4
    }
}

Param reference:

  • Intra-Op Parallelism: This setting controls the number of threads used for parallel execution within a single operator. For example, if a matrix multiplication operation can be parallelized, this setting determines how many threads will work on that operation. Default is 0 to let onnxruntime choose.
  • Inter-Op Parallelism: This controls the number of threads that can run different operators in parallel. For instance, if the model architecture allows for multiple layers or operations to execute simultaneously (i.e., they are not dependent on the output of each other), this setting manages how many of such operations can run at the same time. Default is 0 to let onnxruntime choose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant