[Model builder] Add option to exclude cache in inputs and outputs #1162

xenova · 2024-12-22T12:09:07Z

In certain cases (e.g., single-round conversations), it is not necessary to require past_key_values as inputs and present outputs, like with https://huggingface.co/livekit/turn-detector (and its usage here).

So, this PR adds an option to exclude these inputs and outputs from the graph.

Example usage:

python builder.py -m livekit/turn-detector -o converted -e cpu -p fp32 --extra_options exclude_cache=True

Output graph signature:

For comparison, graph signature w/ cache IO:

xenova · 2024-12-22T12:11:42Z

@microsoft-github-policy-service agree [company="{Hugging Face}"]

xenova · 2024-12-22T12:12:13Z

@microsoft-github-policy-service agree company="Hugging Face"

xenova · 2024-12-22T13:20:17Z

This modification only works for models w/ GQA. Maybe someone with a bit more experience with the model builder could help get it working for models w/ MHA? 😇

ambroser53 · 2025-01-09T12:00:09Z

Does this give a memory overhead improvement?

xenova · 2025-01-09T17:32:49Z

Does this give a memory overhead improvement?

Indeed - not having to store and pass these values back has improved execution time in my tests.

kunal-vaishnavi · 2025-01-11T01:54:04Z

src/python/py/models/builder.py

@@ -3267,6 +3275,8 @@ def get_args():
                exclude_lm_head = Remove language modeling head from your ONNX model.
                    Use this option when you want to remove the language modeling head from within your ONNX model.
                    Instead of `logits`, you will have `hidden_states` as the output to your ONNX model.
+                exclude_cache = Remove cache inputs and outputs from your ONNX model.


Suggested change

exclude_cache = Remove cache inputs and outputs from your ONNX model.

exclude_kv_cache = Remove KV cache inputs and outputs from your ONNX model.

kunal-vaishnavi · 2025-01-11T01:56:15Z

src/python/py/models/builder.py

@@ -3267,6 +3275,8 @@ def get_args():
                exclude_lm_head = Remove language modeling head from your ONNX model.
                    Use this option when you want to remove the language modeling head from within your ONNX model.
                    Instead of `logits`, you will have `hidden_states` as the output to your ONNX model.
+                exclude_cache = Remove cache inputs and outputs from your ONNX model.
+                    Use this option when you want to remove the `past_key_values` inputs and `present` outputs from within your ONNX model.


Suggested change

Use this option when you want to remove the `past_key_values` inputs and `present` outputs from within your ONNX model.

Use this option when you want to remove the `past_key_values` inputs and `present` outputs from within your ONNX model.

Note that this should be used when you want to run ONNX models with ONNX Runtime only. ONNX Runtime GenAI requires the KV cache inputs and outputs for inference.

kunal-vaishnavi · 2025-01-11T01:56:42Z

src/python/py/models/builder.py

@@ -111,6 +111,8 @@ def __init__(self, config, io_dtype, onnx_dtype, ep, cache_dir, extra_options):
        elif self.include_hidden_states:
            self.output_names = ["hidden_states"] + self.output_names

+        self.exclude_cache = "exclude_cache" in extra_options


Suggested change

self.exclude_cache = "exclude_cache" in extra_options

self.exclude_cache = extra_options.get("exclude_cache", False)

kunal-vaishnavi · 2025-01-11T02:04:38Z

src/python/py/models/builder.py

+                past_k, past_v, present_k, present_v = "", "", "", ""
+        else:
+            past_k, past_v = "", ""
+            present_k = f"present.{layer_id}.key"


I think present_k and present_v should be empty strings since the KV cache inputs and outputs are not in the model.

kunal-vaishnavi · 2025-01-11T02:27:24Z

Thanks for the contribution! Can you also update the following places?

Add the option in the validation of the boolean extra options

onnxruntime-genai/src/python/py/models/builder.py

Lines 3133 to 3141 in 44e541e

    
           bools = ["int4_is_symmetric", "exclude_embeds", "exclude_lm_head", "include_hidden_states", "enable_cuda_graph", "use_8bits_moe", "use_qdq", "include_prompt_templates"] 
        
           for key in bools: 
        
               if key in kv_pairs: 
        
                   if kv_pairs[key] in {"false", "False", "0"}: 
        
                       kv_pairs[key] = False 
        
                   elif kv_pairs[key] in {"true", "True", "1"}: 
        
                       kv_pairs[key] = True 
        
                   else: 
        
                       raise ValueError(f"{key} must be false/False/0 or true/True/1.")

Add the option in the README and provide a usage example

For MultiHeadAttention where the repeat KV operation occurs, you may need to remove the Concat node that combines past_kv and curr_kv to produce present_kv.

onnxruntime-genai/src/python/py/models/builder.py

Lines 1287 to 1304 in 44e541e

    
           # Make the initial subgraph 
        
           # 
        
           #                                                       +------> Gather --> Unsqueeze -----+ 
        
           #                                                       |                                  | 
        
           #                                         past_kv       +------> Gather --> Unsqueeze -----+---> Mul --> Concat (4D) 
        
           #                                            |          |                                  | 
        
           # root_input --> Reshape --> Transpose --> Concat --> Shape ---> Gather --> Unsqueeze -----+---> Concat (5D) 
        
           #                                            |          |                                  | 
        
           #                                        present_kv     +------> Gather --> Unsqueeze -----+ 
        
           reshape_1_name = f"{basename}/Reshape_1" 
        
           reshape_1_inputs = [root_input, f"/model/constants/TensorProto.INT64/1D/0, 0, {self.num_kv_heads}, -1"] 
        
           self.make_reshape(reshape_1_name, reshape_1_inputs, dtype=self.io_dtype, shape=['batch_size', 'sequence_length', self.num_kv_heads, self.head_size]) 
        
           transpose_1_name = f"{basename}/Transpose_1" 
        
           transpose_1_input = f"{reshape_1_name}/output_0" 
        
           self.make_transpose(transpose_1_name, transpose_1_input, dtype=self.io_dtype, shape=['batch_size', self.num_kv_heads, 'sequence_length', self.head_size], perm=[0,2,1,3]) 
        
           concat_1_name = f"{basename}/Concat_1" 
        
           concat_1_inputs = [past_kv, f"{transpose_1_name}/output_0"] 
        
           self.make_node("Concat", inputs=concat_1_inputs, outputs=[present_kv], name=concat_1_name, axis=2)

Add option to exclude cache in inputs and outputs

af4df27

RyanUnderhill requested a review from kunal-vaishnavi January 7, 2025 01:28

kunal-vaishnavi reviewed Jan 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model builder] Add option to exclude cache in inputs and outputs #1162

[Model builder] Add option to exclude cache in inputs and outputs #1162

xenova commented Dec 22, 2024

xenova commented Dec 22, 2024

xenova commented Dec 22, 2024

xenova commented Dec 22, 2024

ambroser53 commented Jan 9, 2025

xenova commented Jan 9, 2025

kunal-vaishnavi Jan 11, 2025

kunal-vaishnavi Jan 11, 2025

kunal-vaishnavi Jan 11, 2025

kunal-vaishnavi Jan 11, 2025

kunal-vaishnavi commented Jan 11, 2025

	exclude_cache = Remove cache inputs and outputs from your ONNX model.
	exclude_kv_cache = Remove KV cache inputs and outputs from your ONNX model.

	Use this option when you want to remove the `past_key_values` inputs and `present` outputs from within your ONNX model.
	Use this option when you want to remove the `past_key_values` inputs and `present` outputs from within your ONNX model.
	Note that this should be used when you want to run ONNX models with ONNX Runtime only. ONNX Runtime GenAI requires the KV cache inputs and outputs for inference.

	self.exclude_cache = "exclude_cache" in extra_options
	self.exclude_cache = extra_options.get("exclude_cache", False)

[Model builder] Add option to exclude cache in inputs and outputs #1162

Are you sure you want to change the base?

[Model builder] Add option to exclude cache in inputs and outputs #1162

Conversation

xenova commented Dec 22, 2024

xenova commented Dec 22, 2024

xenova commented Dec 22, 2024

xenova commented Dec 22, 2024

ambroser53 commented Jan 9, 2025

xenova commented Jan 9, 2025

kunal-vaishnavi Jan 11, 2025

Choose a reason for hiding this comment

kunal-vaishnavi Jan 11, 2025

Choose a reason for hiding this comment

kunal-vaishnavi Jan 11, 2025

Choose a reason for hiding this comment

kunal-vaishnavi Jan 11, 2025

Choose a reason for hiding this comment

kunal-vaishnavi commented Jan 11, 2025