Whisper Redesigned Solution #23549

kunal-vaishnavi · 2025-01-31T02:09:34Z

Description

This PR re-designs how Whisper is created and supported in ONNX Runtime. The new solution leverages previous optimization work, and it is designed to be used in conjunction with ONNX Runtime GenAI.

Some of the added changes include:

Re-designed export that creates new ONNX models without needing a WhisperBeamSearch op
- Creates one encoder model that also pre-computes the cross-attention KV caches (since they only need to be run once)
- Creates one decoder model that can be used during pre-fill and token generation
- Creates one jump-times model that can be used for word-level timestamps
- Removes need for a WhisperBeamSearch op to chain the encoder and decoder subgraphs
- Removes need to duplicate decoder's weights in memory
  - Previous solution with the WhisperBeamSearch op created an encoder-decoder-init model and decoder-with-past model. The decoder was duplicated twice, one in each.
- Removes need for separate logic to export the PyTorch model coming from OpenAI vs. the PyTorch model coming from Hugging Face
Re-factors common parameters and logic used in CPU and CUDA attention kernels
- Adds DUMP_STRING to enable easy logging of intermediate information when running in debug mode to debug a problem. This info is not printed in release mode so it will not impact performance.
- Integrates DecoderMaskedMultiHeadAttention into MultiHeadAttention
- Enables past-present buffer sharing in the MultiHeadAttention op for improved performance
- Adds cache_indirection and past_sequence_length as new optional inputs to MultiHeadAttention
- Adds output_qk as new optional output to MultiHeadAttention
- Enables calculating output_qk tensor with FP16 or FP32 precision, regardless of the model's precision
CI tests that run end-to-end across various flag combinations that are used by many customers internally and externally

The existing solutions are still available if desired.

Known Issues

The FP32 CPU model with the WhisperBeamSearch op and output QK is currently disabled. This is because ONNX Runtime doesn't currently support output QK kernels on CPU, only on CUDA.
The FP32 CPU model has a bug with Neg --> Shape in the jump times model when exporting the model to contain the WhisperBeamSearch op.
The DecoderMaskedMultiHeadAttention CPU kernel has a parity mismatch with the DecoderMaskedMultiHeadAttention CUDA kernel.
Using DecoderMaskedMultiHeadAttention for the FP32 CPU model is not enabled. Currently, it uses MultiHeadAttention to avoid the parity mismatch issue.

Motivation and Context

Using the beam search op has made it more difficult to debug and fix errors that are encountered. This new approach is more flexible and more customizable for users (e.g. by running with ONNX Runtime GenAI). It also helps this issue.

…earch op

onnxruntime/python/tools/transformers/convert_generation.py

+    return model
+
+
+def fix_past_sequence_length(model: ModelProto):


onnxruntime/python/tools/transformers/models/whisper/whisper_decoder.py

+                diff = np.abs(pt_outputs[i] - ort_outputs[i])
+                logger.warning(f"Comparing {output_name}...")
+                logger.warning(f"Max diff: {np.max(diff)}")
+        except:  # noqa: E722


onnxruntime/python/tools/transformers/models/whisper/whisper_encoder_decoder_init.py

onnxruntime/test/testdata/dmmha_inside_mha_data.py

@@ -0,0 +1,195 @@
+import numpy as np
+
+import onnxruntime as ort


kunal-vaishnavi and others added 30 commits April 25, 2024 18:32

Add support for creating optimized whisper ONNX models without beam s…

f314287

…earch op

Fix incorrect dynamic axes labels

6a44f72

Fix fusion breaks for OpenAI implementation of Whisper

58ec5eb

Merge branch 'main' into kvaishnavi/whisper-separate-export

4c228ea

Merge branch 'main' into kvaishnavi/whisper-separate-export

dd20876

Comment out DMMHA case temporarily

b13cb22

Replace MHA with DMMHA

31db1a0

Merge branch 'main' into kvaishnavi/whisper-separate-export

3b92432

Debugging beam search output

7bb79f3

Initial commit for new export

14b7e77

Add parity check after export and optimization

fa345fe

Fix multiple attention kernel invocations

e050dea

Make output Q*K values optional

bf87062

Fix batch size check for cache indirection

17fa0ab

Save checkpoint for working solution

52aeb58

Clean up code

240fe3b

Fix string dumping

ae98085

Fix out_qk dtype issue for half input case.

3d2c8fe

Remove type cast for output QK

287151f

Enable release mode build

0805d1d

Make QK output dtype independent of attention dtype

b629903

Add batched jump times export

648b389

Get batched jump times ONNX model with parity check

a6c6ee8

Save checkpoint for working solution

c0a6ce4

Merge branch 'main' into kvaishnavi/whisper-separate-export

008eeb9

Fix build after merge

158d0a8

Fix model with beam search op

02cb5be

Get model impl and beam search op export combinations working

2acd593

Enable separate export of encoder and decoder init

612eb0c

Add tests for multiple export types to CIs

f2d78fd

kunal-vaishnavi added 8 commits December 25, 2024 02:23

Update folder and file names in Whisper README

cb93517

Add FP32 CPU DMMHA support

6da11ec

Add unit tests

9640736

Merge branch 'main' into kvaishnavi/whisper-separate-export

75a342a

Change debug message for PrepareQkv

7fe6b05

Fix seqlens_k after merge

8620168

Merge branch 'main' into kvaishnavi/whisper-separate-export

b0a732b

Add changes suggested by linter

23808f7

github-advanced-security bot found potential problems Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper Redesigned Solution #23549

Whisper Redesigned Solution #23549

kunal-vaishnavi commented Jan 31, 2025 •

edited

Loading

		return model


		def fix_past_sequence_length(model: ModelProto):

		@@ -0,0 +1,195 @@
		import numpy as np

		import onnxruntime as ort

Whisper Redesigned Solution #23549

Are you sure you want to change the base?

Whisper Redesigned Solution #23549

Conversation

kunal-vaishnavi commented Jan 31, 2025 • edited Loading

Description

Known Issues

Motivation and Context

kunal-vaishnavi commented Jan 31, 2025 •

edited

Loading