Add context support to Qwen3 ASR by lucasnewman · Pull Request #126 · Blaizzy/mlx-audio-swift

lucasnewman · 2026-03-25T21:03:45Z

Resolves #101

NicolasArnouts

Background & Ecosystem Context

Context/system prompting is a documented, first-class feature of Qwen3-ASR. The official model (QwenLM/Qwen3-ASR) has a context parameter on transcribe() designed for hotword biasing — providing domain-specific vocabulary (medical terms, product names, proper nouns) to improve recognition accuracy. It supports up to ~10,000 tokens of context.

The prompt format across the ecosystem is consistent:

<|im_start|>system\n{context}<|im_end|>\n
<|im_start|>user\n<|audio_start|>{audio_pads}<|audio_end|><|im_end|>\n
<|im_start|>assistant\n{language prefix}

Cross-implementation comparison

Project	Parameter name	Implementation
QwenLM/Qwen3-ASR (official)	`context`	System message content
Blaizzy/mlx-audio (Python)	`system_prompt`	`_build_prompt()` injects into system turn
k2-fsa/sherpa-onnx (C++)	hotwords string	Concatenated into system segment
dseditor/QwenASRMiniTool	`context`	Token-level splice at position 3
This PR (Swift)	`context`	String interpolation into system turn

The PR's naming choice (context) aligns with the official Qwen API, not the Python mlx-audio's system_prompt. This is arguably the better name.

What's Good

Clean refactor: Splitting buildPromptText (returns String) from buildPrompt (tokenizes to MLXArray) is a solid design decision. It makes the text output independently testable without needing a tokenizer.
Complete threading: Context flows through all 5 call sites — generate(), generateStream(), generateSingleChunk(), buildPrompt(), and both STTGenerationModel protocol conformance methods.
Concurrency is sound: String is Sendable, so context safely crosses the Task.detached boundary in the streaming path without needing UncheckedSendableBox wrapping.
Each chunk gets the same context: Both streaming and non-streaming paths pass the same context to every chunk's buildPrompt() call — correct behavior since context biasing should be consistent across the entire transcription.
Test covers the key behavior: Verifies context injection, audio token count, and assistant prefix in one focused test.
Consistent with codebase patterns: GraniteSpeech already has a model-specific prompt: String? parameter that's NOT in STTGenerateParameters — this PR follows the exact same pattern.

Issues

1. Test uses context as natural language instruction — but Qwen3-ASR expects hotword vocabulary

The test uses:

context: "Prefer product names over pronouns."

But Qwen3-ASR's context feature is trained for hotword biasing (space-separated vocabulary terms), not natural language instructions. Every official example in QwenLM/Qwen3-ASR uses vocabulary terms:

context=["交易 停滞"]  # space-separated hotwords

The model treats the system prompt as conditioning tokens for recognition bias, not as an instruction to follow. A natural language sentence like "Prefer product names over pronouns" could cause those words to leak into the transcription output (known issue: QwenLM/Qwen3-ASR#106, #140).

Consider changing the test to use a realistic hotword example:

context: "TypeWhisper MLX CoreML"

This better documents the intended usage for future contributors.

2. Trailing newline mismatch with Python reference (minor correctness)

The Python _build_prompt() adds a trailing \n after the system prompt:

system_content = f"{system_prompt}\n" if system_prompt else ""

The PR does not add a trailing newline:

return "<|im_start|>system\n\(context)<|im_end|>\n"

With context "foo bar":

Python: <|im_start|>system\nfoo bar\n<|im_end|>\n
Swift PR: <|im_start|>system\nfoo bar<|im_end|>\n

This may affect tokenization and therefore model behavior. The official Qwen chat template also includes the trailing newline. I'd recommend matching the Python behavior:

let systemContent = context.isEmpty ? "" : "\(context)\n"
return "<|im_start|>system\n\(systemContent)<|im_end|>\n"

3. Redundant `systemContext` assignment (nit)

let systemContext = context.isEmpty ? "" : context

This is a no-op — systemContext always equals context. If the intent was to add the trailing newline (per issue #2 above), this is where it should go. As-is, just use context directly.

4. `context: ""` hardcoded at protocol boundary — intentional but worth documenting

Both STTGenerationModel conformance methods hardcode context: "":

public func generate(audio: MLXArray, generationParameters: STTGenerateParameters) -> STTOutput {
    generate(
        audio: audio,
        ...
        context: "",  // Always empty through protocol API
        ...
    )
}

This means callers using the generic STTGenerationModel protocol can never pass context. This is consistent with how GraniteSpeech handles prompt (also not exposed through the protocol), so it's a valid design choice. However, consider:

The Swift CLI already accepts --context but prints a warning that it's ignored (App.swift:278-280). This PR doesn't wire it up — is that intentional for a follow-up?
A one-line doc comment on context in generate() explaining it's for hotword biasing (space-separated vocabulary terms) would help.

5. `buildPromptText` is `internal` while `buildPrompt` is `public` (deliberate?)

buildPromptText is func (internal) while buildPrompt is public func. The test accesses it via @testable import. If this is intentional (only expose the tokenized version publicly), that's fine. But if external callers would benefit from the text version (e.g., for debugging/logging prompt content), consider making it public.

6. Missing test cases

The test covers context with a value, but doesn't verify:

Empty context (default): that the prompt remains <|im_start|>system\n<|im_end|>\n — no stray whitespace
No language (auto-detect): that context + language=nil produces the right prompt (no assistant prefix)

7. `String = ""` vs `String? = nil` type choice

GraniteSpeech in this same codebase uses prompt: String? = nil. The Python mlx-audio uses system_prompt: str | None = None. The PR uses context: String = "". Not blocking, but String? = nil would be more consistent with existing patterns.

Known Qwen3-ASR Context Gotchas (FYI, not PR issues)

From the Qwen3-ASR issue tracker:

Context leaking: Hotwords can appear verbatim in transcription output (QwenLM/Qwen3-ASR#106)
Hotword repetition: With the 0.6B model in streaming mode, phonetically similar speech can cause the model to repeat the hotword list (QwenLM/Qwen3-ASR#140)
Effectiveness is unpredictable: The Qwen maintainer describes context biasing effectiveness as "玄学" (metaphysical/unpredictable)

Verdict

Approve with minor changes. The implementation is solid, correctly threaded through all code paths, concurrency-safe, and follows existing codebase patterns.

Should fix:

Update test to use realistic hotword example instead of natural language instruction (documents correct usage)
Add trailing \n after context to match the Python reference and official Qwen chat template
Remove the redundant systemContext variable

Nice to have:

Add test for empty context (default path)
Wire up CLI --context to actually use this feature (could be a follow-up PR)
Consider String? = nil for consistency with GraniteSpeech's prompt: String?
Doc comment on the context parameter explaining it's for space-separated hotword biasing

Add context support to Qwen3 ASR.

ecedbe0

lucasnewman requested a review from Blaizzy March 25, 2026 21:03

Merge branch 'main' into qwen3asr-context

fd8a8e5

NicolasArnouts reviewed Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add context support to Qwen3 ASR#126

Add context support to Qwen3 ASR#126
lucasnewman wants to merge 2 commits intoBlaizzy:mainfrom
lucasnewman:qwen3asr-context

lucasnewman commented Mar 25, 2026

Uh oh!

NicolasArnouts left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lucasnewman commented Mar 25, 2026

Uh oh!

NicolasArnouts left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Background & Ecosystem Context

Cross-implementation comparison

What's Good

Issues

1. Test uses context as natural language instruction — but Qwen3-ASR expects hotword vocabulary

2. Trailing newline mismatch with Python reference (minor correctness)

3. Redundant systemContext assignment (nit)

4. context: "" hardcoded at protocol boundary — intentional but worth documenting

5. buildPromptText is internal while buildPrompt is public (deliberate?)

6. Missing test cases

7. String = "" vs String? = nil type choice

Known Qwen3-ASR Context Gotchas (FYI, not PR issues)

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NicolasArnouts left a comment •

edited

Loading

3. Redundant `systemContext` assignment (nit)

4. `context: ""` hardcoded at protocol boundary — intentional but worth documenting

5. `buildPromptText` is `internal` while `buildPrompt` is `public` (deliberate?)

7. `String = ""` vs `String? = nil` type choice