Skip to content

Add context support to Qwen3 ASR#126

Open
lucasnewman wants to merge 2 commits intoBlaizzy:mainfrom
lucasnewman:qwen3asr-context
Open

Add context support to Qwen3 ASR#126
lucasnewman wants to merge 2 commits intoBlaizzy:mainfrom
lucasnewman:qwen3asr-context

Conversation

@lucasnewman
Copy link
Copy Markdown
Collaborator

Resolves #101

@lucasnewman lucasnewman requested a review from Blaizzy March 25, 2026 21:03
Copy link
Copy Markdown

@NicolasArnouts NicolasArnouts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Background & Ecosystem Context

Context/system prompting is a documented, first-class feature of Qwen3-ASR. The official model (QwenLM/Qwen3-ASR) has a context parameter on transcribe() designed for hotword biasing — providing domain-specific vocabulary (medical terms, product names, proper nouns) to improve recognition accuracy. It supports up to ~10,000 tokens of context.

The prompt format across the ecosystem is consistent:

<|im_start|>system\n{context}<|im_end|>\n
<|im_start|>user\n<|audio_start|>{audio_pads}<|audio_end|><|im_end|>\n
<|im_start|>assistant\n{language prefix}

Cross-implementation comparison

Project Parameter name Implementation
QwenLM/Qwen3-ASR (official) context System message content
Blaizzy/mlx-audio (Python) system_prompt _build_prompt() injects into system turn
k2-fsa/sherpa-onnx (C++) hotwords string Concatenated into system segment
dseditor/QwenASRMiniTool context Token-level splice at position 3
This PR (Swift) context String interpolation into system turn

The PR's naming choice (context) aligns with the official Qwen API, not the Python mlx-audio's system_prompt. This is arguably the better name.


What's Good

  1. Clean refactor: Splitting buildPromptText (returns String) from buildPrompt (tokenizes to MLXArray) is a solid design decision. It makes the text output independently testable without needing a tokenizer.

  2. Complete threading: Context flows through all 5 call sites — generate(), generateStream(), generateSingleChunk(), buildPrompt(), and both STTGenerationModel protocol conformance methods.

  3. Concurrency is sound: String is Sendable, so context safely crosses the Task.detached boundary in the streaming path without needing UncheckedSendableBox wrapping.

  4. Each chunk gets the same context: Both streaming and non-streaming paths pass the same context to every chunk's buildPrompt() call — correct behavior since context biasing should be consistent across the entire transcription.

  5. Test covers the key behavior: Verifies context injection, audio token count, and assistant prefix in one focused test.

  6. Consistent with codebase patterns: GraniteSpeech already has a model-specific prompt: String? parameter that's NOT in STTGenerateParameters — this PR follows the exact same pattern.


Issues

1. Test uses context as natural language instruction — but Qwen3-ASR expects hotword vocabulary

The test uses:

context: "Prefer product names over pronouns."

But Qwen3-ASR's context feature is trained for hotword biasing (space-separated vocabulary terms), not natural language instructions. Every official example in QwenLM/Qwen3-ASR uses vocabulary terms:

context=["交易 停滞"]  # space-separated hotwords

The model treats the system prompt as conditioning tokens for recognition bias, not as an instruction to follow. A natural language sentence like "Prefer product names over pronouns" could cause those words to leak into the transcription output (known issue: QwenLM/Qwen3-ASR#106, #140).

Consider changing the test to use a realistic hotword example:

context: "TypeWhisper MLX CoreML"

This better documents the intended usage for future contributors.

2. Trailing newline mismatch with Python reference (minor correctness)

The Python _build_prompt() adds a trailing \n after the system prompt:

system_content = f"{system_prompt}\n" if system_prompt else ""

The PR does not add a trailing newline:

return "<|im_start|>system\n\(context)<|im_end|>\n"

With context "foo bar":

  • Python: <|im_start|>system\nfoo bar\n<|im_end|>\n
  • Swift PR: <|im_start|>system\nfoo bar<|im_end|>\n

This may affect tokenization and therefore model behavior. The official Qwen chat template also includes the trailing newline. I'd recommend matching the Python behavior:

let systemContent = context.isEmpty ? "" : "\(context)\n"
return "<|im_start|>system\n\(systemContent)<|im_end|>\n"

3. Redundant systemContext assignment (nit)

let systemContext = context.isEmpty ? "" : context

This is a no-op — systemContext always equals context. If the intent was to add the trailing newline (per issue #2 above), this is where it should go. As-is, just use context directly.

4. context: "" hardcoded at protocol boundary — intentional but worth documenting

Both STTGenerationModel conformance methods hardcode context: "":

public func generate(audio: MLXArray, generationParameters: STTGenerateParameters) -> STTOutput {
    generate(
        audio: audio,
        ...
        context: "",  // Always empty through protocol API
        ...
    )
}

This means callers using the generic STTGenerationModel protocol can never pass context. This is consistent with how GraniteSpeech handles prompt (also not exposed through the protocol), so it's a valid design choice. However, consider:

  • The Swift CLI already accepts --context but prints a warning that it's ignored (App.swift:278-280). This PR doesn't wire it up — is that intentional for a follow-up?
  • A one-line doc comment on context in generate() explaining it's for hotword biasing (space-separated vocabulary terms) would help.

5. buildPromptText is internal while buildPrompt is public (deliberate?)

buildPromptText is func (internal) while buildPrompt is public func. The test accesses it via @testable import. If this is intentional (only expose the tokenized version publicly), that's fine. But if external callers would benefit from the text version (e.g., for debugging/logging prompt content), consider making it public.

6. Missing test cases

The test covers context with a value, but doesn't verify:

  • Empty context (default): that the prompt remains <|im_start|>system\n<|im_end|>\n — no stray whitespace
  • No language (auto-detect): that context + language=nil produces the right prompt (no assistant prefix)

7. String = "" vs String? = nil type choice

GraniteSpeech in this same codebase uses prompt: String? = nil. The Python mlx-audio uses system_prompt: str | None = None. The PR uses context: String = "". Not blocking, but String? = nil would be more consistent with existing patterns.


Known Qwen3-ASR Context Gotchas (FYI, not PR issues)

From the Qwen3-ASR issue tracker:

  • Context leaking: Hotwords can appear verbatim in transcription output (QwenLM/Qwen3-ASR#106)
  • Hotword repetition: With the 0.6B model in streaming mode, phonetically similar speech can cause the model to repeat the hotword list (QwenLM/Qwen3-ASR#140)
  • Effectiveness is unpredictable: The Qwen maintainer describes context biasing effectiveness as "玄学" (metaphysical/unpredictable)

Verdict

Approve with minor changes. The implementation is solid, correctly threaded through all code paths, concurrency-safe, and follows existing codebase patterns.

Should fix:

  • Update test to use realistic hotword example instead of natural language instruction (documents correct usage)
  • Add trailing \n after context to match the Python reference and official Qwen chat template
  • Remove the redundant systemContext variable

Nice to have:

  • Add test for empty context (default path)
  • Wire up CLI --context to actually use this feature (could be a follow-up PR)
  • Consider String? = nil for consistency with GraniteSpeech's prompt: String?
  • Doc comment on the context parameter explaining it's for space-separated hotword biasing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qwen3 STT support context parameter for better accuracy

2 participants