Conversation
There was a problem hiding this comment.
Background & Ecosystem Context
Context/system prompting is a documented, first-class feature of Qwen3-ASR. The official model (QwenLM/Qwen3-ASR) has a context parameter on transcribe() designed for hotword biasing — providing domain-specific vocabulary (medical terms, product names, proper nouns) to improve recognition accuracy. It supports up to ~10,000 tokens of context.
The prompt format across the ecosystem is consistent:
<|im_start|>system\n{context}<|im_end|>\n
<|im_start|>user\n<|audio_start|>{audio_pads}<|audio_end|><|im_end|>\n
<|im_start|>assistant\n{language prefix}
Cross-implementation comparison
| Project | Parameter name | Implementation |
|---|---|---|
| QwenLM/Qwen3-ASR (official) | context |
System message content |
| Blaizzy/mlx-audio (Python) | system_prompt |
_build_prompt() injects into system turn |
| k2-fsa/sherpa-onnx (C++) | hotwords string | Concatenated into system segment |
| dseditor/QwenASRMiniTool | context |
Token-level splice at position 3 |
| This PR (Swift) | context |
String interpolation into system turn |
The PR's naming choice (context) aligns with the official Qwen API, not the Python mlx-audio's system_prompt. This is arguably the better name.
What's Good
-
Clean refactor: Splitting
buildPromptText(returnsString) frombuildPrompt(tokenizes toMLXArray) is a solid design decision. It makes the text output independently testable without needing a tokenizer. -
Complete threading: Context flows through all 5 call sites —
generate(),generateStream(),generateSingleChunk(),buildPrompt(), and bothSTTGenerationModelprotocol conformance methods. -
Concurrency is sound:
StringisSendable, so context safely crosses theTask.detachedboundary in the streaming path without needingUncheckedSendableBoxwrapping. -
Each chunk gets the same context: Both streaming and non-streaming paths pass the same
contextto every chunk'sbuildPrompt()call — correct behavior since context biasing should be consistent across the entire transcription. -
Test covers the key behavior: Verifies context injection, audio token count, and assistant prefix in one focused test.
-
Consistent with codebase patterns: GraniteSpeech already has a model-specific
prompt: String?parameter that's NOT inSTTGenerateParameters— this PR follows the exact same pattern.
Issues
1. Test uses context as natural language instruction — but Qwen3-ASR expects hotword vocabulary
The test uses:
context: "Prefer product names over pronouns."But Qwen3-ASR's context feature is trained for hotword biasing (space-separated vocabulary terms), not natural language instructions. Every official example in QwenLM/Qwen3-ASR uses vocabulary terms:
context=["交易 停滞"] # space-separated hotwordsThe model treats the system prompt as conditioning tokens for recognition bias, not as an instruction to follow. A natural language sentence like "Prefer product names over pronouns" could cause those words to leak into the transcription output (known issue: QwenLM/Qwen3-ASR#106, #140).
Consider changing the test to use a realistic hotword example:
context: "TypeWhisper MLX CoreML"This better documents the intended usage for future contributors.
2. Trailing newline mismatch with Python reference (minor correctness)
The Python _build_prompt() adds a trailing \n after the system prompt:
system_content = f"{system_prompt}\n" if system_prompt else ""The PR does not add a trailing newline:
return "<|im_start|>system\n\(context)<|im_end|>\n"With context "foo bar":
- Python:
<|im_start|>system\nfoo bar\n<|im_end|>\n - Swift PR:
<|im_start|>system\nfoo bar<|im_end|>\n
This may affect tokenization and therefore model behavior. The official Qwen chat template also includes the trailing newline. I'd recommend matching the Python behavior:
let systemContent = context.isEmpty ? "" : "\(context)\n"
return "<|im_start|>system\n\(systemContent)<|im_end|>\n"3. Redundant systemContext assignment (nit)
let systemContext = context.isEmpty ? "" : contextThis is a no-op — systemContext always equals context. If the intent was to add the trailing newline (per issue #2 above), this is where it should go. As-is, just use context directly.
4. context: "" hardcoded at protocol boundary — intentional but worth documenting
Both STTGenerationModel conformance methods hardcode context: "":
public func generate(audio: MLXArray, generationParameters: STTGenerateParameters) -> STTOutput {
generate(
audio: audio,
...
context: "", // Always empty through protocol API
...
)
}This means callers using the generic STTGenerationModel protocol can never pass context. This is consistent with how GraniteSpeech handles prompt (also not exposed through the protocol), so it's a valid design choice. However, consider:
- The Swift CLI already accepts
--contextbut prints a warning that it's ignored (App.swift:278-280). This PR doesn't wire it up — is that intentional for a follow-up? - A one-line doc comment on
contextingenerate()explaining it's for hotword biasing (space-separated vocabulary terms) would help.
5. buildPromptText is internal while buildPrompt is public (deliberate?)
buildPromptText is func (internal) while buildPrompt is public func. The test accesses it via @testable import. If this is intentional (only expose the tokenized version publicly), that's fine. But if external callers would benefit from the text version (e.g., for debugging/logging prompt content), consider making it public.
6. Missing test cases
The test covers context with a value, but doesn't verify:
- Empty context (default): that the prompt remains
<|im_start|>system\n<|im_end|>\n— no stray whitespace - No language (auto-detect): that context +
language=nilproduces the right prompt (no assistant prefix)
7. String = "" vs String? = nil type choice
GraniteSpeech in this same codebase uses prompt: String? = nil. The Python mlx-audio uses system_prompt: str | None = None. The PR uses context: String = "". Not blocking, but String? = nil would be more consistent with existing patterns.
Known Qwen3-ASR Context Gotchas (FYI, not PR issues)
From the Qwen3-ASR issue tracker:
- Context leaking: Hotwords can appear verbatim in transcription output (QwenLM/Qwen3-ASR#106)
- Hotword repetition: With the 0.6B model in streaming mode, phonetically similar speech can cause the model to repeat the hotword list (QwenLM/Qwen3-ASR#140)
- Effectiveness is unpredictable: The Qwen maintainer describes context biasing effectiveness as "玄学" (metaphysical/unpredictable)
Verdict
Approve with minor changes. The implementation is solid, correctly threaded through all code paths, concurrency-safe, and follows existing codebase patterns.
Should fix:
- Update test to use realistic hotword example instead of natural language instruction (documents correct usage)
- Add trailing
\nafter context to match the Python reference and official Qwen chat template - Remove the redundant
systemContextvariable
Nice to have:
- Add test for empty context (default path)
- Wire up CLI
--contextto actually use this feature (could be a follow-up PR) - Consider
String? = nilfor consistency with GraniteSpeech'sprompt: String? - Doc comment on the
contextparameter explaining it's for space-separated hotword biasing
Resolves #101