Skip to content

[Model] Support Qwen3 models with enable_thinking field #686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 5, 2025

Conversation

CharlieFRuan
Copy link
Contributor

@CharlieFRuan CharlieFRuan commented May 4, 2025

Overview

  • This PR adds the following Qwen3 models to WebLLM's prebuilt models:
  • In addition, we add extra_body field and extra_body.enable_thinking field to support switching between thinking and non-thinking mode. To prevent Qwen3 from thinking, use:
  let request = {
    messages: [
      {
        role: "user",
        content: "How many r's are there in the word strawberry?",
      },
    ],
    extra_body: {
      enable_thinking: false,
    },
  };

Internal notes

  • Internally, the enable_thinking is achieved by:
    • Add an extra_body and enable_thinking field to ChatCompletionRequest
    • Add an enable_thinking field to GenerationConfig that forwards the value in engine.ts
    • In llm_chat.ts, when prefillStep() and enable_thinking is false, we call conversation.appendEmptyThinkingReplyHeader(), instead of the normal appendReplyHeader()
    • In conversation.ts, adjust getPromptArrayInternal() to support the reply header with an empty thinking block, using a field isLastMessageEmptyThinkingReplyHeader
    • This is tested with tests/conversation.test.ts

Future work

  • Currently we hardcode const emptyThinkingBlockStr = "<think>\n\n</think>\n\n";. This should be configurable per-model in the future. Perhaps make it a part of the ConvConfig
  • Optimize multi-turn chat with Qwen3. Currently we strictly require all messages to match, but we can modify compareConversationObject() in engine.ts to allow missing several last messages (in this case, the message without the thinking tokens), so that in longer conversations, those that already stripped the thinking tokens can reuse KV
  • Perhaps we should separate the thinking tokens from the other tokens in the returned response, instead of asking users to parse on their own

@CharlieFRuan CharlieFRuan requested a review from Copilot May 4, 2025 22:58
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for Qwen3 models by introducing a new enable_thinking field and related changes across the API protocols, conversation handling, configuration, tests, and examples.

  • New tests and constants for Qwen3 configuration are introduced.
  • The chat completion API and conversation methods now support an extra_body.enable_thinking flag.
  • Examples and documentation have been updated to demonstrate the new Qwen3 functionality.

Reviewed Changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/conversation.test.ts Added tests to verify Qwen3-specific behavior with empty thinking blocks.
tests/constants.ts Introduced new Qwen3 config JSON with enable_thinking support, though conv_template name remains "qwen2".
src/openai_api_protocols/chat_completion.ts Added extra_body field with enable_thinking flag.
src/llm_chat.ts Updated message appending logic to conditionally disable thinking tokens.
src/engine.ts Forwarded the enable_thinking flag from the extra_body field.
src/conversation.ts Added methods for appending empty thinking headers and managing their lifecycle.
src/config.ts Updated GenerationConfig and prebuiltAppConfig with Qwen3 models.
examples/simple-chat-ts/src/simple_chat.ts Configured extra_body for Qwen3 models in the simple chat example.
examples/qwen3/src/qwen3_example.ts Provided example usage of Qwen3 models with varying enable_thinking configurations.
examples/qwen3/src/qwen3_example.html Updated HTML wrapper to load the new Qwen3 example.
examples/qwen3/README.md Updated documentation with instructions for running Qwen3 demos.
Files not reviewed (2)
  • examples/qwen3/package.json: Language not supported
  • package.json: Language not supported
Comments suppressed due to low confidence (1)

tests/constants.ts:271

  • [nitpick] The conv_template name in the Qwen3 configuration is set to "qwen2", which may be confusing. Consider updating it to "qwen3" for consistency with the model type.
      "name": "qwen2",

@CharlieFRuan CharlieFRuan marked this pull request as ready for review May 5, 2025 03:07
@CharlieFRuan CharlieFRuan merged commit 089bbd0 into mlc-ai:main May 5, 2025
1 check passed
CharlieFRuan added a commit that referenced this pull request May 5, 2025
### Change
- The only change is #686, which
  - Add prebuilt models:
    - Qwen3-0.6B: `q0f16, q0f32, q4f16_1, q4f32_1`
    - Other Qwen3: `{1.7B, 4B, 8B} x {q4f16_1, q4f32_1}`
- Support `extra_body: {enable_thinking: false}` for qwen3 models to
toggle thinking
    - See `examples/qwen3` for more on Qwen3 usage
- Also bumped `web-tokenizers` package to `0.1.6` to resolve
rust-related issues


### TVMjs
- No change, version `0.18.0-dev2` just like 0.2.71
@CharlieFRuan CharlieFRuan mentioned this pull request May 5, 2025
@CharlieFRuan
Copy link
Contributor Author

As a reference of using Qwen3, WebLLM Chat adds a thinking toggling button in the toolbar, allowing you to think or not think in the same multi-turn conversation

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Model Request: Gemma 3
1 participant