Skip to content

[Bug]: model_switch tool does not persist across turns; gateway/UI path ignores it entirely #6173

@NiuBlibing

Description

@NiuBlibing

Affected component

runtime/daemon

Severity

S2 - degraded behavior

Current behavior

The model_switch tool advertises that it switches the active model "immediately for the current conversation"
(crates/zeroclaw-runtime/src/tools/model_switch.rs:26), but in practice the
switch is silently lost in both major entry paths.

Path A — channel orchestrator (process_channel_messagehandle_message)

When the LLM calls model_switch, the tool sets a process-global
MODEL_SWITCH_REQUEST (crates/zeroclaw-runtime/src/agent/loop_.rs:91).
run_tool_call_loop picks it up and bubbles it out as an error;
the orchestrator handler at
crates/zeroclaw-channels/src/orchestrator/mod.rs:3189-3217 catches it,
builds a new provider, mutates the local route variable, clears the
global flag, and continues the current loop:

Ok(new_prov) => {
    active_provider = Arc::from(new_prov);
    route.provider = new_provider;          // local var only
    route.model = new_model;                // local var only
    clear_model_switch_request();
    // ❌ no set_route_selection(ctx, &history_key, route.clone())
    continue;
}

The change is never written back to ctx.route_overrides. The next inbound
message calls get_route_selection(ctx, &history_key)
(orchestrator/mod.rs:2636), reads the stale override (or the default), and
runs on the original provider/model. Compare with the /model slash command
handler (orchestrator/mod.rs:1791, 1830) which correctly calls
set_route_selection to persist.

The same handler also has two related defects:

  • it uses ctx.api_key (mod.rs:3191) — the startup global key — instead of
    the route-specific api_key from ctx.model_routes, so a switch into a
    provider that needs a different key fails with auth errors;
  • the freshly built provider is not written into provider_cache, so each
    switch rebuilds a provider instance.

Path B — gateway / built-in daemon UI (/ws/chat, /webhook)

The gateway's WebSocket chat (crates/zeroclaw-gateway/src/ws.rs:166)
constructs a fresh Agent via Agent::from_config and runs
agent.turn_streamed, which calls self.provider.stream_chat directly
(crates/zeroclaw-runtime/src/agent/agent.rs:1122). It does not invoke
run_tool_call_loop, and grep MODEL_SWITCH agent/agent.rs returns zero
matches. ModelSwitchTool is registered in all_tools_with_runtime
(tools/mod.rs:368), so the LLM can call it and the global
MODEL_SWITCH_REQUEST gets set — but nothing consumes it on this path,
so the switch is a complete no-op for the duration of the WS connection.
/webhook (run_gateway_chat_simple, lib.rs:1374-1384) has the same
property: it calls state.provider.chat() directly with no tool loop.

Expected behavior

model_switch should either persist the change for at least the rest of the
sender's conversation in every path where the tool is exposed, or — if the
intent really is "current turn only" — it should be removed from paths where
even that is impossible, and the tool description should reflect the actual
guarantee.

Preferred fix — eliminate the divergence at the source: implement the
built-in web UI's /ws/chat (and ideally /webhook plus the WhatsApp / Linq
/ Nextcloud Talk gateway endpoints) as a proper channel under
zeroclaw-channels, so every inbound message flows through
process_channel_message → handle_message → run_tool_call_loop like any
other channel. That single change makes model_switch, /model, sticky
route_overrides, classifier-based routing, the provider_cache, the config
mtime hot-reload, autosave-on-message, and all future channel-level features
work uniformly across CLI, Telegram, Discord, the daemon UI, and webhooks —
without each gateway entrypoint reinventing a parallel, partially-broken
agent loop. The current bifurcation (Agent::turn_streamed vs.
run_tool_call_loop) is the root cause of this bug class; surface fixes on
either side will keep diverging.

If the architectural unification above is out of scope for this fix, the
narrower per-path repairs are:

  • Channel path: after a successful in-loop swap, call
    set_route_selection(ctx, &history_key, route.clone()) so the new
    provider/model survives into the next message; resolve the per-route
    api_key from ctx.model_routes (mirroring the SetModel slash-command
    handler) instead of falling back to ctx.api_key; and seed the
    provider_cache with the new instance.
  • Gateway/UI path (only as a stopgap until the unification above): either
    (a) consume MODEL_SWITCH_REQUEST inside Agent::turn_streamed between
    iterations and rebuild self.provider, or (b) drop ModelSwitchTool from
    the gateway-built tool registry and surface a clear error to the LLM if it
    is invoked.

In all cases, the tool description in tools/model_switch.rs:26 should state
the actual scope of the switch.

Steps to reproduce

# Channel path (e.g., Telegram, Discord, CLI):
# 1. Start daemon with default provider = anthropic, model = claude-sonnet-4-6.
# 2. Send a message that prompts the LLM to call:
#      model_switch { action: "set", provider: "openai", model: "gpt-4o" }
# 3. Observe that within the same turn, subsequent LLM calls go to gpt-4o.
# 4. Send a second user message in the same conversation.
# 5. Observe that the agent is back on anthropic/claude-sonnet-4-6.

# Gateway/UI path:
# 1. Start daemon, open the built-in web UI and connect to /ws/chat.
# 2. Prompt the model to call model_switch as above.
# 3. Observe: the tool returns "Model switch requested" but every subsequent
#    streamed turn is still served by the original provider/model. This holds
#    for the entire WS connection's lifetime.

Impact

Affected users: anyone relying on model_switch for runtime model swapping
(agents using cost-tier routing, fallback escalation, or "use a strong model
for this one task" patterns) — i.e. exactly the use case the tool advertises.

Frequency: always.

Consequence: the tool silently lies. On the channel path it appears to work
within a single turn but reverts on the next message; on the gateway/UI path
it is a complete no-op. Models that rely on the documented "switch takes
effect immediately" semantics will plan around guarantees that are not
actually delivered, leading to wrong-model responses, unexpected billing, and
hard-to-debug routing behaviour.

Logs / stack traces

N/A — this is a logic defect, not a crash. The misleading
{"message":"Model switch requested", ...} tool result in
tools/model_switch.rs:156-164 is itself the smoking gun: the tool reports
success even when nobody downstream will act on it.

ZeroClaw version

master at eebd7b634f91c37f7a976e03ba3f29d9b76a1ca9.

Operating system

Linux

Regression?

Unknown

Pre-flight checks

  • I reproduced this on the latest master branch or latest release.
  • I redacted secrets, tokens, and personal data from all submitted content.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdomain:architectureArchitecture domaingatewayAuto scope: src/gateway/** changed.priority:p1High priorityrisk: highAuto risk: security/runtime/gateway/tools/workflows.runtimeAuto scope: src/runtime/** changed.toolAuto scope: src/tools/** changed.

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions