Affected component
runtime/daemon
Severity
S2 - degraded behavior
Current behavior
The model_switch tool advertises that it switches the active model "immediately for the current conversation"
(crates/zeroclaw-runtime/src/tools/model_switch.rs:26), but in practice the
switch is silently lost in both major entry paths.
Path A — channel orchestrator (process_channel_message → handle_message)
When the LLM calls model_switch, the tool sets a process-global
MODEL_SWITCH_REQUEST (crates/zeroclaw-runtime/src/agent/loop_.rs:91).
run_tool_call_loop picks it up and bubbles it out as an error;
the orchestrator handler at
crates/zeroclaw-channels/src/orchestrator/mod.rs:3189-3217 catches it,
builds a new provider, mutates the local route variable, clears the
global flag, and continues the current loop:
Ok(new_prov) => {
active_provider = Arc::from(new_prov);
route.provider = new_provider; // local var only
route.model = new_model; // local var only
clear_model_switch_request();
// ❌ no set_route_selection(ctx, &history_key, route.clone())
continue;
}
The change is never written back to ctx.route_overrides. The next inbound
message calls get_route_selection(ctx, &history_key)
(orchestrator/mod.rs:2636), reads the stale override (or the default), and
runs on the original provider/model. Compare with the /model slash command
handler (orchestrator/mod.rs:1791, 1830) which correctly calls
set_route_selection to persist.
The same handler also has two related defects:
- it uses
ctx.api_key (mod.rs:3191) — the startup global key — instead of
the route-specific api_key from ctx.model_routes, so a switch into a
provider that needs a different key fails with auth errors;
- the freshly built provider is not written into
provider_cache, so each
switch rebuilds a provider instance.
Path B — gateway / built-in daemon UI (/ws/chat, /webhook)
The gateway's WebSocket chat (crates/zeroclaw-gateway/src/ws.rs:166)
constructs a fresh Agent via Agent::from_config and runs
agent.turn_streamed, which calls self.provider.stream_chat directly
(crates/zeroclaw-runtime/src/agent/agent.rs:1122). It does not invoke
run_tool_call_loop, and grep MODEL_SWITCH agent/agent.rs returns zero
matches. ModelSwitchTool is registered in all_tools_with_runtime
(tools/mod.rs:368), so the LLM can call it and the global
MODEL_SWITCH_REQUEST gets set — but nothing consumes it on this path,
so the switch is a complete no-op for the duration of the WS connection.
/webhook (run_gateway_chat_simple, lib.rs:1374-1384) has the same
property: it calls state.provider.chat() directly with no tool loop.
Expected behavior
model_switch should either persist the change for at least the rest of the
sender's conversation in every path where the tool is exposed, or — if the
intent really is "current turn only" — it should be removed from paths where
even that is impossible, and the tool description should reflect the actual
guarantee.
Preferred fix — eliminate the divergence at the source: implement the
built-in web UI's /ws/chat (and ideally /webhook plus the WhatsApp / Linq
/ Nextcloud Talk gateway endpoints) as a proper channel under
zeroclaw-channels, so every inbound message flows through
process_channel_message → handle_message → run_tool_call_loop like any
other channel. That single change makes model_switch, /model, sticky
route_overrides, classifier-based routing, the provider_cache, the config
mtime hot-reload, autosave-on-message, and all future channel-level features
work uniformly across CLI, Telegram, Discord, the daemon UI, and webhooks —
without each gateway entrypoint reinventing a parallel, partially-broken
agent loop. The current bifurcation (Agent::turn_streamed vs.
run_tool_call_loop) is the root cause of this bug class; surface fixes on
either side will keep diverging.
If the architectural unification above is out of scope for this fix, the
narrower per-path repairs are:
- Channel path: after a successful in-loop swap, call
set_route_selection(ctx, &history_key, route.clone()) so the new
provider/model survives into the next message; resolve the per-route
api_key from ctx.model_routes (mirroring the SetModel slash-command
handler) instead of falling back to ctx.api_key; and seed the
provider_cache with the new instance.
- Gateway/UI path (only as a stopgap until the unification above): either
(a) consume MODEL_SWITCH_REQUEST inside Agent::turn_streamed between
iterations and rebuild self.provider, or (b) drop ModelSwitchTool from
the gateway-built tool registry and surface a clear error to the LLM if it
is invoked.
In all cases, the tool description in tools/model_switch.rs:26 should state
the actual scope of the switch.
Steps to reproduce
# Channel path (e.g., Telegram, Discord, CLI):
# 1. Start daemon with default provider = anthropic, model = claude-sonnet-4-6.
# 2. Send a message that prompts the LLM to call:
# model_switch { action: "set", provider: "openai", model: "gpt-4o" }
# 3. Observe that within the same turn, subsequent LLM calls go to gpt-4o.
# 4. Send a second user message in the same conversation.
# 5. Observe that the agent is back on anthropic/claude-sonnet-4-6.
# Gateway/UI path:
# 1. Start daemon, open the built-in web UI and connect to /ws/chat.
# 2. Prompt the model to call model_switch as above.
# 3. Observe: the tool returns "Model switch requested" but every subsequent
# streamed turn is still served by the original provider/model. This holds
# for the entire WS connection's lifetime.
Impact
Affected users: anyone relying on model_switch for runtime model swapping
(agents using cost-tier routing, fallback escalation, or "use a strong model
for this one task" patterns) — i.e. exactly the use case the tool advertises.
Frequency: always.
Consequence: the tool silently lies. On the channel path it appears to work
within a single turn but reverts on the next message; on the gateway/UI path
it is a complete no-op. Models that rely on the documented "switch takes
effect immediately" semantics will plan around guarantees that are not
actually delivered, leading to wrong-model responses, unexpected billing, and
hard-to-debug routing behaviour.
Logs / stack traces
N/A — this is a logic defect, not a crash. The misleading
{"message":"Model switch requested", ...} tool result in
tools/model_switch.rs:156-164 is itself the smoking gun: the tool reports
success even when nobody downstream will act on it.
ZeroClaw version
master at eebd7b634f91c37f7a976e03ba3f29d9b76a1ca9.
Operating system
Linux
Regression?
Unknown
Pre-flight checks
Affected component
runtime/daemon
Severity
S2 - degraded behavior
Current behavior
The
model_switchtool advertises that it switches the active model "immediately for the current conversation"(
crates/zeroclaw-runtime/src/tools/model_switch.rs:26), but in practice theswitch is silently lost in both major entry paths.
Path A — channel orchestrator (
process_channel_message→handle_message)When the LLM calls
model_switch, the tool sets a process-globalMODEL_SWITCH_REQUEST(crates/zeroclaw-runtime/src/agent/loop_.rs:91).run_tool_call_looppicks it up and bubbles it out as an error;the orchestrator handler at
crates/zeroclaw-channels/src/orchestrator/mod.rs:3189-3217catches it,builds a new provider, mutates the local
routevariable, clears theglobal flag, and continues the current loop:
The change is never written back to
ctx.route_overrides. The next inboundmessage calls
get_route_selection(ctx, &history_key)(
orchestrator/mod.rs:2636), reads the stale override (or the default), andruns on the original provider/model. Compare with the
/modelslash commandhandler (
orchestrator/mod.rs:1791, 1830) which correctly callsset_route_selectionto persist.The same handler also has two related defects:
ctx.api_key(mod.rs:3191) — the startup global key — instead ofthe route-specific
api_keyfromctx.model_routes, so a switch into aprovider that needs a different key fails with auth errors;
provider_cache, so eachswitch rebuilds a provider instance.
Path B — gateway / built-in daemon UI (
/ws/chat,/webhook)The gateway's WebSocket chat (
crates/zeroclaw-gateway/src/ws.rs:166)constructs a fresh
AgentviaAgent::from_configand runsagent.turn_streamed, which callsself.provider.stream_chatdirectly(
crates/zeroclaw-runtime/src/agent/agent.rs:1122). It does not invokerun_tool_call_loop, andgrep MODEL_SWITCH agent/agent.rsreturns zeromatches.
ModelSwitchToolis registered inall_tools_with_runtime(
tools/mod.rs:368), so the LLM can call it and the globalMODEL_SWITCH_REQUESTgets set — but nothing consumes it on this path,so the switch is a complete no-op for the duration of the WS connection.
/webhook(run_gateway_chat_simple,lib.rs:1374-1384) has the sameproperty: it calls
state.provider.chat()directly with no tool loop.Expected behavior
model_switchshould either persist the change for at least the rest of thesender's conversation in every path where the tool is exposed, or — if the
intent really is "current turn only" — it should be removed from paths where
even that is impossible, and the tool description should reflect the actual
guarantee.
Preferred fix — eliminate the divergence at the source: implement the
built-in web UI's
/ws/chat(and ideally/webhookplus the WhatsApp / Linq/ Nextcloud Talk gateway endpoints) as a proper channel under
zeroclaw-channels, so every inbound message flows throughprocess_channel_message → handle_message → run_tool_call_looplike anyother channel. That single change makes
model_switch,/model, stickyroute_overrides, classifier-based routing, theprovider_cache, the configmtime hot-reload, autosave-on-message, and all future channel-level features
work uniformly across CLI, Telegram, Discord, the daemon UI, and webhooks —
without each gateway entrypoint reinventing a parallel, partially-broken
agent loop. The current bifurcation (
Agent::turn_streamedvs.run_tool_call_loop) is the root cause of this bug class; surface fixes oneither side will keep diverging.
If the architectural unification above is out of scope for this fix, the
narrower per-path repairs are:
set_route_selection(ctx, &history_key, route.clone())so the newprovider/model survives into the next message; resolve the per-route
api_keyfromctx.model_routes(mirroring theSetModelslash-commandhandler) instead of falling back to
ctx.api_key; and seed theprovider_cachewith the new instance.(a) consume
MODEL_SWITCH_REQUESTinsideAgent::turn_streamedbetweeniterations and rebuild
self.provider, or (b) dropModelSwitchToolfromthe gateway-built tool registry and surface a clear error to the LLM if it
is invoked.
In all cases, the tool description in
tools/model_switch.rs:26should statethe actual scope of the switch.
Steps to reproduce
Impact
Affected users: anyone relying on
model_switchfor runtime model swapping(agents using cost-tier routing, fallback escalation, or "use a strong model
for this one task" patterns) — i.e. exactly the use case the tool advertises.
Frequency: always.
Consequence: the tool silently lies. On the channel path it appears to work
within a single turn but reverts on the next message; on the gateway/UI path
it is a complete no-op. Models that rely on the documented "switch takes
effect immediately" semantics will plan around guarantees that are not
actually delivered, leading to wrong-model responses, unexpected billing, and
hard-to-debug routing behaviour.
Logs / stack traces
N/A — this is a logic defect, not a crash. The misleading
{"message":"Model switch requested", ...}tool result intools/model_switch.rs:156-164is itself the smoking gun: the tool reportssuccess even when nobody downstream will act on it.
ZeroClaw version
masterateebd7b634f91c37f7a976e03ba3f29d9b76a1ca9.Operating system
Linux
Regression?
Unknown
Pre-flight checks