-
Notifications
You must be signed in to change notification settings - Fork 150
feat: add health monitoring for MCP clients #1168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
Caution Review failedThe pull request is closed. 📝 WalkthroughSummary by CodeRabbitRelease Notes
✏️ Tip: You can customize this high-level summary in your review settings. WalkthroughAdds a per-client health monitoring subsystem, exposes explicit MCP client connection state, integrates monitoring into MCP manager lifecycle and client connection logic, and renames/refactors tools handler to tools manager. Also removes local state derivation in client aggregation. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as MCP Client
participant Mgr as MCPManager
participant HealthMgr as HealthMonitorManager
participant Monitor as ClientHealthMonitor
rect rgb(240,248,255)
Note over Client,Mgr: Connection flow
Client->>Mgr: connectToMCPClient()
Mgr->>Client: establish transport
Mgr->>Client: set client.State = Connected
Mgr->>HealthMgr: NewClientHealthMonitor(clientID)
HealthMgr->>Monitor: StartMonitoring(monitor)
Monitor->>Monitor: Start() (goroutine)
end
rect rgb(240,255,240)
Note over Monitor,Client: Periodic health checks
loop every interval
Monitor->>Client: ping (with timeout)
alt success
Client-->>Monitor: pong
Monitor->>Mgr: reset failure counter
else failure
Monitor->>Monitor: increment failures
alt failures >= max
Monitor->>Mgr: update client.State = Disconnected
end
end
end
end
rect rgb(255,240,240)
Note over Mgr,HealthMgr: Teardown
Mgr->>HealthMgr: StopAll()
HealthMgr->>Monitor: StopMonitoring(clientID)
Monitor->>Monitor: Stop() (cancel context)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (6)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
core/mcp/health_monitor.go (1)
56-70: Consider linking monitor context to manager context.The monitor creates a context from
context.Background()(line 65). This means the health monitor goroutine won't automatically stop when the MCPManager's context is cancelled. WhileStopAll()handles this inCleanup(), consider deriving fromchm.manager.ctxfor defense-in-depth.Proposed change to derive from manager context
func (chm *ClientHealthMonitor) Start() { chm.mu.Lock() defer chm.mu.Unlock() if chm.isMonitoring { return // Already monitoring } chm.isMonitoring = true - chm.ctx, chm.cancel = context.WithCancel(context.Background()) + chm.ctx, chm.cancel = context.WithCancel(chm.manager.ctx) chm.ticker = time.NewTicker(chm.interval) go chm.monitorLoop() logger.Debug(fmt.Sprintf("%s Health monitor started for client %s (interval: %v)", MCPLogPrefix, chm.clientID, chm.interval)) }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
core/bifrost.gocore/mcp/clientmanager.gocore/mcp/health_monitor.gocore/mcp/mcp.gocore/schemas/mcp.gotransports/bifrost-http/handlers/mcp.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
always check the stack if there is one for the current PR. do not give localized reviews for the PR, always see all changes in the light of the whole stack of PRs (if there is a stack, if there is no stack you can continue to make localized suggestions/reviews)
Files:
core/schemas/mcp.gotransports/bifrost-http/handlers/mcp.gocore/mcp/clientmanager.gocore/bifrost.gocore/mcp/health_monitor.gocore/mcp/mcp.go
🧠 Learnings (2)
📚 Learning: 2025-12-09T17:07:42.007Z
Learnt from: qwerty-dvorak
Repo: maximhq/bifrost PR: 1006
File: core/schemas/account.go:9-18
Timestamp: 2025-12-09T17:07:42.007Z
Learning: In core/schemas/account.go, the HuggingFaceKeyConfig field within the Key struct is currently unused and reserved for future Hugging Face inference endpoint deployments. Do not flag this field as missing from OpenAPI documentation or require its presence in the API spec until the feature is actively implemented and used. When the feature is added, update the OpenAPI docs accordingly; otherwise, treat this field as non-breaking and not part of the current API surface.
Applied to files:
core/schemas/mcp.gotransports/bifrost-http/handlers/mcp.gocore/mcp/clientmanager.gocore/bifrost.gocore/mcp/health_monitor.gocore/mcp/mcp.go
📚 Learning: 2025-12-12T08:25:02.629Z
Learnt from: Pratham-Mishra04
Repo: maximhq/bifrost PR: 1000
File: transports/bifrost-http/integrations/router.go:709-712
Timestamp: 2025-12-12T08:25:02.629Z
Learning: In transports/bifrost-http/**/*.go, update streaming response handling to align with OpenAI Responses API: use typed SSE events such as response.created, response.output_text.delta, response.done, etc., and do not rely on the legacy data: [DONE] termination marker. Note that data: [DONE] is only used by the older Chat Completions and Text Completions streaming APIs. Ensure parsers, writers, and tests distinguish SSE events from the [DONE] sentinel and handle each event type accordingly for correct stream termination and progress updates.
Applied to files:
transports/bifrost-http/handlers/mcp.go
🧬 Code graph analysis (4)
core/schemas/mcp.go (1)
ui/lib/types/mcp.ts (1)
MCPConnectionState(5-5)
core/mcp/clientmanager.go (3)
core/mcp.go (1)
MCPLogPrefix(31-31)core/schemas/mcp.go (3)
MCPConnectionStateConnected(89-89)MCPConnectionTypeSSE(75-75)MCPConnectionStateDisconnected(90-90)core/mcp/health_monitor.go (2)
NewClientHealthMonitor(35-53)DefaultHealthCheckInterval(14-14)
core/mcp/health_monitor.go (3)
core/mcp/mcp.go (2)
MCPManager(41-49)MCPLogPrefix(23-23)transports/bifrost-http/handlers/mcp.go (1)
MCPManager(20-24)core/schemas/mcp.go (3)
MCPConnectionStateDisconnected(90-90)MCPConnectionStateConnected(89-89)MCPConnectionState(86-86)
core/mcp/mcp.go (4)
core/mcp/toolmanager.go (2)
ToolsManager(22-34)NewToolsManager(53-82)core/schemas/mcp.go (1)
MCPClientState(96-104)core/mcp/health_monitor.go (2)
HealthMonitorManager(183-186)NewHealthMonitorManager(189-193)core/mcp.go (1)
MCPManager(49-56)
⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Graphite / mergeability_check
- GitHub Check: Graphite / mergeability_check
- GitHub Check: Graphite / mergeability_check
- GitHub Check: Graphite / mergeability_check
- GitHub Check: Graphite / mergeability_check
- GitHub Check: Graphite / mergeability_check
- GitHub Check: Graphite / mergeability_check
- GitHub Check: Graphite / mergeability_check
🔇 Additional comments (13)
core/schemas/mcp.go (1)
103-103: LGTM - State field addition is well-designed.The
Statefield is correctly added without a JSON tag sinceMCPClientStateis an internal struct. The publicMCPClientstruct (line 119) properly exposes state viajson:"state". This separation maintains clean internal vs. external API boundaries.transports/bifrost-http/handlers/mcp.go (1)
168-179: LGTM - State now sourced from centralized health monitoring.The change correctly uses
connectedClient.Statefrom the health-monitoredMCPClientStateinstead of deriving state locally. The fallback for clients in config but not in Bifrost (failed initialization) is properly marked asMCPConnectionStateError.core/bifrost.go (1)
1803-1807: LGTM - State now uses centralized tracking.The change correctly sources
client.Statefrom the health-monitoredMCPClientStateinstead of deriving state locally based on connection status. This aligns with the new health monitoring subsystem.core/mcp/clientmanager.go (4)
140-143: Improved cleanup with health monitoring integration.Good integration - health monitoring is properly stopped before closing transport connections, ensuring clean resource cleanup.
406-406: State correctly set to connected after successful initialization.Setting the state to
MCPConnectionStateConnectedat this point is appropriate since the client has been initialized and tools have been retrieved.
447-449: Health monitor started correctly after connection setup.The monitor is created and started at the appropriate time - after the client is fully connected and registered. This ensures health checks can properly access the client state.
434-444: No deadlock risk from this pattern. TheOnConnectionLost()call registers an asynchronous callback; the callback function body executes later in the SSE transport goroutine after the lock is released. The brief lock acquisition in the callback to update client state is safe. Consider moving the callback registration outside the lock scope (after line 399) for consistency with your existing defensive programming patterns, though the current implementation is functionally safe.core/mcp/mcp.go (2)
41-49: Well-structured MCPManager with health monitoring integration.The struct cleanly separates concerns with
toolsManagerfor tool handling andhealthMonitorManagerfor client health tracking. The health monitor manager is correctly initialized in the constructor before any client connections are made.
261-264: Correct cleanup order - stop monitors before acquiring lock.
healthMonitorManager.StopAll()is correctly called before acquiringm.mu.Lock(). This prevents potential deadlocks since health monitors acquirem.muduring their checks.core/mcp/health_monitor.go (4)
12-17: Reasonable default health check configuration.The 10-second interval, 5-second timeout, and 5 consecutive failures before marking unhealthy provide a good balance between responsiveness and avoiding false positives from transient network issues.
175-180: Use RLock for read-only access.
getConsecutiveFailures()only reads the counter. UsingRLock()/RUnlock()instead ofLock()/Unlock()would allow concurrent reads.Proposed change
// getConsecutiveFailures returns the current consecutive failure count func (chm *ClientHealthMonitor) getConsecutiveFailures() int { - chm.mu.Lock() - defer chm.mu.Unlock() + chm.mu.Lock() // Note: sync.Mutex doesn't have RLock, this is fine as-is + defer chm.mu.Unlock() return chm.consecutiveFailures }Actually,
sync.Mutexdoesn't haveRLock(). If you want read-write semantics, you'd need to changechm.mutosync.RWMutex. Given the low frequency of access, the current implementation is acceptable.
182-193: HealthMonitorManager provides clean lifecycle management.The manager correctly maintains a registry of monitors and provides proper start/stop semantics.
StartMonitoringhandles the case of replacing an existing monitor gracefully.
220-229: StopAll correctly cleans up all monitors.The implementation properly stops all monitors and clears the map. This is called from
MCPManager.Cleanup()before acquiring other locks, which prevents deadlocks.
5ba4e27 to
50a26f4
Compare
50a26f4 to
d7414a6
Compare
f2f39bc to
fab6ecc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
core/mcp/health_monitor.go (1)
141-161: Race condition fix verified.The previous race condition (identified in past review) has been addressed. The method now holds
chm.manager.mu.Lock()throughout the read-compare-write sequence and logs after releasing the lock. This is correct.
🧹 Nitpick comments (1)
core/mcp/health_monitor.go (1)
56-70: Consider using the manager's context as a parent.The
Startmethod creates a new context withcontext.Background(). This means the health monitor won't respect the manager's context cancellation (e.g., during shutdown). Consider deriving from the manager's context for proper lifecycle management.Proposed fix
func (chm *ClientHealthMonitor) Start() { chm.mu.Lock() defer chm.mu.Unlock() if chm.isMonitoring { return // Already monitoring } chm.isMonitoring = true - chm.ctx, chm.cancel = context.WithCancel(context.Background()) + chm.ctx, chm.cancel = context.WithCancel(chm.manager.ctx) chm.ticker = time.NewTicker(chm.interval) go chm.monitorLoop() logger.Debug(fmt.Sprintf("%s Health monitor started for client %s (interval: %v)", MCPLogPrefix, chm.clientID, chm.interval)) }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
core/bifrost.gocore/mcp/clientmanager.gocore/mcp/health_monitor.gocore/mcp/mcp.gocore/schemas/mcp.gotransports/bifrost-http/handlers/mcp.go
🚧 Files skipped from review as they are similar to previous changes (2)
- core/schemas/mcp.go
- transports/bifrost-http/handlers/mcp.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
always check the stack if there is one for the current PR. do not give localized reviews for the PR, always see all changes in the light of the whole stack of PRs (if there is a stack, if there is no stack you can continue to make localized suggestions/reviews)
Files:
core/mcp/clientmanager.gocore/mcp/mcp.gocore/bifrost.gocore/mcp/health_monitor.go
🧠 Learnings (1)
📚 Learning: 2025-12-09T17:07:42.007Z
Learnt from: qwerty-dvorak
Repo: maximhq/bifrost PR: 1006
File: core/schemas/account.go:9-18
Timestamp: 2025-12-09T17:07:42.007Z
Learning: In core/schemas/account.go, the HuggingFaceKeyConfig field within the Key struct is currently unused and reserved for future Hugging Face inference endpoint deployments. Do not flag this field as missing from OpenAPI documentation or require its presence in the API spec until the feature is actively implemented and used. When the feature is added, update the OpenAPI docs accordingly; otherwise, treat this field as non-breaking and not part of the current API surface.
Applied to files:
core/mcp/clientmanager.gocore/mcp/mcp.gocore/bifrost.gocore/mcp/health_monitor.go
🧬 Code graph analysis (3)
core/mcp/clientmanager.go (4)
core/mcp/mcp.go (1)
MCPLogPrefix(23-23)core/mcp.go (1)
MCPLogPrefix(31-31)core/schemas/mcp.go (3)
MCPConnectionStateConnected(89-89)MCPConnectionTypeSSE(75-75)MCPConnectionStateDisconnected(90-90)core/mcp/health_monitor.go (2)
NewClientHealthMonitor(35-53)DefaultHealthCheckInterval(14-14)
core/mcp/mcp.go (2)
core/mcp/toolmanager.go (2)
ToolsManager(22-34)NewToolsManager(53-82)core/mcp/health_monitor.go (2)
HealthMonitorManager(185-188)NewHealthMonitorManager(191-195)
core/mcp/health_monitor.go (2)
core/mcp/mcp.go (2)
MCPManager(41-49)MCPLogPrefix(23-23)core/schemas/mcp.go (3)
MCPConnectionStateDisconnected(90-90)MCPConnectionStateConnected(89-89)MCPConnectionState(86-86)
⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Graphite / mergeability_check
- GitHub Check: Graphite / mergeability_check
- GitHub Check: Graphite / mergeability_check
- GitHub Check: Graphite / mergeability_check
- GitHub Check: Graphite / mergeability_check
🔇 Additional comments (20)
core/mcp/health_monitor.go (8)
12-17: LGTM! Reasonable health check defaults.The constants define sensible default values for health monitoring: 10-second intervals, 5-second timeout, and 5 consecutive failures before marking unhealthy.
19-53: LGTM! Well-structured health monitor with proper initialization.The
ClientHealthMonitorstruct correctly encapsulates per-client health tracking with its own mutex for failure counting, and the constructor properly applies the default interval when zero is passed.
72-89: LGTM! Clean stop implementation.The
Stopmethod correctly guards against double-stop, cleans up the ticker and cancels the context.
91-101: LGTM! Standard select-based monitor loop.The loop correctly listens for both context cancellation and ticker events.
104-139: Potential self-call to Stop() while holding no lock is safe, but Stop() can be called multiple times.In
performHealthCheck, when the client no longer exists (line 111),chm.Stop()is called. This is safe becauseStop()has its own lock and guards against double-stopping. The health check logic correctly handles nil connections and ping failures with appropriate state transitions.
163-182: LGTM! Thread-safe failure counter operations.The increment, reset, and get methods for consecutive failures correctly use the monitor's mutex for synchronization.
222-231: LGTM! StopAll correctly iterates and cleans up.The method stops all monitors and reinitializes the map while holding the lock.
197-209: No deadlock risk inStartMonitoring().
Stop()only acquireschm.mu(the ClientHealthMonitor's own mutex) and does not callupdateClientState()or interact with the manager's lock. SinceStop()cannot trigger lock acquisition onchm.manager.mu(which ishmm.mu), there is no circular lock dependency. The design is safe.core/bifrost.go (1)
1803-1811: LGTM! Correct state propagation from health-monitored client state.The change from computing state locally to using
client.Statedirectly is correct. The health monitoring subsystem now manages theStatefield, and this ensures the public API returns the actual monitored state. This is a key integration point for the new health monitoring feature.core/mcp/clientmanager.go (4)
140-143: LGTM! Health monitor cleanup on client removal.Correctly stops health monitoring before proceeding with other cleanup tasks. The log message update to reference "MCP server" instead of "client" improves clarity.
154-156: LGTM! Log message consistency.Log message correctly references "MCP server" for the connection being closed.
402-418: LGTM! Initial connected state properly set.Setting
client.State = schemas.MCPConnectionStateConnectedafter successful connection establishment is correct. This provides the initial state that the health monitor will subsequently track.
434-450: SSE connection-lost callback and health monitor initialization look correct.The
OnConnectionLosthook properly:
- Logs the connection loss with the error
- Acquires the lock before updating state
- Updates state to disconnected only if client still exists
The health monitor is started after connection setup, which is the correct ordering.
One observation: the health monitor is started even after registering the
OnConnectionLosthook. Both mechanisms can update the state to "disconnected" - the health monitor via failed pings, and the callback via transport-level detection. This is actually good for redundancy (health monitor catches silent failures, callback catches immediate disconnects).core/mcp/mcp.go (7)
41-49: LGTM! MCPManager struct correctly updated.The addition of
healthMonitorManagerfield and renamingtoolsHandlertotoolsManageris clean. The struct maintains proper organization with related fields grouped together.
51-53: LGTM! Generic tool function type is well-defined.The
MCPToolFunction[T any]type provides a clean abstraction for typed tool handlers.
78-94: LGTM! Manager initialization correctly sets up health monitoring.The constructor properly initializes
healthMonitorManagerusingNewHealthMonitorManager()before any clients are added. ThetoolsManageris also correctly initialized with the manager reference for client access.
106-112: LGTM! Tool management delegation is clean.The delegation to
toolsManagerforAddToolsToRequestandGetAvailableToolsmaintains a clean separation of concerns.
133-156: LGTM! Tool execution and config update properly delegated.
ExecuteChatTool,ExecuteResponsesTool, andUpdateToolManagerConfigall correctly delegate totoolsManager.
199-201: LGTM! Agent execution delegated to tools manager.Both
CheckAndExecuteAgentForChatRequestandCheckAndExecuteAgentForResponsesRequestcorrectly delegate to the tools manager after validating preconditions.Also applies to: 250-252
261-264: Correct cleanup ordering: stop health monitors first.Stopping health monitors before acquiring the main lock and removing clients prevents potential issues where health checks might run on clients being cleaned up. This is the correct ordering.
Merge activity
|

Summary
Implement health monitoring for MCP clients to detect and handle connection state changes automatically.
Changes
health_monitor.gofile with a health monitoring system for MCP clientsClientHealthMonitorto periodically check client health via pingHealthMonitorManagerto manage multiple client health monitorsType of change
Affected areas
How to test
Breaking changes
Related issues
Improves reliability of MCP client connections and provides better visibility into connection states.
Security considerations
No significant security implications. The health monitoring system only interacts with already established connections.
Checklist