-
Notifications
You must be signed in to change notification settings - Fork 173
Implement service bus monitoring with heartbeat detection and automatic recovery #4601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…tions Co-authored-by: marrobi <[email protected]>
|
@copilot can the supervisor be moved into |
Unit Test Results671 tests 671 ✅ 7s ⏱️ Results for commit 7eac68b. ♻️ This comment has been updated with latest results. |
|
@copilot can you add something in the supervisor function to check that there is still a heartbeat. |
…ection Co-authored-by: marrobi <[email protected]>
Added heartbeat monitoring to the supervisor function. Each worker process now updates a heartbeat file every iteration, and the supervisor checks these timestamps to detect stuck processes. Stale processes (no heartbeat update for 5 minutes) are automatically terminated and restarted. This ensures the system can recover from processes that are alive but not actively working. Changes include:
Commit: c7072b9 |
|
@copilot this needs to be for deployment status updater, ion the API, not the runner. |
…s updater Co-authored-by: marrobi <[email protected]>
Moved the heartbeat monitoring from the resource processor to the deployment status updater in the API as requested. The implementation now includes:
Commit: 202e726 |
|
@copilot fix the linting issues:
/github/workspace/api_app/service_bus/deployment_status_updater.py:57:1: W293 blank line contains whitespace
|
Co-authored-by: marrobi <[email protected]>
Fixed all linting issues by removing trailing whitespace from blank lines in both deployment_status_updater.py and test_deployment_status_update.py. Also incremented the API version from 0.24.3 to 0.24.4 as requested. Commit: 381bd9c |
|
/test-extended 381bd9c |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/15887800326 (with refid (in response to this comment from @marrobi) |
|
/test-extended 75d77dd |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/15901591226 (with refid (in response to this comment from @marrobi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements comprehensive service bus monitoring with heartbeat detection and automatic recovery to eliminate stuck VM operations in the TRE UI.
- Introduces a new ServiceBusConsumer base class that provides heartbeat management and a supervisor pattern.
- Updates DeploymentStatusUpdater and AirlockStatusUpdater to inherit from the new base class and integrates heartbeat updates.
- Enhances logging by applying a worker ID filter and updates the main application to use supervisor tasks.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| api_app/tests_ma/test_service_bus/test_service_bus_consumer.py | Adds comprehensive unit tests for heartbeat and supervisor functionality |
| api_app/services/logging.py | Implements WorkerIdFilter and applies worker ID to log format |
| api_app/service_bus/service_bus_consumer.py | Introduces heartbeat monitoring and supervisor functions in a new base class |
| api_app/service_bus/deployment_status_updater.py | Refactors to inherit from ServiceBusConsumer and integrates heartbeat updates |
| api_app/service_bus/airlock_request_status_update.py | Refactors to inherit from ServiceBusConsumer and integrates heartbeat updates |
| api_app/main.py | Updates to schedule supervisor tasks for service bus consumers |
| api_app/_version.py | Bumps API version to reflect the new changes |
Comments suppressed due to low confidence (1)
api_app/tests_ma/test_service_bus/test_service_bus_consumer.py:117
- [nitpick] Consider using a custom exception or a cancellation flag to exit the infinite loop in tests instead of KeyboardInterrupt, to clearly indicate that the termination is part of the test design.
except KeyboardInterrupt:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
|
/test-extended |
|
🤖 pr-bot 🤖
(in response to this comment from @marrobi) |
|
/test-extended ba8d1e9 |
|
🤖 pr-bot 🤖 🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/19174005010 (with refid (in response to this comment from @marrobi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
|
|
||
| def __init__(self, heartbeat_file_prefix: str): | ||
| # Create a unique identifier for this worker process | ||
| import tempfile |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import tempfile statement should be moved to the top of the file with other imports rather than being inside the __init__ method. This follows Python best practices and PEP 8 style guidelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot apply changes based on this feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved import tempfile to the top of the file with other imports and added an explanatory comment to the except clause handling asyncio.CancelledError.
Commit: ff963ac
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: marrobi <[email protected]>
Resolves #4464
What is being addressed
VM operations in the TRE UI were getting stuck in "awaiting_action" status indefinitely. The resource processor completed operations successfully (VMs started/stopped in Azure), but the API never received status updates, leaving operations stuck in the UI.
Root cause: Service bus consumers (
DeploymentStatusUpdaterandAirlockStatusUpdater) could fail silently or hang without detection or recovery mechanisms.How is this addressed
Implemented a comprehensive monitoring and auto-recovery system with dual detection:
Key Features
ServiceBusConsumer Base Class: Provides heartbeat monitoring and supervisor functionality for all service bus consumers.
Dual Monitoring:
receive_messages()tasks fail or completeAutomatic Recovery:
Implementation Details
Files Changed
service_bus/service_bus_consumer.py- New base class with monitoring (follows PEP 8 import conventions)service_bus/deployment_status_updater.py- Inherits from base classservice_bus/airlock_request_status_update.py- Same integrationmain.py- Uses supervisor functionsservices/logging.py- Worker ID filterResult
Before: Manual API restart required when consumers failed
After: Self-healing system with zero manual intervention needed
This eliminates indefinitely stuck operations and makes the TRE service bus system resilient to both transient failures and silent hangs.
Documentation and versioning
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.