Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jun 25, 2025

Resolves #4464

What is being addressed

VM operations in the TRE UI were getting stuck in "awaiting_action" status indefinitely. The resource processor completed operations successfully (VMs started/stopped in Azure), but the API never received status updates, leaving operations stuck in the UI.

Root cause: Service bus consumers (DeploymentStatusUpdater and AirlockStatusUpdater) could fail silently or hang without detection or recovery mechanisms.

How is this addressed

Implemented a comprehensive monitoring and auto-recovery system with dual detection:

Key Features

ServiceBusConsumer Base Class: Provides heartbeat monitoring and supervisor functionality for all service bus consumers.

Dual Monitoring:

  • Task monitoring: Detects when receive_messages() tasks fail or complete
  • Heartbeat monitoring: Detects silent hangs via file-based heartbeat timestamps (5-minute staleness threshold)

Automatic Recovery:

  • Failed tasks restart after 5 seconds
  • Silent hangs detected and restarted within 5 minutes
  • Rate limiting prevents restart loops

Implementation Details

  • Enhanced error handling: Fixed type safety issues and improved exception handling with explanatory comments
  • Configuration: Extracted magic numbers to named constants for maintainability
  • Code quality: Moved imports to follow Python best practices and PEP 8 guidelines
  • Logging: Added worker ID tracking for better observability
  • Testing: Comprehensive unit and edge case test coverage

Files Changed

  • service_bus/service_bus_consumer.py - New base class with monitoring (follows PEP 8 import conventions)
  • service_bus/deployment_status_updater.py - Inherits from base class
  • service_bus/airlock_request_status_update.py - Same integration
  • main.py - Uses supervisor functions
  • services/logging.py - Worker ID filter
  • Complete test suite for all functionality

Result

Before: Manual API restart required when consumers failed
After: Self-healing system with zero manual intervention needed

This eliminates indefinitely stuck operations and makes the TRE service bus system resilient to both transient failures and silent hangs.

Documentation and versioning

  • Updated CHANGELOG.md
  • Incremented API version from 0.24.3 to 0.24.4

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI changed the title [WIP] TRE processor/UI does not seem to be picking up VM resource status changes/operations Add restart mechanism to deployment status updater to fix stuck VM operations Jun 25, 2025
Copilot AI requested a review from marrobi June 25, 2025 11:09
Copilot finished work on behalf of marrobi June 25, 2025 11:09
@marrobi
Copy link
Member

marrobi commented Jun 25, 2025

@copilot can the supervisor be moved into async def runner(process_number: int, config: dict):.

Copilot finished work on behalf of marrobi June 25, 2025 11:26
@github-actions
Copy link

github-actions bot commented Jun 25, 2025

Unit Test Results

671 tests   671 ✅  7s ⏱️
  1 suites    0 💤
  1 files      0 ❌

Results for commit 7eac68b.

♻️ This comment has been updated with latest results.

@marrobi
Copy link
Member

marrobi commented Jun 25, 2025

@copilot can you add something in the supervisor function to check that there is still a heartbeat.

Copy link
Contributor Author

Copilot AI commented Jun 25, 2025

@copilot can you add something in the supervisor function to check that there is still a heartbeat.

Added heartbeat monitoring to the supervisor function. Each worker process now updates a heartbeat file every iteration, and the supervisor checks these timestamps to detect stuck processes. Stale processes (no heartbeat update for 5 minutes) are automatically terminated and restarted. This ensures the system can recover from processes that are alive but not actively working. Changes include:

  • update_heartbeat() function to write process heartbeat timestamps
  • check_process_heartbeat() function to validate heartbeat freshness
  • Enhanced check_runners() function to monitor heartbeats and restart stuck processes
  • Comprehensive unit tests for the new functionality

Commit: c7072b9

Copilot finished work on behalf of marrobi June 25, 2025 21:00
@marrobi
Copy link
Member

marrobi commented Jun 25, 2025

@copilot this needs to be for deployment status updater, ion the API, not the runner.

Copy link
Contributor Author

Copilot AI commented Jun 25, 2025

@copilot this needs to be for deployment status updater, ion the API, not the runner.

Moved the heartbeat monitoring from the resource processor to the deployment status updater in the API as requested. The implementation now includes:

  • Heartbeat file updates in receive_messages()
  • check_heartbeat() method to validate heartbeat freshness
  • supervisor_with_heartbeat_check() function that monitors the heartbeat and restarts the task if it becomes stale (no update for 5 minutes)
  • Updated main.py to use the supervisor function
  • Comprehensive unit tests for the heartbeat functionality

Commit: 202e726

Copilot finished work on behalf of marrobi June 25, 2025 21:17
@marrobi
Copy link
Member

marrobi commented Jun 25, 2025

@copilot fix the linting issues:

2025-06-25 21:33:34 [ERROR] Found errors in [flake8] linter!
2025-06-25 21:33:34 [ERROR] Error code: 1. Command output:


/github/workspace/api_app/service_bus/deployment_status_updater.py:57:1: W293 blank line contains whitespace
/github/workspace/api_app/service_bus/deployment_status_updater.py:60:1: W293 blank line contains whitespace
/github/workspace/api_app/service_bus/deployment_status_updater.py:63:1: W293 blank line contains whitespace
/github/workspace/api_app/service_bus/deployment_status_updater.py:95:1: W293 blank line contains whitespace
/github/workspace/api_app/service_bus/deployment_status_updater.py:101:1: W293 blank line contains whitespace
/github/workspace/api_app/service_bus/deployment_status_updater.py:111:1: W293 blank line contains whitespace
/github/workspace/api_app/service_bus/deployment_status_updater.py:125:1: W293 blank line contains whitespace
/github/workspace/api_app/service_bus/deployment_status_updater.py:128:1: W293 blank line contains whitespace

2025-06-25 21:33:34 [INFO] ---------------------------
2025-06-25 21:33:34 [INFO] File:[/github/workspace/api_app/tests_ma/test_service_bus/test_deployment_status_update.py]
2025-06-25 21:33:34 [ERROR] Found errors in [flake8] linter!
2025-06-25 21:33:34 [ERROR] Error code: 1. Command output:

/github/workspace/api_app/tests_ma/test_service_bus/test_deployment_status_update.py:460:1: W293 blank line contains whitespace
/github/workspace/api_app/tests_ma/test_service_bus/test_deployment_status_update.py:472:1: W293 blank line contains whitespace
/github/workspace/api_app/tests_ma/test_service_bus/test_deployment_status_update.py:492:1: W293 blank line contains whitespace

And increment the API version.

Copy link
Contributor Author

Copilot AI commented Jun 25, 2025

@copilot fix the linting issues:

2025-06-25 21:33:34 [ERROR] Found errors in [flake8] linter!
...

Fixed all linting issues by removing trailing whitespace from blank lines in both deployment_status_updater.py and test_deployment_status_update.py. Also incremented the API version from 0.24.3 to 0.24.4 as requested.

Commit: 381bd9c

Copilot finished work on behalf of marrobi June 25, 2025 21:39
@marrobi
Copy link
Member

marrobi commented Jun 25, 2025

/test-extended 381bd9c

@github-actions
Copy link

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/15887800326 (with refid ce7e1779)

(in response to this comment from @marrobi)

@marrobi
Copy link
Member

marrobi commented Jun 26, 2025

/test-extended 75d77dd

@github-actions
Copy link

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/15901591226 (with refid ce7e1779)

(in response to this comment from @marrobi)

@marrobi marrobi requested review from Copilot June 26, 2025 12:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements comprehensive service bus monitoring with heartbeat detection and automatic recovery to eliminate stuck VM operations in the TRE UI.

  • Introduces a new ServiceBusConsumer base class that provides heartbeat management and a supervisor pattern.
  • Updates DeploymentStatusUpdater and AirlockStatusUpdater to inherit from the new base class and integrates heartbeat updates.
  • Enhances logging by applying a worker ID filter and updates the main application to use supervisor tasks.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
api_app/tests_ma/test_service_bus/test_service_bus_consumer.py Adds comprehensive unit tests for heartbeat and supervisor functionality
api_app/services/logging.py Implements WorkerIdFilter and applies worker ID to log format
api_app/service_bus/service_bus_consumer.py Introduces heartbeat monitoring and supervisor functions in a new base class
api_app/service_bus/deployment_status_updater.py Refactors to inherit from ServiceBusConsumer and integrates heartbeat updates
api_app/service_bus/airlock_request_status_update.py Refactors to inherit from ServiceBusConsumer and integrates heartbeat updates
api_app/main.py Updates to schedule supervisor tasks for service bus consumers
api_app/_version.py Bumps API version to reflect the new changes
Comments suppressed due to low confidence (1)

api_app/tests_ma/test_service_bus/test_service_bus_consumer.py:117

  • [nitpick] Consider using a custom exception or a cancellation flag to exit the infinite loop in tests instead of KeyboardInterrupt, to clearly indicate that the termination is part of the test design.
        except KeyboardInterrupt:

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@marrobi marrobi requested review from jonnyry and tamirkamara June 26, 2025 17:02
@marrobi marrobi changed the title Implement comprehensive service bus monitoring with heartbeat detection and automatic recovery Implement service bus monitoring with heartbeat detection and automatic recovery Jun 30, 2025
@marrobi marrobi requested a review from a team as a code owner November 7, 2025 14:41
@marrobi
Copy link
Member

marrobi commented Nov 7, 2025

/test-extended

@github-actions
Copy link

github-actions bot commented Nov 7, 2025

🤖 pr-bot 🤖

⚠️ When using /test-extended on external PRs, the SHA of the checked commit must be specified

(in response to this comment from @marrobi)

@marrobi marrobi requested a review from Copilot November 7, 2025 16:06
@marrobi
Copy link
Member

marrobi commented Nov 7, 2025

/test-extended ba8d1e9

@github-actions
Copy link

github-actions bot commented Nov 7, 2025

🤖 pr-bot 🤖

🏃 Running extended tests: https://github.com/microsoft/AzureTRE/actions/runs/19174005010 (with refid ce7e1779)

(in response to this comment from @marrobi)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.


def __init__(self, heartbeat_file_prefix: str):
# Create a unique identifier for this worker process
import tempfile
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import tempfile statement should be moved to the top of the file with other imports rather than being inside the __init__ method. This follows Python best practices and PEP 8 style guidelines.

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved import tempfile to the top of the file with other imports and added an explanatory comment to the except clause handling asyncio.CancelledError.

Commit: ff963ac

Copilot finished work on behalf of marrobi November 7, 2025 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TRE processor/UI does not seem to be picking up VM resource status changes/operations

2 participants