Better monitoring #336

v1kko · 2025-11-28T10:56:10Z

This PR implements a way to get the node agents (if applicable) to monitor each instance that registers itself in the instance registry

The monitoring is quite basic and can be extended, it is logged with a DEBUG level to the main log.

Implement more testing for the profiling
Important Send the monitoring data less frequently than every 0.1s, find out how we want to do this
- It turns out that psutil has to do cpu profiling over a certain period, this would block the main thread, so I created another thread that takes approx 1 second to do this. Then this result is sent back only when it is ready, thus about once every 1.1 seconds.
Do we support MPI out-of-the-box, or do we add support for that later?
- something for later

After this is merged, probably there should also be a way to easily extract this info from the database

- Fix executables call with an empty env When the environment is emptied, it is not guaranteed that "ls" or "sleep" are in the PATH of the new shell. Use /usr/bin/env "cmd" to ensure the system version of this command is used. - Fix all the python tests related to register_instance - Fix all libmuscle/python tests

Both for initialization and monitoring itself

v1kko · 2025-11-28T11:03:52Z

Closes #312

LourensVeen

Okay, I've left a bunch of comments, but I also still have some second thoughts whether it's good to merge this with the profiling events. It may be better to have a separate table (instance, rank, hostname, pid) in the ProfileStore, and another one with (hostname, pid, time_start, time_stop, cpu, mem_max), and then you can join that together and/or with the other data in all sorts of creative ways when analysing the results.

Let me sleep on that, and maybe discuss tomorrow?

libmuscle/cpp/src/libmuscle/mmp_client.cpp

LourensVeen · 2025-12-03T13:40:18Z

libmuscle/python/libmuscle/manager/mmp_server.py

    def _register_instance(
            self, instance_id: str, locations: List[str],
-            ports: List[List[str]], version: str = '') -> Any:
+            ports: List[List[str]], pid: int, hostname: str, version: str = '') -> Any:


Instances that use MPI will have a hostname/pid for each rank. How would that be passed here?

Only the process calling "register_instance" will be monitored, if we want to also monitor all the other MPI processes, then I think that the following needs to be done:

The MPI Process that calls register_instance must provide all PIDS and hostnames

We need to call this method for each provided PID/Hostname Pair

I think the backend is flexible enough to have Many PIDs for a single Instance ID

libmuscle/python/libmuscle/native_instantiator/test/test_process_manager.py

LourensVeen · 2025-12-03T13:53:26Z

libmuscle/python/libmuscle/manager/profile_store.py

                    e.stop_time.nanoseconds, port_name, port_operator,
                    e.port_length, e.slot, e.message_number, e.message_size,
-                    e.message_timestamp)
+                    e.message_timestamp, e.cpu_percent, e.memory_usage)


I'm guessing that both cpu_percent and memory_usage are averages? Or is memory_usage a maximum? That would probably actually be more useful.

libmuscle/python/libmuscle/manager/test/test_profile_database.py

libmuscle/python/libmuscle/native_instantiator/agent/map_client.py

libmuscle/python/libmuscle/manager/instance_manager.py

LourensVeen · 2025-12-03T16:58:16Z

libmuscle/python/libmuscle/manager/profile_store.py

        cur.execute("COMMIT")
        cur.close()

+    def add_event(


Why a separate function? Can't we just call add_events(instance_id, [event]) directly instead?

I made it because a single agent might submit for various instances, which with add_events means you have to pre-sort, or submit with arrays of length 1. but either way is fine I think.

libmuscle/python/libmuscle/native_instantiator/agent_manager.py

libmuscle/python/libmuscle/native_instantiator/native_instantiator.py

LourensVeen · 2025-12-03T17:44:00Z

Oh, and could you merge develop into your branch? I've fixed the CI, so then we can see what it says.

…cified, use the existing env (default behaviour)

…d monitor asynchronous

…isfy threading tests

v1kko added 6 commits November 25, 2025 13:23

Add extra assertion to tests

6d21081

Add profiling event to profiling database

b1b2021

communicate muscle processes to be monitored

08ef79e

Implementing the Protocol for monitoring

3bbc2c9

Both for initialization and monitoring itself

Add the last mile to get everything in the sqlite database

d23c172

LourensVeen reviewed Dec 3, 2025

View reviewed changes

v1kko added 9 commits December 4, 2025 08:58

Merge branch 'multiscale:develop' into better_monitoring

2f157c7

Add (skeleton) MLPServer, split events to events/usage_evenst

f57c879

Pass resource usage through MLPServer

f867167

Get MLPServer address to Agents

919b62d

remove wrong implementation to monitor

c1058a9

Removing leftovers

6fb36b1

Fix mypy issues

4a84e76

Make environment optional, if no environment for a new process is spe…

d1d1a32

…cified, use the existing env (default behaviour)

Fix all flake8 issues

4e07f29

v1kko force-pushed the better_monitoring branch from b9d4647 to 4e07f29 Compare December 10, 2025 15:30

v1kko added 10 commits December 10, 2025 16:34

pass procid and hostname from the instance instead of the MMPClient

b54c0e3

Fix profile_database_test to count forward instead of backwards

154f29b

Fix private variable access of the instance_manager

1366ede

Fix mmp_client.cpp::register_instance headers

0b1d965

Fix report usage implementation, only report back once per second, an…

9a65742

…d monitor asynchronous

fix mlp_tests

1f9af1a

fix threading issue with py3.8

b7cccee

fix last remaining threading issue by force closing the Pool

bcead64

Fix tests by calling MLPClient.close() at the end of the tests

935fc58

Usage collection now non-threaded, but with simple timestamps, to sat…

8c9a8d3

…isfy threading tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better monitoring #336

Better monitoring #336

Uh oh!

v1kko commented Nov 28, 2025 •

edited

Loading

Uh oh!

v1kko commented Nov 28, 2025

Uh oh!

LourensVeen left a comment

Uh oh!

Uh oh!

LourensVeen Dec 3, 2025

Uh oh!

v1kko Dec 10, 2025

Uh oh!

Uh oh!

LourensVeen Dec 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LourensVeen Dec 3, 2025

Uh oh!

v1kko Dec 10, 2025

Uh oh!

Uh oh!

Uh oh!

LourensVeen commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Better monitoring #336

Are you sure you want to change the base?

Better monitoring #336

Uh oh!

Conversation

v1kko commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

v1kko commented Nov 28, 2025

Uh oh!

LourensVeen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LourensVeen Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

v1kko Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LourensVeen Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LourensVeen Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

v1kko Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

LourensVeen commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

v1kko commented Nov 28, 2025 •

edited

Loading