Skip to content

Conversation

@v1kko
Copy link
Contributor

@v1kko v1kko commented Nov 28, 2025

This PR implements a way to get the node agents (if applicable) to monitor each instance that registers itself in the instance registry

The monitoring is quite basic and can be extended, it is logged with a DEBUG level to the main log.

  • Implement more testing for the profiling
  • Important Send the monitoring data less frequently than every 0.1s, find out how we want to do this
    • It turns out that psutil has to do cpu profiling over a certain period, this would block the main thread, so I created another thread that takes approx 1 second to do this. Then this result is sent back only when it is ready, thus about once every 1.1 seconds.
  • Do we support MPI out-of-the-box, or do we add support for that later?
    • something for later

After this is merged, probably there should also be a way to easily extract this info from the database

- Fix executables call with an empty env

  When the environment is emptied, it is not guaranteed that "ls" or "sleep"
  are in the  PATH of the new shell.
  Use /usr/bin/env "cmd" to ensure the system version of this command is used.

- Fix all the python tests related to register_instance

- Fix all libmuscle/python tests
Both for initialization and monitoring itself
@v1kko
Copy link
Contributor Author

v1kko commented Nov 28, 2025

Closes #312

Copy link
Contributor

@LourensVeen LourensVeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've left a bunch of comments, but I also still have some second thoughts whether it's good to merge this with the profiling events. It may be better to have a separate table (instance, rank, hostname, pid) in the ProfileStore, and another one with (hostname, pid, time_start, time_stop, cpu, mem_max), and then you can join that together and/or with the other data in all sorts of creative ways when analysing the results.

Let me sleep on that, and maybe discuss tomorrow?

def _register_instance(
self, instance_id: str, locations: List[str],
ports: List[List[str]], version: str = '') -> Any:
ports: List[List[str]], pid: int, hostname: str, version: str = '') -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instances that use MPI will have a hostname/pid for each rank. How would that be passed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the process calling "register_instance" will be monitored, if we want to also monitor all the other MPI processes, then I think that the following needs to be done:

  • The MPI Process that calls register_instance must provide all PIDS and hostnames
  • We need to call this method for each provided PID/Hostname Pair

I think the backend is flexible enough to have Many PIDs for a single Instance ID

e.stop_time.nanoseconds, port_name, port_operator,
e.port_length, e.slot, e.message_number, e.message_size,
e.message_timestamp)
e.message_timestamp, e.cpu_percent, e.memory_usage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing that both cpu_percent and memory_usage are averages? Or is memory_usage a maximum? That would probably actually be more useful.

cur.execute("COMMIT")
cur.close()

def add_event(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a separate function? Can't we just call add_events(instance_id, [event]) directly instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it because a single agent might submit for various instances, which with add_events means you have to pre-sort, or submit with arrays of length 1. but either way is fine I think.

@LourensVeen
Copy link
Contributor

Oh, and could you merge develop into your branch? I've fixed the CI, so then we can see what it says.

@v1kko v1kko force-pushed the better_monitoring branch from b9d4647 to 4e07f29 Compare December 10, 2025 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants