-
Notifications
You must be signed in to change notification settings - Fork 20
Better monitoring #336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Better monitoring #336
Conversation
- Fix executables call with an empty env When the environment is emptied, it is not guaranteed that "ls" or "sleep" are in the PATH of the new shell. Use /usr/bin/env "cmd" to ensure the system version of this command is used. - Fix all the python tests related to register_instance - Fix all libmuscle/python tests
Both for initialization and monitoring itself
|
Closes #312 |
LourensVeen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've left a bunch of comments, but I also still have some second thoughts whether it's good to merge this with the profiling events. It may be better to have a separate table (instance, rank, hostname, pid) in the ProfileStore, and another one with (hostname, pid, time_start, time_stop, cpu, mem_max), and then you can join that together and/or with the other data in all sorts of creative ways when analysing the results.
Let me sleep on that, and maybe discuss tomorrow?
| def _register_instance( | ||
| self, instance_id: str, locations: List[str], | ||
| ports: List[List[str]], version: str = '') -> Any: | ||
| ports: List[List[str]], pid: int, hostname: str, version: str = '') -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instances that use MPI will have a hostname/pid for each rank. How would that be passed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the process calling "register_instance" will be monitored, if we want to also monitor all the other MPI processes, then I think that the following needs to be done:
- The MPI Process that calls register_instance must provide all PIDS and hostnames
- We need to call this method for each provided PID/Hostname Pair
I think the backend is flexible enough to have Many PIDs for a single Instance ID
libmuscle/python/libmuscle/native_instantiator/test/test_process_manager.py
Outdated
Show resolved
Hide resolved
| e.stop_time.nanoseconds, port_name, port_operator, | ||
| e.port_length, e.slot, e.message_number, e.message_size, | ||
| e.message_timestamp) | ||
| e.message_timestamp, e.cpu_percent, e.memory_usage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing that both cpu_percent and memory_usage are averages? Or is memory_usage a maximum? That would probably actually be more useful.
libmuscle/python/libmuscle/manager/test/test_profile_database.py
Outdated
Show resolved
Hide resolved
libmuscle/python/libmuscle/native_instantiator/agent/map_client.py
Outdated
Show resolved
Hide resolved
| cur.execute("COMMIT") | ||
| cur.close() | ||
|
|
||
| def add_event( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a separate function? Can't we just call add_events(instance_id, [event]) directly instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it because a single agent might submit for various instances, which with add_events means you have to pre-sort, or submit with arrays of length 1. but either way is fine I think.
libmuscle/python/libmuscle/native_instantiator/agent_manager.py
Outdated
Show resolved
Hide resolved
libmuscle/python/libmuscle/native_instantiator/native_instantiator.py
Outdated
Show resolved
Hide resolved
|
Oh, and could you merge develop into your branch? I've fixed the CI, so then we can see what it says. |
…cified, use the existing env (default behaviour)
b9d4647 to
4e07f29
Compare
…d monitor asynchronous
…isfy threading tests
This PR implements a way to get the node agents (if applicable) to monitor each instance that registers itself in the instance registry
The monitoring is quite basic and can be extended, it is logged with a DEBUG level to the main log.
After this is merged, probably there should also be a way to easily extract this info from the database