-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
profiling: run cylc 7 profile tests against cylc 8 #141
Comments
Yeah this should be a priority once rc1 is out. (Feature freeze, more or less, then bug-fix, profile, optimize...) Cylc 8 ought to be a lot more efficient, but we haven't really proved that yet. |
Here is my first go at profiling: Small workflowDatetime cycling example from tutorials; tasks with no scripts/sleeps.
|
7.8.10 | 8.0rc1.dev01 | |
---|---|---|
Wall clock time | 0:01:46 | 0:04:03 |
User + Sys time (s) | 21.0 | 31.4 |
% of CPU this job got | 19 | 12 |
Max resident set size (kB) | 27,464 | 48,704 |
Complex workflow
A while ago I had taken the "Complex" suite from Oliver's ~/cylc-run
and updated it to Cylc 8 syntax: https://gist.github.com/MetRonnie/98d42d0b090b1fee7f24a000b90be0af
Tasks with 1s sleeps (note for some reason I used env-script
instead of script
in [runtime][root]
)
7.8.10 | 8.0rc1.dev01 | |
---|---|---|
Wall clock time | 0:25:18 | 1:11:06 |
User + Sys time (s) | 469.3 | 771.9 |
% of CPU this job got | 30 | 18 |
Max resident set size (kB) | 71,664 | 101,748 |
P.S. An indication of how verbose the logging has become (at log level INFO) - I noticed there were 9 log files for Cylc 8 vs 3 for Cylc 7. And the first 4 of the log files for Cylc 8 had been deleted as the default is to only keep 5 at any time.
Footnotes
The Cylc 7 results are as expected, the Cylc 8 results less so, will need to investigate. To help reproduce these results:
|
Ran on my VLD box while not using it (with only a VSCode window with a few tabs, a file manager window and 3 terminal sessions open). Cylc 7 was ( Also, I did the Cylc 8 complex test while keeping the VDI connection open during the middle of the day (doing stuff in Windows in the meantime). Whereas for the Cylc 7 complex test I started it last thing in the evening and closed the VDI connection to let it run overnight. Don't know if a difference in terms of load on the VLD servers depending on time of day, or whether the virtual desktop connection was open or not would affect the results. |
Profiled the small datetime-cycling workflow using memory-profiler |
I've profiled
The program as a whole weighs ~3.5 times more than the config alone adding up to ~38Mb. Measuring the overheads of a "Hello World" Python script I get a number of 6909952 b, subtracting that from the 38Mb leaves us with 30Mb of un accounted memory.
This is approximately 60% of the total usage. From introspection it looks like the bulk of that is Python does fill up memory with a lot of MethodologyObtaining the inner measurement: diff --git a/cylc/flow/scripts/validate.py b/cylc/flow/scripts/validate.py
index 53a9b824f..d1c134323 100755
--- a/cylc/flow/scripts/validate.py
+++ b/cylc/flow/scripts/validate.py
@@ -168,3 +168,5 @@ def main(parser: COP, options: 'Values', reg: str) -> None:
print(cparse('<green>Valid for cylc-%s</green>' % CYLC_VERSION))
profiler.stop()
+ from pympler.asizeof import asized
+ print(asized(cfg, detail=2).format(depth=2)) $ cylc validate complex
...
<cylc.flow.config.WorkflowConfig object at 0x7f80e9030ad0> size=14322992 flat=64
__dict__ size=14322928 flat=416
[V] taskdefs: {'ensemble_install_startdata_cold': <c....def.TaskDef object at 0x7f80eb1dce60>} size=9974040 flat=36976
[V] edges: {<cylc.flow.cycling.iso8601.ISO8601Seq....mble_forecast_012_tN', False, False)}} size=2080024 flat=248
[V] pcfg: <cylc.flow.cfgspec.workflow.RawWorkflowConfig object at 0x7f80e7c8f6d0> size=1615208 flat=64
[V] runtime: {'parents': {'root': [], 'HPC': ['root....fication_to_highway_happyland_yay2']}} size=625096 flat=248
[V] ns_defn_order: ['root', 'HPC', 'HPC_BACKUP', 'CORETYP....tification_to_highway_happyland_yay2'] size=10040 flat=10040
[V] leaves: ['archive_logs', 'brief_archive', 'bri....', 'recover1', 'recover2', 'recover3'] size=9200 flat=9200
[V] clock_offsets: {'long_start_00': <cylc.flow.cycling.i....601Interval object at 0x7f80ea82f5d0>} size=1736 flat=656
[V] options: <Values at 0x7f80e565c690: {'check_cir....emplatevars_file': None, 'icp': None}> size=1688 flat=64 Obtaining the outer measurement: $ # (after reversing the above diff)
$ /usr/bin/time -lp cylc validate complex # note: BSD time args, use -v for GNU time
...
real 1.96
user 1.85
sys 0.10
51855360 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
24744 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
8 voluntary context switches
170 involuntary context switches Obtaining the Python baseline: $ /usr/bin/time -lp python -c 'print("hello world")'
hello world
real 0.03
user 0.02
sys 0.00
6909952 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
1706 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
2 involuntary context switches |
Simple test comparing Python with and without importing
$ cat test.py
from cylc.flow.scheduler import Scheduler
$ /usr/bin/time -lp python test.py
real 0.47
user 0.34
sys 0.10
40615936 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
10017 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
49 voluntary context switches
73 involuntary context switches
$ /usr/bin/time -lp python -c 'print("hello world")'
hello world
real 0.03
user 0.02
sys 0.00
6909952 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
1706 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
1 involuntary context switches |
Using If I strip def main(*a, **k) -> None:
"""cylc validate CLI."""
return Max RSS: 19,248 kB Adding the imports back in Code snippetfrom ansimarkup import parse as cparse
from optparse import Values
import sys
import textwrap
from cylc.flow import LOG, __version__ as CYLC_VERSION
from cylc.flow.config import WorkflowConfig
from cylc.flow.exceptions import (
WorkflowConfigError,
TaskProxySequenceBoundsError,
TriggerExpressionError
)
import cylc.flow.flags
from cylc.flow.loggingutil import disable_timestamps
from cylc.flow.option_parsers import (
CylcOptionParser as COP,
Options
)
from cylc.flow.profiler import Profiler
from cylc.flow.task_proxy import TaskProxy
from cylc.flow.templatevars import get_template_vars
from cylc.flow.terminal import cli_function
from cylc.flow.workflow_files import parse_reg
def main(*a, **k) -> None:
"""cylc validate CLI."""
return Max RSS: 31,520 kB And with the unmodified file I get max RSS: 35,632 kB In graphical form: For comparison in Cylc 7 the full |
SnakeViz for
|
That's mostly We can reduce the main loop sleep for more responsiveness, however, that raises CPU usage. |
Here's the complex workflow running with The largest object @12MB is the config, followed by the task pool@4Mb, nothing to explain the >100Mb reported which suggests the remaining memory is perhaps not associated directly with the Scheduler object? The orange line sloping upwards which looks like a memory leak is the data that generated this graph accumulating. |
That was deliberate (on my part anyway) to help with Cylc 8 development. Expected to pare it back a lot once things stabilized a bit. |
@MetRonnie - did you do this for Cylc 7 too? |
No, Python 2 vs 3 might make it difficult but I'll try |
The Cylc 7
So a lot of CPU time taken up by DB ops in Cylc 7 too |
Not sure I understand the time difference: Seems kinda impossible... It's almost as if the job communication isn't working and it's switched to polling mode... Is the memory overhead the loaded libraries? |
How can you tell if the job communication is happening via polling? I'm getting
in job scripts. Example
which is |
let's say there's 5000 sequential tasks (maybe more in parallel), then if each job/task message takes 200ms longer with 3 messages each job. That's 3000sec or 50min extra, which would account for the difference.. (I don't know the exact numbers). I'm, of course, assuming that what we found previously (with the time spent loading network libraries, and the auth etc) is the issue.. This is still a concern, as someone could use Cylc for running many small jobs in the same way... We need to find out exactly where..
debug mode will tell you how the messages are being received (zmq or otherwise), in the workflow log..
What's cylc7 look like? looks like that's it.... If so, then at least it's not the Scheduler internals (I suppose) |
I think we now omit the contact file installation for polling platforms. You can also run the workflow in |
We run It does wait for the started message: But only at the end of the job. |
Comparing the same tasks (all
|
We should create some sort of job profiler, cause that's obviously where the issue is.. We can find out exactly what's taking so long but fiddling with The auth will be doing file reads, and we could print out timing about library loading, client/connection instantiation, and message send/receive. |
I think, the only reason going via the UIS could be faster is because it already has libraries loaded and client/connection instantiated. If we find it is this, one option may be a long lived client for the life of the job (which may mean having the job wrapper in python)... But I don't know, could introduce other delays I suppose... Surely there's a way to slim down the client end with zmq. (again assuming this is the issue) |
(coming back to this issue after we have have a release out) |
Here's a basic auto-profiling script I wrote for the Basic profiling script./profile#!/usr/bin/env bash |
Cylc 7 had an on board automated profile battery which provided us with insights into Cylc's scaling characteristics.
This profile battery will need a total re-write for Cylc 8, there is an issue for that here #38, however, this is a relatively large job and one for which we don't have time at the moment so I suggest manually running some of the old scaling tests to give us an idea of how scaling has changed and how Cylc 8 compares againt Cylc 7.
Quick overview of the main scaling dimensions of Cylc:
These dimensions are targeted by suites which were located in
etc/dev-suites
in the Cylc 7 source code.For MO people some (rather dated) results from the old automated profile-battery can be found here: https://www-nwp/~osanders/profiling-results/2016/
To test scaling you will need to run the same experiment with different template variables.
To profile them I suggest using
/usr/bin/time
as we have a good level of understanding of this command and results will be comparable to prior tests.In the output we are interested in:
For fair tests you need to run all experiments on the same hardware and to ensure the machine is idle. Running on a cluster may be a good idea but you may need to claim the whole node to meet these conditions (with the obvious resource wastage implications).
We don't have any up-to-date results for Cylc 7 so you will need to profile that too, however, you can use the automated profile battery there if you like. I think you run it like so:
$ cylc profile-battery -e complex -v 7.x.x
The text was updated successfully, but these errors were encountered: