Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cpu] higher check runner counts may produce CPU spikes #919

Open
truthbk opened this issue Dec 8, 2017 · 4 comments
Open

[cpu] higher check runner counts may produce CPU spikes #919

truthbk opened this issue Dec 8, 2017 · 4 comments
Labels
[deprecated] team/agent-core Deprecated. Use metrics-logs / shared-components labels instead..

Comments

@truthbk
Copy link
Member

truthbk commented Dec 8, 2017

Describe what happened:

Due to the way we process and schedule checks, when the number of check runner go-routines is high there's a chance to experience CPU spikes.

As discussed, we believe that due to the fact we schedule checks to run at fixed intervals when the number of runners is high, and in particular when these checks are likely to wait on system calls (check instances - ie. check runs - will release the GIL when waiting for the OS to return) will drive up the number of python checks concurrently running and therefore drive up the CPU utilization. A lower number of check runners reduces the concurrency and lowers the CPU utilization.

A single python check runner (other than long running-checks) replicates the agent5 behavior where instances ran serially - resulting in the lowest possible CPU load.

The overall averaged CPU also had an increment even after being averaged out. So it's not just a spike, but an increase in overall CPU load (scheduling and context-switching overhead?).

Describe what you expected:

A flatter+lower CPU profile/footprint would be preferable.

Steps to reproduce the issue:

  • High check runner count
  • IO Intensive checks (waiting on system calls will increase concurrent work in the python interpreter) with multiple instances

Possible fixes

  • Reduce the number of workers.
  • Modify instance scheduling - maybe spreading execution of checks over the interval as opposed to queueing everything until the next collection iteration begins, where all work would be available as soon as possible.
  • Not doing anything and embracing higher concurrency at the expense of unexpected spikes. Probably not a satisfactory solution since the overall CPU footprint is increased.
@olivielpeau
Copy link
Member

On a machine with 2 cores, running about 15 python checks total (most of them process checks), here's a typical cpu graph depending on the number of check runners that are running "real" checks (i.e. excluding the runners that run long-running checks):

screen shot 2017-12-08 at 6 03 12 pm

The spikes happen every 2 minutes because that's the frequency at which process checks refresh their caches (when they do, they call psutil.process_iter and iterate over all processes to find matching processes)

@olivielpeau
Copy link
Member

Done some more testing: this behavior of the python runtime can be reproduced outside of the Agent6, with 2 simple python scripts that attempt to mimic the behavior of the process check:

Given:

import psutil
def list_psutil_processes():
    for p in psutil.process_iter():
        print p.name()

Every n seconds (and accounting for total execution time of each sequence to compute next run) we run directly with python:

  • Script 1 that runs list_psutil_processes in 20 threading.Threads all started concurrently (they're all started at the same time).
  • Script 2 that runs list_psutil_processes in 20 threading.Threads started sequentially (one is started only once the previous one has joined)

On Dev account:
https://app.datadoghq.com/dash/417599?live=false&page=0&is_auto=false&from_ts=1512757351829&to_ts=1512758196361&tile_size=m

The scripts use on average, on an 8-core Linux VM:

  • 90% of a CPU core for script 1
  • 10% of a CPU core for script 2

@masci
Copy link
Contributor

masci commented Dec 8, 2017

In general, I'm more interested on fetching numbers able to describe how much overhead an highly concurrent scheduler is adding to the metrics collection cycle - the use case here (IO Intensive checks with multiple instances) looks a lot like a corner case and I'd like to collect more info and feedback before claiming we have a fire to put out.

With this in mind and in regard of possible fixes,

Modify instance scheduling

I strongly advice against this: implementation would add significant complexity, specially considering that Autodiscovery can easily change the number of check instances running at every given time.

Reduce the number of workers.

This can be done to some extent in order to provide a more reasonable default but IMO we should still prefer concurrency - users should be able to reduce it, even drastically, but only if spikes happen and when that represents a problem.

@xvello
Copy link
Contributor

xvello commented Dec 11, 2017

Spiking cpu/mem usage will cause issues with the docker agent if limits are set:

  • if CPU limit is reached, the whole container is frozen for a bit, possibliy making dogstatsd unresponsive
  • if mem limit is reached, oomkiller will kill one process in the container

This means if we want reliable behaviour in containers, we must aim for the flatter resource usage profile possible. Could we "autoscale" the runner number depending on the total collection time?

@masci masci added the [deprecated] team/agent-core Deprecated. Use metrics-logs / shared-components labels instead.. label Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[deprecated] team/agent-core Deprecated. Use metrics-logs / shared-components labels instead..
Projects
None yet
Development

No branches or pull requests

4 participants