Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Metrics values are wrong when using serverless-init #19446

Closed
terlenbach opened this issue Sep 13, 2023 · 18 comments · Fixed by #20507 · May be fixed by #19447
Closed

[BUG] Metrics values are wrong when using serverless-init #19446

terlenbach opened this issue Sep 13, 2023 · 18 comments · Fixed by #20507 · May be fixed by #19447

Comments

@terlenbach
Copy link

Problem summary

The statsd agent launched by serverless-init pushes wrong metrics.

Problem description

On MacOS, Agent version 7.47.0

Given the very simple Python program:

from time import sleep
from datadog import DogStatsd

client = DogStatsd()
counter = 0
while True:
    client.increment('test.terlenbach.page.views')
    counter += 1
    sleep(1)

which increments a counter every second.

First run

I run a DodStatsd agent via docker like this:

docker run -it -p8125:8125/udp -e DD_SERVICE=test-terlenbach -e DD_ENV=dev -e DD_HOSTNAME=test -e DD_LOG_LEVEL=debug -e DD_SITE=datadoghq.eu -e DD_DOGSTATSD_NON_LOCAL_TRAFFIC=true -e DD_LOG_PAYLOADS=true datadog/dogstatsd

and then run the Python application.

Second run

I run the same Python application using serverless-init via this command line:

DD_SERVICE=test-terlenbach DD_ENV=dev DD_SITE=datadoghq.eu DD_LOG_LEVEL=DEBUG DD_LOG_PAYLOADS=true ./serverless-init python ./app.py

Results

Screenshot 2023-09-13 at 17 28 51

The purple line corresponds to the first run, i.e. using dogstatsd.
The blue line corresponds to the second run, i.e. using serverless-init.

Interesting logs related to the flush of the metrics for the first run:

2023-09-13 15:12:10 UTC | DSD | DEBUG | (pkg/aggregator/demultiplexer.go:119 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points":[[1694617910,0],[1694617920,0.8]],"tags":["env:dev","service:bg-test-terlenbach","version:1"],"host":"test","type":"rate","interval":10} 

2023-09-13 15:12:24 UTC | DSD | DEBUG | (pkg/aggregator/demultiplexer.go:119 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points":[[169461793
0,1]],"tags":["env:dev","service:bg-test-terlenbach","version:1"],"host":"test","type":"rate","interval":10} 

2023-09-13 15:12:39 UTC | DSD | DEBUG | (pkg/aggregator/demultiplexer.go:119 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points":[[169461794
0,1]],"tags":["env:dev","service:bg-test-terlenbach","version:1"],"host":"test","type":"rate","interval":10}     

2023-09-13 15:12:54 UTC | DSD | DEBUG | (pkg/aggregator/demultiplexer.go:119 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points":[[169461796
0,1],[1694617950,1]],"tags":["env:dev","service:bg-test-terlenbach","version:1"],"host":"test","type":"rate","interval":10}    

2023-09-13 15:13:09 UTC | DSD | DEBUG | (pkg/aggregator/demultiplexer.go:119 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points":[[1694617970,1]],"tags":["env:dev","service:bg-test-terlenbach","version:1"],"host":"test","type":"rate","interval":10} 

Interesting logs related to the flush of the metrics for the second run:

2023-09-13 17:17:31 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points
":[[1694618240,0.1],[1694618250,0.2]],"tags":["env:dev","service:bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","type":"rate","interval":10}

2023-09-13 17:17:34 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points
":[[1694618250,0.3]],"tags":["env:dev","service:bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","type":"rate","interval":10}        

2023-09-13 17:17:37 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points
":[[1694618250,0.3]],"tags":["env:dev","service:bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","type":"rate","interval":10}

2023-09-13 17:17:40 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points
":[[1694618250,0.2],[1694618260,0.1]],"tags":["env:dev","service:bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","type":"rate","interval":10}

2023-09-13 17:17:43 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points
":[[1694618260,0.3]],"tags":["env:dev","service:bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","type":"rate","interval":10}  

As we can see the metrics sent to Datadog are different whether we use serverless-init or dogstatsd, plus the metrics sent when using serverless-init are wrong.

@IvanTopolcic
Copy link
Contributor

This is actually a known limitation and why we advise users to only use distribution metrics, which use server-side aggregation. When you call increment, you're implicitly using a count metric, which is aggregated on the client-side and can lead to erroneous values. When we have multiple containers with the same hostname (i.e, when your service scales up) in Cloud Run, Container Apps, etc. these metrics are overwritten instead of aggregated on the backend which can lead to incorrect values in some cases.

I did see your PR but I don't believe it'd address the problem as a whole, and might only work in this one specific case. I'd recommend trying distribution metrics and checking to see if the problem persists as that's our recommended solution.

@terlenbach
Copy link
Author

@IvanTopolcic thanks for answering.
I did the same test with a distribution metric and results also differ whether I use dogstatsd or serverless-init. However if I use the “fixed” version of serverless-unit, the metrics are the same.

@terlenbach
Copy link
Author

terlenbach commented Sep 15, 2023

@IvanTopolcic One of my colleagues already created a PR to add the value of the env variable CONTAINER_APP_REPLICA_NAME as a metric dimension so that metrics don't get overridden in case of scale up: #19332.

This issue concerns another problem, the fact that the aggregation is wrong on the serverless version of dogstatsd, and hence even with one replica, the metrics are wrong.

@IvanTopolcic
Copy link
Contributor

We do appreciate the contributions!

In your testing, did you create a new metric once you switched from non-distribution to distribution metrics? If you just change the name, it'll still have some non-distribution metric behavior in the backend. I recommend trying the same thing but with a completely new metric. You should be able to view the metrics summary page and the metric should be classified as a serverless / distribution metric.

@terlenbach
Copy link
Author

I did create a new metric. The problem is that the aggregation performed by dogstatsd before sending the metrics to datadog is wrong (I can see in the logs that the payload sent to datadog contains the metrics with wrong values).

As you saw in my PR, IMHO the problem comes from the extra 10 seconds added when calling ForceFlushToSerializer in the serverless version. For the series at least, these extra 10 seconds puts us in a situation where the current bucket always gets closed because in the functions https://github.com/DataDog/datadog-agent/blob/main/pkg/aggregator/time_sampler.go#L124 and https://github.com/DataDog/datadog-agent/blob/main/pkg/aggregator/time_sampler.go#L71 we see that the bucket gets closed when bucketStartTimestamp+s.interval < timestamp, and this is always true if we always add 10 seconds to the current timestamp. If you look at the logs shared in this issue, you'll notice that the metric value is wrong.

To give you more information, the problem arises only when this flush statement is called: https://github.com/DataDog/datadog-agent/blob/main/cmd/serverless-init/main.go#L115, which means that if my program lasts for less than 3 seconds, the metrics sent while the program is terminating are good.

@IvanTopolcic
Copy link
Contributor

IvanTopolcic commented Sep 15, 2023

Hmm, I'm fairly certain this is the shadowing issue I mentioned above. My guess is that somehow the metric you're using isn't being properly marked as a non-aggregate distribution metric. As an example, I copied your code and changed to use a distribution metric:

from time import sleep
from datadog import DogStatsd

client = DogStatsd()
counter = 0
while True:
    client.distribution('test.terlenbach.page.views', 1)
    counter += 1
    sleep(1)

It's correctly sending one metric every second, as you can see below here:

image

(One interval is 10 seconds, so we see 10 metrics).

The bucketing doesn't matter for distribution metrics as all of the aggregation is done server-side. There shouldn't be anything happening on the client that would modify your distribution metrics. If you're using a non-distribution metric, or your metric is incorrectly set to a non-aggregate mode (this is why I asked to create a new metric), it's possible that what happens is that the backend will drop some of the payloads as it believes they are duplicates, and will use the last timestamp.

If you're sending distribution metrics, you should see a flushing sketches message, similar to below:

2023-09-15 17:10:33 UTC | SERVERLESS_INIT | DEBUG | Flushing Sketches: {"metric":"test.terlenbach.page.views","tags":["env:datadog-demo","service:datadog-demo-run-python","version:1","datadog_init_version:1.1.1"],"host":"","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":3,"Avg":1,"Cnt":3}},"ts":1694797830}]}

Notice that we flush every 3 seconds, and we send the metric every 1s as in your example. In the above payload we can see correctly that we're getting a count of 3 metrics.

@IvanTopolcic
Copy link
Contributor

To add one more thing, introducing this change would blow up the cardinality of metric data, which might cause billing to increase drastically, especially if new containers are spun up often. I think if you can verify with 100% certainty that creating a completely new metric with a different name and ensuring you're sending it as a distribution metric doesn't work, we can go from there, but as you can see above it does submit the correct amount of metrics.

We've had customers in the past who have initially sent metrics as gauge or something else, and then swap to distribution. While this may appear to work, it actually doesn't work and still causes the overwriting issue leading to metric loss. The only way to avoid this is to submit a completely new metric entirely, and make sure it's never anything besides a distribution metric.

@terlenbach
Copy link
Author

@IvanTopolcic thanks for taking the time to test the distribution metric.
So I tried again the distribution metric and indeed the metrics reported by Datadog seem fine, I actually stopped at what was sent to datadog and that's why I believed the metric was also wrong, let me explain a bit more.

Here is what I get with serverless-init:

2023-09-16 10:48:44 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:156 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":3,"Avg":1,"Cnt":3}},"ts":1694854120}]}
2023-09-16 10:48:47 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:156 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":3,"Avg":1,"Cnt":3}},"ts":1694854120}]}
2023-09-16 10:48:50 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:156 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":2,"Avg":1,"Cnt":2}},"ts":1694854120},{"sketch":{"summary":{"Min":1,"Max":1,"Sum":1,"Avg":1,"Cnt":1}},"ts":1694854130}]}

And this is what I get with dogstatsd:

2023-09-16 08:49:32 UTC | DSD | DEBUG | (pkg/aggregator/demultiplexer.go:128 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:bg-test-terlenbach","version:1"],"host":"test","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":7,"Avg":1,"Cnt":7}},"ts":1694854160}]}
2023-09-16 08:49:47 UTC | DSD | DEBUG | (pkg/aggregator/demultiplexer.go:128 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:bg-test-terlenbach","version:1"],"host":"test","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":10,"Avg":1,"Cnt":10}},"ts":1694854170}]}
2023-09-16 08:50:02 UTC | DSD | DEBUG | (pkg/aggregator/demultiplexer.go:128 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:bg-test-terlenbach","version:1"],"host":"test","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":10,"Avg":1,"Cnt":10}},"ts":1694854180},{"sketch":{"summary":{"Min":1,"Max":1,"Sum":10,"Avg":1,"Cnt":10}},"ts":1694854190}]}

We can see that with serverless-init different values are sent for the same bucket (with the same ts), whereas with dogstatsd, there is always one value for each bucket.

From what I understand, with distribution metrics, no deduplication is performed on the backend side and values are aggregated even though they come with the same timestamp. So, in this case, yes distribution metrics work, at least, the values reported by Datadog are fine.

The behavior between dogstatsd and serverless-init differs though, even with distribution metrics.

Interestingly, this is what I get with the modified version of serverless-init (i.e. the one from the linked PR):

2023-09-16 10:55:30 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:156 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:
bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":4,"Avg":1,"Cnt":4}},"ts":1694854520}]}
2023-09-16 10:55:42 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:156 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:
bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":10,"Avg":1,"Cnt":10}},"ts":1694854530}]}
2023-09-16 10:55:51 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:156 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:
bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":10,"Avg":1,"Cnt":10}},"ts":1694854540}]}
2023-09-16 10:56:00 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:156 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:
bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":10,"Avg":1,"Cnt":10}},"ts":1694854550}]}

Which is the same behavior as dogstatsd, only one value per timestamp. Should the serverless version of dogstatsd be different than the classic one when it comes to compute sketches?

Anyway, I do acknowledge that distribution metrics work with serverless-init (but we don't want to use them as this creates more dimensions to our metrics and then they are more expensive), but if we focus on the count metrics, we can see that the aggregation done client side is wrong, and I still think that this comes from these extra 10 seconds :)

@IvanTopolcic
Copy link
Contributor

Yep we can definitely change the flushing behavior so at least it isn't as confusing. I agree -- although it isn't wrong per se, it's a little confusing if you're just looking at the logs output.

@IvanTopolcic
Copy link
Contributor

IvanTopolcic commented Sep 25, 2023

Would you be able to test if #19724 fixes the issue for you? This won't fix the issue of backend payloads being dropped, but it should fix the client-side aggregation having incorrect values.

@terlenbach
Copy link
Author

Would you be able to test if #19724 fixes the issue for you? This won't fix the issue of backend payloads being dropped, but it should fix the client-side aggregation having incorrect values.

This is what I get:

2023-09-26 09:20:03 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.increment","points":[[1695712790,0.6],[1695712800,0.4]],"tags":["datadog_init_version:xxx","env:dev","service:bg-test-ter
lenbach","version:1"],"host":"","type":"rate","interval":10}
2023-09-26 09:20:13 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.increment","points":[[1695712800,0.6],[1695712810,0.4]],"tags":["datadog_init_version:xxx","env:dev","service:bg-test-ter
lenbach","version:1"],"host":"","type":"rate","interval":10}
2023-09-26 09:20:23 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.increment","points":[[1695712810,0.6],[1695712820,0.4]],"tags":["datadog_init_version:xxx","env:dev","service:bg-test-ter
lenbach","version:1"],"host":"","type":"rate","interval":10}
2023-09-26 09:20:33 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.increment","points":[[1695712820,0.6],[1695712830,0.4]],"tags":["datadog_init_version:xxx","env:dev","service:bg-test-ter
lenbach","version:1"],"host":"","type":"rate","interval":10}
2023-09-26 09:20:43 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.increment","points":[[1695712830,0.6],[1695712840,0.4]],"tags":["datadog_init_version:xxx","env:dev","service:bg-test-ter
lenbach","version:1"],"host":"","type":"rate","interval":10}

And by interpreting these logs this is what the client sent:

Timestamp Points sent to DD
1695712803 [[1695712790,0.6],[1695712800,0.4]]
1695712813 [[1695712800,0.6],[1695712810,0.4]]
1695712823 [[1695712810,0.6],[1695712820,0.4]]
1695712833 [[1695712820,0.6],[1695712830,0.4]]
1695712843 [[1695712830,0.6],[1695712840,0.4]]

With these points sent by the client, Datadog will display a count of 6 whereas it should be 10, so unfortunately this does not fix the issue.

Here we can see that every 10 seconds, the client sends the value of 2 different buckets, the one that just got closed, and the next one as some values have been added to it, and we ask to flush with a cutOffTime set to now()+10 seconds (https://github.com/DataDog/datadog-agent/blob/main/comp/dogstatsd/server/server.go#L500).

If we focus on the first log line, at a timestamp 1695712803, to me it's normal that the bucket with the key 1695712790 is sent, but it's premature to send the one with the key 1695712800 as buckets have a time range of 10 seconds and there is no aggregation on the server side, it should be sent only one time and at a timestamp >= 1695712800+10.

@IvanTopolcic
Copy link
Contributor

Could you update to 1.1.2-rc1 and see if this fixes the bug for you?

@terlenbach
Copy link
Author

Hi @IvanTopolcic, after a quick test, yes it seems that this new version fixes the bug!

@IvanTopolcic
Copy link
Contributor

Perfect! We're going to get this fix in the next release. And thanks again for your patience :-)

@terlenbach
Copy link
Author

Hi @IvanTopolcic, I just saw that the version 1.1.2 has officially been released. Unfortunately it seems that this version behaves differently than the version 1.1.2-rc1 and that the metrics being sent are still wrong.

@IvanTopolcic
Copy link
Contributor

Hey @terlenbach sorry for the late response. That's correct, it'll hopefully land in 1.1.3! Again, sorry for the delay.

@IvanTopolcic IvanTopolcic linked a pull request Oct 30, 2023 that will close this issue
10 tasks
@IvanTopolcic IvanTopolcic reopened this Nov 1, 2023
@IvanTopolcic
Copy link
Contributor

We've released version 1.1.3. Would you mind upgrading and letting us know if this has fixed the issue?

@terlenbach
Copy link
Author

Hi @IvanTopolcic, I confirm that version 1.1.3 fixes the issue. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants