-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Metrics values are wrong when using serverless-init #19446
Comments
This is actually a known limitation and why we advise users to only use distribution metrics, which use server-side aggregation. When you call increment, you're implicitly using a count metric, which is aggregated on the client-side and can lead to erroneous values. When we have multiple containers with the same hostname (i.e, when your service scales up) in Cloud Run, Container Apps, etc. these metrics are overwritten instead of aggregated on the backend which can lead to incorrect values in some cases. I did see your PR but I don't believe it'd address the problem as a whole, and might only work in this one specific case. I'd recommend trying distribution metrics and checking to see if the problem persists as that's our recommended solution. |
@IvanTopolcic thanks for answering. |
@IvanTopolcic One of my colleagues already created a PR to add the value of the env variable CONTAINER_APP_REPLICA_NAME as a metric dimension so that metrics don't get overridden in case of scale up: #19332. This issue concerns another problem, the fact that the aggregation is wrong on the serverless version of dogstatsd, and hence even with one replica, the metrics are wrong. |
We do appreciate the contributions! In your testing, did you create a new metric once you switched from non-distribution to distribution metrics? If you just change the name, it'll still have some non-distribution metric behavior in the backend. I recommend trying the same thing but with a completely new metric. You should be able to view the metrics summary page and the metric should be classified as a serverless / distribution metric. |
I did create a new metric. The problem is that the aggregation performed by dogstatsd before sending the metrics to datadog is wrong (I can see in the logs that the payload sent to datadog contains the metrics with wrong values). As you saw in my PR, IMHO the problem comes from the extra 10 seconds added when calling To give you more information, the problem arises only when this flush statement is called: https://github.com/DataDog/datadog-agent/blob/main/cmd/serverless-init/main.go#L115, which means that if my program lasts for less than 3 seconds, the metrics sent while the program is terminating are good. |
To add one more thing, introducing this change would blow up the cardinality of metric data, which might cause billing to increase drastically, especially if new containers are spun up often. I think if you can verify with 100% certainty that creating a completely new metric with a different name and ensuring you're sending it as a distribution metric doesn't work, we can go from there, but as you can see above it does submit the correct amount of metrics. We've had customers in the past who have initially sent metrics as gauge or something else, and then swap to distribution. While this may appear to work, it actually doesn't work and still causes the overwriting issue leading to metric loss. The only way to avoid this is to submit a completely new metric entirely, and make sure it's never anything besides a distribution metric. |
@IvanTopolcic thanks for taking the time to test the distribution metric. Here is what I get with serverless-init:
And this is what I get with dogstatsd:
We can see that with serverless-init different values are sent for the same bucket (with the same ts), whereas with dogstatsd, there is always one value for each bucket. From what I understand, with distribution metrics, no deduplication is performed on the backend side and values are aggregated even though they come with the same timestamp. So, in this case, yes distribution metrics work, at least, the values reported by Datadog are fine. The behavior between dogstatsd and serverless-init differs though, even with distribution metrics. Interestingly, this is what I get with the modified version of serverless-init (i.e. the one from the linked PR):
Which is the same behavior as dogstatsd, only one value per timestamp. Should the serverless version of dogstatsd be different than the classic one when it comes to compute sketches? Anyway, I do acknowledge that distribution metrics work with serverless-init (but we don't want to use them as this creates more dimensions to our metrics and then they are more expensive), but if we focus on the count metrics, we can see that the aggregation done client side is wrong, and I still think that this comes from these extra 10 seconds :) |
Yep we can definitely change the flushing behavior so at least it isn't as confusing. I agree -- although it isn't wrong per se, it's a little confusing if you're just looking at the logs output. |
Would you be able to test if #19724 fixes the issue for you? This won't fix the issue of backend payloads being dropped, but it should fix the client-side aggregation having incorrect values. |
This is what I get:
And by interpreting these logs this is what the client sent:
With these points sent by the client, Datadog will display a count of 6 whereas it should be 10, so unfortunately this does not fix the issue. Here we can see that every 10 seconds, the client sends the value of 2 different buckets, the one that just got closed, and the next one as some values have been added to it, and we ask to flush with a cutOffTime set to now()+10 seconds (https://github.com/DataDog/datadog-agent/blob/main/comp/dogstatsd/server/server.go#L500). If we focus on the first log line, at a timestamp 1695712803, to me it's normal that the bucket with the key 1695712790 is sent, but it's premature to send the one with the key 1695712800 as buckets have a time range of 10 seconds and there is no aggregation on the server side, it should be sent only one time and at a timestamp >= 1695712800+10. |
Could you update to |
Hi @IvanTopolcic, after a quick test, yes it seems that this new version fixes the bug! |
Perfect! We're going to get this fix in the next release. And thanks again for your patience :-) |
Hi @IvanTopolcic, I just saw that the version |
Hey @terlenbach sorry for the late response. That's correct, it'll hopefully land in 1.1.3! Again, sorry for the delay. |
We've released version 1.1.3. Would you mind upgrading and letting us know if this has fixed the issue? |
Hi @IvanTopolcic, I confirm that version 1.1.3 fixes the issue. Thanks. |
Problem summary
The statsd agent launched by serverless-init pushes wrong metrics.
Problem description
On MacOS, Agent version 7.47.0
Given the very simple Python program:
which increments a counter every second.
First run
I run a DodStatsd agent via docker like this:
and then run the Python application.
Second run
I run the same Python application using serverless-init via this command line:
Results
The purple line corresponds to the first run, i.e. using dogstatsd.
The blue line corresponds to the second run, i.e. using serverless-init.
Interesting logs related to the flush of the metrics for the first run:
Interesting logs related to the flush of the metrics for the second run:
As we can see the metrics sent to Datadog are different whether we use serverless-init or dogstatsd, plus the metrics sent when using serverless-init are wrong.
The text was updated successfully, but these errors were encountered: