[BUG] Metrics values are wrong when using serverless-init #19446

terlenbach · 2023-09-13T15:55:21Z

Problem summary

The statsd agent launched by serverless-init pushes wrong metrics.

Problem description

On MacOS, Agent version 7.47.0

Given the very simple Python program:

from time import sleep
from datadog import DogStatsd

client = DogStatsd()
counter = 0
while True:
    client.increment('test.terlenbach.page.views')
    counter += 1
    sleep(1)

which increments a counter every second.

First run

I run a DodStatsd agent via docker like this:

docker run -it -p8125:8125/udp -e DD_SERVICE=test-terlenbach -e DD_ENV=dev -e DD_HOSTNAME=test -e DD_LOG_LEVEL=debug -e DD_SITE=datadoghq.eu -e DD_DOGSTATSD_NON_LOCAL_TRAFFIC=true -e DD_LOG_PAYLOADS=true datadog/dogstatsd

and then run the Python application.

Second run

I run the same Python application using serverless-init via this command line:

DD_SERVICE=test-terlenbach DD_ENV=dev DD_SITE=datadoghq.eu DD_LOG_LEVEL=DEBUG DD_LOG_PAYLOADS=true ./serverless-init python ./app.py

Results

The purple line corresponds to the first run, i.e. using dogstatsd.
The blue line corresponds to the second run, i.e. using serverless-init.

Interesting logs related to the flush of the metrics for the first run:

2023-09-13 15:12:10 UTC | DSD | DEBUG | (pkg/aggregator/demultiplexer.go:119 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points":[[1694617910,0],[1694617920,0.8]],"tags":["env:dev","service:bg-test-terlenbach","version:1"],"host":"test","type":"rate","interval":10} 

2023-09-13 15:12:24 UTC | DSD | DEBUG | (pkg/aggregator/demultiplexer.go:119 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points":[[169461793
0,1]],"tags":["env:dev","service:bg-test-terlenbach","version:1"],"host":"test","type":"rate","interval":10} 

2023-09-13 15:12:39 UTC | DSD | DEBUG | (pkg/aggregator/demultiplexer.go:119 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points":[[169461794
0,1]],"tags":["env:dev","service:bg-test-terlenbach","version:1"],"host":"test","type":"rate","interval":10}     

2023-09-13 15:12:54 UTC | DSD | DEBUG | (pkg/aggregator/demultiplexer.go:119 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points":[[169461796
0,1],[1694617950,1]],"tags":["env:dev","service:bg-test-terlenbach","version:1"],"host":"test","type":"rate","interval":10}    

2023-09-13 15:13:09 UTC | DSD | DEBUG | (pkg/aggregator/demultiplexer.go:119 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points":[[1694617970,1]],"tags":["env:dev","service:bg-test-terlenbach","version:1"],"host":"test","type":"rate","interval":10}

Interesting logs related to the flush of the metrics for the second run:

2023-09-13 17:17:31 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points
":[[1694618240,0.1],[1694618250,0.2]],"tags":["env:dev","service:bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","type":"rate","interval":10}

2023-09-13 17:17:34 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points
":[[1694618250,0.3]],"tags":["env:dev","service:bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","type":"rate","interval":10}        

2023-09-13 17:17:37 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points
":[[1694618250,0.3]],"tags":["env:dev","service:bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","type":"rate","interval":10}

2023-09-13 17:17:40 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points
":[[1694618250,0.2],[1694618260,0.1]],"tags":["env:dev","service:bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","type":"rate","interval":10}

2023-09-13 17:17:43 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.page.views","points
":[[1694618260,0.3]],"tags":["env:dev","service:bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","type":"rate","interval":10}

As we can see the metrics sent to Datadog are different whether we use serverless-init or dogstatsd, plus the metrics sent when using serverless-init are wrong.

The text was updated successfully, but these errors were encountered:

IvanTopolcic · 2023-09-14T15:42:03Z

This is actually a known limitation and why we advise users to only use distribution metrics, which use server-side aggregation. When you call increment, you're implicitly using a count metric, which is aggregated on the client-side and can lead to erroneous values. When we have multiple containers with the same hostname (i.e, when your service scales up) in Cloud Run, Container Apps, etc. these metrics are overwritten instead of aggregated on the backend which can lead to incorrect values in some cases.

I did see your PR but I don't believe it'd address the problem as a whole, and might only work in this one specific case. I'd recommend trying distribution metrics and checking to see if the problem persists as that's our recommended solution.

terlenbach · 2023-09-14T16:27:53Z

@IvanTopolcic thanks for answering.
I did the same test with a distribution metric and results also differ whether I use dogstatsd or serverless-init. However if I use the “fixed” version of serverless-unit, the metrics are the same.

terlenbach · 2023-09-15T09:38:20Z

@IvanTopolcic One of my colleagues already created a PR to add the value of the env variable CONTAINER_APP_REPLICA_NAME as a metric dimension so that metrics don't get overridden in case of scale up: #19332.

This issue concerns another problem, the fact that the aggregation is wrong on the serverless version of dogstatsd, and hence even with one replica, the metrics are wrong.

IvanTopolcic · 2023-09-15T14:37:08Z

We do appreciate the contributions!

In your testing, did you create a new metric once you switched from non-distribution to distribution metrics? If you just change the name, it'll still have some non-distribution metric behavior in the backend. I recommend trying the same thing but with a completely new metric. You should be able to view the metrics summary page and the metric should be classified as a serverless / distribution metric.

terlenbach · 2023-09-15T15:41:20Z

I did create a new metric. The problem is that the aggregation performed by dogstatsd before sending the metrics to datadog is wrong (I can see in the logs that the payload sent to datadog contains the metrics with wrong values).

As you saw in my PR, IMHO the problem comes from the extra 10 seconds added when calling ForceFlushToSerializer in the serverless version. For the series at least, these extra 10 seconds puts us in a situation where the current bucket always gets closed because in the functions https://github.com/DataDog/datadog-agent/blob/main/pkg/aggregator/time_sampler.go#L124 and https://github.com/DataDog/datadog-agent/blob/main/pkg/aggregator/time_sampler.go#L71 we see that the bucket gets closed when bucketStartTimestamp+s.interval < timestamp, and this is always true if we always add 10 seconds to the current timestamp. If you look at the logs shared in this issue, you'll notice that the metric value is wrong.

To give you more information, the problem arises only when this flush statement is called: https://github.com/DataDog/datadog-agent/blob/main/cmd/serverless-init/main.go#L115, which means that if my program lasts for less than 3 seconds, the metrics sent while the program is terminating are good.

IvanTopolcic · 2023-09-15T17:14:50Z

Hmm, I'm fairly certain this is the shadowing issue I mentioned above. My guess is that somehow the metric you're using isn't being properly marked as a non-aggregate distribution metric. As an example, I copied your code and changed to use a distribution metric:

from time import sleep
from datadog import DogStatsd

client = DogStatsd()
counter = 0
while True:
    client.distribution('test.terlenbach.page.views', 1)
    counter += 1
    sleep(1)

It's correctly sending one metric every second, as you can see below here:

(One interval is 10 seconds, so we see 10 metrics).

The bucketing doesn't matter for distribution metrics as all of the aggregation is done server-side. There shouldn't be anything happening on the client that would modify your distribution metrics. If you're using a non-distribution metric, or your metric is incorrectly set to a non-aggregate mode (this is why I asked to create a new metric), it's possible that what happens is that the backend will drop some of the payloads as it believes they are duplicates, and will use the last timestamp.

If you're sending distribution metrics, you should see a flushing sketches message, similar to below:

2023-09-15 17:10:33 UTC | SERVERLESS_INIT | DEBUG | Flushing Sketches: {"metric":"test.terlenbach.page.views","tags":["env:datadog-demo","service:datadog-demo-run-python","version:1","datadog_init_version:1.1.1"],"host":"","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":3,"Avg":1,"Cnt":3}},"ts":1694797830}]}

Notice that we flush every 3 seconds, and we send the metric every 1s as in your example. In the above payload we can see correctly that we're getting a count of 3 metrics.

IvanTopolcic · 2023-09-15T19:55:06Z

To add one more thing, introducing this change would blow up the cardinality of metric data, which might cause billing to increase drastically, especially if new containers are spun up often. I think if you can verify with 100% certainty that creating a completely new metric with a different name and ensuring you're sending it as a distribution metric doesn't work, we can go from there, but as you can see above it does submit the correct amount of metrics.

We've had customers in the past who have initially sent metrics as gauge or something else, and then swap to distribution. While this may appear to work, it actually doesn't work and still causes the overwriting issue leading to metric loss. The only way to avoid this is to submit a completely new metric entirely, and make sure it's never anything besides a distribution metric.

terlenbach · 2023-09-16T09:20:04Z

@IvanTopolcic thanks for taking the time to test the distribution metric.
So I tried again the distribution metric and indeed the metrics reported by Datadog seem fine, I actually stopped at what was sent to datadog and that's why I believed the metric was also wrong, let me explain a bit more.

Here is what I get with serverless-init:

2023-09-16 10:48:44 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:156 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":3,"Avg":1,"Cnt":3}},"ts":1694854120}]}
2023-09-16 10:48:47 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:156 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":3,"Avg":1,"Cnt":3}},"ts":1694854120}]}
2023-09-16 10:48:50 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:156 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":2,"Avg":1,"Cnt":2}},"ts":1694854120},{"sketch":{"summary":{"Min":1,"Max":1,"Sum":1,"Avg":1,"Cnt":1}},"ts":1694854130}]}

And this is what I get with dogstatsd:

2023-09-16 08:49:32 UTC | DSD | DEBUG | (pkg/aggregator/demultiplexer.go:128 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:bg-test-terlenbach","version:1"],"host":"test","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":7,"Avg":1,"Cnt":7}},"ts":1694854160}]}
2023-09-16 08:49:47 UTC | DSD | DEBUG | (pkg/aggregator/demultiplexer.go:128 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:bg-test-terlenbach","version:1"],"host":"test","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":10,"Avg":1,"Cnt":10}},"ts":1694854170}]}
2023-09-16 08:50:02 UTC | DSD | DEBUG | (pkg/aggregator/demultiplexer.go:128 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:bg-test-terlenbach","version:1"],"host":"test","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":10,"Avg":1,"Cnt":10}},"ts":1694854180},{"sketch":{"summary":{"Min":1,"Max":1,"Sum":10,"Avg":1,"Cnt":10}},"ts":1694854190}]}

We can see that with serverless-init different values are sent for the same bucket (with the same ts), whereas with dogstatsd, there is always one value for each bucket.

From what I understand, with distribution metrics, no deduplication is performed on the backend side and values are aggregated even though they come with the same timestamp. So, in this case, yes distribution metrics work, at least, the values reported by Datadog are fine.

The behavior between dogstatsd and serverless-init differs though, even with distribution metrics.

Interestingly, this is what I get with the modified version of serverless-init (i.e. the one from the linked PR):

2023-09-16 10:55:30 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:156 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:
bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":4,"Avg":1,"Cnt":4}},"ts":1694854520}]}
2023-09-16 10:55:42 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:156 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:
bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":10,"Avg":1,"Cnt":10}},"ts":1694854530}]}
2023-09-16 10:55:51 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:156 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:
bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":10,"Avg":1,"Cnt":10}},"ts":1694854540}]}
2023-09-16 10:56:00 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:156 in func2) | Flushing Sketches: {"metric":"test.terlenbach.new.distribution","tags":["env:dev","service:
bg-test-terlenbach","version:1","datadog_init_version:xxx"],"host":"","interval":10,"points":[{"sketch":{"summary":{"Min":1,"Max":1,"Sum":10,"Avg":1,"Cnt":10}},"ts":1694854550}]}

Which is the same behavior as dogstatsd, only one value per timestamp. Should the serverless version of dogstatsd be different than the classic one when it comes to compute sketches?

Anyway, I do acknowledge that distribution metrics work with serverless-init (but we don't want to use them as this creates more dimensions to our metrics and then they are more expensive), but if we focus on the count metrics, we can see that the aggregation done client side is wrong, and I still think that this comes from these extra 10 seconds :)

IvanTopolcic · 2023-09-16T17:38:47Z

Yep we can definitely change the flushing behavior so at least it isn't as confusing. I agree -- although it isn't wrong per se, it's a little confusing if you're just looking at the logs output.

IvanTopolcic · 2023-09-25T20:38:52Z

Would you be able to test if #19724 fixes the issue for you? This won't fix the issue of backend payloads being dropped, but it should fix the client-side aggregation having incorrect values.

terlenbach · 2023-09-26T08:20:47Z

Would you be able to test if #19724 fixes the issue for you? This won't fix the issue of backend payloads being dropped, but it should fix the client-side aggregation having incorrect values.

This is what I get:

2023-09-26 09:20:03 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.increment","points":[[1695712790,0.6],[1695712800,0.4]],"tags":["datadog_init_version:xxx","env:dev","service:bg-test-ter
lenbach","version:1"],"host":"","type":"rate","interval":10}
2023-09-26 09:20:13 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.increment","points":[[1695712800,0.6],[1695712810,0.4]],"tags":["datadog_init_version:xxx","env:dev","service:bg-test-ter
lenbach","version:1"],"host":"","type":"rate","interval":10}
2023-09-26 09:20:23 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.increment","points":[[1695712810,0.6],[1695712820,0.4]],"tags":["datadog_init_version:xxx","env:dev","service:bg-test-ter
lenbach","version:1"],"host":"","type":"rate","interval":10}
2023-09-26 09:20:33 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.increment","points":[[1695712820,0.6],[1695712830,0.4]],"tags":["datadog_init_version:xxx","env:dev","service:bg-test-ter
lenbach","version:1"],"host":"","type":"rate","interval":10}
2023-09-26 09:20:43 CEST | SERVERLESS_INIT | DEBUG | (pkg/aggregator/demultiplexer.go:147 in func1) | Flushing serie: {"metric":"test.terlenbach.increment","points":[[1695712830,0.6],[1695712840,0.4]],"tags":["datadog_init_version:xxx","env:dev","service:bg-test-ter
lenbach","version:1"],"host":"","type":"rate","interval":10}

And by interpreting these logs this is what the client sent:

Timestamp	Points sent to DD
1695712803	[[1695712790,0.6],[1695712800,0.4]]
1695712813	[[1695712800,0.6],[1695712810,0.4]]
1695712823	[[1695712810,0.6],[1695712820,0.4]]
1695712833	[[1695712820,0.6],[1695712830,0.4]]
1695712843	[[1695712830,0.6],[1695712840,0.4]]

With these points sent by the client, Datadog will display a count of 6 whereas it should be 10, so unfortunately this does not fix the issue.

Here we can see that every 10 seconds, the client sends the value of 2 different buckets, the one that just got closed, and the next one as some values have been added to it, and we ask to flush with a cutOffTime set to now()+10 seconds (https://github.com/DataDog/datadog-agent/blob/main/comp/dogstatsd/server/server.go#L500).

If we focus on the first log line, at a timestamp 1695712803, to me it's normal that the bucket with the key 1695712790 is sent, but it's premature to send the one with the key 1695712800 as buckets have a time range of 10 seconds and there is no aggregation on the server side, it should be sent only one time and at a timestamp >= 1695712800+10.

IvanTopolcic · 2023-10-12T15:16:20Z

Could you update to 1.1.2-rc1 and see if this fixes the bug for you?

terlenbach · 2023-10-13T10:50:31Z

Hi @IvanTopolcic, after a quick test, yes it seems that this new version fixes the bug!

IvanTopolcic · 2023-10-17T14:42:26Z

Perfect! We're going to get this fix in the next release. And thanks again for your patience :-)

terlenbach · 2023-10-20T14:30:05Z

Hi @IvanTopolcic, I just saw that the version 1.1.2 has officially been released. Unfortunately it seems that this version behaves differently than the version 1.1.2-rc1 and that the metrics being sent are still wrong.

IvanTopolcic · 2023-10-30T15:09:46Z

Hey @terlenbach sorry for the late response. That's correct, it'll hopefully land in 1.1.3! Again, sorry for the delay.

IvanTopolcic · 2023-11-01T19:38:11Z

We've released version 1.1.3. Would you mind upgrading and letting us know if this has fixed the issue?

terlenbach · 2023-11-08T14:56:13Z

Hi @IvanTopolcic, I confirm that version 1.1.3 fixes the issue. Thanks.

terlenbach added the team/triage label Sep 13, 2023

terlenbach mentioned this issue Sep 13, 2023

Fix metrics when using serverless-init #19447

Open

10 tasks

IvanTopolcic added team/serverless and removed team/triage labels Sep 16, 2023

IvanTopolcic mentioned this issue Sep 25, 2023

Metrics flush every 10s instead of every 3s in serverless-init #19724

Closed

10 tasks

IvanTopolcic linked a pull request Oct 30, 2023 that will close this issue

Fix metric aggregation in serverless-init #20507

Merged

10 tasks

IvanTopolcic closed this as completed in #20507 Oct 31, 2023

IvanTopolcic reopened this Nov 1, 2023

terlenbach closed this as completed Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Metrics values are wrong when using serverless-init #19446

[BUG] Metrics values are wrong when using serverless-init #19446

terlenbach commented Sep 13, 2023

IvanTopolcic commented Sep 14, 2023

terlenbach commented Sep 14, 2023

terlenbach commented Sep 15, 2023 •

edited

Loading

IvanTopolcic commented Sep 15, 2023

terlenbach commented Sep 15, 2023

IvanTopolcic commented Sep 15, 2023 •

edited

Loading

IvanTopolcic commented Sep 15, 2023

terlenbach commented Sep 16, 2023

IvanTopolcic commented Sep 16, 2023

IvanTopolcic commented Sep 25, 2023 •

edited

Loading

terlenbach commented Sep 26, 2023

IvanTopolcic commented Oct 12, 2023

terlenbach commented Oct 13, 2023

IvanTopolcic commented Oct 17, 2023

terlenbach commented Oct 20, 2023

IvanTopolcic commented Oct 30, 2023

IvanTopolcic commented Nov 1, 2023

terlenbach commented Nov 8, 2023

[BUG] Metrics values are wrong when using serverless-init #19446

[BUG] Metrics values are wrong when using serverless-init #19446

Comments

terlenbach commented Sep 13, 2023

Problem summary

Problem description

First run

Second run

Results

IvanTopolcic commented Sep 14, 2023

terlenbach commented Sep 14, 2023

terlenbach commented Sep 15, 2023 • edited Loading

IvanTopolcic commented Sep 15, 2023

terlenbach commented Sep 15, 2023

IvanTopolcic commented Sep 15, 2023 • edited Loading

IvanTopolcic commented Sep 15, 2023

terlenbach commented Sep 16, 2023

IvanTopolcic commented Sep 16, 2023

IvanTopolcic commented Sep 25, 2023 • edited Loading

terlenbach commented Sep 26, 2023

IvanTopolcic commented Oct 12, 2023

terlenbach commented Oct 13, 2023

IvanTopolcic commented Oct 17, 2023

terlenbach commented Oct 20, 2023

IvanTopolcic commented Oct 30, 2023

IvanTopolcic commented Nov 1, 2023

terlenbach commented Nov 8, 2023

terlenbach commented Sep 15, 2023 •

edited

Loading

IvanTopolcic commented Sep 15, 2023 •

edited

Loading

IvanTopolcic commented Sep 25, 2023 •

edited

Loading