You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're running a heavily loaded NodeJS app in Fargate using the dd-trace node lib, running datadog agent in a sidecar, sending metrics and traces to it via UDS. "Heavily loaded" means 10s of K websocket connections with event loop delays sporadically spiking over 1,000ms with bursts of traffic, but otherwise good in the normal case.
We sometimes see a top level uncaught exception EPIPE error that crashes the app. Unfortunately, with Node it's been impossible to see the full stack. But when this happens, we see this log in the datadog agent a few hundred ms before the app:
If we set DD_TRACE_PARTIAL_FLUSH_MIN_SPANS to 100 (from the default of 1000), the frequency of these errors drops to -almost- 0. But not completely. It still happens once in a while across a cluster handling millions of connections.
I think this is happening for the following reasons:
In handleTraces(), the agent is doing a select, waiting for a semaphore or a "time.After", but it's not also selecting on req.Context().Done() which is a context WithTimeout that is initialized with a different timeout by timeoutMiddleware(), initialized in buildMux()
So my guess is the agent is ungracefully closing the connection rather than sending a graceful response as from case <-time.After...
And this ungraceful UDS close is hitting the small race window within NodeJS itself (linked above)
The text was updated successfully, but these errors were encountered:
We're running a heavily loaded NodeJS app in Fargate using the dd-trace node lib, running datadog agent in a sidecar, sending metrics and traces to it via UDS. "Heavily loaded" means 10s of K websocket connections with event loop delays sporadically spiking over 1,000ms with bursts of traffic, but otherwise good in the normal case.
We sometimes see a top level uncaught exception EPIPE error that crashes the app. Unfortunately, with Node it's been impossible to see the full stack. But when this happens, we see this log in the datadog agent a few hundred ms before the app:
agent log:
2025-02-19T23:25:30.916-08:00 - 2025-02-20 07:25:30 UTC | TRACE | ERROR | (pkg/trace/api/api.go:618 in handleTraces) | Cannot decode v0.4 traces payload: read unix /var/run/datadog/apm.socket->@: i/o timeout
app log:
2025-02-19T23:25:31.325-08:00 Uncaught Exception: [Error] [uncaughtException] write EPIPE: undefined
If we set DD_TRACE_PARTIAL_FLUSH_MIN_SPANS to 100 (from the default of 1000), the frequency of these errors drops to -almost- 0. But not completely. It still happens once in a while across a cluster handling millions of connections.
I think this is happening for the following reasons:
The agent:
datadog-agent/pkg/trace/api/api.go
Line 585 in 69867d9
And a race window in NodeJS itself:
nodejs/node#55952
In handleTraces(), the agent is doing a select, waiting for a semaphore or a "time.After", but it's not also selecting on req.Context().Done() which is a context WithTimeout that is initialized with a different timeout by timeoutMiddleware(), initialized in buildMux()
So my guess is the agent is ungracefully closing the connection rather than sending a graceful response as from case <-time.After...
And this ungraceful UDS close is hitting the small race window within NodeJS itself (linked above)
The text was updated successfully, but these errors were encountered: