UDS timeout appears to be triggering a NodeJS crash from unhandled EPIPE exception #34461

bataras · 2025-02-26T06:38:24Z

We're running a heavily loaded NodeJS app in Fargate using the dd-trace node lib, running datadog agent in a sidecar, sending metrics and traces to it via UDS. "Heavily loaded" means 10s of K websocket connections with event loop delays sporadically spiking over 1,000ms with bursts of traffic, but otherwise good in the normal case.

We sometimes see a top level uncaught exception EPIPE error that crashes the app. Unfortunately, with Node it's been impossible to see the full stack. But when this happens, we see this log in the datadog agent a few hundred ms before the app:

agent log:
2025-02-19T23:25:30.916-08:00 - 2025-02-20 07:25:30 UTC | TRACE | ERROR | (pkg/trace/api/api.go:618 in handleTraces) | Cannot decode v0.4 traces payload: read unix /var/run/datadog/apm.socket->@: i/o timeout

app log:
2025-02-19T23:25:31.325-08:00 Uncaught Exception: [Error] [uncaughtException] write EPIPE: undefined

If we set DD_TRACE_PARTIAL_FLUSH_MIN_SPANS to 100 (from the default of 1000), the frequency of these errors drops to -almost- 0. But not completely. It still happens once in a while across a cluster handling millions of connections.

I think this is happening for the following reasons:

The agent:

datadog-agent/pkg/trace/api/api.go

Line 585 in 69867d9

case r.recvsem <- struct{}{}:

And a race window in NodeJS itself:
nodejs/node#55952

In handleTraces(), the agent is doing a select, waiting for a semaphore or a "time.After", but it's not also selecting on req.Context().Done() which is a context WithTimeout that is initialized with a different timeout by timeoutMiddleware(), initialized in buildMux()

So my guess is the agent is ungracefully closing the connection rather than sending a graceful response as from case <-time.After...

And this ungraceful UDS close is hitting the small race window within NodeJS itself (linked above)

github-actions bot added the team/agent-apm trace-agent label Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UDS timeout appears to be triggering a NodeJS crash from unhandled EPIPE exception #34461

UDS timeout appears to be triggering a NodeJS crash from unhandled EPIPE exception #34461

bataras commented Feb 26, 2025 •

edited

Loading

UDS timeout appears to be triggering a NodeJS crash from unhandled EPIPE exception #34461

UDS timeout appears to be triggering a NodeJS crash from unhandled EPIPE exception #34461

Comments

bataras commented Feb 26, 2025 • edited Loading

bataras commented Feb 26, 2025 •

edited

Loading