Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDS timeout appears to be triggering a NodeJS crash from unhandled EPIPE exception #34461

Open
bataras opened this issue Feb 26, 2025 · 0 comments
Labels
team/agent-apm trace-agent

Comments

@bataras
Copy link

bataras commented Feb 26, 2025

We're running a heavily loaded NodeJS app in Fargate using the dd-trace node lib, running datadog agent in a sidecar, sending metrics and traces to it via UDS. "Heavily loaded" means 10s of K websocket connections with event loop delays sporadically spiking over 1,000ms with bursts of traffic, but otherwise good in the normal case.

We sometimes see a top level uncaught exception EPIPE error that crashes the app. Unfortunately, with Node it's been impossible to see the full stack. But when this happens, we see this log in the datadog agent a few hundred ms before the app:

agent log:
2025-02-19T23:25:30.916-08:00 - 2025-02-20 07:25:30 UTC | TRACE | ERROR | (pkg/trace/api/api.go:618 in handleTraces) | Cannot decode v0.4 traces payload: read unix /var/run/datadog/apm.socket->@: i/o timeout

app log:
2025-02-19T23:25:31.325-08:00 Uncaught Exception: [Error] [uncaughtException] write EPIPE: undefined

If we set DD_TRACE_PARTIAL_FLUSH_MIN_SPANS to 100 (from the default of 1000), the frequency of these errors drops to -almost- 0. But not completely. It still happens once in a while across a cluster handling millions of connections.

I think this is happening for the following reasons:

The agent:

case r.recvsem <- struct{}{}:

And a race window in NodeJS itself:
nodejs/node#55952

In handleTraces(), the agent is doing a select, waiting for a semaphore or a "time.After", but it's not also selecting on req.Context().Done() which is a context WithTimeout that is initialized with a different timeout by timeoutMiddleware(), initialized in buildMux()

So my guess is the agent is ungracefully closing the connection rather than sending a graceful response as from case <-time.After...

And this ungraceful UDS close is hitting the small race window within NodeJS itself (linked above)

@github-actions github-actions bot added the team/agent-apm trace-agent label Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/agent-apm trace-agent
Projects
None yet
Development

No branches or pull requests

1 participant