fix: fall back to configured seed URLs when pool is exhausted during discovery#104
Open
fix: fall back to configured seed URLs when pool is exhausted during discovery#104
Conversation
…discovery When every discovered node becomes unreachable between discovery intervals, pool.Next() returns an error and discovery aborts, preventing the client from self-healing even if the originally configured URLs are reachable. getNodesInfo now attempts each c.urls entry as a fallback when pool.Next() fails. The first seed URL that returns a valid /_nodes/http response wins; if all fail, the pool error is joined with per-seed errors via errors.Join so the root cause is preserved. Closes #93
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
pool.Next() on a statusConnectionPool rarely returns an error when nodes are dead - it force-resurrects the last dead connection and hands it back. The self-healing scenario from #93 (all nodes unreachable but seed URLs are up) actually surfaces as a roundTrip failure on the returned connection, not as a pool.Next() error. Extend the fallback: if the pool-selected connection's fetch fails (network error, non-2xx response, or decode error), call pool.OnFailure so the pool state stays accurate, then try each configured seed URL. Short-circuit on context cancellation to avoid hammering seed URLs when the parent context is already done.
TestCloseConcurrent caught a race: when Close cancels the parent context of an in-flight DiscoverNodesContext, fetchNodesInfo returns a ctx error, which previously triggered pool.OnFailure -> scheduleResurrect. That schedules a goroutine via resurrectWaitGroup.Add(1) concurrent with statusConnectionPool.Close calling resurrectWaitGroup.Wait(), violating sync.WaitGroup's contract. A canceled request is a client-side signal, not a node fault, so skip OnFailure when ctx.Err() != nil. Discovery still reports the underlying error; the pool state simply isn't perturbed by our own cancellation.
Contributor
Benchmark Resultsubuntu-latestwindows-latestmacos-latestCompared using benchstat. Benchmarks were run with |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #93.
Discovery now falls back to the configured seed URLs when the pool-based fetch of
/_nodes/httpfails, giving the client a chance to self-heal when every discovered node is unreachable but a seed URL is still up.getNodesInfoattempts the pool-selected connection first.pool.Next()errors, or the fetch via the pool connection fails (network error, non-2xx, decode error), eachc.urlsentry is tried in order. On pool-connection failure we also callpool.OnFailure(conn)so the pool state stays accurate.errors.Joinso the root cause is preserved and still unwrappable viaerrors.Is/errors.As.fetchNodesInfoso the pool and fallback paths share one implementation.Discussion point: why fall back on request failure, not just on
pool.Next()errorThe issue text describes the trigger as
pool.Next()returning an error "for example every node has been marked dead and is waiting for resurrection". In practice that's not howstatusConnectionPool.Next()behaves today:liveanddeadare empty (or the pool is closed, orresurrectitself errors).scheduleResurrectalso auto-resurrects dead nodes afterdefaultResurrectTimeoutInitial(60s, doubling up to ~32min).So "every node marked dead" doesn't produce a
pool.Next()error -Next()hands back a force-resurrected dead node, and the real failure surfaces insidec.roundTrip(req)when that node is actually unreachable. A fix that only kicks in onpool.Next()errors would miss the scenario the issue is trying to solve and would only help the degenerate "pool has zero connections" case.Caveat of the broader trigger: on a transient network blip where a pool node fails a single discovery request, we now walk every seed URL before giving up, which can amplify latency. I think that's acceptable for the discovery path (runs on a long interval, off the hot request path), but happy to narrow the trigger if reviewers prefer.
Test plan
go test -v -race ./...passesgo vet ./...cleanpool.OnFailurecalled exactly once, seed URL contacted after the dead node, discovery succeeds