Skip to content

DataColumnSidecarsByRange potential StackOverflow #10793

@Alleysira

Description

@Alleysira

Summary

DataColumnSidecarsByRangeMessageHandler.sendDataColumnSidecars recursively chains thenCompose calls to send the next data column sidecar. If each sidecar load and response future is already complete, the next recursive call executes on the same stack. A near-limit Fulu response can include up to 128 * 128 = 16,384 data column sidecars, which is enough to trigger StackOverflowError in this completed-future execution model.

We run a poc and in the tested on teku in kurtosis. This appears to be a robustness issue, in the poc this we triggered the bug, but we failed to reproduce a live Teku node crash in Kurtosis.

We confirmed the failure with a local poc, but did not reproduce a crash against a live Teku node in Kurtosis (see below).

Thanks for your attention!

Affected code

private SafeFuture<RequestState> sendDataColumnSidecars(final RequestState requestState) {
return requestState
.loadNextDataColumnSidecar()
.thenCompose(
maybeDataColumnSidecar ->
maybeDataColumnSidecar
.map(requestState::sendDataColumnSidecar)
.orElse(SafeFuture.COMPLETE))
.thenCompose(
__ -> {
if (requestState.isComplete()) {
return SafeFuture.completedFuture(requestState);
} else {
return sendDataColumnSidecars(requestState);
}
});
}

BTW, Teku's BeaconBlocksByRangeMessageHandler and ExecutionPayloadEnvelopesByRangeMessageHandler already use an iterative guard for already-completed futures to avoid this failure mode.

Local reproduction

The Main.java reproduces the failure by making every load and send return an already-completed future:

javac Main.java
java -Xss8m Main 16384

Observed locally:

recursive thenCompose loop         FAILURE sent=14000 elapsed_ms=28 error=java.lang.StackOverflowError

Kurtosis live-node result

We also tested against a Teku node in a Kurtosis Fulu devnet. The request used start_slot=0,count=128 and 128 columns.

The tested range only had 65 blob-bearing slots, so Teku served 65 * 128 = 8,320 response chunks successfully:

request[1] chunks=8320 bytes=258807949 duration=18.960851833s first_error=none stream_error="" error_text=""
health_after=200 head_after=69
summary requests=1 total_chunks=8320 stream_errors=0 server_errors=0 requested_sidecars_per_request=16384
verdict=live path served thousands of chunks without an observable requester-side crash signal

Teku logs showed the matching request completed successfully:

ReqResp inbound data_column_sidecars_by_range, columns: 8320/16384 in 18941 ms

Suggested fix

Consider the similar guard in

private SafeFuture<RequestState> sendNextBlock(final RequestState requestState) {
SafeFuture<Boolean> blockFuture = processNextBlock(requestState);
// Avoid risk of StackOverflowException by iterating when the block future is already complete
// Using thenCompose on the completed future would execute immediately and recurse back into
// this method to send the next block. When not already complete, thenCompose is executed
// on a separate thread so doesn't recurse on the same stack.
while (blockFuture.isDone() && !blockFuture.isCompletedExceptionally()) {
if (blockFuture.join()) {
return completedFuture(requestState);
}
blockFuture = processNextBlock(requestState);
}
return blockFuture.thenCompose(
complete -> complete ? completedFuture(requestState) : sendNextBlock(requestState));
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions