Summary
DataColumnSidecarsByRangeMessageHandler.sendDataColumnSidecars recursively chains thenCompose calls to send the next data column sidecar. If each sidecar load and response future is already complete, the next recursive call executes on the same stack. A near-limit Fulu response can include up to 128 * 128 = 16,384 data column sidecars, which is enough to trigger StackOverflowError in this completed-future execution model.
We run a poc and in the tested on teku in kurtosis. This appears to be a robustness issue, in the poc this we triggered the bug, but we failed to reproduce a live Teku node crash in Kurtosis.
We confirmed the failure with a local poc, but did not reproduce a crash against a live Teku node in Kurtosis (see below).
Thanks for your attention!
Affected code
|
private SafeFuture<RequestState> sendDataColumnSidecars(final RequestState requestState) { |
|
return requestState |
|
.loadNextDataColumnSidecar() |
|
.thenCompose( |
|
maybeDataColumnSidecar -> |
|
maybeDataColumnSidecar |
|
.map(requestState::sendDataColumnSidecar) |
|
.orElse(SafeFuture.COMPLETE)) |
|
.thenCompose( |
|
__ -> { |
|
if (requestState.isComplete()) { |
|
return SafeFuture.completedFuture(requestState); |
|
} else { |
|
return sendDataColumnSidecars(requestState); |
|
} |
|
}); |
|
} |
BTW, Teku's BeaconBlocksByRangeMessageHandler and ExecutionPayloadEnvelopesByRangeMessageHandler already use an iterative guard for already-completed futures to avoid this failure mode.
Local reproduction
The Main.java reproduces the failure by making every load and send return an already-completed future:
javac Main.java
java -Xss8m Main 16384
Observed locally:
recursive thenCompose loop FAILURE sent=14000 elapsed_ms=28 error=java.lang.StackOverflowError
Kurtosis live-node result
We also tested against a Teku node in a Kurtosis Fulu devnet. The request used start_slot=0,count=128 and 128 columns.
The tested range only had 65 blob-bearing slots, so Teku served 65 * 128 = 8,320 response chunks successfully:
request[1] chunks=8320 bytes=258807949 duration=18.960851833s first_error=none stream_error="" error_text=""
health_after=200 head_after=69
summary requests=1 total_chunks=8320 stream_errors=0 server_errors=0 requested_sidecars_per_request=16384
verdict=live path served thousands of chunks without an observable requester-side crash signal
Teku logs showed the matching request completed successfully:
ReqResp inbound data_column_sidecars_by_range, columns: 8320/16384 in 18941 ms
Suggested fix
Consider the similar guard in
|
private SafeFuture<RequestState> sendNextBlock(final RequestState requestState) { |
|
SafeFuture<Boolean> blockFuture = processNextBlock(requestState); |
|
// Avoid risk of StackOverflowException by iterating when the block future is already complete |
|
// Using thenCompose on the completed future would execute immediately and recurse back into |
|
// this method to send the next block. When not already complete, thenCompose is executed |
|
// on a separate thread so doesn't recurse on the same stack. |
|
while (blockFuture.isDone() && !blockFuture.isCompletedExceptionally()) { |
|
if (blockFuture.join()) { |
|
return completedFuture(requestState); |
|
} |
|
blockFuture = processNextBlock(requestState); |
|
} |
|
return blockFuture.thenCompose( |
|
complete -> complete ? completedFuture(requestState) : sendNextBlock(requestState)); |
|
} |
Summary
DataColumnSidecarsByRangeMessageHandler.sendDataColumnSidecarsrecursively chainsthenComposecalls to send the next data column sidecar. If each sidecar load and response future is already complete, the next recursive call executes on the same stack. A near-limit Fulu response can include up to128 * 128 = 16,384data column sidecars, which is enough to triggerStackOverflowErrorin this completed-future execution model.We run a poc and in the tested on teku in kurtosis. This appears to be a robustness issue, in the poc this we triggered the bug, but we failed to reproduce a live Teku node crash in Kurtosis.
We confirmed the failure with a local poc, but did not reproduce a crash against a live Teku node in Kurtosis (see below).
Thanks for your attention!
Affected code
teku/networking/eth2/src/main/java/tech/pegasys/teku/networking/eth2/rpc/beaconchain/methods/DataColumnSidecarsByRangeMessageHandler.java
Lines 211 to 227 in 71dd996
BTW, Teku's
BeaconBlocksByRangeMessageHandlerandExecutionPayloadEnvelopesByRangeMessageHandleralready use an iterative guard for already-completed futures to avoid this failure mode.Local reproduction
The Main.java reproduces the failure by making every load and send return an already-completed future:
Observed locally:
Kurtosis live-node result
We also tested against a Teku node in a Kurtosis Fulu devnet. The request used
start_slot=0,count=128and128columns.The tested range only had
65blob-bearing slots, so Teku served65 * 128 = 8,320response chunks successfully:Teku logs showed the matching request completed successfully:
Suggested fix
Consider the similar guard in
teku/networking/eth2/src/main/java/tech/pegasys/teku/networking/eth2/rpc/beaconchain/methods/BeaconBlocksByRangeMessageHandler.java
Lines 204 to 218 in 71dd996