-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trin is occasionally sending invalid/incomplete content #1610
Comments
I'm working on reproducing this. Here is my script, adapted with the aim of halting when an error is encountered. I was worried it would keep looping after a failure and I would miss it :) ATTEMPTS=0
while [[ ATTEMPTS -eq 0 ]] || (echo $RESULT | jq -e .result >/dev/null); do
ATTEMPTS=$(( ATTEMPTS + 1))
RESULT=$(curl -s -X POST -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","id":"1","method":"portal_historyFindContent","params":["enr:-LS4QKsj-3uj6kWH0rLl8-LOf3eeI7o6TkGNk5OMXecV4ETwY0u-F8JqCfAD1P6OusOFZoENox95jBmOvXx5dIK5Fn2EZ1tWG2OqdCBjOTNlMTI5ZGNmYTAwNmM1YjQwZjY3M2I3Yzk4OGU5MjNkZGViNmI2gmlkgnY0gmlwhIDHliOJc2VjcDI1NmsxoQOYfB94zD_LtTgL22BsteJJUnzT8mEQ9BxV3ASaXXuhcYN1ZHCCIzE", "0x017ec54d5714a58002c993856e0d952452fe39d494808647fbdf3ebcf6e3545b9c"]}' http://localhost:8547);
echo -n "attempt: $ATTEMPTS, at: ";
date;
sleep 2;
done;
echo $RESULT | jq; So far, I've run it ~500 times on the latest trin without reproducing. If I can't reproduce with a few thousand attempts, then I think the next step is to reproduce with an older version of trin. If that does show the error, then I think we can reasonably close the issue as resolved. |
Ok, got it in about 1k attempts: {
"jsonrpc": "2.0",
"id": "1",
"error": {
"code": -32099,
"message": "FindContent request timeout: FailedValidation(\"Network: History, Reason: Content validation failed for content key BlockBody(BlockBodyKey { block_hash: [126, 197, 77, 87, 20, 165, 128, 2, 201, 147, 133, 110, 13, 149, 36, 82, 254, 57, 212, 148, 128, 134, 71, 251, 223, 62, 188, 246, 227, 84, 91, 156] }) with error: Block Body content has invalid encoding: BytesInvalid(\\\"Invalid block body ssz bytes\\\")\")"
}
} |
Yup, got a partial result. In this example, the invalid result is just the tail end of the correct one. Valid result:
Invalid result:
The invalid result is missing the first 4096 hex characters (2048 bytes), but is otherwise correct. I compared the strings via this python: from difflib import SequenceMatcher
SequenceMatcher(None, invalid, valid, autojunk=False).find_longest_match() I will repeat it a couple times to see if the shape of the invalid result is the same every time. |
Huh, I got same exact result after the next attempt. It is missing the first 2048 bytes again.
Lol, of course it's the same node, this test explicitly sends a find-content request to the same node over and over again... |
That target node was a little old, so I reproduced against Same result: missing the first 2048 bytes. |
Ok, I was finally able to get control of a sending node that could reproduce the issue. I added logs like this on both sides: diff --git a/crates/portalnet/src/utp/controller.rs b/crates/portalnet/src/utp/controller.rs
index e7dfc433..8c7ce868 100644
--- a/crates/portalnet/src/utp/controller.rs
+++ b/crates/portalnet/src/utp/controller.rs
@@ -183,7 +183,7 @@ impl UtpController {
})?;
let mut data = vec![];
- stream.read_to_eof(&mut data).await
+ let byte_len = stream.read_to_eof(&mut data).await
.map_err(|err| {
self.metrics
.report_utp_outcome(UtpDirectionLabel::Inbound, UtpOutcomeLabel::FailedDataTx);
@@ -192,6 +192,7 @@ impl UtpController {
"Unable to locate content on the network: error reading data from {message}"
)
})?;
+ tracing::warn!("Read {} bytes from uTP stream", byte_len);
// report utp tx as successful, even if we go on to fail to process the payload
self.metrics
@@ -241,6 +242,7 @@ impl UtpController {
}
};
+ tracing::warn!("Writing {} bytes to uTP stream", data.len());
match stream.write(data).await {
Ok(write_size) => {
if write_size != data.len() { The number of bytes sent into |
So the first window appears to default to 2048 bytes, which seemed like a good first place to look... But even if I change that constant, the first 2048 bytes are still getting dropped. So I think it's a red herring. |
This data is a little bit sketchy, because there's a decent amount of randomness involved, but here are some scenarios that seem to minimize or prevent the bug:
Normally the bug reproduces within 500-1000 runs, but with the above scenarios, I don't see the bug in 3500 or more attempts. The bug does reproduce if the trace is turned on on the receiver side. See two successful traces followed by an unsuccessful trace here:
At first glance, it looks like the failure pattern is something like:
|
@carver could you run your test with this potential "fix" and see if there is any improvements? |
I'm running a test right now with this patch: diff --git a/src/conn.rs b/src/conn.rs
index d2336e0..23bc8e0 100644
--- a/src/conn.rs
+++ b/src/conn.rs
@@ -184,6 +184,7 @@ pub struct Connection<const N: usize, P: ConnectionPeer> {
pending_writes: VecDeque<QueuedWrite>,
writable: Notify,
latest_timeout: Option<Instant>,
+ syn_ack: Option<Packet>,
}
impl<const N: usize, P: ConnectionPeer> Connection<N, P> {
@@ -228,6 +229,7 @@ impl<const N: usize, P: ConnectionPeer> Connection<N, P> {
pending_writes: VecDeque::new(),
writable: Notify::new(),
latest_timeout: None,
+ syn_ack: None,
}
}
@@ -254,6 +256,7 @@ impl<const N: usize, P: ConnectionPeer> Connection<N, P> {
}
Endpoint::Acceptor((syn, syn_ack)) => {
let state = self.state_packet().unwrap();
+ self.syn_ack = Some(state.clone());
self.socket_events
.send(SocketEvent::Outgoing((state, self.peer.clone())))
.unwrap();
@@ -787,7 +790,16 @@ impl<const N: usize, P: ConnectionPeer> Connection<N, P> {
// NOTE: We should probably move this so that multiple incoming packets can be processed
// before we send a STATE.
match packet.packet_type() {
- PacketType::Syn | PacketType::Fin | PacketType::Data => {
+ PacketType::Syn => {
+ if let Some(state) = &self.syn_ack {
+ let event = SocketEvent::Outgoing((state.clone(), self.peer.clone()));
+ if self.socket_events.send(event).is_err() {
+ tracing::warn!("Cannot re-transmit syn ack packet: socket closed channel");
+ return;
+ }
+ }
+ }
+ PacketType::Fin | PacketType::Data => {
if let Some(state) = self.state_packet() {
let event = SocketEvent::Outgoing((state, self.peer.clone()));
if self.socket_events.send(event).is_err() { If that doesn't solve it, I'll try 151 next. I don't have intuition for how 151 would be related, but I'm happy to try it anyway :) |
With the above patch I am not able to reproduce the bug, but... I am also not seeing any line like this from the log:
My understanding is that the issue starts with a timeout. I'm not seeing any timeout with this patch (even with I think the next step is to build a more targeted utp test, rather than just relying on randomness to reproduce it. I don't think I'll be able to do that in my random weekend downtime, so I'll pick it back up tomorrow. |
Got one! I don't know why it took >20k attempts, instead of the usual 1k, but here is a SYN timeout without the partially-dropped content bug:
This seems to validate the diagnosis, and basic shape of the solution. I would still like to add a test that reproduces the failure, to prevent regressions. |
I noticed this issue when running fluffy and requesting (
FindContent
) a lot of block headers and bodies. Occasionally the block body send back fails validation. It fails already at the SSZ decoding step. The uTP socket went through proper connection closing (received the remote FIN).Now this per se is perhaps not an actual issue. There might be some error cases where a uTP socket gets closed with an early FIN without having send all the bytes of the content. However what I am often seeing is that when validation fails a part of the content received is a valid part of the encoded block, but it is always the later part of the block body. A big chunk of bytes of the beginning of the content is missing.
Occasionally I also see the validation fail because 0 data bytes got send. Curious to hear when that would occur, but this is probably more of an normal error case where a socket got closed early.
I have seen it only so far with block bodies, but I am only testing block headers & block bodies. And headers don't require uTP connection, so perhaps it will occur for any data requested that is bigger than the discv5 payload size.
Manually requesting (JSON-RPC
FindContent
) the same content again from the same node usually works (= valid data) so it is an issue that happens rarely. I have however been able to reproduce this by retrying it semi-manually in a loop.Typical error logs from requesting fluffy node:
And the case of data bytes send = 0:
I think it is also reproducible with a Trin node as requesting node, which makes me deduce that the issue is on the content sender side (hence the creation of this issue). And I've only seen it occur so far with Trin nodes as sender of the content. But I am not 100% sure as it depends on what the error message I am seeing actually mean (see logs below).
To reproduce:
Run:
It might take a while but eventually you will hit a response which will take much longer and look like:
or
I'm not sure which/if one of those relates to the error case of the beginning of the content that is missing, but I assuming here that one of the two does, hence the creation of this issue.
The text was updated successfully, but these errors were encountered: