Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trin is occasionally sending invalid/incomplete content #1610

Closed
kdeme opened this issue Dec 20, 2024 · 12 comments · Fixed by #1732
Closed

Trin is occasionally sending invalid/incomplete content #1610

kdeme opened this issue Dec 20, 2024 · 12 comments · Fixed by #1732

Comments

@kdeme
Copy link
Contributor

kdeme commented Dec 20, 2024

I noticed this issue when running fluffy and requesting (FindContent) a lot of block headers and bodies. Occasionally the block body send back fails validation. It fails already at the SSZ decoding step. The uTP socket went through proper connection closing (received the remote FIN).

Now this per se is perhaps not an actual issue. There might be some error cases where a uTP socket gets closed with an early FIN without having send all the bytes of the content. However what I am often seeing is that when validation fails a part of the content received is a valid part of the encoded block, but it is always the later part of the block body. A big chunk of bytes of the beginning of the content is missing.

Occasionally I also see the validation fail because 0 data bytes got send. Curious to hear when that would occur, but this is probably more of an normal error case where a socket got closed early.

I have seen it only so far with block bodies, but I am only testing block headers & block bodies. And headers don't require uTP connection, so perhaps it will occur for any data requested that is bigger than the discv5 payload size.

Manually requesting (JSON-RPC FindContent) the same content again from the same node usually works (= valid data) so it is an issue that happens rarely. I have however been able to reproduce this by retrying it semi-manually in a loop.

Typical error logs from requesting fluffy node:

WRN 2024-12-20 17:50:22.843+01:00 Validation of block body failed            topics="portal_hist" error="SSZ PortalBlockBodyLegacy: object dynamic portion starts at invalid offset" node=enr:-LS4QCJ4igzi-8VZzG7fuwwzzbab3u5kgDo2XlpiyBdzhSgsdxpnIJo1HjD3jVydTpBOyhJHvQTPHXau9YnykxZPjCuEZ1tVrWOqdCBjOTNlMTI5ZGNmYTAwNmM1YjQwZjY3M2I3Yzk4OGU5MjNkZGViNmI2gmlkgnY0gmlwhI_0qIWJc2VjcDI1NmsxoQIIu0viHReNIKnjhuSCXDMcC6TLKEDF5oMmqogiPISN4oN1ZHCCIzE contentKey=0x010e8434574dd477d537a77007c7607715ae73fff4447347abd320b8255fd548f5

And the case of data bytes send = 0:

WRN 2024-12-20 17:50:16.730+01:00 Validation of block body failed            topics="portal_hist" error="SSZ PortalBlockBodyLegacy: input of insufficient size" node=enr:-LS4QKBFkP1FnZr7sdzj-bicU6KhWCtpRby6leiRTjO8MgNmOM2mzRLvC9okXvI57OP69SjI9DTGEAzKj6w2VPIkuKuEZ1tWemOqdCBjOTNlMTI5ZGNmYTAwNmM1YjQwZjY3M2I3Yzk4OGU5MjNkZGViNmI2gmlkgnY0gmlwhLKAMrKJc2VjcDI1NmsxoQIjF3KPpXlg6aJ10rzOCiBGTMIkhuEZAGssSi8VlhQ-vIN1ZHCCIzE contentKey=0x01be29a407ad4522d25366cc2e7913bb6005c965fca007af7e6f6773fe29a7445e

I think it is also reproducible with a Trin node as requesting node, which makes me deduce that the issue is on the content sender side (hence the creation of this issue). And I've only seen it occur so far with Trin nodes as sender of the content. But I am not 100% sure as it depends on what the error message I am seeing actually mean (see logs below).

To reproduce:

# Run trin, I used v0.1.0
RUST_LOG=info cargo run -- --web3-http-address http://127.0.0.1:8547 --web3-transport http --discovery-port 9010 --bootnodes default --mb 200 --no-stun

Run:

while true; do curl -s -X POST -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","id":"1","method":"portal_historyFindContent","params":["enr:-LS4QKsj-3uj6kWH0rLl8-LOf3eeI7o6TkGNk5OMXecV4ETwY0u-F8JqCfAD1P6OusOFZoENox95jBmOvXx5dIK5Fn2EZ1tWG2OqdCBjOTNlMTI5ZGNmYTAwNmM1YjQwZjY3M2I3Yzk4OGU5MjNkZGViNmI2gmlkgnY0gmlwhIDHliOJc2VjcDI1NmsxoQOYfB94zD_LtTgL22BsteJJUnzT8mEQ9BxV3ASaXXuhcYN1ZHCCIzE", "0x017ec54d5714a58002c993856e0d952452fe39d494808647fbdf3ebcf6e3545b9c"]}' http://localhost:8547 | jq ; sleep 2; done

It might take a while but eventually you will hit a response which will take much longer and look like:

{
  "jsonrpc": "2.0",
  "id": "1",
  "error": {
    "code": -32099,
    "message": "FindContent request timeout: FailedValidation(\"Network: History, Reason: Content validation failed for content key BlockBody(BlockBodyKey { block_hash: [126, 197, 77, 87, 20, 165, 128, 2, 201, 147, 133, 110, 13, 149, 36, 82, 254, 57, 212, 148, 128, 134, 71, 251, 223, 62, 188, 246, 227, 84, 91, 156] }) with error: Block Body content has invalid encoding: BytesInvalid(\\\"Invalid block body ssz bytes\\\")\")"
  }
}

or

{
  "jsonrpc": "2.0",
  "id": "1",
  "error": {
    "code": -32099,
    "message": "Failed to parse error message from history subnetwork: Error(\"expected value\", line: 1, column: 1)"
  }
}

I'm not sure which/if one of those relates to the error case of the beginning of the content that is missing, but I assuming here that one of the two does, hence the creation of this issue.

@carver
Copy link
Collaborator

carver commented Mar 5, 2025

I'm working on reproducing this. Here is my script, adapted with the aim of halting when an error is encountered. I was worried it would keep looping after a failure and I would miss it :)

ATTEMPTS=0
while [[ ATTEMPTS -eq 0 ]] || (echo $RESULT | jq -e .result >/dev/null); do
 ATTEMPTS=$(( ATTEMPTS + 1))
 RESULT=$(curl -s -X POST -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","id":"1","method":"portal_historyFindContent","params":["enr:-LS4QKsj-3uj6kWH0rLl8-LOf3eeI7o6TkGNk5OMXecV4ETwY0u-F8JqCfAD1P6OusOFZoENox95jBmOvXx5dIK5Fn2EZ1tWG2OqdCBjOTNlMTI5ZGNmYTAwNmM1YjQwZjY3M2I3Yzk4OGU5MjNkZGViNmI2gmlkgnY0gmlwhIDHliOJc2VjcDI1NmsxoQOYfB94zD_LtTgL22BsteJJUnzT8mEQ9BxV3ASaXXuhcYN1ZHCCIzE", "0x017ec54d5714a58002c993856e0d952452fe39d494808647fbdf3ebcf6e3545b9c"]}' http://localhost:8547);
 echo -n "attempt: $ATTEMPTS, at: ";
 date;
 sleep 2;
done;
echo $RESULT | jq;

So far, I've run it ~500 times on the latest trin without reproducing. If I can't reproduce with a few thousand attempts, then I think the next step is to reproduce with an older version of trin. If that does show the error, then I think we can reasonably close the issue as resolved.

@carver
Copy link
Collaborator

carver commented Mar 5, 2025

Ok, got it in about 1k attempts:

{
  "jsonrpc": "2.0",
  "id": "1",
  "error": {
    "code": -32099,
    "message": "FindContent request timeout: FailedValidation(\"Network: History, Reason: Content validation failed for content key BlockBody(BlockBodyKey { block_hash: [126, 197, 77, 87, 20, 165, 128, 2, 201, 147, 133, 110, 13, 149, 36, 82, 254, 57, 212, 148, 128, 134, 71, 251, 223, 62, 188, 246, 227, 84, 91, 156] }) with error: Block Body content has invalid encoding: BytesInvalid(\\\"Invalid block body ssz bytes\\\")\")"
  }
}

@carver
Copy link
Collaborator

carver commented Mar 17, 2025

Yup, got a partial result. In this example, the invalid result is just the tail end of the correct one.

Valid result:

08000000ed130000a40000001401000083010000f10100005f020000cd0200003a030000a80300001704000086040000f604000064050000d305000041060000b00600001e070000a10700000f0800007e0800000109000071090000f4090000770a0000fb0a00007e0b0000010c0000840c0000070d0000770d0000fa0d00007d0e0000000f0000830f000006100000891000000c1100008f1100001212000081120000f012000062130000f86e82011d8506fc23ac008252089432be343b94f860124dc4fee278fdcbd38c102d88880e432e9ee7d7e000801ba0260b26abb88e953ba807a3dacdc2b50beed5bdb83c7b7ccd3b3f398b30ac290aa07c0132ffb5deb370d616b2c482bae62b9330b6a9cfd36a1a0b2f68911169cfddf86d81c88506fc23ac008252089432be343b94f860124dc4fee278fdcbd38c102d888817fbd1c5ea840400801ca0e6531ef56e87ae8f6cd0b0e9de9c75acd326f1535a475f6cc114056ea3be1037a07ca443bd3dc045a7e882e9fc0101feb5d72a5b32aff42cc1a141d1149f4c600af86c048506fc23ac008252089432be343b94f860124dc4fee278fdcbd38c102d88880e27cced4171d000801ca009a41d1e00580cc25be195c2c97305a4124895138431ffc263bdd58731757b2ea045046996a46996f606800606d50b2cbe9af15de335234fcb8fb8b27ec75b14c0f86c1c8506fc23ac008252089432be343b94f860124dc4fee278fdcbd38c102d88880f7d0351368a0400801ba03c77f169d2301722fc56c4fe0058bd691d6fcdf489200ea280659a30dd1a317ca0569c50a802b6b9280a89afbf7d113b18e85aa2b5b116029715cc2723abd8a2aff86c648506fc23ac008252089432be343b94f860124dc4fee278fdcbd38c102d88880e851a448382e400801ba0c3acc77a6b3555cd31cd0d2083c3e22e0c2277e8a5139bb97f40e5c2a7d38d29a078a947d2aa1fe0f941064cc51f51b9cc916cb5322218e194317f87b3c7af4e15f86b1c8506fc23ac008252089432be343b94f860124dc4fee278fdcbd38c102d88880e25b829b8a0c800801ca0ce32fe20a6297bc416d759d26105e177fa3ae3e130c9892c356e046f4a4ee4649f75e97a82361b478d53d86408a60b4bf2967c4606e8f81c1fad64c2feecfd02f86c2a8506fc23ac008252089432be343b94f860124dc4fee278fdcbd38c102d88880e37dfc343d92c00801ba0ab2d9ed6ebe4c42e4c9ca99886bbace36f6dfb515425b061b72746c75db16e87a03bce9caa47846181cc9daa9da30c8bb228f511ed6fad7272013146e7e97cec75f86d819b8506fc23ac008252089432be343b94f860124dc4fee278fdcbd38c102d88880e26f8039a1a4800801ca05b800a71cc7bf2987742c34000b4e772ce13568739a7e256632c1f4fd7de4552a07ed22cb4b0b0e50968c7a4ff8344db106357f26f50706a65fcede582c20fcf2ff86d81e08506fc23ac008252089432be343b94f860124dc4fee278fdcbd38c102d88884e0423cb14d41c00801ba0bcfdc66d48bd9887c1cb1cf9fb00c708e33651f5285ad9bff4d4f87b3bec1ba9a07c15c7540df833cf6208a0e5a88c1bd8a02ef2845980b62da8b4f5492535bde3f86e82012b8506fc23ac008252089432be343b94f860124dc4fee278fdcbd38c102d88880effc08b46061800801ca048d7dea1c8fa340a414dc87ad1e751666bd5f99efe7467b422c5bf9b8c8f4475a0699a8e19befb34398c16d1a8c1c926085c2a3406ea2874d4956b7167550c8e67f86c2e8506fc23ac008252089432be343b94f860124dc4fee278fdcbd38c102d88880e91eb26caef4800801ba0eda6b16858818de9edde478cb32f633354286be19c204943f1dc67951d3ffd49a06a7f171d04f3b3f0c495b7124f5afd1e6c33db20dd49f100bbd48f8e5ef95bdef86d81be8506fc23ac008252089432be343b94f860124dc4fee278fdcbd38c102d88880ed15636cb25e800801ca00e546101effd90a5a22fd3953c10bae139643783548dd70f6e159b68324e5015a04aebf1272ae5ff0d68250b34ce794a266bdae60c6c9f770cd1949467889ed502f86c068506fc23ac008252089432be343b94f860124dc4fee278fdcbd38c102d88880e13c9bbe960dc00801ba099483c373e8751eb6faa9ce920a9d77f354b8bdcc83e74a895d7395fc9f0045ba02c5c30c2eceda6e65ec741174948f3c43d8f6132c2d92890e431a8b975b827a7f86d81d08506fc23ac008252089432be343b94f860124dc4fee278fdcbd38c102d88880edabd51b0324800801ba06bb0588f38c6c75c886f9af248bc0ae33fe3883c2db5628045f2a702cfc6f6ffa0639345cac3bd658033c4ea39e4ac85ce44e0d1fcd5cf07d7ac925fb3ee82db85f86c1f8506fc23ac008252089432be343b94f860124dc4fee278fdcbd38c102d88880e4207d382550c00801ca0a26aae53e3b2acad379efa0d8aacf52a35b5d965236922b4fd6bb3bde1a25e6ba05eac4ece696c40159e2728e91064f9fe26050a63381b77eb3fd5373ff0c83bcff8810e8504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880f25e9fb3e63bc009400000000000000000000000000000000000000001ca0811b0e6da5e5cf08936c41b0188cb9a34c442b59f04aabff5ac7ac54d001ba63a059b48d458c7c6176b1174f5e76b3890340c0d0e319429e56e9c761053f88b98ef86c038504a817c8008252089491337a300e0361bddb2e377dd4e88ccb7796663d880e1c2533bb9bd000801ca01286af8af9072fcd1d902e5c2db94c6449811c71147f51a94fe4060cbc188a4fa05580a63bed14933544415f72be1d1b9b13580b0f0096ffa1fcaefe9282860861f86d81df8504a817c80082562294fbb1b73c4f0bda4f67dca266ce6ef42f520fbb988807684b9e81eb4800801ba0a2f4e366669b855bc0eb1e0617e3d5fd50052fa5c31176d85c5a4c453a415dd4a078526c1fcfdd5405eaeb652943f2f7fb97a28b40da03e7c9c7216ea07c135425f8810d8504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880e67c4a6d804d8809400000000000000000000000000000000000000001ca017bf01d4a6f52378e2c8861fafc3a05379ba301d4d3c7188343b209fad208826a00922be30f035dca08a421e20b80d7e29df5b9237b2834f08af20a924883a63acf86e168504a817c8008301d8a894af13834ec1eb11756b2803f6ba704fcd9f89c011896e0be8c4995af80000801ba071a60317c7f7882690b6d91028d278ed51698ba5c915cf8058da92bf0629f9aba07a858f6898cae1215c0979321082b98a58bea526ac99b2bf22449021bed3d1a8f881098504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea58802e578cac36ca4009400000000000000000000000000000000000000001ca010eaf58e6ee25c13cd7f278bb3ef43b233d1ca1a5a40a156ed1fdbc09675fadfa07a82ae99c54939e94ac5c2de13e4a1ab853ee8f6a025ca9f31e84819a6c731bcf881158504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea58801657e6096f060009400000000000000000000000000000000000000001ca00ba740fa7accf10ebd8b7a4c0b4d36d7ba57d62851d40b45a81fa76af0cb8427a0744959963f5ecad9e75007b6d6fb7407d31c35936fb1c80f988d995ab193ea3cf88281d08504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea588025142257eeb6c009400000000000000000000000000000000000000001ba0b14ea032f7bf39277a830e1b6dcee3f490812e67ba44d9e4dc5c4fc1e720cdf4a026270b487b1b1f005d9f449637e8e8c362c5b929b5f46a9be0ba11a5b71b5a81f8811f8504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880706a390b307a3c09400000000000000000000000000000000000000001ba05a1b56464215126fc835c1285189307a1c32b8d9155667fcd50e42799badb794a0427d7a720311249aae09fcf12ddedc1c54f2a8a9852c7b224fda1c0a1e6bee32f881108504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880263015d110f1be09400000000000000000000000000000000000000001ba0e006565cadc89711b3098f7bffcbf05f8bc01ce36318a077645a444e501fdb7da07e695322360d1c219eca7a486c6b34273993da528ae7d8652cfb2d658f13d175f881198504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea58808925f22573818009400000000000000000000000000000000000000001ca007810de0e0e13ee44d21545807f5535fb111c870c2ecf47e6697d71553503b16a07e706242d19abb0520738dc2d3fc9dfb253e288f77340fa634b0d0fd40452edbf881038504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880e04fd4781cc78009400000000000000000000000000000000000000001ba05587e8c47e7975132ac231f909085332c1ffce444ea7a4f7f11f4b875a33ae2ca06d491b510ec65ee9b14d22534b7d575c132d72adaec5e88029f7594027d7840bf86e82014a8504a817c80082562294fbb1b73c4f0bda4f67dca266ce6ef42f520fbb988806c1d876eb692000801ca0a3f098f5c8efeb64886caf149a6f1ebb38959e85853e6628ec653d0f0208ed5aa014e5f64ae7284e58035853369560a9ed01d9e26804c6ee05f10be79c69b9773bf881078504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880189e31f6f6a0c009400000000000000000000000000000000000000001ca096046faff2901b5a79df075b7d83c5c554627ae7bbe20055800e8b7c3afd8e3ea0424cccf47990812beadd33d45a51196d46a6f04afb6b02feeeeefa2e8a55b2e5f8811f8504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880ab792eb33b810009400000000000000000000000000000000000000001ca01885c06e16a8344055a0110e3486beb126fda9b18f763f40c0e54ba16f28e403a0758ace4b39b3277be7acbea4c1f200a85af303e5cd120ae988da5291748ef2b5f8811f8504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea588018d3e3446bacff09400000000000000000000000000000000000000001ca0d6cbe7437799fbce926881e2295b861e3dceb42383749990e90f7e8e052c2e35a0121f902fe7351f3f8141679c93fdc0529859506f1aee7340b296f8e7c601d2d6f8811a8504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea58802c75cc2c2a0b7e09400000000000000000000000000000000000000001ca008fadac4da7361d84d7f6eb896e20ab988995d2f65eeeb21c4098bb7964fc103a049c66116be1eddaeb659ff008cda4d116d23ded6964983076d0a313fce68acbcf881118504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea588029af34926e4fc009400000000000000000000000000000000000000001ca0bd3293e745ea47f1771e5a51ce3557a08a8024f93242830be905641370de6ecaa05b11ca99d841f56f7ed2ee0bce70b36c3ec414eb2d7162638fa110feb4c10931f8811f8504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880d92f3c5cf02e8009400000000000000000000000000000000000000001ba01617e59b9b9933ec959a82e977377768fe9f316e0d10ac47d06e12834e280d65a04ea855e63a592d6a65ac666a7f0f9456a7949e95b7ec92c113de2b46664d3d3af881268504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880debe8770acc98009400000000000000000000000000000000000000001ba0a1118f9a8119e106012587602326c858414bac53b96f9c2cf6425d05d16575b0a06e6d12088cf75056b8d30b4550747494201c1277a5199fdd181ae155806246e6f881158504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880e11c8d56600f5009400000000000000000000000000000000000000001ba02db6befeefe54b936587e5e308c6d130194d907137e7ff0ebbc34853359fb319a056487db75f6ffb931f507ab68bea17b5db0043e2fe4c806cf230f6765831017bf8811f8504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea58803d2e354aa7b7fe09400000000000000000000000000000000000000001ba094a6ff21478aa2897083fe74020e25f976bda9226d4cc706c21681b8774fe83ea02026fcfad1a7e55fcc429198edbd8b9062c7bce5774ce6fc4353298e2a0dac36f86d8201488504a817c80082562294fbb1b73c4f0bda4f67dca266ce6ef42f520fbb988802e3358d95428400801ba0be29b37fa70931eac8df904a52a996347d25e13fd122228c76ebfd1c700b75849f0cd0a47e50781de3280e20903cadd57f9c3f6ea2f7a122f3fab60526820ae7f86d81e58504a817c80082562294fbb1b73c4f0bda4f67dca266ce6ef42f520fbb988802e0b7a6d381dc00801ba0f777862c0b8b276f68bd8c0bab3f513d84a7260cc9b85fa298b0cd6c6d6f6dfba01353066bc9f8095f5313ab1a393efe70c82eb0d044e0e4508e5bacf32fe9f162f870830113028504a817c80083015f90940013c1d20fb157fda0391afb2c35c54b84dc68b1880de3ebb0eb4d40f8801ca0f3bb804a3ffb975f4b9527a2bfeb16c1c121fc236f37a95cb64a83e926df9cb1a01dee27e677e0fe3eb2fdf77de0d276ef764ac6552265f18714c9e051e4ded9c0f8811f8504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea588230bf317e39fe8009400000000000000000000000000000000000000001ba08c08736644cf801af866ced8a95164dee06468e17e71c1fa9adbaba8e81622f3a07b7a26c087a02b4cc9b9d2120b0a812d19a36c0151f9bcf16de5abf5ce7bc215c0

Invalid result:

415f72be1d1b9b13580b0f0096ffa1fcaefe9282860861f86d81df8504a817c80082562294fbb1b73c4f0bda4f67dca266ce6ef42f520fbb988807684b9e81eb4800801ba0a2f4e366669b855bc0eb1e0617e3d5fd50052fa5c31176d85c5a4c453a415dd4a078526c1fcfdd5405eaeb652943f2f7fb97a28b40da03e7c9c7216ea07c135425f8810d8504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880e67c4a6d804d8809400000000000000000000000000000000000000001ca017bf01d4a6f52378e2c8861fafc3a05379ba301d4d3c7188343b209fad208826a00922be30f035dca08a421e20b80d7e29df5b9237b2834f08af20a924883a63acf86e168504a817c8008301d8a894af13834ec1eb11756b2803f6ba704fcd9f89c011896e0be8c4995af80000801ba071a60317c7f7882690b6d91028d278ed51698ba5c915cf8058da92bf0629f9aba07a858f6898cae1215c0979321082b98a58bea526ac99b2bf22449021bed3d1a8f881098504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea58802e578cac36ca4009400000000000000000000000000000000000000001ca010eaf58e6ee25c13cd7f278bb3ef43b233d1ca1a5a40a156ed1fdbc09675fadfa07a82ae99c54939e94ac5c2de13e4a1ab853ee8f6a025ca9f31e84819a6c731bcf881158504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea58801657e6096f060009400000000000000000000000000000000000000001ca00ba740fa7accf10ebd8b7a4c0b4d36d7ba57d62851d40b45a81fa76af0cb8427a0744959963f5ecad9e75007b6d6fb7407d31c35936fb1c80f988d995ab193ea3cf88281d08504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea588025142257eeb6c009400000000000000000000000000000000000000001ba0b14ea032f7bf39277a830e1b6dcee3f490812e67ba44d9e4dc5c4fc1e720cdf4a026270b487b1b1f005d9f449637e8e8c362c5b929b5f46a9be0ba11a5b71b5a81f8811f8504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880706a390b307a3c09400000000000000000000000000000000000000001ba05a1b56464215126fc835c1285189307a1c32b8d9155667fcd50e42799badb794a0427d7a720311249aae09fcf12ddedc1c54f2a8a9852c7b224fda1c0a1e6bee32f881108504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880263015d110f1be09400000000000000000000000000000000000000001ba0e006565cadc89711b3098f7bffcbf05f8bc01ce36318a077645a444e501fdb7da07e695322360d1c219eca7a486c6b34273993da528ae7d8652cfb2d658f13d175f881198504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea58808925f22573818009400000000000000000000000000000000000000001ca007810de0e0e13ee44d21545807f5535fb111c870c2ecf47e6697d71553503b16a07e706242d19abb0520738dc2d3fc9dfb253e288f77340fa634b0d0fd40452edbf881038504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880e04fd4781cc78009400000000000000000000000000000000000000001ba05587e8c47e7975132ac231f909085332c1ffce444ea7a4f7f11f4b875a33ae2ca06d491b510ec65ee9b14d22534b7d575c132d72adaec5e88029f7594027d7840bf86e82014a8504a817c80082562294fbb1b73c4f0bda4f67dca266ce6ef42f520fbb988806c1d876eb692000801ca0a3f098f5c8efeb64886caf149a6f1ebb38959e85853e6628ec653d0f0208ed5aa014e5f64ae7284e58035853369560a9ed01d9e26804c6ee05f10be79c69b9773bf881078504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880189e31f6f6a0c009400000000000000000000000000000000000000001ca096046faff2901b5a79df075b7d83c5c554627ae7bbe20055800e8b7c3afd8e3ea0424cccf47990812beadd33d45a51196d46a6f04afb6b02feeeeefa2e8a55b2e5f8811f8504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880ab792eb33b810009400000000000000000000000000000000000000001ca01885c06e16a8344055a0110e3486beb126fda9b18f763f40c0e54ba16f28e403a0758ace4b39b3277be7acbea4c1f200a85af303e5cd120ae988da5291748ef2b5f8811f8504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea588018d3e3446bacff09400000000000000000000000000000000000000001ca0d6cbe7437799fbce926881e2295b861e3dceb42383749990e90f7e8e052c2e35a0121f902fe7351f3f8141679c93fdc0529859506f1aee7340b296f8e7c601d2d6f8811a8504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea58802c75cc2c2a0b7e09400000000000000000000000000000000000000001ca008fadac4da7361d84d7f6eb896e20ab988995d2f65eeeb21c4098bb7964fc103a049c66116be1eddaeb659ff008cda4d116d23ded6964983076d0a313fce68acbcf881118504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea588029af34926e4fc009400000000000000000000000000000000000000001ca0bd3293e745ea47f1771e5a51ce3557a08a8024f93242830be905641370de6ecaa05b11ca99d841f56f7ed2ee0bce70b36c3ec414eb2d7162638fa110feb4c10931f8811f8504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880d92f3c5cf02e8009400000000000000000000000000000000000000001ba01617e59b9b9933ec959a82e977377768fe9f316e0d10ac47d06e12834e280d65a04ea855e63a592d6a65ac666a7f0f9456a7949e95b7ec92c113de2b46664d3d3af881268504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880debe8770acc98009400000000000000000000000000000000000000001ba0a1118f9a8119e106012587602326c858414bac53b96f9c2cf6425d05d16575b0a06e6d12088cf75056b8d30b4550747494201c1277a5199fdd181ae155806246e6f881158504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea5880e11c8d56600f5009400000000000000000000000000000000000000001ba02db6befeefe54b936587e5e308c6d130194d907137e7ff0ebbc34853359fb319a056487db75f6ffb931f507ab68bea17b5db0043e2fe4c806cf230f6765831017bf8811f8504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea58803d2e354aa7b7fe09400000000000000000000000000000000000000001ba094a6ff21478aa2897083fe74020e25f976bda9226d4cc706c21681b8774fe83ea02026fcfad1a7e55fcc429198edbd8b9062c7bce5774ce6fc4353298e2a0dac36f86d8201488504a817c80082562294fbb1b73c4f0bda4f67dca266ce6ef42f520fbb988802e3358d95428400801ba0be29b37fa70931eac8df904a52a996347d25e13fd122228c76ebfd1c700b75849f0cd0a47e50781de3280e20903cadd57f9c3f6ea2f7a122f3fab60526820ae7f86d81e58504a817c80082562294fbb1b73c4f0bda4f67dca266ce6ef42f520fbb988802e0b7a6d381dc00801ba0f777862c0b8b276f68bd8c0bab3f513d84a7260cc9b85fa298b0cd6c6d6f6dfba01353066bc9f8095f5313ab1a393efe70c82eb0d044e0e4508e5bacf32fe9f162f870830113028504a817c80083015f90940013c1d20fb157fda0391afb2c35c54b84dc68b1880de3ebb0eb4d40f8801ca0f3bb804a3ffb975f4b9527a2bfeb16c1c121fc236f37a95cb64a83e926df9cb1a01dee27e677e0fe3eb2fdf77de0d276ef764ac6552265f18714c9e051e4ded9c0f8811f8504a817c80083015f9094353765840d98943d116912997e3039afb16a0ea588230bf317e39fe8009400000000000000000000000000000000000000001ba08c08736644cf801af866ced8a95164dee06468e17e71c1fa9adbaba8e81622f3a07b7a26c087a02b4cc9b9d2120b0a812d19a36c0151f9bcf16de5abf5ce7bc215c0

The invalid result is missing the first 4096 hex characters (2048 bytes), but is otherwise correct. I compared the strings via this python:

from difflib import SequenceMatcher
SequenceMatcher(None, invalid, valid, autojunk=False).find_longest_match()

I will repeat it a couple times to see if the shape of the invalid result is the same every time.

@carver
Copy link
Collaborator

carver commented Mar 18, 2025

Huh, I got same exact result after the next attempt. It is missing the first 2048 bytes again.

Also, they both came from the same host: enr:-LS4QKsj-3uj6kWH0rLl8-LOf3eeI7o6TkGNk5OMXecV4ETwY0u-F8JqCfAD1P6OusOFZoENox95jBmOvXx5dIK5Fn2EZ1tWG2OqdCBjOTNlMTI5ZGNmYTAwNmM1YjQwZjY3M2I3Yzk4OGU5MjNkZGViNmI2gmlkgnY0gmlwhIDHliOJc2VjcDI1NmsxoQOYfB94zD_LtTgL22BsteJJUnzT8mEQ9BxV3ASaXXuhcYN1ZHCCIzE

Lol, of course it's the same node, this test explicitly sends a find-content request to the same node over and over again...

@carver
Copy link
Collaborator

carver commented Mar 18, 2025

That target node was a little old, so I reproduced against enr:-JO4QL8ix2g8TRqnOhA39OTCWl4XaGVX9bZqvRwLgONdr_GPa00eVfHKHx6d-KG8u9Qgl6J5AtPMkHvexl6IewfqfGiEZ8C68mOJdCBlYjYzZDA5gmlkgnY0gmlwhIDHliOJc2VjcDI1NmsxoQOYfB94zD_LtTgL22BsteJJUnzT8mEQ9BxV3ASaXXuhcYN1ZHCCIzE which is at v0.2.1 ( eb63d09 ).

Same result: missing the first 2048 bytes.

@carver
Copy link
Collaborator

carver commented Mar 21, 2025

Ok, I was finally able to get control of a sending node that could reproduce the issue. I added logs like this on both sides:

diff --git a/crates/portalnet/src/utp/controller.rs b/crates/portalnet/src/utp/controller.rs
index e7dfc433..8c7ce868 100644
--- a/crates/portalnet/src/utp/controller.rs
+++ b/crates/portalnet/src/utp/controller.rs
@@ -183,7 +183,7 @@ impl UtpController {
         })?;
 
         let mut data = vec![];
-        stream.read_to_eof(&mut data).await
+        let byte_len = stream.read_to_eof(&mut data).await
             .map_err(|err| {
                 self.metrics
                     .report_utp_outcome(UtpDirectionLabel::Inbound, UtpOutcomeLabel::FailedDataTx);
@@ -192,6 +192,7 @@ impl UtpController {
                     "Unable to locate content on the network: error reading data from {message}"
                 )
             })?;
+        tracing::warn!("Read {} bytes from uTP stream", byte_len);
 
         // report utp tx as successful, even if we go on to fail to process the payload
         self.metrics
@@ -241,6 +242,7 @@ impl UtpController {
             }
         };
 
+        tracing::warn!("Writing {} bytes to uTP stream", data.len());
         match stream.write(data).await {
             Ok(write_size) => {
                 if write_size != data.len() {

The number of bytes sent into write(data) was 5102, and the number of bytes returned by read_to_eof() was 3054 (on the failure at the 511th attempt). So the bug is necessarily at or below the uTP logic.

@carver
Copy link
Collaborator

carver commented Mar 22, 2025

So the first window appears to default to 2048 bytes, which seemed like a good first place to look... But even if I change that constant, the first 2048 bytes are still getting dropped. So I think it's a red herring.

@carver
Copy link
Collaborator

carver commented Mar 23, 2025

This data is a little bit sketchy, because there's a decent amount of randomness involved, but here are some scenarios that seem to minimize or prevent the bug:

  1. The sending node is running with a single CPU
  2. The sending node is running with RUST_LOG="info,utp_rs=trace" (who doesn't love a heisenbug?)

Normally the bug reproduces within 500-1000 runs, but with the above scenarios, I don't see the bug in 3500 or more attempts.

The bug does reproduce if the trace is turned on on the receiver side. See two successful traces followed by an unsuccessful trace here:

2025-03-23T04:13:09.367323Z DEBUG uTP{send=13831 recv=13830}: utp_rs::conn: uTP conn starting... Peer { id: 0x05207bda395575c2134e5b22c11483a8969cf7004b3fbbc562b22e0eb2347b21, peer: Some(UtpPeer { enr: Enr { id: Some("v4"), seq: 1742595518, NodeId: 0x05207bda395575c2134e5b22c11483a8969cf7004b3fbbc562b22e0eb2347b21, signature: "9773b0fd5dab7ef676dc8bd11f190afbea02ffe0e28df8d9677aa40f58304e4e050e9a4e3fa7f71a442d9b04ba778e1be87a2b53dd7e390870b2a61180350b07", IpV4 UDP Socket: Some(134.199.237.161:9009), IpV6 UDP Socket: None, IpV4 TCP Socket: None, IpV6 TCP Socket: None, Other Pairs: [("c", "89742063336366663033"), ("secp256k1", "a103f2e7b84165eeaa7dfff5b31260f6a9006ea19088ebda696d86abc577d1c8b8c5")], .. }, Peer Client Type: "t c3cff03" }) }
2025-03-23T04:13:09.382702Z TRACE uTP{send=13831 recv=13830}: utp_rs::conn: Received packet cid=13830 packetType=ST_STATE seqNr=46610 ackNr=6018 timestamp=3029149564 timestampDiff=3284 remoteWindow=1048576
2025-03-23T04:13:09.383013Z TRACE uTP{send=13831 recv=13830}: utp_rs::conn: Received packet cid=13830 packetType=ST_DATA seqNr=46612 ackNr=6018 timestamp=3029149861 timestampDiff=3284 remoteWindow=1048576
2025-03-23T04:13:09.383103Z TRACE uTP{send=13831 recv=13830}: utp_rs::conn: Received packet cid=13830 packetType=ST_DATA seqNr=46610 ackNr=6018 timestamp=3029149861 timestampDiff=3284 remoteWindow=1048576
2025-03-23T04:13:09.383502Z TRACE uTP{send=13831 recv=13830}: utp_rs::conn: Received packet cid=13830 packetType=ST_DATA seqNr=46611 ackNr=6018 timestamp=3029149861 timestampDiff=3284 remoteWindow=1048576
2025-03-23T04:13:09.407944Z TRACE uTP{send=13831 recv=13830}: utp_rs::conn: Received packet cid=13830 packetType=ST_FIN seqNr=46617 ackNr=6018 timestamp=3029173767 timestampDiff=0 remoteWindow=1048576
2025-03-23T04:13:09.408040Z DEBUG uTP{send=13831 recv=13830}: utp_rs::conn: received FIN seq=46617
2025-03-23T04:13:09.408108Z TRACE uTP{send=13831 recv=13830}: utp_rs::conn: Received packet cid=13830 packetType=ST_DATA seqNr=46614 ackNr=6018 timestamp=3029173727 timestampDiff=11343 remoteWindow=1048576
2025-03-23T04:13:09.408141Z TRACE uTP{send=13831 recv=13830}: utp_rs::conn: Received packet cid=13830 packetType=ST_DATA seqNr=46615 ackNr=6018 timestamp=3029173727 timestampDiff=11343 remoteWindow=1048576
2025-03-23T04:13:09.408168Z TRACE uTP{send=13831 recv=13830}: utp_rs::conn: Received packet cid=13830 packetType=ST_DATA seqNr=46616 ackNr=6018 timestamp=3029173727 timestampDiff=11343 remoteWindow=1048576
2025-03-23T04:13:09.408196Z TRACE uTP{send=13831 recv=13830}: utp_rs::conn: Received packet cid=13830 packetType=ST_DATA seqNr=46613 ackNr=6018 timestamp=3029173727 timestampDiff=11343 remoteWindow=1048576
2025-03-23T04:13:09.408255Z DEBUG uTP{send=13831 recv=13830}: utp_rs::conn: uTP conn closing... err=None
2025-03-23T04:13:09.408361Z  WARN portalnet::utp::controller: Read 5102 bytes from uTP stream
2025-03-23T04:13:09.408479Z DEBUG utp_rs::socket: uTP conn shutdown cid.send=13831 cid.recv=13830
2025-03-23T04:13:09.448248Z DEBUG utp_rs::socket: received uTP packet for non-existing conn cid=13830 packet=Data seq=46613 ack=6018 peer_init_cid=ConnectionId { send: 13830, recv: 13829, peer_id: 0x05207bda395575c2134e5b22c11483a8969cf7004b3fbbc562b22e0eb2347b21 } we_init_cid=ConnectionId { send: 13831, recv: 13830, peer_id: 0x05207bda395575c2134e5b22c11483a8969cf7004b3fbbc562b22e0eb2347b21 } acc_cid=ConnectionId { send: 13829, recv: 13830, peer_id: 0x05207bda395575c2134e5b22c11483a8969cf7004b3fbbc562b22e0eb2347b21 }
2025-03-23T04:13:09.773343Z DEBUG uTP{send=33774 recv=33773}: utp_rs::conn: uTP conn starting... Peer { id: 0x05207bda395575c2134e5b22c11483a8969cf7004b3fbbc562b22e0eb2347b21, peer: Some(UtpPeer { enr: Enr { id: Some("v4"), seq: 1742595518, NodeId: 0x05207bda395575c2134e5b22c11483a8969cf7004b3fbbc562b22e0eb2347b21, signature: "9773b0fd5dab7ef676dc8bd11f190afbea02ffe0e28df8d9677aa40f58304e4e050e9a4e3fa7f71a442d9b04ba778e1be87a2b53dd7e390870b2a61180350b07", IpV4 UDP Socket: Some(134.199.237.161:9009), IpV6 UDP Socket: None, IpV4 TCP Socket: None, IpV6 TCP Socket: None, Other Pairs: [("c", "89742063336366663033"), ("secp256k1", "a103f2e7b84165eeaa7dfff5b31260f6a9006ea19088ebda696d86abc577d1c8b8c5")], .. }, Peer Client Type: "t c3cff03" }) }
2025-03-23T04:13:09.791309Z TRACE uTP{send=33774 recv=33773}: utp_rs::conn: Received packet cid=33773 packetType=ST_STATE seqNr=14429 ackNr=25852 timestamp=3029557683 timestampDiff=5369 remoteWindow=1048576
2025-03-23T04:13:09.791792Z TRACE uTP{send=33774 recv=33773}: utp_rs::conn: Received packet cid=33773 packetType=ST_DATA seqNr=14431 ackNr=25852 timestamp=3029558045 timestampDiff=5369 remoteWindow=1048576
2025-03-23T04:13:09.794987Z TRACE uTP{send=33774 recv=33773}: utp_rs::conn: Received packet cid=33773 packetType=ST_DATA seqNr=14429 ackNr=25852 timestamp=3029558045 timestampDiff=5369 remoteWindow=1048576
2025-03-23T04:13:09.795111Z TRACE uTP{send=33774 recv=33773}: utp_rs::conn: Received packet cid=33773 packetType=ST_DATA seqNr=14430 ackNr=25852 timestamp=3029558045 timestampDiff=5369 remoteWindow=1048576
2025-03-23T04:13:09.816543Z TRACE uTP{send=33774 recv=33773}: utp_rs::conn: Received packet cid=33773 packetType=ST_DATA seqNr=14432 ackNr=25852 timestamp=3029582300 timestampDiff=8413 remoteWindow=1048576
2025-03-23T04:13:09.816647Z TRACE uTP{send=33774 recv=33773}: utp_rs::conn: Received packet cid=33773 packetType=ST_FIN seqNr=14436 ackNr=25852 timestamp=3029582327 timestampDiff=0 remoteWindow=1048576
2025-03-23T04:13:09.816669Z DEBUG uTP{send=33774 recv=33773}: utp_rs::conn: received FIN seq=14436
2025-03-23T04:13:09.816714Z TRACE uTP{send=33774 recv=33773}: utp_rs::conn: Received packet cid=33773 packetType=ST_DATA seqNr=14433 ackNr=25852 timestamp=3029582300 timestampDiff=8413 remoteWindow=1048576
2025-03-23T04:13:09.825545Z TRACE uTP{send=33774 recv=33773}: utp_rs::conn: Received packet cid=33773 packetType=ST_DATA seqNr=14434 ackNr=25852 timestamp=3029582300 timestampDiff=8413 remoteWindow=1048576
2025-03-23T04:13:09.825644Z TRACE uTP{send=33774 recv=33773}: utp_rs::conn: Received packet cid=33773 packetType=ST_DATA seqNr=14435 ackNr=25852 timestamp=3029582300 timestampDiff=8413 remoteWindow=1048576
2025-03-23T04:13:09.825697Z DEBUG uTP{send=33774 recv=33773}: utp_rs::conn: uTP conn closing... err=None
2025-03-23T04:13:09.825795Z  WARN portalnet::utp::controller: Read 5102 bytes from uTP stream
2025-03-23T04:13:09.825795Z DEBUG utp_rs::socket: uTP conn shutdown cid.send=33774 cid.recv=33773
2025-03-23T04:13:10.208428Z DEBUG uTP{send=40139 recv=40138}: utp_rs::conn: uTP conn starting... Peer { id: 0x05207bda395575c2134e5b22c11483a8969cf7004b3fbbc562b22e0eb2347b21, peer: Some(UtpPeer { enr: Enr { id: Some("v4"), seq: 1742595518, NodeId: 0x05207bda395575c2134e5b22c11483a8969cf7004b3fbbc562b22e0eb2347b21, signature: "9773b0fd5dab7ef676dc8bd11f190afbea02ffe0e28df8d9677aa40f58304e4e050e9a4e3fa7f71a442d9b04ba778e1be87a2b53dd7e390870b2a61180350b07", IpV4 UDP Socket: Some(134.199.237.161:9009), IpV6 UDP Socket: None, IpV4 TCP Socket: None, IpV6 TCP Socket: None, Other Pairs: [("c", "89742063336366663033"), ("secp256k1", "a103f2e7b84165eeaa7dfff5b31260f6a9006ea19088ebda696d86abc577d1c8b8c5")], .. }, Peer Client Type: "t c3cff03" }) }
2025-03-23T04:13:11.210259Z DEBUG uTP{send=40139 recv=40138}: utp_rs::conn: timeout seq=34855 ack=0 packet=Syn
2025-03-23T04:13:11.210338Z TRACE uTP{send=40139 recv=40138}: utp_rs::conn: log_msg="retrying connection, after 1 attempts"
2025-03-23T04:13:11.224665Z TRACE uTP{send=40139 recv=40138}: utp_rs::conn: Received packet cid=40138 packetType=ST_DATA seqNr=37743 ackNr=34855 timestamp=3030993551 timestampDiff=3100 remoteWindow=1048576
2025-03-23T04:13:11.269349Z TRACE uTP{send=40139 recv=40138}: utp_rs::conn: Received packet cid=40138 packetType=ST_DATA seqNr=37741 ackNr=34855 timestamp=3030993522 timestampDiff=3100 remoteWindow=1048576
2025-03-23T04:13:11.269426Z TRACE uTP{send=40139 recv=40138}: utp_rs::conn: Received packet cid=40138 packetType=ST_DATA seqNr=37742 ackNr=34855 timestamp=3030993544 timestampDiff=3100 remoteWindow=1048576
2025-03-23T04:13:11.278322Z TRACE uTP{send=40139 recv=40138}: utp_rs::conn: Received packet cid=40138 packetType=ST_STATE seqNr=37744 ackNr=34855 timestamp=3030993792 timestampDiff=4606 remoteWindow=1048576
2025-03-23T04:13:13.263205Z TRACE uTP{send=40139 recv=40138}: utp_rs::conn: Received packet cid=40138 packetType=ST_DATA seqNr=37743 ackNr=34855 timestamp=3032995688 timestampDiff=4606 remoteWindow=1048576
2025-03-23T04:13:13.263304Z TRACE uTP{send=40139 recv=40138}: utp_rs::conn: Received packet cid=40138 packetType=ST_DATA seqNr=37741 ackNr=34855 timestamp=3032995668 timestampDiff=4606 remoteWindow=1048576
2025-03-23T04:13:13.264041Z TRACE uTP{send=40139 recv=40138}: utp_rs::conn: Received packet cid=40138 packetType=ST_DATA seqNr=37742 ackNr=34855 timestamp=3032995682 timestampDiff=4606 remoteWindow=1048576
2025-03-23T04:13:13.318475Z TRACE uTP{send=40139 recv=40138}: utp_rs::conn: Received packet cid=40138 packetType=ST_DATA seqNr=37746 ackNr=34855 timestamp=3033053932 timestampDiff=11050 remoteWindow=1048576
2025-03-23T04:13:13.319312Z TRACE uTP{send=40139 recv=40138}: utp_rs::conn: Received packet cid=40138 packetType=ST_DATA seqNr=37745 ackNr=34855 timestamp=3033053932 timestampDiff=11050 remoteWindow=1048576
2025-03-23T04:13:13.320043Z TRACE uTP{send=40139 recv=40138}: utp_rs::conn: Received packet cid=40138 packetType=ST_DATA seqNr=37744 ackNr=34855 timestamp=3033053932 timestampDiff=11050 remoteWindow=1048576
2025-03-23T04:13:13.320750Z TRACE uTP{send=40139 recv=40138}: utp_rs::conn: Received packet cid=40138 packetType=ST_DATA seqNr=37747 ackNr=34855 timestamp=3033053932 timestampDiff=11050 remoteWindow=1048576
2025-03-23T04:13:13.321456Z TRACE uTP{send=40139 recv=40138}: utp_rs::conn: Received packet cid=40138 packetType=ST_FIN seqNr=37748 ackNr=34855 timestamp=3033053963 timestampDiff=0 remoteWindow=1048576
2025-03-23T04:13:13.321475Z DEBUG uTP{send=40139 recv=40138}: utp_rs::conn: received FIN seq=37748
2025-03-23T04:13:13.322213Z DEBUG uTP{send=40139 recv=40138}: utp_rs::conn: uTP conn closing... err=None
2025-03-23T04:13:13.322325Z  WARN portalnet::utp::controller: Read 3054 bytes from uTP stream
2025-03-23T04:13:13.322547Z DEBUG utp_rs::socket: uTP conn shutdown cid.send=40139 cid.recv=40138

At first glance, it looks like the failure pattern is something like:

  1. recipient initiates connection with SYN
  2. sender sends STATE packet to acknowledge connection
  3. [STATE packet gets lost for whatever reason?]
  4. sender starts sending DATA packets
  5. recipient re-attempts to initiate connection with SYN
  6. sender repeats STATE packet to acknowledge connection, but with a later sequence number (subsequent to the DATA packets)
  7. recipient starts receiving the content from the later sequence number, ignoring all data sent with a sequence number lower than the 2nd STATE packet

@ogenev
Copy link
Member

ogenev commented Mar 23, 2025

@carver could you run your test with this potential "fix" and see if there is any improvements?

ethereum/utp#151

@carver
Copy link
Collaborator

carver commented Mar 23, 2025

I'm running a test right now with this patch:

diff --git a/src/conn.rs b/src/conn.rs
index d2336e0..23bc8e0 100644
--- a/src/conn.rs
+++ b/src/conn.rs
@@ -184,6 +184,7 @@ pub struct Connection<const N: usize, P: ConnectionPeer> {
     pending_writes: VecDeque<QueuedWrite>,
     writable: Notify,
     latest_timeout: Option<Instant>,
+    syn_ack: Option<Packet>,
 }
 
 impl<const N: usize, P: ConnectionPeer> Connection<N, P> {
@@ -228,6 +229,7 @@ impl<const N: usize, P: ConnectionPeer> Connection<N, P> {
             pending_writes: VecDeque::new(),
             writable: Notify::new(),
             latest_timeout: None,
+            syn_ack: None,
         }
     }
 
@@ -254,6 +256,7 @@ impl<const N: usize, P: ConnectionPeer> Connection<N, P> {
             }
             Endpoint::Acceptor((syn, syn_ack)) => {
                 let state = self.state_packet().unwrap();
+                self.syn_ack = Some(state.clone());
                 self.socket_events
                     .send(SocketEvent::Outgoing((state, self.peer.clone())))
                     .unwrap();
@@ -787,7 +790,16 @@ impl<const N: usize, P: ConnectionPeer> Connection<N, P> {
         // NOTE: We should probably move this so that multiple incoming packets can be processed
         // before we send a STATE.
         match packet.packet_type() {
-            PacketType::Syn | PacketType::Fin | PacketType::Data => {
+            PacketType::Syn => {
+                if let Some(state) = &self.syn_ack {
+                    let event = SocketEvent::Outgoing((state.clone(), self.peer.clone()));
+                    if self.socket_events.send(event).is_err() {
+                        tracing::warn!("Cannot re-transmit syn ack packet: socket closed channel");
+                        return;
+                    }
+                }
+            }
+            PacketType::Fin | PacketType::Data => {
                 if let Some(state) = self.state_packet() {
                     let event = SocketEvent::Outgoing((state, self.peer.clone()));
                     if self.socket_events.send(event).is_err() {

If that doesn't solve it, I'll try 151 next. I don't have intuition for how 151 would be related, but I'm happy to try it anyway :)

@carver
Copy link
Collaborator

carver commented Mar 23, 2025

With the above patch I am not able to reproduce the bug, but... I am also not seeing any line like this from the log:

2025-03-23T20:46:51.730893Z DEBUG uTP{send=30634 recv=30633}: utp_rs::conn: timeout seq=47151 ack=0 packet=Syn

My understanding is that the issue starts with a timeout. I'm not seeing any timeout with this patch (even with 5x 15x more attempts than usual). So I can't rule out that the patch is making the SYN timeout less likely, rather than fixing the actual problem of resending the correct STATE response to the SYN.

I think the next step is to build a more targeted utp test, rather than just relying on randomness to reproduce it. I don't think I'll be able to do that in my random weekend downtime, so I'll pick it back up tomorrow.

@carver
Copy link
Collaborator

carver commented Mar 23, 2025

Got one! I don't know why it took >20k attempts, instead of the usual 1k, but here is a SYN timeout without the partially-dropped content bug:

2025-03-23T22:58:16.492880Z DEBUG uTP{send=26222 recv=26221}: utp_rs::conn: uTP conn starting... Peer { id: 0x05207bda395575c2134e5b22c11483a8969cf7004b3fbbc562b22e0eb2347b21, peer: Some(UtpPeer { enr: Enr { id: Some("v4"), seq: 1742595518, NodeId: 0x05207bda395575c2134e5b22c11483a8969cf7004b3fbbc562b22e0eb2347b21, signature: "9773b0fd5dab7ef676dc8bd11f190afbea02ffe0e28df8d9677aa40f58304e4e050e9a4e3fa7f71a442d9b04ba778e1be87a2b53dd7e390870b2a61180350b07", IpV4 UDP Socket: Some(134.199.237.161:9009), IpV6 UDP Socket: None, IpV4 TCP Socket: None, IpV6 TCP Socket: None, Other Pairs: [("c", "89742063336366663033"), ("secp256k1", "a103f2e7b84165eeaa7dfff5b31260f6a9006ea19088ebda696d86abc577d1c8b8c5")], .. }, Peer Client Type: "t c3cff03" }) }
2025-03-23T22:58:16.509400Z TRACE uTP{send=26222 recv=26221}: utp_rs::conn: Received packet cid=26221 packetType=ST_DATA seqNr=27274 ackNr=26094 timestamp=1816804078 timestampDiff=8570 remoteWindow=1048576
2025-03-23T22:58:17.494533Z DEBUG uTP{send=26222 recv=26221}: utp_rs::conn: timeout seq=26094 ack=0 packet=Syn
2025-03-23T22:58:17.494602Z TRACE uTP{send=26222 recv=26221}: utp_rs::conn: log_msg="retrying connection, after 1 attempts"
2025-03-23T22:58:17.511962Z TRACE uTP{send=26222 recv=26221}: utp_rs::conn: Received packet cid=26221 packetType=ST_DATA seqNr=27275 ackNr=26094 timestamp=1817805483 timestampDiff=8570 remoteWindow=1048576
2025-03-23T22:58:17.512060Z TRACE uTP{send=26222 recv=26221}: utp_rs::conn: Received packet cid=26221 packetType=ST_DATA seqNr=27273 ackNr=26094 timestamp=1817805455 timestampDiff=8570 remoteWindow=1048576
2025-03-23T22:58:17.512105Z TRACE uTP{send=26222 recv=26221}: utp_rs::conn: Received packet cid=26221 packetType=ST_DATA seqNr=27274 ackNr=26094 timestamp=1817805477 timestampDiff=8570 remoteWindow=1048576
2025-03-23T22:58:17.512805Z TRACE uTP{send=26222 recv=26221}: utp_rs::conn: Received packet cid=26221 packetType=ST_STATE seqNr=27273 ackNr=26094 timestamp=1816803719 timestampDiff=8570 remoteWindow=1048576
2025-03-23T22:58:19.512381Z TRACE uTP{send=26222 recv=26221}: utp_rs::conn: Received packet cid=26221 packetType=ST_DATA seqNr=27275 ackNr=26094 timestamp=1819806898 timestampDiff=10954 remoteWindow=1048576
2025-03-23T22:58:19.512483Z TRACE uTP{send=26222 recv=26221}: utp_rs::conn: Received packet cid=26221 packetType=ST_DATA seqNr=27273 ackNr=26094 timestamp=1819806880 timestampDiff=10954 remoteWindow=1048576
2025-03-23T22:58:19.512514Z TRACE uTP{send=26222 recv=26221}: utp_rs::conn: Received packet cid=26221 packetType=ST_DATA seqNr=27274 ackNr=26094 timestamp=1819806893 timestampDiff=10954 remoteWindow=1048576
2025-03-23T22:58:19.529930Z TRACE uTP{send=26222 recv=26221}: utp_rs::conn: Received packet cid=26221 packetType=ST_DATA seqNr=27276 ackNr=26094 timestamp=1819824004 timestampDiff=9417 remoteWindow=1048576
2025-03-23T22:58:19.530015Z TRACE uTP{send=26222 recv=26221}: utp_rs::conn: Received packet cid=26221 packetType=ST_DATA seqNr=27278 ackNr=26094 timestamp=1819824004 timestampDiff=9417 remoteWindow=1048576
2025-03-23T22:58:19.530052Z TRACE uTP{send=26222 recv=26221}: utp_rs::conn: Received packet cid=26221 packetType=ST_DATA seqNr=27277 ackNr=26094 timestamp=1819824004 timestampDiff=9417 remoteWindow=1048576
2025-03-23T22:58:19.530071Z TRACE uTP{send=26222 recv=26221}: utp_rs::conn: Received packet cid=26221 packetType=ST_FIN seqNr=27280 ackNr=26094 timestamp=1819824050 timestampDiff=0 remoteWindow=1048576
2025-03-23T22:58:19.530083Z DEBUG uTP{send=26222 recv=26221}: utp_rs::conn: received FIN seq=27280
2025-03-23T22:58:19.530102Z TRACE uTP{send=26222 recv=26221}: utp_rs::conn: Received packet cid=26221 packetType=ST_DATA seqNr=27279 ackNr=26094 timestamp=1819824004 timestampDiff=9417 remoteWindow=1048576
2025-03-23T22:58:19.530141Z DEBUG uTP{send=26222 recv=26221}: utp_rs::conn: uTP conn closing... err=None
2025-03-23T22:58:19.530243Z  WARN portalnet::utp::controller: Read 5102 bytes from uTP stream
2025-03-23T22:58:19.530299Z DEBUG utp_rs::socket: uTP conn shutdown cid.send=26222 cid.recv=26221

This seems to validate the diagnosis, and basic shape of the solution. I would still like to add a test that reproduces the failure, to prevent regressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants