-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Buggy network conditions cause permanent TCP connection exhaustion #13823
Comments
I tried reproducing this with SYN floods and somehow managed to reproduce it, but it was not a 100% reproducible thing because when I finally managed to do it, 1 connection entered FIN_WAIT_1 while the others were in the states that I had observed previously. It also wasn't a simple matter of just sending N SYN packets. No amount of sending SYN packets alone triggered this. It was a mixed workload where a colleague tried talking to the serial line over the network after I had failed and it entered that state (although not entirely as 1 connection was in FIN_WAIT_1). Unfortunately, forces beyond my control mandate that the development hardware running the beta software is sent into a production environment, so I will not be able to debug this any further until sometime next week when a new unit is assembled by a colleague. There is no way that I can work fast enough to solve this problem, so I wrote a hack. I removed the static keyword from
I put
This is a bandaid (and I know I am restarting nuttx in a non-portable way, but to get something working quickly, I copy and pasted code from an OTA update mechanism that I have under development). Consider the source code I posted to be licensed under your choice of OSI-approved license. Any OSI-approved license is acceptable to me, even CC-0. When we have fixed the problem, the development hardware running this will be replaced with production hardware, so thankfully, things will not stay like this in production indefinitely. fingers crossed |
Description / Steps to reproduce the issue
We have a simple application running on nuttx on the RP2040 that allows network access to a serial port and were doing stress testing of it. The NIC of the workstation that was used for stress testing has some kind of issue that causes connections to fail periodically. The motherboard was already replaced once without solving it, but that is offtopic. Anyway, after a moment of failures, NuttX got into a strange state where it would respond to pings, but attempts at connecting to listening TCP sockets would fail. Additionally,
ifconfig
prints no output.I attached OpenOCD and started debugging with gdb, and found that tcp_alloc() is returning 0x0, which causes the TCP packets to be dropped:
Apparently, we ran out of tcp_conn_s connection structures:
I decided to look at the states of the TCP connections and found 5 are in TCP_CLOSED and 3 are in TCP_ALLOCATED:
We did not build with CONFIG_NET_SOLINGER (or NET_TCP_WRITE_BUFFERS/NET_UDP_WRITE_BUFFERS for that matter), so I wondered why the code for recycling TCP connections did not do anything. Apparently, all of the structures are marked as having references:
We have three daemons running that have open sockets. One is telnetd and ps shows no open telnet sessions. The other two are a really simple web server that accepts a connection and returns either a webpage or a 404 depending on the request, only to close the connection afterward. The final one is the serial bridge, which only ever maintains 1 open connection and will close it if a new connection occurs. I do not understand how we got into this state.
I have not yet confirmed that the issue is producible on either the current master or the latest stable release, but I looked through the commits to net/ since our snapshot of master was taken and I do not see anything that would address this. Here is a copy of the build's .config:
config.txt
I have so far refrained from trying to reproduce it since I did not want to lose the ability to poke around the RP2040's memory to understand what is going wrong. Given that this was caused by flaky hardware at the client machine talking to nuttx over the network, I am not sure if I can reproduce the exact sequence that caused this, although I have a few ideas on how to produce similar conditions that I will try after filing this to give others a heads up that there is an issue in the TCP stack. Also, we are using the ENCX24J600 driver on the RP2040, which is not yet supported on master. I have patches for enabling that which I plan to upstream after I am sure that I did not make any mistakes on them.
On which OS does this issue occur?
[OS: Linux]
What is the version of your OS?
Ubuntu 20.04
NuttX Version
09bfaa7
Issue Architecture
[Arch: arm]
Issue Area
[Area: Networking]
Verification
The text was updated successfully, but these errors were encountered: