Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Buggy network conditions cause permanent TCP connection exhaustion #13823

Open
1 task done
ryao opened this issue Oct 4, 2024 · 1 comment
Open
1 task done
Labels
Arch: arm Issues related to ARM (32-bit) architecture Area: Networking Effects networking subsystem

Comments

@ryao
Copy link

ryao commented Oct 4, 2024

Description / Steps to reproduce the issue

We have a simple application running on nuttx on the RP2040 that allows network access to a serial port and were doing stress testing of it. The NIC of the workstation that was used for stress testing has some kind of issue that causes connections to fail periodically. The motherboard was already replaced once without solving it, but that is offtopic. Anyway, after a moment of failures, NuttX got into a strange state where it would respond to pings, but attempts at connecting to listening TCP sockets would fail. Additionally, ifconfig prints no output.

I attached OpenOCD and started debugging with gdb, and found that tcp_alloc() is returning 0x0, which causes the TCP packets to be dropped:

(gdb) bt
#0  tcp_alloc (domain=domain@entry=2 '\002') at tcp/tcp_conn.c:693
#1  0x10013f1c in tcp_alloc_accept (dev=dev@entry=0x20001710 <g_encx24j600+356>, tcp=0x0, tcp@entry=0x1003d210 <__stack_chk_guard>, listener=listener@entry=0x200040bc <g_tcp_connections+1240>) at tcp/tcp_conn.c:1083
#2  0x10016c3c in tcp_input (domain=2 '\002', iplen=<optimized out>, dev=0x20001710 <g_encx24j600+356>) at tcp/tcp_input.c:815
#3  tcp_ipv4_input (dev=dev@entry=0x20001710 <g_encx24j600+356>) at tcp/tcp_input.c:1761
#4  0x10015b6e in ipv4_in (dev=0x20001710 <g_encx24j600+356>) at devif/ipv4_input.c:401
#5  0x1001645c in netdev_input (dev=0x20001710 <g_encx24j600+356>, callback=0x20007188, callback@entry=0x10015acd <ipv4_in>, reply=reply@entry=true) at netdev/netdev_input.c:90
#6  0x10015bd2 in ipv4_input (dev=dev@entry=0x20001710 <g_encx24j600+356>) at devif/ipv4_input.c:510
#7  0x10006e0e in enc_rxdispatch (priv=0x200015ac <g_encx24j600>) at net/encx24j600.c:1437
#8  enc_pktif (priv=0x200015ac <g_encx24j600>) at net/encx24j600.c:1636
#9  enc_irqworker (arg=0x200015ac <g_encx24j600>) at net/encx24j600.c:1835
#10 0x1000312e in work_thread (argc=<optimized out>, argv=<optimized out>) at wqueue/kwork_thread.c:186
#11 0x10003f0c in nxtask_start () at task/task_start.c:107
#12 0x00000000 in ?? ()

Apparently, we ran out of tcp_conn_s connection structures:

(gdb) print g_free_tcp_connections                                                                                                                                                                                                                                            
$36 = {head = 0x0, tail = 0x0}

I decided to look at the states of the TCP connections and found 5 are in TCP_CLOSED and 3 are in TCP_ALLOCATED:

(gdb) set $n = sizeof(g_tcp_connections) / sizeof(g_tcp_connections[0])                                                                                                                                                                                                       
(gdb) set $i = 0
(gdb) while ($i < $n)                                                                                                                                                                                                                                                         
 >p g_tcp_connections[$i].tcpstateflags                                                                                                                                                                                                                                       
 >set $i = $i + 1                                                                                                                                                                                                                                                             
 >end                                                                                                                                                                                                                                                                         
$47 = 0 '\000'                                                                                                                                                                                                                                                                
$48 = 0 '\000'                                                                                                                                                                                                                                                                
$49 = 0 '\000'                                                                                                                                                                                                                                                                
$50 = 0 '\000'                                                                                                                                                                                                                                                                
$51 = 1 '\001'                                                                                                                                                                                                                                                                
$52 = 1 '\001'                                                                                                                                                                                                                                                                
$53 = 1 '\001'                                                                                                                                                                                                                                                                
$54 = 0 '\000

We did not build with CONFIG_NET_SOLINGER (or NET_TCP_WRITE_BUFFERS/NET_UDP_WRITE_BUFFERS for that matter), so I wondered why the code for recycling TCP connections did not do anything. Apparently, all of the structures are marked as having references:

(gdb) while ($i < $n)                                                                                                                                                                                                                                                         
 >p g_tcp_connections[$i].crefs                                                                                                                                                                                                                                               
 >set $i = $i + 1                                                                                                                                                                                                                                                             
 >end                                                                                                                                                                                                                                                                         
$55 = 1 '\001'                                                                                                                                                                                                                                                                
$56 = 1 '\001'                                                                                                                                                                                                                                                                
$57 = 1 '\001'                                                                                                                                                                                                                                                                
$58 = 1 '\001'                                                                                                                                                                                                                                                                
$59 = 1 '\001'                                                                                                                                                                                                                                                                
$60 = 1 '\001'                                                                                                                                                                                                                                                                
$61 = 1 '\001'                                                                                                                                                                                                                                                                
$62 = 1 '\001'

We have three daemons running that have open sockets. One is telnetd and ps shows no open telnet sessions. The other two are a really simple web server that accepts a connection and returns either a webpage or a 404 depending on the request, only to close the connection afterward. The final one is the serial bridge, which only ever maintains 1 open connection and will close it if a new connection occurs. I do not understand how we got into this state.

I have not yet confirmed that the issue is producible on either the current master or the latest stable release, but I looked through the commits to net/ since our snapshot of master was taken and I do not see anything that would address this. Here is a copy of the build's .config:

config.txt

I have so far refrained from trying to reproduce it since I did not want to lose the ability to poke around the RP2040's memory to understand what is going wrong. Given that this was caused by flaky hardware at the client machine talking to nuttx over the network, I am not sure if I can reproduce the exact sequence that caused this, although I have a few ideas on how to produce similar conditions that I will try after filing this to give others a heads up that there is an issue in the TCP stack. Also, we are using the ENCX24J600 driver on the RP2040, which is not yet supported on master. I have patches for enabling that which I plan to upstream after I am sure that I did not make any mistakes on them.

On which OS does this issue occur?

[OS: Linux]

What is the version of your OS?

Ubuntu 20.04

NuttX Version

09bfaa7

Issue Architecture

[Arch: arm]

Issue Area

[Area: Networking]

Verification

  • I have verified before submitting the report.
@github-actions github-actions bot added Arch: arm Issues related to ARM (32-bit) architecture Area: Networking Effects networking subsystem labels Oct 4, 2024
@ryao
Copy link
Author

ryao commented Oct 4, 2024

I tried reproducing this with SYN floods and somehow managed to reproduce it, but it was not a 100% reproducible thing because when I finally managed to do it, 1 connection entered FIN_WAIT_1 while the others were in the states that I had observed previously. It also wasn't a simple matter of just sending N SYN packets. No amount of sending SYN packets alone triggered this. It was a mixed workload where a colleague tried talking to the serial line over the network after I had failed and it entered that state (although not entirely as 1 connection was in FIN_WAIT_1).

Unfortunately, forces beyond my control mandate that the development hardware running the beta software is sent into a production environment, so I will not be able to debug this any further until sometime next week when a new unit is assembled by a colleague. There is no way that I can work fast enough to solve this problem, so I wrote a hack. I removed the static keyword from g_free_tcp_connections and wrote a watchdog that I have being started by tcp_watchdog >/dev/log & in /etc/init.d/rc.sysinit:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <nuttx/net/net.h>
#include <nuttx/queue.h>
#include "arm_internal.h"
#include "nvic.h"

#define CHECK_INTERVAL 1

int check_tcp_connections(void) {
    // Access the global queue for free TCP connections
    extern dq_queue_t g_free_tcp_connections;

    // If g_free_tcp_connections is empty, it means no available connections
    if (dq_empty(&g_free_tcp_connections)) {
        return 0; // No available connections
    }

    return 1; // Connections available
}

void restart_machine(void) {
    // Reset the machine
    uint32_t regval = ((0x5fa << SYSCON_AIRCR_VECTKEY_SHIFT) | 
                       SYSCON_AIRCR_SYSRESETREQ);
    putreg32(regval, ARMV6M_SYSCON_AIRCR);

    // Ensure completion of memory accesses
    __asm volatile ("dsb");

    // Wait for the reset
    for (;;);
}

int main(int argc, char *argv[]) {
    while (1) {
        // Check TCP connections
        if (!check_tcp_connections()) {
            restart_machine();
        }

        // Sleep for the specified interval
        sleep(CHECK_INTERVAL);
    }

    return EXIT_SUCCESS;
}

I put CFLAGS += -I../../../nuttx/arch/arm/src/common -I../../../nuttx/arch/arm/src/armv6-m in the makefile so that the appropriate headers were accessible. I then used this python script to exhaust the TCP connection list to fool the watch dog daemon into thinking a reboot is needed and it rebooted:

import socket
import threading

def keep_connection_open(ip, port):
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.connect((ip, port))
        print(f"Connected to {ip}:{port}")

        while True:
            pass

    except Exception as e:
        print(f"Connection to {ip}:{port} failed: {e}")
    finally:
        sock.close()

def main():
    ip = "10.158.8.136"
    port = 80
    threads = []

    # Create 8 threads for 8 connections
    for i in range(8):
        thread = threading.Thread(target=keep_connection_open, args=(ip, port))
        threads.append(thread)
        thread.start()
    
    # Wait for all threads to complete (they won't, but this keeps the main thread alive)
    for thread in threads:
        thread.join()

if __name__ == "__main__":
    main()

This is a bandaid (and I know I am restarting nuttx in a non-portable way, but to get something working quickly, I copy and pasted code from an OTA update mechanism that I have under development). Consider the source code I posted to be licensed under your choice of OSI-approved license. Any OSI-approved license is acceptable to me, even CC-0.

When we have fixed the problem, the development hardware running this will be replaced with production hardware, so thankfully, things will not stay like this in production indefinitely. fingers crossed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arch: arm Issues related to ARM (32-bit) architecture Area: Networking Effects networking subsystem
Projects
None yet
Development

No branches or pull requests

1 participant