99% CPU utilization after 8 Hours #2103

Blitzritze · 2024-10-31T10:13:01Z

Versions

Core
Version is v5.18.3-550-g112b961 (Latest: null)
Branch is development
Hash is 112b9617 (Latest: 112b9617)
Web
Version is v5.21-1012-gc689fdc7 (Latest: null)
Branch is development
Hash is c689fdc7 (Latest: c689fdc7)
FTL
Version is vDev-3dd756d (Latest: null)
Branch is development
Hash is 3dd756d (Latest: 3dd756d)

Platform

OS and version: Debian 12
Platform: Proxmox, Unprivileged Container with 1vCore, 1GiB RAM, 5GiB Disk size
Hardware: MB/CPU: ASRock N100M, RAM: Team Electronic Elite 32GB DDR4 3200,

Expected behavior

Normal execution of the DNS block without over 100% CPU usage

Actual behavior / bug

Every 8 hours, a 99% CPU utilization is displayed in Proxmox in the pihole container. In HTOP, in the pihole container, a 400% CPU utilization is displayed, this comes from FTL.

Steps to reproduce

Wait 8 hours and see

Debug Token

URL: https://tricorder.pi-hole.net/puHjVjHU/

Screenshots

Proxmox:

Pihole Container:

The text was updated successfully, but these errors were encountered:

DL6ER · 2024-10-31T19:50:46Z

this comes from FTL

it'd be very helpful if you could enable tree view in htop and also enable custom thread names, like

Thereafter, you will be able to see which part of Pi-hole is consuming the CPU which will help us a lot trying to find the root cause

Blitzritze · 2024-11-01T11:13:33Z

I changed the settings from HTOP.
Strangely enough, nothing happened this night i'll keep an eye on it...
I installed Pihole V6 the day before yesterday and since then it has happened several times.

Blitzritze · 2024-11-02T12:56:55Z

It happened again.
I ran pihole -d again while the problem was there
URL: https://tricorder.pi-hole.net/as41yWRB/

DL6ER · 2024-11-03T07:21:54Z

Oh, this is interesting. So 400% CPU is consumed in civetweb-worker which is one of the worker threads of the integrated web API.

My next suggestion would be trying to identify which API endpoint is causing this. For this, please run

sudo pihole-FTL --config debug.api true

Once it happens again, have another a look into htop and identify the PID of the thread causing this - it's 18524 in your example above:

We can then search the FTL log file with enabled debug.api to identify which API request is taking so long by running

grep "T18524" /var/log/pihole/FTL.log | tail

You'll obviously have to replace the numeric part of the T18524 by the PID you have identified above. What is this output?
You can, of course, remove any IP addresses from the output before posting it here, just make it clea when you do do, e.g. ----removed----.

Blitzritze · 2024-11-04T06:21:58Z

I have noticed that when the problem occurs, the web interface can no longer be accessed.

FTL Log:

Edit:
It happend again. 10.0.1.1 is my Docker VM. On this VM i have NGINX proxy manager

Nginx settings: no custom locations or advanced settings.

DL6ER · 2024-11-04T13:40:36Z

Sorry, I didn't see you made an edit later, adding more content. This is still (or, even more) puzzling as the logs themselves do not show anything being processed at the time of these massive CPU spikes making me think this may be some strange ill interaction between FTL and your nginx proxy. Does the latter have any logs for us? Is there also notable CPU load on nginx when this is happening?

Also check out the lines of /var/log/pihole/webserver.log at that time which becomes an access log when API debug logging is enabled. This, together with the nginx logs hopefully sheds some light on the matter. My current hypothesis is that there may be some massive redirection going on here keeping the proxy and FTL busy.

Edit No need to enable any additional logging, debug.api already enables the access logs so you already have the log at the time when it happened.

Blitzritze · 2024-11-07T07:14:38Z

I tried to gather all Informations. I hope this is all you need.
Today 07.11.24 at 06:31 did the error occurred.

Webserverlog:

Nginx Proxyhost 14 Errorlog:

Nginx Proxyhost 14 Accesslog:

10.0.0.22 -> Notebook
10.0.2.1 -> Pi-Hole
10.5.0.1 -> Handy
10.0.1.1 -> Docker (with niginx etc.)
10.10.0.4 -> Amazn Echo Dot

DL6ER · 2024-11-07T07:43:29Z

Today 07.11.24 at 06:31 did the error occurred.

What timezone is this? The logs are in UTC but nothing extends to later than 05:03 UTC there.

Blitzritze · 2024-11-07T08:57:21Z

Today 07.11.24 at 06:31 did the error occurred.

What timezone is this? The logs are in UTC but nothing extends to later than 05:03 UTC there.

Sorry its "Timezone in Berlin (GMT+1) Central European Standard Time"

DL6ER · 2024-11-09T07:30:27Z

I'm still somewhat confused - 06:31 Berlin should be 05:31 UTC. But neither of your logs show anything after 05:22 UTC... How is your particular setup configured? Could you put a tcpdump process next to the traffic between nginx and pihole-FTL on the HTTP port to see what really is happening? I wonder if the issue may be something stupid like an endless loop of null-bytes being transferred, ultimately causing the webserver to stay busy the whole time doing nonsense?

Blitzritze · 2024-11-09T23:49:43Z

I ran a tcpdump in both containers. when i specified port 80 or 443 nothing was recorded so i recorded all the traffic from the respective “partners” but not much came out of it.
nginx.dmp
pihole.dmp

DL6ER · 2024-11-11T13:25:08Z

Okay, so let me collect what we already know:

There is a nginx proxy interacting with the FTL webserver, the FTL webserver is not contacted by any client directly
Both webserver threads are busy at 200% CPU each
The webserver is not busy because it is processing something in the FTL API or serving any file
There seems to be neither traffic to nor from the FTL or nginx servers related to this issue.

Currently, my best guess is that the external webserver we are embedding (civetweb) has a bug causing some endless loop somewhere.

I'd like you to attach your pihole-FTL process to the gdb debugger, see instructions for v6 here while we are still in the beta. It will allow you to interrupt pihole-FTL when you notice this issue (using Ctrl + C) and then see what two webserver threads are doing at this very moment.

Once you interrupted pihole-FTL, please enter info threads. It should look like:

^C
Thread 1 "pihole-FTL" received signal SIGINT, Interrupt.
0x000070ef9191b4cd in __GI___poll (fds=0x61d3cc680870, nfds=7, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29      in ../sysdeps/unix/sysv/linux/poll.c

(gdb) info threads <-----------------

to get a listing of the currently active threads.

The output will look like

  Id   Target Id                                           Frame 
* 1    Thread 0x70ef91a77740 (LWP 31708) "pihole-FTL"      0x000070ef9191b4cd in __GI___poll (fds=0x61d3cc680870, nfds=7, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  2    Thread 0x70ef874006c0 (LWP 31761) "civetweb-worker" 0x000070ef91898d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x70ef916150c8) at ./nptl/futex-internal.c:57
  3    Thread 0x70ef87e006c0 (LWP 31756) "civetweb-worker" 0x000070ef91898d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x70ef916150cc) at ./nptl/futex-internal.c:57
[...]

Have a look for the threads where you see the massive CPU spike, it'd be IDs 2 and 3 in the example above.

You can then switch context from ID 1 (notice the * in the output above) using thread and the ID, like

(gdb) thread 2

and run backtrace to get where the code currently is busy:

(gdb) backtrace

#0  0x000070ef91898d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x70ef916150c8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x70ef916150c8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x70ef916150c8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, 
    private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x000070ef9189b7dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x70ef91615040, cond=0x70ef916150a0) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=cond@entry=0x70ef916150a0, mutex=mutex@entry=0x70ef91615040) at ./nptl/pthread_cond_wait.c:627
#5  0x000061d3cb6c54db in consume_socket (thread_index=<optimized out>, counter_was_preincremented=<optimized out>, sp=<optimized out>, ctx=0x70ef91615010)
    at /workspace/FTL/src/webserver/civetweb/civetweb.c:20022
#6  worker_thread_run (conn=0x61d3d1456bb0) at /workspace/FTL/src/webserver/civetweb/civetweb.c:20170
#7  worker_thread (thread_func_param=0x61d3d1456bb0) at /workspace/FTL/src/webserver/civetweb/civetweb.c:20345
#8  0x000070ef9189ca94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#9  0x000070ef91929c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

It is this backtrace output, or, even better, the entire output you have seen in the debugger which we need to proceed here. Hopefully this will give us an idea of what is going on. You can leave the debugger using quit at the end.

DL6ER added the Pi-hole v6.0 label Oct 31, 2024

DL6ER transferred this issue from pi-hole/pi-hole Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

99% CPU utilization after 8 Hours #2103

99% CPU utilization after 8 Hours #2103

Blitzritze commented Oct 31, 2024

DL6ER commented Oct 31, 2024

Blitzritze commented Nov 1, 2024

Blitzritze commented Nov 2, 2024

DL6ER commented Nov 3, 2024

Blitzritze commented Nov 4, 2024 •

edited

Loading

DL6ER commented Nov 4, 2024 •

edited

Loading

Blitzritze commented Nov 7, 2024 •

edited

Loading

DL6ER commented Nov 7, 2024

Blitzritze commented Nov 7, 2024

DL6ER commented Nov 9, 2024

Blitzritze commented Nov 9, 2024

DL6ER commented Nov 11, 2024

99% CPU utilization after 8 Hours #2103

99% CPU utilization after 8 Hours #2103

Comments

Blitzritze commented Oct 31, 2024

Versions

Platform

Expected behavior

Actual behavior / bug

Steps to reproduce

Debug Token

Screenshots

DL6ER commented Oct 31, 2024

Blitzritze commented Nov 1, 2024

Blitzritze commented Nov 2, 2024

DL6ER commented Nov 3, 2024

Blitzritze commented Nov 4, 2024 • edited Loading

DL6ER commented Nov 4, 2024 • edited Loading

Blitzritze commented Nov 7, 2024 • edited Loading

DL6ER commented Nov 7, 2024

Blitzritze commented Nov 7, 2024

DL6ER commented Nov 9, 2024

Blitzritze commented Nov 9, 2024

DL6ER commented Nov 11, 2024

Blitzritze commented Nov 4, 2024 •

edited

Loading

DL6ER commented Nov 4, 2024 •

edited

Loading

Blitzritze commented Nov 7, 2024 •

edited

Loading