Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbound process sporadically returns TOO MANY Servfail and Read/Write errors at different load levels. #1105

Open
maintain3r opened this issue Jul 12, 2024 · 4 comments

Comments

@maintain3r
Copy link

maintain3r commented Jul 12, 2024

Unbound version installed: 1.13.1-1ubuntu5.5
unbound runs as a regular service (no as a docker container)
no packet drops are detected on the unbound host
verbosity level is set to 5

The tool to test unbound: dnspyre
The command used to test unbound: dnspyre -c 100 -d 60s --max=20ms -s 172.31.28.217 https://raw.githubusercontent.com/Tantalor93/dnspyre/master/data/10000-domains

Interestingly when I take the domain names that were failing and try to resolve them while the testing tool is not running I do get things resolved properly without an issue.

unbound.conf:
_server:
verbosity: 5
statistics-cumulative: yes
extended-statistics: yes
num-threads: 4
interface: 0.0.0.0
port: 53
prefer-ip6: no
outgoing-range: 8192
outgoing-port-permit: 5354
so-rcvbuf: 8m
so-sndbuf: 8m
so-reuseport: yes
ip-transparent: no
ip-freebind: yes
max-udp-size: 4096
msg-cache-size: 256m
msg-cache-slabs: 8
num-queries-per-thread: 4096
rrset-cache-size: 640m
rrset-cache-slabs: 8
cache-min-ttl: 300
cache-max-ttl: 86400
cache-max-negative-ttl: 300
infra-host-ttl: 60
infra-cache-slabs: 8
infra-cache-numhosts: 100000
do-ip4: yes
do-ip6: no
do-udp: yes
do-tcp: yes
use-systemd: no
do-daemonize: no
access-control: 192.168.0.0/16 allow
access-control: 172.16.0.0/12 allow
access-control: 10.0.0.0/8 allow
access-control: 127.0.0.0/8 allow
username: "unbound"
directory: "/etc/unbound"
use-syslog: no
log-identity: "unbound"
log-time-ascii: yes
log-queries: no
log-replies: yes
log-tag-queryreply: yes
pidfile: "/var/run/unbound.pid"
root-hints: "/var/lib/unbound/root.hints"
hide-identity: yes
hide-version: yes
hide-trustanchor: yes
identity: ""
version: ""
harden-glue: yes
qname-minimisation: yes
use-caps-for-id: yes
do-not-query-localhost: no
prefetch: yes
deny-any: yes
rrset-roundrobin: yes
minimal-responses: yes
val-clean-additional: yes
serve-expired: yes
val-log-level: 2
key-cache-size: 10m
key-cache-slabs: 8
neg-cache-size: 1m
ratelimit: 0
ip-ratelimit: 0

remote-control:
control-enable: yes
control-use-cert: no
control-interface: 127.0.0.1
control-port: 8953
server-key-file: "/etc/unbound/unbound_server.key"
server-cert-file: "/etc/unbound/unbound_server.pem"
control-key-file: "/etc/unbound/unbound_control.key"
control-cert-file: "/etc/unbound/unbound_control.pem"

forward-zone:
name: "."
forward-first: yes
forward-addr: 169.254.169.253@53 # aws provided vpc dns server
forward-addr: 1.1.1.1@53
forward-addr: 8.8.8.8@53_

Testing results
Total requests: 280881
Read/Write errors: 244061
DNS success responses: 34141
DNS negative responses: 1900
DNS error responses: 779

DNS response codes:
NOERROR: 35141
SERVFAIL: 779
NXDOMAIN: 900

DNS question types:
A: 280881

# Running dnspyre locally against 127.0.0.1 (unbound has a listener on this IP). Using 10 concurrent requests didn;t change almost anything, still too many errors.
root@ip-172-31-28-217:/etc/unbound# dnspyre -c 10 -d 60s --max=20ms -s 127.0.0.1 https://raw.githubusercontent.com/Tantalor93/dnspyre/master/data/10000-domains
Using 10000 hostnames
Benchmarking 127.0.0.1:53 via udp with 10 concurrent requests
Total requests: 12844
Read/Write errors: 1134
DNS success responses: 10610
DNS negative responses: 950
DNS error responses: 150

DNS response codes:
NOERROR: 10960
SERVFAIL: 150
NXDOMAIN: 600

DNS question types:
A: 12844

Unbound runs on Ubuntu 22.04.4 LTS
RAM: 4GB
CPU: 2 core
aws t3.medium type host
Changing instance type does not change a lot!!!
CPU usage is ~30-40%

@maintain3r maintain3r changed the title Unbound process sporadically returns Servfail at different load levels. Unbound process sporadically returns TOO MANY Servfail at different load levels. Jul 12, 2024
@maintain3r maintain3r changed the title Unbound process sporadically returns TOO MANY Servfail at different load levels. Unbound process sporadically returns TOO MANY Servfail and Read/Write errors at different load levels. Jul 12, 2024
@wcawijngaards
Copy link
Member

The setting use-caps-for-id: yes could be the issue, try use-caps-for-id: no. If there is fallback that needs a lot of additional queries, and this option is not common, so I think it causes load and possibly also failures.

With log-servfail: yes it would print out what the servfails are that happen. That would give a clue that point in the direction of what is the cause.

With num-threads: 4, but the host has 2 cpu cores, I would expect num-threads: 2 to be the correct choice. I would not expect that to cause the outcome, but maybe interesting.

The so-rcvbuf and so-sndbuf settings of 8m are large, and I wonder if the 4G host runs out the memory on the many requests that you cause it to queue up for recursion. Out of memory on the socket buffers, and then the recursor cannot make more socket buffers and this causes failure, perhaps.

@maintain3r
Copy link
Author

Thanks @wcawijngaards Im gonna try your suggestions and will get back with the results.
For the 'so-rcvbuf' and 'so-sndbuf' what should I use and how to calculate a proper value for that should I create a bigger instance with more RAM ?

@wcawijngaards
Copy link
Member

I do not know a value calculation for them. Perhaps leave them at default. Or 64k for less buffer size but also less memory consumption, since the test involves opening thousands of sockets.

@maintain3r
Copy link
Author

Taken from unbound official doc page:
Set so-rcvbuf to a larger value (4m or 8m) for a busy server. This sets the kernel buffer larger so that no messages are lost in spikes in the traffic. Adds extra 9s to the reply-reliability percentage. The OS caps it at a maximum, on linux unbound needs root permission to bypass the limit, or the admin can use sysctl net.core.rmem_max. On BSD change kern.ipc.maxsockbuf in /etc/sysctl.conf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants