-
Notifications
You must be signed in to change notification settings - Fork 1.3k
device: add BenchmarkAllowedIPsInsertRemove #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
device: add BenchmarkAllowedIPsInsertRemove #36
Conversation
To show that RemoveByPeer is slow. Currently: (pprof) top Showing nodes accounting for 2.99s, 96.14% of 3.11s total Dropped 35 nodes (cum <= 0.02s) Showing top 10 nodes out of 36 flat flat% sum% cum cum% 2.72s 87.46% 87.46% 2.72s 87.46% golang.zx2c4.com/wireguard/device.(*trieEntry).removeByPeer 0.10s 3.22% 90.68% 0.10s 3.22% runtime.memclrNoHeapPointers 0.05s 1.61% 92.28% 0.06s 1.93% runtime.scanobject 0.03s 0.96% 93.25% 0.05s 1.61% runtime.casgstatus 0.02s 0.64% 93.89% 0.02s 0.64% runtime.(*gcBitsArena).tryAlloc (inline) 0.02s 0.64% 94.53% 0.02s 0.64% runtime.heapBitsSetType 0.02s 0.64% 95.18% 0.04s 1.29% runtime.sweepone 0.01s 0.32% 95.50% 0.02s 0.64% golang.zx2c4.com/wireguard/device.commonBits 0.01s 0.32% 95.82% 0.03s 0.96% runtime.(*mheap).allocSpan 0.01s 0.32% 96.14% 0.24s 7.72% runtime.mallocgc Signed-off-by: Brad Fitzpatrick <[email protected]>
Same issue in the kernel code. That's a hard traversal to speed up without increasing the size of each node beyond a cacheline and therefore making lookups slow. Any suggestions? |
At least in our case (and perhaps with others?), the overwhelming majority of routes are complete IPv4 or IPv6 addresses (cidr /32 or /128). I was planning on adding a Go map alongside the trie and using both: map for complete addresses and trie for prefixes. That does mean some lookups (for non-complete addresses) need to consult both. I'm fine with that if it means reducing the removeByPeer cost, which is eating 40% of our CPU on our big shared test node accessible to all users. |
Instead of trying to add special cases -- whose complexity I wouldn't be so happy about having here -- what about implementing better/faster algorithms for the general case? Specifically, check out https://github.com/openbsd/src/blob/master/sys/net/art.c https://github.com/openbsd/src/blob/master/sys/net/art.h I would very very gladly take an implementation of this directly into wireguard-go (and would prefer it there instead of in a separate repo). |
Oh, nice, I hadn't seen that. PDF from the comments there: http://www.hariguchi.org/art/art.pdf |
Right. Basically it sounds like what happened somebody submitted a paper for a new routing table data structure. Knuth reviewed it, and during the review thought of something better. And that's ART. LC-Tries are also pretty fast, but not very fun to implement, and ART may well outperform it. Weidong Wu has a great book called "Packet Forwarding Technologies" that compares a lot of these different structures, but the latest addition I've found is 2007, which doesn't cover ART unfortunately. However, the combination of versatility, code compactness, and simplicity makes me prefer ART over other ones I've implemented in toys. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmark LGTM
(The ART data structure is nice.)
a.RemoveByPeer(peers[(i+num/2)%num]) | ||
} | ||
|
||
// Finally, some stats & validity checks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This work at the end is getting added to your total benchmark time and making your number fuzzier. Does calling b.StopTimer()
just before this work?
rand.Seed(1) | ||
rand.Shuffle(num, func(i, j int) { ips[i], ips[j] = ips[j], ips[i] }) | ||
|
||
// Then repeatedly add one and remove one that was insert 32k inserts back. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/insert /inserted /
e467e07
to
2a607d1
Compare
c597c63
to
3b3de75
Compare
8ae4473
to
70b7b71
Compare
1ae3898
to
4e9e5da
Compare
eba36c5
to
ffb742d
Compare
89a9432
to
b9669b7
Compare
I think there are now https://pkg.go.dev/tailscale.com/net/art (seems to not be used by anyone publicly) and https://pkg.go.dev/github.com/gaissmai/bart which outperforms |
5ba9663
to
c92064f
Compare
https://pkg.go.dev/tailscale.com/net/art is a straight implementation of ART from the paper mentioned above. It works well, but even with optimizations mentioned in the paper, it has a fairly large memory footprint. bart implements an additional optimization to reduce art's memory footprint substantially, which on modern systems constrained by memory bandwidth is a big win. It also goes a harder in the implementation on trading readability for performance, e.g. with use of CPU intrinsics, manual loop unrolling, precomputed lookup tables. I don't say that as a negative to be clear, I think bart's author did a very good job of balancing the two considerations. There's more performance available in tailscale.com/net/art by doing similar low-level optimizations to what bart did... but bart came along before I got around to it, and bart is just a better algorithm with its novel storage layout. Tailscale switched to bart back when it was still a bit slower than art, because art's memory footprint was prohibitive on mobile and embedded targets for anything but trivial route tables. Bart worked everywhere, and the memory savings were worth a small performance hit vs. art. And then bart just got faster and that became moot anyway :) So yeah tl;dr, there should be no reason to use tailscale.com/net/art over github.com/gaissmai/bart at this point. Bart is both a better algorithm, and has had more performance work put into it on top of that. |
To show that RemoveByPeer is slow. Currently:
Signed-off-by: Brad Fitzpatrick [email protected]
/cc @zx2c4 @crawshaw @danderson