device: add BenchmarkAllowedIPsInsertRemove #36

bradfitz · 2020-07-14T20:29:46Z

To show that RemoveByPeer is slow. Currently:

(pprof) top
Showing nodes accounting for 2.99s, 96.14% of 3.11s total
Dropped 35 nodes (cum <= 0.02s)
Showing top 10 nodes out of 36
      flat  flat%   sum%        cum   cum%
     2.72s 87.46% 87.46%      2.72s 87.46%  golang.zx2c4.com/wireguard/device.(*trieEntry).removeByPeer
     0.10s  3.22% 90.68%      0.10s  3.22%  runtime.memclrNoHeapPointers
     0.05s  1.61% 92.28%      0.06s  1.93%  runtime.scanobject
     0.03s  0.96% 93.25%      0.05s  1.61%  runtime.casgstatus
     0.02s  0.64% 93.89%      0.02s  0.64%  runtime.(*gcBitsArena).tryAlloc (inline)
     0.02s  0.64% 94.53%      0.02s  0.64%  runtime.heapBitsSetType
     0.02s  0.64% 95.18%      0.04s  1.29%  runtime.sweepone
     0.01s  0.32% 95.50%      0.02s  0.64%  golang.zx2c4.com/wireguard/device.commonBits
     0.01s  0.32% 95.82%      0.03s  0.96%  runtime.(*mheap).allocSpan
     0.01s  0.32% 96.14%      0.24s  7.72%  runtime.mallocgc

Signed-off-by: Brad Fitzpatrick [email protected]

/cc @zx2c4 @crawshaw @danderson

To show that RemoveByPeer is slow. Currently: (pprof) top Showing nodes accounting for 2.99s, 96.14% of 3.11s total Dropped 35 nodes (cum <= 0.02s) Showing top 10 nodes out of 36 flat flat% sum% cum cum% 2.72s 87.46% 87.46% 2.72s 87.46% golang.zx2c4.com/wireguard/device.(*trieEntry).removeByPeer 0.10s 3.22% 90.68% 0.10s 3.22% runtime.memclrNoHeapPointers 0.05s 1.61% 92.28% 0.06s 1.93% runtime.scanobject 0.03s 0.96% 93.25% 0.05s 1.61% runtime.casgstatus 0.02s 0.64% 93.89% 0.02s 0.64% runtime.(*gcBitsArena).tryAlloc (inline) 0.02s 0.64% 94.53% 0.02s 0.64% runtime.heapBitsSetType 0.02s 0.64% 95.18% 0.04s 1.29% runtime.sweepone 0.01s 0.32% 95.50% 0.02s 0.64% golang.zx2c4.com/wireguard/device.commonBits 0.01s 0.32% 95.82% 0.03s 0.96% runtime.(*mheap).allocSpan 0.01s 0.32% 96.14% 0.24s 7.72% runtime.mallocgc Signed-off-by: Brad Fitzpatrick <[email protected]>

zx2c4 · 2020-07-14T20:31:13Z

Same issue in the kernel code. That's a hard traversal to speed up without increasing the size of each node beyond a cacheline and therefore making lookups slow. Any suggestions?

zx2c4 · 2020-07-14T20:35:57Z

For cross-reference:

https://github.com/WireGuard/wireguard-linux/blob/0f57a1e522f413e87852e632f55de4723e511939/drivers/net/wireguard/allowedips.c#L69-L121

bradfitz · 2020-07-14T20:37:39Z

At least in our case (and perhaps with others?), the overwhelming majority of routes are complete IPv4 or IPv6 addresses (cidr /32 or /128). I was planning on adding a Go map alongside the trie and using both: map for complete addresses and trie for prefixes. That does mean some lookups (for non-complete addresses) need to consult both. I'm fine with that if it means reducing the removeByPeer cost, which is eating 40% of our CPU on our big shared test node accessible to all users.

zx2c4 · 2020-07-14T20:40:06Z

Instead of trying to add special cases -- whose complexity I wouldn't be so happy about having here -- what about implementing better/faster algorithms for the general case? Specifically, check out https://github.com/openbsd/src/blob/master/sys/net/art.c https://github.com/openbsd/src/blob/master/sys/net/art.h I would very very gladly take an implementation of this directly into wireguard-go (and would prefer it there instead of in a separate repo).

bradfitz · 2020-07-14T20:43:16Z

Oh, nice, I hadn't seen that. PDF from the comments there: http://www.hariguchi.org/art/art.pdf

zx2c4 · 2020-07-14T20:46:05Z

Right. Basically it sounds like what happened somebody submitted a paper for a new routing table data structure. Knuth reviewed it, and during the review thought of something better. And that's ART.

LC-Tries are also pretty fast, but not very fun to implement, and ART may well outperform it.

Weidong Wu has a great book called "Packet Forwarding Technologies" that compares a lot of these different structures, but the latest addition I've found is 2007, which doesn't cover ART unfortunately. However, the combination of versatility, code compactness, and simplicity makes me prefer ART over other ones I've implemented in toys.

crawshaw

Benchmark LGTM

(The ART data structure is nice.)

crawshaw · 2020-07-15T04:40:01Z

device/allowedips_test.go

+		a.RemoveByPeer(peers[(i+num/2)%num])
+	}
+
+	// Finally, some stats & validity checks.


This work at the end is getting added to your total benchmark time and making your number fuzzier. Does calling b.StopTimer() just before this work?

crawshaw · 2020-07-15T04:40:47Z

device/allowedips_test.go

+	rand.Seed(1)
+	rand.Shuffle(num, func(i, j int) { ips[i], ips[j] = ips[j], ips[i] })
+
+	// Then repeatedly add one and remove one that was insert 32k inserts back.


s/insert /inserted /

AlexanderYastrebov · 2025-05-21T08:38:00Z

I think there are now https://pkg.go.dev/tailscale.com/net/art (seems to not be used by anyone publicly) and https://pkg.go.dev/github.com/gaissmai/bart which outperforms art as claimed by the author.

danderson · 2025-06-01T03:10:50Z

https://pkg.go.dev/tailscale.com/net/art is a straight implementation of ART from the paper mentioned above. It works well, but even with optimizations mentioned in the paper, it has a fairly large memory footprint.

bart implements an additional optimization to reduce art's memory footprint substantially, which on modern systems constrained by memory bandwidth is a big win. It also goes a harder in the implementation on trading readability for performance, e.g. with use of CPU intrinsics, manual loop unrolling, precomputed lookup tables. I don't say that as a negative to be clear, I think bart's author did a very good job of balancing the two considerations.

There's more performance available in tailscale.com/net/art by doing similar low-level optimizations to what bart did... but bart came along before I got around to it, and bart is just a better algorithm with its novel storage layout.

Tailscale switched to bart back when it was still a bit slower than art, because art's memory footprint was prohibitive on mobile and embedded targets for anything but trivial route tables. Bart worked everywhere, and the memory savings were worth a small performance hit vs. art. And then bart just got faster and that became moot anyway :)

So yeah tl;dr, there should be no reason to use tailscale.com/net/art over github.com/gaissmai/bart at this point. Bart is both a better algorithm, and has had more performance work put into it on top of that.

crawshaw approved these changes Jul 15, 2020

View reviewed changes

zx2c4-bot force-pushed the master branch 2 times, most recently from e467e07 to 2a607d1 Compare December 23, 2020 16:45

zx2c4-bot force-pushed the master branch 2 times, most recently from c597c63 to 3b3de75 Compare January 7, 2021 16:09

zx2c4-bot force-pushed the master branch from 06f1482 to 294d3be Compare January 20, 2021 19:12

zx2c4-bot force-pushed the master branch 17 times, most recently from 8ae4473 to 70b7b71 Compare February 25, 2021 14:08

zx2c4-bot force-pushed the master branch 4 times, most recently from 1ae3898 to 4e9e5da Compare May 10, 2021 15:49

zx2c4-bot force-pushed the master branch from 9252f58 to 841756e Compare June 3, 2021 14:29

zx2c4-bot force-pushed the master branch from 0243978 to 23d4e52 Compare November 6, 2021 13:31

zx2c4-bot force-pushed the master branch 4 times, most recently from eba36c5 to ffb742d Compare November 16, 2021 20:16

zx2c4-bot force-pushed the master branch from ff73da5 to 387f7c4 Compare November 23, 2021 21:03

zx2c4-bot force-pushed the master branch 5 times, most recently from 89a9432 to b9669b7 Compare February 2, 2022 22:09

zx2c4-bot force-pushed the master branch from 3a0dfef to b51010b Compare September 4, 2022 10:57

zx2c4-bot force-pushed the master branch from c7b76d3 to 1e2c3e5 Compare February 16, 2023 15:34

zx2c4-bot force-pushed the master branch from 787da64 to f41f474 Compare March 10, 2023 13:53

zx2c4-bot force-pushed the master branch from d3cb5bd to 6f895be Compare March 24, 2023 16:05

zx2c4-bot force-pushed the master branch from a08e667 to 436f7fd Compare May 5, 2025 13:10

zx2c4-bot force-pushed the master branch 7 times, most recently from 5ba9663 to c92064f Compare May 21, 2025 23:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

device: add BenchmarkAllowedIPsInsertRemove #36

device: add BenchmarkAllowedIPsInsertRemove #36

Uh oh!

bradfitz commented Jul 14, 2020

Uh oh!

zx2c4 commented Jul 14, 2020

Uh oh!

zx2c4 commented Jul 14, 2020 •

edited

Loading

Uh oh!

bradfitz commented Jul 14, 2020

Uh oh!

zx2c4 commented Jul 14, 2020

Uh oh!

bradfitz commented Jul 14, 2020

Uh oh!

zx2c4 commented Jul 14, 2020

Uh oh!

crawshaw left a comment

Uh oh!

crawshaw Jul 15, 2020

Uh oh!

crawshaw Jul 15, 2020

Uh oh!

AlexanderYastrebov commented May 21, 2025

Uh oh!

danderson commented Jun 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

Uh oh!

device: add BenchmarkAllowedIPsInsertRemove #36

Are you sure you want to change the base?

device: add BenchmarkAllowedIPsInsertRemove #36

Uh oh!

Conversation

bradfitz commented Jul 14, 2020

Uh oh!

zx2c4 commented Jul 14, 2020

Uh oh!

zx2c4 commented Jul 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bradfitz commented Jul 14, 2020

Uh oh!

zx2c4 commented Jul 14, 2020

Uh oh!

bradfitz commented Jul 14, 2020

Uh oh!

zx2c4 commented Jul 14, 2020

Uh oh!

crawshaw left a comment

Choose a reason for hiding this comment

Uh oh!

crawshaw Jul 15, 2020

Choose a reason for hiding this comment

Uh oh!

crawshaw Jul 15, 2020

Choose a reason for hiding this comment

Uh oh!

AlexanderYastrebov commented May 21, 2025

Uh oh!

danderson commented Jun 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

zx2c4 commented Jul 14, 2020 •

edited

Loading