-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use an arena-based BTree library #88
Conversation
Up to 1.18x faster (cycles) on sightglass benchmarks and big wins in terms of cache misses as well as cache accesses in general; about 1.05x faster in terms of instructions retired.
|
See #87 for more data on regalloc2's memory allocations. You posted this PR ~15 seconds after I opened the issue, great responsiveness! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a ton for doing this work! Just some nits below.
@@ -11,6 +11,7 @@ description = "Backtracking register allocator inspired from IonMonkey" | |||
repository = "https://github.com/bytecodealliance/regalloc2" | |||
|
|||
[dependencies] | |||
arena-btree = { path = "../arena-btree" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should refer to published crate here
@@ -432,6 +447,17 @@ pub struct Env<'a, F: Function> { | |||
pub annotations_enabled: bool, | |||
} | |||
|
|||
impl<'a, F: Function> Drop for Env<'a, F> { | |||
fn drop(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we actually need this, given that we know our key and index types are Copy
(u32
s in fact)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Perhaps in trivial cases the empty drop impl means things optimize away, but I'd be a little skeptical of LLVM's ability to do that here in concert with the use of the drain iters etc)
} | ||
} | ||
|
||
pub fn drop(self, arena: &mut Arena<LiveRangeKey, LiveRangeIndex>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise here, not needed given knowledge of key/value types?
Spent some time doing some more benchmarking of this PR and #92. Benchmark Methods
Overall SummaryNeither arena implementation performed better than The vec-based arena is a little slower than With the shuffling allocator -- which makes allocation slower, and therefore can The Default AllocatorSummary:
main.so vs vec-arena.soSummary:
main.so vs bumpalo-arena.soSummary:
vec-arena.so vs bumpalo-arena.soSummary:
Shuffling AllocatorSummary:
main-shuffling.so vs vec-arena-shuffling.soSummary:
main-shuffling.so vs bumpalo-arena-shuffling.soSummary:
vec-arena-shuffling.so vs bumpalo-arena-shuffling.soSummary:
|
I dug some more into this with DHAT and I'm seeing that the lifetime of the bump chunks inside I verified this by logging free list hits (i.e. we reuse a previously allocated block) vs free list misses (i.e. we had to bump allocate a new block) in the arena free list. I got these numbers on
So only about 1/20 allocations use recycled blocks. Takeaways:
|
For the hell of it, I experimented with changing the b6.so vs b12.so
b6.so vs b3.so
b6-shuffling.so vs b12-shuffling.so
b6-shuffling.so vs b3-shuffling.so
|
That's too bad that the tree parameter didn't have more of an effect -- thanks a bunch for looking at this though, as at least now we know! (We've talked offline already but) I agree with your general analysis above: it seems that the use-case we have in RA2 is traversal-heavy, cache-miss-heavy on bigger workloads, but not allocator-heavy, and with little reuse. Given that, maybe it makes sense in hindsight that arena allocation would not have much impact either way here. All of that said, thank you so much for going through the exercise of actually building it and measuring it here -- this has been really valuable exploration. |
FWIW, I did this experiment and it was a wash. No real gains or slow downs. I also did an experiment where I didn't pin to one thread on one core, and allowed cranelift to use all available parallelism. My thought was that maybe there was contention on locks in the malloc implementation, and that therefore by moving to bump/arena allocation we would see speed ups in parallel benchmarks but not necessarily in serial benchmarks. This was also a wash with no real gains or slow downs (even when built with the shuffling allocator, which doesn't have thread local caches and has a single lock for the whole allocator!) So at this point, I don't think it is worth following this line of investigation any further. I think we can conclude that mallocs/frees aren't really a bottleneck for regalloc2. Based on my analysis of regalloc2's malloc usage and the lifetimes of the allocations, regalloc2 isn't thrashing the allocator, and instead just allocates some stuff, does a lot of work on that stuff, and then frees it. The allocation and freeing just doesn't register relative to the actual work. And I believe the work is cache bound, so future efforts should focus on shrinking data structures to fit more of the working set in cache at the same time, making the working set smaller (e.g. less splitting), and general data-oriented engineering of the code base. Closing this PR and the other one. |
Depends on bytecodealliance/arena-btree#1