Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use an arena-based BTree library #88

Closed
wants to merge 1 commit into from
Closed

Use an arena-based BTree library #88

wants to merge 1 commit into from

Conversation

fitzgen
Copy link
Member

@fitzgen fitzgen commented Sep 23, 2022

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@fitzgen fitzgen requested a review from cfallin September 23, 2022 17:39
@fitzgen
Copy link
Member Author

fitzgen commented Sep 23, 2022

Up to 1.18x faster (cycles) on sightglass benchmarks and big wins in terms of cache misses as well as cache accesses in general; about 1.05x faster in terms of instructions retired.

bz2

$ cargo run --release -- benchmark -e ~/scratch/arena.so -e ~/scratch/main.so -m perf-counters --stop-after compilation --processes 1 --iterations-per-process 40 --engine-flags="--disable-parallel-compilation --disable-cache" -- benchmarks/bz2/benchmark.wasm
    Finished release [optimized] target(s) in 0.07s
     Running `target/release/sightglass-cli benchmark -e /home/nick/scratch/arena.so -e /home/nick/scratch/main.so -m perf-counters --stop-after compilation --processes 1 --iterations-per-process 40 '--engine-flags=--disable-parallel-compilation --disable-cache' -- benchmarks/bz2/benchmark.wasm`

compilation :: cache-misses :: benchmarks/bz2/benchmark.wasm

  Δ = 2108648.82 ± 143597.02 (confidence = 99%)

  arena.so is 1.51x to 1.59x faster than main.so!

  [3146143 3845760.10 4176622] arena.so
  [5343061 5954408.92 6481465] main.so

compilation :: cache-accesses :: benchmarks/bz2/benchmark.wasm

  Δ = 3818308.85 ± 280059.26 (confidence = 99%)

  arena.so is 1.23x to 1.26x faster than main.so!

  [14828375 15560377.50 17216673] arena.so
  [18867462 19378686.35 20366619] main.so

compilation :: cpu-cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 38941613.33 ± 19219287.85 (confidence = 99%)

  arena.so is 1.03x to 1.09x faster than main.so!

  [571722056 626451165.17 711378844] arena.so
  [621795300 665392778.50 740884727] main.so

compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 49186818.10 ± 63349.50 (confidence = 99%)

  arena.so is 1.05x to 1.05x faster than main.so!

  [1077556564 1077809332.47 1078060160] arena.so
  [1126885057 1126996150.58 1127442426] main.so

pulldown-cmark

$ cargo run --release -- benchmark -e ~/scratch/arena.so -e ~/scratch/main.so -m perf-counters --stop-after compilation --processes 1 --iterations-per-process 40 --engine-flags="--disable-parallel-compilation --disable-cache" -- benchmarks/pulldown-cmark/benchmark.wasm
    Finished release [optimized] target(s) in 0.07s
     Running `target/release/sightglass-cli benchmark -e /home/nick/scratch/arena.so -e /home/nick/scratch/main.so -m perf-counters --stop-after compilation --processes 1 --iterations-per-process 40 '--engine-flags=--disable-parallel-compilation --disable-cache' -- benchmarks/pulldown-cmark/benchmark.wasm`

compilation :: cache-misses :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 4770703.25 ± 196253.80 (confidence = 99%)

  arena.so is 1.80x to 1.87x faster than main.so!

  [4882433 5738295.45 6247242] arena.so
  [9574980 10508998.70 11563720] main.so

compilation :: cache-accesses :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 7220528.20 ± 455671.53 (confidence = 99%)

  arena.so is 1.22x to 1.25x faster than main.so!

  [29428460 30815515.30 33251042] arena.so
  [36929377 38036043.50 39868781] main.so

compilation :: cpu-cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 152986047.12 ± 28594412.22 (confidence = 99%)

  arena.so is 1.11x to 1.16x faster than main.so!

  [1036551451 1141864121.40 1279073542] arena.so
  [1219401319 1294850168.53 1388134349] main.so

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 87259946.75 ± 57401.58 (confidence = 99%)

  arena.so is 1.05x to 1.05x faster than main.so!

  [1696639546 1696725575.50 1696907384] arena.so
  [1783778021 1783985522.25 1784597199] main.so

SpiderMonkey

$ cargo run --release -- benchmark -e ~/scratch/arena.so -e ~/scratch/main.so -m perf-counters --stop-after compilation --processes 1 --iterations-per-process 40 --engine-flags="--disable-parallel-compilation --disable-cache" -- benchmarks/spidermonkey/benchmark.wasm
    Finished release [optimized] target(s) in 0.07s
     Running `target/release/sightglass-cli benchmark -e /home/nick/scratch/arena.so -e /home/nick/scratch/main.so -m perf-counters --stop-after compilation --processes 1 --iterations-per-process 40 '--engine-flags=--disable-parallel-compilation --disable-cache' -- benchmarks/spidermonkey/benchmark.wasm`

compilation :: cache-misses :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 123745162.35 ± 19885865.40 (confidence = 99%)

  arena.so is 1.88x to 2.21x faster than main.so!

  [115759499 118334138.50 121153493] arena.so
  [82176442 242079300.85 261169472] main.so

compilation :: cache-accesses :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 202389509.40 ± 8891265.06 (confidence = 99%)

  arena.so is 1.28x to 1.30x faster than main.so!

  [686668405 702508605.20 714454288] arena.so
  [836680298 904898114.60 927760538] main.so

compilation :: cpu-cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 3915119666.83 ± 805385256.91 (confidence = 99%)

  arena.so is 1.12x to 1.18x faster than main.so!

  [25074362631 25904269384.80 26765528578] arena.so
  [23709398637 29819389051.62 31702914091] main.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 2144950737.78 ± 1023862.38 (confidence = 99%)

  arena.so is 1.06x to 1.06x faster than main.so!

  [36455455691 36456713471.25 36457799041] arena.so
  [38589345943 38601664209.03 38604828198] main.so

@Amanieu
Copy link
Contributor

Amanieu commented Sep 23, 2022

See #87 for more data on regalloc2's memory allocations. You posted this PR ~15 seconds after I opened the issue, great responsiveness!

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a ton for doing this work! Just some nits below.

@@ -11,6 +11,7 @@ description = "Backtracking register allocator inspired from IonMonkey"
repository = "https://github.com/bytecodealliance/regalloc2"

[dependencies]
arena-btree = { path = "../arena-btree" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should refer to published crate here

@@ -432,6 +447,17 @@ pub struct Env<'a, F: Function> {
pub annotations_enabled: bool,
}

impl<'a, F: Function> Drop for Env<'a, F> {
fn drop(&mut self) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we actually need this, given that we know our key and index types are Copy (u32s in fact)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Perhaps in trivial cases the empty drop impl means things optimize away, but I'd be a little skeptical of LLVM's ability to do that here in concert with the use of the drain iters etc)

}
}

pub fn drop(self, arena: &mut Arena<LiveRangeKey, LiveRangeIndex>) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here, not needed given knowledge of key/value types?

@fitzgen
Copy link
Member Author

fitzgen commented Oct 13, 2022

Spent some time doing some more benchmarking of this PR and #92.

Benchmark Methods

  • Each run took 100 samples: 10 processes x 10 iterations per process
  • Parallel compilation disabled
  • Pinned to one logical CPU
  • Used the perf-counters measure
    • the cache measurements are for last-level caches, which are shared across
      all cores, and should be taken with a grain of salt
  • Ran the default sightglass suite:
    • bz2.wasm
    • pulldown-cmark.wasm
    • spidermonkey.wasm
  • Comparing 6 engine variants:
    • Three regalloc2 implementations:
      1. main
      2. vec-arena
      3. bumpalo-arena)
    • Two allocator configurations:
      1. default
      2. shuffling
    • However, we only do head-to-head comparisons between implementations with
      the same allocator configuration.

Overall Summary

Neither arena implementation performed better than main. I was not able to
reproduce the earlier wins I saw with the vec-based arena.

The vec-based arena is a little slower than main with the default allocator.

With the shuffling allocator -- which makes allocation slower, and therefore can
penalize workloads that involve lots of allocation -- the bumpalo-based arena
is a little slower than main. This is a bit surprising and confusing since the
whole point of the arena is to avoid allocations, and regalloc2's b-trees make
many allocations.

The bumpalo-based arena seems to be a little faster than the vec-based arena.

Default Allocator

Summary:

  • vec-arena.so is slower than main.so on spidermonkey.wasm (1-4% slower in
    terms of cycles, 0.1% slower in terms of instructions)
  • bumpalo-arena.so seems to be on par with main.so (has 1-3% more last-level
    cache accesses on one Wasm input, but everything else is no-diff)

main.so vs vec-arena.so

Summary:

  • cycles on spidermonkey.wasm: main.so is 1.01x to 1.04x faster than vec-arena.so!
  • instructions retired on spidermonkey.wasm: main.so is 1.001x faster than vec-arena.so!
  • no other diffs
compilation :: cpu-cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 3321226.64 ± 2347139.37 (confidence = 99%)

  main.so is 1.01x to 1.04x faster than vec-arena.so!

  [118968342 126593088.21 141535311] main.so
  [120578278 129914314.85 193223707] vec-arena.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 250377.03 ± 182714.10 (confidence = 99%)

  main.so is 1.00x to 1.00x faster than vec-arena.so!

  [166111363 166962295.96 167944220] main.so
  [166183812 167212672.99 167940135] vec-arena.so

compilation :: cache-misses :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [35557 74432.76 136085] main.so
  [37301 76460.87 100611] vec-arena.so

compilation :: cache-misses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [23911 32620.57 38657] main.so
  [24397 33095.29 37056] vec-arena.so

compilation :: cache-misses :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [3441361 4046630.81 4532648] main.so
  [3409879 4007373.69 4861943] vec-arena.so

compilation :: cpu-cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [4352335 4539535.79 6350143] main.so
  [4358350 4565799.14 5390065] vec-arena.so

compilation :: cache-accesses :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [7167659 7423189.00 7669007] main.so
  [7121794 7389363.06 8169958] vec-arena.so

compilation :: cache-accesses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [96588 105038.96 109717] main.so
  [96219 104685.30 108858] vec-arena.so

compilation :: cache-accesses :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [277554 319404.66 340129] main.so
  [280054 318627.63 332594] vec-arena.so

compilation :: cpu-cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [1440360 1509933.85 1749914] main.so
  [1449254 1512692.79 1756652] vec-arena.so

compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [1816213 1838084.56 1858426] main.so
  [1821831 1835870.22 1856665] vec-arena.so

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [6945086 6974918.32 6992623] main.so
  [6951260 6972738.84 6992571] vec-arena.so

main.so vs bumpalo-arena.so

Summary:

  • cache accesses on spidermonkey.wasm: bumpalo-arena.so is 1.01x to 1.02x
    faster than main.so!
  • no other diffs
compilation :: cache-accesses :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 93886.94 ± 46912.21 (confidence = 99%)

  bumpalo-arena.so is 1.01x to 1.02x faster than main.so!

  [7015995 7323577.94 7557538] bumpalo-arena.so
  [7157855 7417464.88 7775004] main.so

compilation :: cache-misses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [24253 32925.54 44973] bumpalo-arena.so
  [23792 33240.28 39935] main.so

compilation :: cache-accesses :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [288822 314983.61 331855] bumpalo-arena.so
  [275457 317565.22 338781] main.so

compilation :: cache-accesses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [99976 104755.13 109641] bumpalo-arena.so
  [97170 105480.38 111072] main.so

compilation :: cpu-cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [4334228 4510719.29 5113323] bumpalo-arena.so
  [4328428 4536406.38 6980371] main.so

compilation :: cpu-cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [1443724 1518182.79 2177999] bumpalo-arena.so
  [1446833 1513733.51 1740785] main.so

compilation :: cpu-cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [118397198 125623004.95 137038816] bumpalo-arena.so
  [119272696 125973215.25 139140952] main.so

compilation :: cache-misses :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [36886 74293.20 98685] bumpalo-arena.so
  [35082 74112.08 110603] main.so

compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [1820167 1835752.18 1859170] bumpalo-arena.so
  [1818809 1838485.17 1858404] main.so

compilation :: cache-misses :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [3400231 4030084.47 4330616] bumpalo-arena.so
  [3443736 4034117.91 4309666] main.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [166201099 167036304.03 167964214] bumpalo-arena.so
  [166122777 166911409.42 167944317] main.so

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [6905697 6972474.39 6989491] bumpalo-arena.so
  [6954046 6975921.27 6992029] main.so

vec-arena.so vs bumpalo-arena.so

Summary:

  • cycles on spidermonkey.wasm: bumpalo-arena.so is 1.01x to 1.04x faster than
    vec-arena.so!
  • no other diffs
compilation :: cpu-cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 3024230.65 ± 1579214.08 (confidence = 99%)

  bumpalo-arena.so is 1.01x to 1.04x faster than vec-arena.so!

  [118288070 124440635.57 131586209] bumpalo-arena.so
  [119976484 127464866.22 133721530] vec-arena.so

compilation :: cache-misses :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [36001 74734.03 137556] bumpalo-arena.so
  [36649 75578.58 120600] vec-arena.so

compilation :: cpu-cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [1437801 1509094.21 1746979] bumpalo-arena.so
  [1449127 1520580.50 1762474] vec-arena.so

compilation :: cache-accesses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [95687 104512.06 110325] bumpalo-arena.so
  [99566 105233.99 111077] vec-arena.so

compilation :: cache-accesses :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [277034 317410.86 364021] bumpalo-arena.so
  [290459 315464.64 332740] vec-arena.so

compilation :: cache-misses :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [3408984 4047884.95 4338857] bumpalo-arena.so
  [3412731 4039401.61 4614208] vec-arena.so

compilation :: cache-accesses :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [7071805 7327152.92 7575432] bumpalo-arena.so
  [7100365 7340224.74 7693277] vec-arena.so

compilation :: cpu-cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [4353436 4557541.32 7536193] bumpalo-arena.so
  [4351769 4550991.17 5017868] vec-arena.so

compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [1818977 1837099.83 1858868] bumpalo-arena.so
  [1819322 1838583.15 1856402] vec-arena.so

compilation :: cache-misses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [23943 33113.44 46715] bumpalo-arena.so
  [24857 33126.08 38986] vec-arena.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [166150811 167010415.61 167937627] bumpalo-arena.so
  [166213517 167063192.94 167936966] vec-arena.so

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [6941943 6974230.83 6991006] bumpalo-arena.so
  [6950997 6974730.65 6998607] vec-arena.so

Shuffling Allocator

Summary:

  • main-shuffling.so is 0-2% faster than bumpalo-arena-shuffling.so on
    spidermonkey.wasm in terms of cycles
  • A few other statistically significant results regarding cache misses and
    accesses point to main.so being faster than bumpalo-arena-shuffling.so and
    vec-arena-shuffling.so.

main-shuffling.so vs vec-arena-shuffling.so

Summary:

  • cache accesses on spidermonkey.wasm: main-shuffling.so is 1.00x to 1.01x
    faster than vec-arena-shuffling.so!
  • no other diffs
compilation :: cache-accesses :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 35540.93 ± 35312.03 (confidence = 99%)

  main-shuffling.so is 1.00x to 1.01x faster than vec-arena-shuffling.so!

  [9686659 9906567.47 10188952] main-shuffling.so
  [9668973 9942108.40 10212537] vec-arena-shuffling.so

compilation :: cache-misses :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [110724 139268.65 244985] main-shuffling.so
  [110932 135682.47 196493] vec-arena-shuffling.so

compilation :: cache-misses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [51073 53992.76 72498] main-shuffling.so
  [51435 55212.08 78213] vec-arena-shuffling.so

compilation :: cpu-cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [6546476 7032716.86 10570950] main-shuffling.so
  [6529655 6923396.83 9983800] vec-arena-shuffling.so

compilation :: cache-misses :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [4506515 4847335.75 6030634] main-shuffling.so
  [4512752 4894127.40 5739325] vec-arena-shuffling.so

compilation :: cache-accesses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [136627 142074.75 159310] main-shuffling.so
  [135381 140979.39 157394] vec-arena-shuffling.so

compilation :: cpu-cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2092662 2293254.58 3444273] main-shuffling.so
  [2095181 2306138.83 3699533] vec-arena-shuffling.so

compilation :: cpu-cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [176905719 187776237.75 214312547] main-shuffling.so
  [179386382 188312915.54 212210798] vec-arena-shuffling.so

compilation :: cache-accesses :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [413910 424070.53 481481] main-shuffling.so
  [414768 423157.76 447841] vec-arena-shuffling.so

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [8809879 8904520.47 9404248] main-shuffling.so
  [8797464 8907856.34 9413422] vec-arena-shuffling.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [209795765 210313371.76 211289903] main-shuffling.so
  [209772903 210278904.12 211254625] vec-arena-shuffling.so

compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2367464 2442552.56 2981399] main-shuffling.so
  [2365319 2442629.86 2981515] vec-arena-shuffling.so

main-shuffling.so vs bumpalo-arena-shuffling.so

Summary:

  • cache misses on pulldown-cmark.wasm: main-shuffling.so is 1.01x to 1.07x
    faster than bumpalo-arena-shuffling.so!
  • cycles on spidermonkey.wasm: main-shuffling.so is 1.00x to 1.02x faster than
    bumpalo-arena-shuffling.so!
  • no other diffs
compilation :: cache-misses :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 5701.45 ± 3948.42 (confidence = 99%)

  main-shuffling.so is 1.01x to 1.07x faster than bumpalo-arena-shuffling.so!

  [110326 134801.81 211303] bumpalo-arena-shuffling.so
  [115747 129100.36 145934] main-shuffling.so

compilation :: cpu-cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1833544.37 ± 1757183.72 (confidence = 99%)

  main-shuffling.so is 1.00x to 1.02x faster than bumpalo-arena-shuffling.so!

  [183299354 188604134.99 211892685] bumpalo-arena-shuffling.so
  [180928873 186770590.62 224929823] main-shuffling.so

compilation :: cache-misses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [51847 55262.35 76081] bumpalo-arena-shuffling.so
  [50319 54217.59 75533] main-shuffling.so

compilation :: cpu-cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [6508297 6881884.73 8548606] bumpalo-arena-shuffling.so
  [6526515 6783486.67 7986817] main-shuffling.so

compilation :: cpu-cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2094021 2308424.04 3768184] bumpalo-arena-shuffling.so
  [2083530 2290711.54 3598033] main-shuffling.so

compilation :: cache-misses :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [4515444 4834020.16 5571233] bumpalo-arena-shuffling.so
  [4486568 4800394.03 5148230] main-shuffling.so

compilation :: cache-accesses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [135943 141308.61 157652] bumpalo-arena-shuffling.so
  [136630 141849.29 159224] main-shuffling.so

compilation :: cache-accesses :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [9663023 9912456.56 10162737] bumpalo-arena-shuffling.so
  [9617621 9901626.51 10125293] main-shuffling.so

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [8799736 8899505.30 9398204] bumpalo-arena-shuffling.so
  [8803072 8908807.28 9398923] main-shuffling.so

compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2367865 2444061.18 3078864] bumpalo-arena-shuffling.so
  [2365628 2441645.17 2980835] main-shuffling.so

compilation :: cache-accesses :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [413862 422908.24 437771] bumpalo-arena-shuffling.so
  [413025 422714.98 438622] main-shuffling.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [209732724 210295471.03 211245867] bumpalo-arena-shuffling.so
  [209519192 210279863.05 211216763] main-shuffling.so

vec-arena-shuffling.so vs bumpalo-arena-shuffling.so

Summary:

  • cache misses on spidermonkey.wasm: bumpalo-arena-shuffling.so is 1.00x to
    1.02x faster than vec-arena-shuffling.so!
  • no other diffs
compilation :: cache-misses :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 50513.07 ± 44807.18 (confidence = 99%)

  bumpalo-arena-shuffling.so is 1.00x to 1.02x faster than vec-arena-shuffling.so!

  [4513232 4800770.56 5177318] bumpalo-arena-shuffling.so
  [4526523 4851283.63 5377497] vec-arena-shuffling.so

compilation :: cache-misses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [49204 54875.29 75681] bumpalo-arena-shuffling.so
  [51084 54490.75 64178] vec-arena-shuffling.so

compilation :: cpu-cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2100143 2305499.94 3499585] bumpalo-arena-shuffling.so
  [2082348 2292218.30 3529604] vec-arena-shuffling.so

compilation :: cache-misses :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [114637 129217.36 170460] bumpalo-arena-shuffling.so
  [114029 129626.69 169862] vec-arena-shuffling.so

compilation :: cpu-cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [181294649 187670127.97 201030910] bumpalo-arena-shuffling.so
  [180946160 187098249.37 200894222] vec-arena-shuffling.so

compilation :: cache-accesses :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [9556818 9896018.57 10080262] bumpalo-arena-shuffling.so
  [9675638 9921761.02 10129442] vec-arena-shuffling.so

compilation :: cache-accesses :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [414151 421603.52 437250] bumpalo-arena-shuffling.so
  [413527 422408.12 451752] vec-arena-shuffling.so

compilation :: cpu-cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [6537817 6795377.93 7946347] bumpalo-arena-shuffling.so
  [6536294 6804174.71 8137456] vec-arena-shuffling.so

compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2366055 2444235.25 3024281] bumpalo-arena-shuffling.so
  [2362953 2441385.69 2990922] vec-arena-shuffling.so

compilation :: cache-accesses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [133287 141207.09 156049] bumpalo-arena-shuffling.so
  [135268 141054.50 162852] vec-arena-shuffling.so

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [8799647 8901409.73 9404684] bumpalo-arena-shuffling.so
  [8806424 8903136.24 9406136] vec-arena-shuffling.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [209563075 210289015.23 211232105] bumpalo-arena-shuffling.so
  [209614252 210283314.68 211324794] vec-arena-shuffling.so

@fitzgen
Copy link
Member Author

fitzgen commented Oct 14, 2022

I dug some more into this with DHAT and I'm seeing that the lifetime of the bump chunks inside bumpalo are roughly matching the lifetime of the b-tree nodes in main. This suggests that regalloc2 isn't thrashing the allocator with a bunch of repeated malloc/free pairs so much as it is allocating a bunch of stuff, doing a bunch of work with that stuff, and then freeing it.

I verified this by logging free list hits (i.e. we reuse a previously allocated block) vs free list misses (i.e. we had to bump allocate a new block) in the arena free list. I got these numbers on spidermonkey.wasm compilation:

total:  622502 (100%)
hits:    28351 (4.6%)
misses: 594151 (95.4%)

So only about 1/20 allocations use recycled blocks.

Takeaways:

  • It is possibly worth exploring using bumpalo directly without a freelist overlay on top of it. We get about 5% memory overhead but potentially much faster allocation.

  • If the cost of allocating and freeing pales in comparison to the work done using these blocks, then we aren't going to see much of a speed up from this line of inquiry into speeding up allocation at all. (I suspect this is probably the case.)

@fitzgen
Copy link
Member Author

fitzgen commented Oct 18, 2022

For the hell of it, I experimented with changing the B value for the b-trees. std's choice of B is 6 and I compared that against 3 and 12. Neither was an improvement over 6.

b6.so vs b12.so

    Finished release [optimized] target(s) in 0.18s
     Running `target/release/sightglass-cli benchmark -e /home/nick/scratch/b6.so -e /home/nick/scratch/b12.so --stop-after compilation --measure perf-counters --engine-flags=--disable-parallel-compilation --processes 10 --iterations-per-process 10 -- benchmarks/default.suite`

compilation :: cache-misses :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 5557.71 ± 5446.26 (confidence = 99%)

  6.so is 1.00x to 1.12x faster than 12.so!

  [43981 96226.24 132861] 12.so
  [44164 90668.53 125011] 6.so

compilation :: cache-accesses :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 93049.73 ± 45615.31 (confidence = 99%)

  12.so is 1.01x to 1.02x faster than 6.so!

  [7234109 7431053.49 7705827] 12.so
  [7364752 7524103.22 7979479] 6.so

compilation :: cache-misses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [25753 36478.87 46780] 12.so
  [25344 37306.48 48519] 6.so

compilation :: cache-misses :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [3636748 4195650.62 4526189] 12.so
  [3672734 4162251.82 4507069] 6.so

compilation :: cpu-cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [4565402 4859106.66 6102903] 12.so
  [4563173 4830891.87 7071373] 6.so

compilation :: cache-accesses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [101317 107829.71 114491] 12.so
  [102925 108395.29 113922] 6.so

compilation :: cpu-cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [1495157 1584916.62 1822714] 12.so
  [1507141 1590195.96 1839162] 6.so

compilation :: cache-accesses :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [295013 327156.18 358831] 12.so
  [296138 327610.76 387968] 6.so

compilation :: cpu-cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [122757125 132182186.49 143967744] 12.so
  [123945449 132269100.76 142992622] 6.so

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [7025265 7057156.69 7074688] 12.so
  [7029664 7058630.99 7075939] 6.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [167807444 168879625.58 169876586] 12.so
  [167824494 168849081.24 170167519] 6.so

compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [1853225 1869092.52 1884520] 12.so
  [1851866 1868990.47 1888014] 6.so

b6.so vs b3.so

    Finished release [optimized] target(s) in 0.09s
     Running `target/release/sightglass-cli benchmark -e /home/nick/scratch/b6.so -e /home/nick/scratch/b3.so --stop-after compilation --measure perf-counters --engine-flags=--disable-parallel-compilation --processes 10 --iterations-per-process 10 -- benchmarks/default.suite`

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 286491.05 ± 217111.61 (confidence = 99%)

  6.so is 1.00x to 1.00x faster than 3.so!

  [167743806 169187035.52 170235819] 3.so
  [167656121 168900544.47 170060760] 6.so

compilation :: cache-misses :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [45392 89402.99 141951] 3.so
  [39191 84709.30 144211] 6.so

compilation :: cpu-cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [4543732 4814182.07 5862717] 3.so
  [4455692 4738084.71 5832006] 6.so

compilation :: cache-misses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [25708 36415.94 48313] 3.so
  [24799 36005.49 54648] 6.so

compilation :: cpu-cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [123191729 133769474.88 140952121] 3.so
  [123490027 132678198.72 148115215] 6.so

compilation :: cache-misses :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [3621993 4135165.02 4444120] 3.so
  [3669601 4167520.70 4605631] 6.so

compilation :: cache-accesses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [101067 108712.55 125222] 3.so
  [102032 107935.43 115247] 6.so

compilation :: cache-accesses :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [7220683 7492162.95 7956672] 3.so
  [7369634 7529938.66 7932766] 6.so

compilation :: cache-accesses :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [296509 328627.73 351518] 3.so
  [294600 327488.44 351862] 6.so

compilation :: cpu-cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [1512337 1586877.78 2222255] 3.so
  [1478149 1582202.11 2060300] 6.so

compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [1856499 1872122.66 1893422] 3.so
  [1851833 1869705.23 1885830] 6.so

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [7029178 7060809.79 7080450] 3.so
  [7020553 7056232.96 7088388] 6.so

b6-shuffling.so vs b12-shuffling.so

    Finished release [optimized] target(s) in 0.09s
     Running `target/release/sightglass-cli benchmark -e /home/nick/scratch/b6-shuffling.so -e /home/nick/scratch/b12-shuffling.so --stop-after compilation --measure perf-counters --engine-flags=--disable-parallel-compilation --processes 10 --iterations-per-process 10 -- benchmarks/default.suite`

compilation :: cache-misses :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [118198 143929.27 235355] 12-shuffling.so
  [112703 140060.81 182074] 6-shuffling.so

compilation :: cpu-cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [6722954 7110159.57 10262750] 12-shuffling.so
  [6705822 6991075.06 8317079] 6-shuffling.so

compilation :: cpu-cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [182803705 191904395.17 210052481] 12-shuffling.so
  [181852363 190901135.61 265123169] 6-shuffling.so

compilation :: cache-misses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [49633 58255.01 85670] 12-shuffling.so
  [51839 58515.35 72205] 6-shuffling.so

compilation :: cpu-cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2142454 2363043.73 3644493] 12-shuffling.so
  [2154495 2354381.06 3865439] 6-shuffling.so

compilation :: cache-misses :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [4635104 5012783.94 5716531] 12-shuffling.so
  [4672922 5021577.63 5375136] 6-shuffling.so

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [8889700 8987221.59 9469601] 12-shuffling.so
  [8891355 9000867.68 9485770] 6-shuffling.so

compilation :: cache-accesses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [137269 145720.57 161339] 12-shuffling.so
  [136487 145597.99 163855] 6-shuffling.so

compilation :: cache-accesses :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [405853 432202.68 449273] 12-shuffling.so
  [393012 432456.71 448034] 6-shuffling.so

compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2398178 2476377.16 3111283] 12-shuffling.so
  [2399882 2475155.84 3051547] 6-shuffling.so

compilation :: cache-accesses :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [9809896 10123119.41 10387930] 12-shuffling.so
  [9856813 10125532.23 10909998] 6-shuffling.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [211968459 212551572.61 213444544] 12-shuffling.so
  [211612404 212545372.90 213436823] 6-shuffling.so

b6-shuffling.so vs b3-shuffling.so

    Finished release [optimized] target(s) in 0.09s
     Running `target/release/sightglass-cli benchmark -e /home/nick/scratch/b6-shuffling.so -e /home/nick/scratch/b3-shuffling.so --stop-after compilation --measure perf-counters --engine-flags=--disable-parallel-compilation --processes 10 --iterations-per-process 10 -- benchmarks/default.suite`

compilation :: cache-misses :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 4094.28 ± 3472.27 (confidence = 99%)

  6-shuffling.so is 1.00x to 1.05x faster than 3-shuffling.so!

  [119542 143857.73 191653] 3-shuffling.so
  [114082 139763.45 160456] 6-shuffling.so

compilation :: cache-accesses :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 1738.30 ± 1627.85 (confidence = 99%)

  6-shuffling.so is 1.00x to 1.01x faster than 3-shuffling.so!

  [423400 433981.29 447389] 3-shuffling.so
  [423852 432242.99 442607] 6-shuffling.so

compilation :: cpu-cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [6719327 7084758.68 10064428] 3-shuffling.so
  [6701177 6956602.41 8325881] 6-shuffling.so

compilation :: cache-misses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [52643 59154.94 73684] 3-shuffling.so
  [52708 58473.94 71230] 6-shuffling.so

compilation :: cpu-cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2159963 2376615.55 3625000] 3-shuffling.so
  [2159300 2354941.58 3618767] 6-shuffling.so

compilation :: cpu-cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [183893238 191929395.85 213397247] 3-shuffling.so
  [183212367 190555168.62 209106642] 6-shuffling.so

compilation :: cache-misses :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [4657148 5025717.68 5850160] 3-shuffling.so
  [4678347 5034496.39 5531142] 6-shuffling.so

compilation :: cache-accesses :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [138475 146437.48 159005] 3-shuffling.so
  [139572 146521.74 161769] 6-shuffling.so

compilation :: cache-accesses :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [9876374 10125535.32 10363441] 3-shuffling.so
  [9897430 10130085.48 10367089] 6-shuffling.so

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [8884280 8991722.15 9493670] 3-shuffling.so
  [8888503 8990360.07 9489875] 6-shuffling.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [211996662 212589286.38 213410203] 3-shuffling.so
  [212195132 212613703.88 213350104] 6-shuffling.so

compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2402153 2476723.66 3022032] 3-shuffling.so
  [2398418 2476822.40 3103512] 6-shuffling.so

@cfallin
Copy link
Member

cfallin commented Oct 18, 2022

That's too bad that the tree parameter didn't have more of an effect -- thanks a bunch for looking at this though, as at least now we know!

(We've talked offline already but) I agree with your general analysis above: it seems that the use-case we have in RA2 is traversal-heavy, cache-miss-heavy on bigger workloads, but not allocator-heavy, and with little reuse. Given that, maybe it makes sense in hindsight that arena allocation would not have much impact either way here. All of that said, thank you so much for going through the exercise of actually building it and measuring it here -- this has been really valuable exploration.

@fitzgen
Copy link
Member Author

fitzgen commented Oct 19, 2022

  • It is possibly worth exploring using bumpalo directly without a freelist overlay on top of it. We get about 5% memory overhead but potentially much faster allocation.

FWIW, I did this experiment and it was a wash. No real gains or slow downs.

I also did an experiment where I didn't pin to one thread on one core, and allowed cranelift to use all available parallelism. My thought was that maybe there was contention on locks in the malloc implementation, and that therefore by moving to bump/arena allocation we would see speed ups in parallel benchmarks but not necessarily in serial benchmarks. This was also a wash with no real gains or slow downs (even when built with the shuffling allocator, which doesn't have thread local caches and has a single lock for the whole allocator!)

So at this point, I don't think it is worth following this line of investigation any further. I think we can conclude that mallocs/frees aren't really a bottleneck for regalloc2. Based on my analysis of regalloc2's malloc usage and the lifetimes of the allocations, regalloc2 isn't thrashing the allocator, and instead just allocates some stuff, does a lot of work on that stuff, and then frees it. The allocation and freeing just doesn't register relative to the actual work. And I believe the work is cache bound, so future efforts should focus on shrinking data structures to fit more of the working set in cache at the same time, making the working set smaller (e.g. less splitting), and general data-oriented engineering of the code base.

Closing this PR and the other one.

@fitzgen fitzgen closed this Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants