feat: Runtime detection, take 2 #86

aumetra · 2024-05-25T15:50:18Z

What type of PR is this?

feat: A new feature

Check the PR title.

This PR title match the format: <type>(optional scope): <description>
The description of this PR title is user-oriented and clear enough for others to understand.
Attach the PR updating the user documentation if the current PR requires user awareness at the usage level. User docs repo

(Optional) More detailed description for this PR(en: English/zh: Chinese).

en:

This PR adds runtime detection of SIMD features but, unlike in #55, not on the level of SIMD instructions, but instead implements enum dispatch over multiple inner parsers that each either use AVX2, SSE2, or NEON (or the scalar fallback).

(Optional) Which issue(s) this PR fixes:

Closes #14

(optional) The PR that updates user documentation:

aumetra · 2024-05-25T19:36:21Z

This isn't conditionally doing the improvements for the NEON backend via the NeonBits struct yet. Thinking about also adding support for that.

liuq19 · 2024-05-30T05:04:30Z

Thanks a lot, i need time to review this

aumetra · 2024-06-01T14:50:17Z

That commit should get rid of a bunch of compile issues related to adding generic types to structs that don't take them.
I was testing this on an x86 system, not passing any compiler options that would enable architecture-specific optimizations, meaning it only compiled with SSE2 support. I didn't even get to see the errors.

aumetra · 2024-06-03T13:55:16Z

I'm still trying to figure out what the best way forward is to make to_bitmask64 on NEON CPUs work.

liuq19 · 2024-06-25T11:00:09Z

Thanks a lot, I will review the PR this week. I will benchmark the performance at first~

aumetra · 2024-08-20T14:00:36Z

@liuq19 Would you be up to benching the speed of an implementation for NEON that runtime dispatches the bitmask creation? We could technically cache the result whether NEON (or any other feature, really) is supported in globals. That way the performance loss shouldn't be too bad.

Because hacking in runtime dispatch for bitmask creation otherwise is really tricky.

aumetra · 2024-08-20T14:22:32Z

Hacked in the version that dispatches on each bitmask call. Maybe the performance hit is too severe to justify..

aumetra · 2024-08-21T10:17:17Z

Okay, I'm not sure why this is broken? It's on ARM64, right?
I guess I'll have to whip out the cross-compilation for now. I don't own a suitable ARM machine for testing.

aumetra · 2024-08-21T12:32:43Z

Now I just need to find a way to properly express this in trait form, preferrably very generic.

liuq19 · 2024-08-21T14:10:03Z

I benched in x86 and maybe the simd is not work in runtime-detect. There maybe some function missed and not working in simd

     Running benches/deserialize_struct.rs (target/release/deps/deserialize_struct-016910da72a3ea13)
twitter/sonic_rs::from_slice_unchecked
                        time:   [861.50 µs 862.23 µs 863.02 µs]
                        change: [+76.254% +76.499% +76.735%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
twitter/sonic_rs::from_slice
                        time:   [885.21 µs 885.86 µs 886.62 µs]
                        change: [+73.328% +73.607% +73.868%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

citm_catalog/sonic_rs::from_slice_unchecked
                        time:   [1.6269 ms 1.6298 ms 1.6327 ms]
                        change: [+71.170% +71.493% +71.800%] (p = 0.00 < 0.05)
                        Performance has regressed.
citm_catalog/sonic_rs::from_slice
                        time:   [1.6559 ms 1.6572 ms 1.6587 ms]
                        change: [+67.728% +67.938% +68.127%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

canada/sonic_rs::from_slice_unchecked
                        time:   [4.5318 ms 4.5332 ms 4.5349 ms]
                        change: [+20.602% +20.659% +20.716%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
canada/sonic_rs::from_slice
                        time:   [4.5934 ms 4.5951 ms 4.5970 ms]
                        change: [+20.546% +20.602% +20.659%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

aumetra · 2024-08-21T14:35:49Z

That's weird. On my local machine, the change is somewhat in the ballpark of ~3-4%, which is acceptable (I'd need to profile it to get a closer idea of what's going on; where the performance is lost. Maybe some optimization opportunities that are too opaque for the compiler with all the generics)

     Running benches/deserialize_struct.rs (target/release/deps/deserialize_struct-b15a6d4de21d32b1)
Gnuplot not found, using plotters backend
twitter/sonic_rs::from_slice_unchecked
                        time:   [438.48 µs 440.11 µs 442.07 µs]
                        change: [+4.0017% +4.6062% +5.2032%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) high mild
  10 (10.00%) high severe
twitter/sonic_rs::from_slice
                        time:   [445.74 µs 447.52 µs 449.94 µs]
                        change: [+3.0931% +3.6056% +4.2171%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

citm_catalog/sonic_rs::from_slice_unchecked
                        time:   [856.18 µs 856.77 µs 857.50 µs]
                        change: [+1.2295% +1.3723% +1.5158%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

liuq19 · 2024-08-21T14:46:38Z

could you remove or comment the config in .cargo/config.toml

aumetra · 2024-08-21T15:12:37Z

Already wasn't active due to my global Cargo config. But for the benches above I set -C target-cpu=native. I can somewhat reproduce your findings when I toggle -C target-cpu=native for the main branch, and disable it for the runtime detection branch. But that ignores that all the optimizations the Rust compiler by default doesn't do if it isn't aware of the target CPU model.

I added debug statements and the runtime correctly detects that my CPU supports AVX2, with and without target-cpu=native.

aumetra · 2024-08-21T15:13:55Z

So it is much slower without target-cpu=native but that is to be expected with all the CPU-specific optimizations LLVM can do. But the runtime detection only sets performance back by under 5%, which is IMO acceptable as an opt-in feature.

aumetra · 2024-08-21T15:18:31Z

Never mind, I get what you mean. Let me look into it.

liuq19 · 2024-08-22T03:16:42Z

maybe we can try to compare more benchmarks

liuq19 · 2024-08-30T11:46:19Z

src/util/simd/sse2.rs

 };

 use super::{Mask, Simd};
 use crate::impl_lanes;

+#[inline]


We do not need the optimization. The generated asm from std::arch::is_x86_feature_detected has optimizations.
https://rust.godbolt.org/z/sdqefTPxW

Ah! That's nice to know. I'll be reverting that then

liuq19 · 2024-08-30T11:49:07Z

Cargo.toml

@@ -81,7 +81,8 @@ name = "value_operator"
 harness = false

 [features]
-default = []
+default = ["runtime-detection"]


the runtime detection always has fewer overheads, I think it is better not to enable the feature in the default

liuq19 · 2024-09-03T05:35:55Z

any updates?

aumetra · 2024-09-03T05:43:48Z

Sorry, I've been busy the last two weeks, but I can hopefully do some work today at the airport

aumetra added 8 commits July 12, 2024 15:22

Make scalar backends generic over inner SIMD backends

d165363

Make parser generic over SIMD ops

00b4538

Add runtime dispatch

3e4afa1

Use default types instead

f1e458e

Fix warnings and compilation with arbitrary-precision

1e433f4

Introduce new function on bitmask to allow for generic bitmasks

7a4dd0e

Fix linter warnings

c8dc14f

Replace explicit bitmask requirement with abstraction

7a3a864

aumetra force-pushed the runtime-detection branch from 9e6e4a6 to 7a3a864 Compare July 12, 2024 13:23

aumetra added 5 commits July 12, 2024 15:25

Fix compile errors

c6904d6

Merge branch 'main' into runtime-detection

0729f54

Fix compile error

d153039

Merge branch 'main' into runtime-detection

7944184

Merge branch 'main' into runtime-detection

6180e95

Runtime bitmask dispatch for NEON

e3fa2cf

Fix dispatch, fix macro references

cb52681

Cache cpuid result

dab5e68

Merge branch 'main' into runtime-detection

8b73fa9

liuq19 reviewed Aug 30, 2024

View reviewed changes

Merge branch 'main' into runtime-detection

76fee9a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Runtime detection, take 2 #86

feat: Runtime detection, take 2 #86

aumetra commented May 25, 2024 •

edited

Loading

aumetra commented May 25, 2024

liuq19 commented May 30, 2024

aumetra commented Jun 1, 2024 •

edited

Loading

aumetra commented Jun 3, 2024

liuq19 commented Jun 25, 2024

aumetra commented Aug 20, 2024

aumetra commented Aug 20, 2024

aumetra commented Aug 21, 2024

aumetra commented Aug 21, 2024

liuq19 commented Aug 21, 2024 •

edited

Loading

aumetra commented Aug 21, 2024 •

edited

Loading

liuq19 commented Aug 21, 2024

aumetra commented Aug 21, 2024

aumetra commented Aug 21, 2024

aumetra commented Aug 21, 2024

liuq19 commented Aug 22, 2024

liuq19 Aug 30, 2024

aumetra Aug 30, 2024

liuq19 Aug 30, 2024 •

edited

Loading

liuq19 commented Sep 3, 2024

aumetra commented Sep 3, 2024

feat: Runtime detection, take 2 #86

Are you sure you want to change the base?

feat: Runtime detection, take 2 #86

Conversation

aumetra commented May 25, 2024 • edited Loading

What type of PR is this?

Check the PR title.

(Optional) More detailed description for this PR(en: English/zh: Chinese).

(Optional) Which issue(s) this PR fixes:

(optional) The PR that updates user documentation:

aumetra commented May 25, 2024

liuq19 commented May 30, 2024

aumetra commented Jun 1, 2024 • edited Loading

aumetra commented Jun 3, 2024

liuq19 commented Jun 25, 2024

aumetra commented Aug 20, 2024

aumetra commented Aug 20, 2024

aumetra commented Aug 21, 2024

aumetra commented Aug 21, 2024

liuq19 commented Aug 21, 2024 • edited Loading

aumetra commented Aug 21, 2024 • edited Loading

liuq19 commented Aug 21, 2024

aumetra commented Aug 21, 2024

aumetra commented Aug 21, 2024

aumetra commented Aug 21, 2024

liuq19 commented Aug 22, 2024

liuq19 Aug 30, 2024

Choose a reason for hiding this comment

aumetra Aug 30, 2024

Choose a reason for hiding this comment

liuq19 Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

liuq19 commented Sep 3, 2024

aumetra commented Sep 3, 2024

aumetra commented May 25, 2024 •

edited

Loading

aumetra commented Jun 1, 2024 •

edited

Loading

liuq19 commented Aug 21, 2024 •

edited

Loading

aumetra commented Aug 21, 2024 •

edited

Loading

liuq19 Aug 30, 2024 •

edited

Loading