Skip to content

scrypt: SSE2/simd128 RoMix data layout optimization #622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

eternal-flame-AD
Copy link

@eternal-flame-AD eternal-flame-AD commented Aug 1, 2025

Prearranged data into 128bit lanes so we don't have to transpose back and forth in the BlockMix Salsa20 kernel on SSE2.

The permute constants are the same as https://github.com/RustCrypto/stream-ciphers/blob/07ee501ac9067abe0679a596aa771a575baec68e/salsa20/src/backends/soft.rs#L54-L57 read column wise.

After:

> cargo bench 
test scrypt_15_8_1 ... bench: 180,070,625.10 ns/iter (+/- 4,549,929.06)
> RUSTFLAGS="-Ctarget-feature=+simd128" cargo bench --target wasm32-wasip1    
test scrypt_15_8_1 ... bench: 118,944,571.20 ns/iter (+/- 3,098,151.70)
> ssh cheap_vps cargo bench
test scrypt_15_8_1 ... bench: 304,886,161.00 ns/iter (+/- 6,625,867.19)

Before:

> cargo bench
test scrypt_15_8_1 ... bench: 230,760,302.00 ns/iter (+/- 8,838,571.54)
> RUSTFLAGS="-Ctarget-feature=+simd128" cargo bench --target wasm32-wasip1    
test scrypt_15_8_1 ... bench: 190,474,545.40 ns/iter (+/- 5,895,216.01)
> ssh cheap_vps cargo bench
test scrypt_15_8_1 ... bench: 409,880,353.40 ns/iter (+/- 17,629,444.54)

Picked from my own performance oriented implementation: https://github.com/eternal-flame-AD/scrypt-opt

Signed-off-by: eternal-flame-AD <[email protected]>
Signed-off-by: eternal-flame-AD <[email protected]>
Signed-off-by: eternal-flame-AD <[email protected]>
@eternal-flame-AD eternal-flame-AD changed the title scrypt: SSE2 RoMix data layout optimization scrypt: SSE2/simd128 RoMix data layout optimization Aug 1, 2025
@eternal-flame-AD
Copy link
Author

I tested on: wasm32-wasip1 (wasmtime), aarch64-unknown-linux-musl (QEMU) and x86_64, plus a unit test to make sure the new kernels yield the same result as the original code (not moved into block_mix::soft). So it should cover all code paths I added.

@tarcieri
Copy link
Member

tarcieri commented Aug 1, 2025

Sidebar: huh interesting, I wasn't aware of that wasmtime target but it's very cool you can run WASM benchmarks from the CLI like that. I assume tests work too? If so we should make use of that.

@eternal-flame-AD
Copy link
Author

@tarcieri it should be just:

[target.wasm32-wasip1]
runner = "wasmtime"

Then cargo test and cargo bench just works. Setting CARGO_TARGET_WASM32_WASIP1_RUNNER=wasmtime is the same.

You can definitely set it up for CI (in another PR probably).

Copy link
Member

@newpavlov newpavlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we usually prefer the module.rs style instead of module/mod.rs.

Signed-off-by: eternal-flame-AD <[email protected]>
@eternal-flame-AD
Copy link
Author

eternal-flame-AD commented Aug 1, 2025

I am benchmarking on real A64 hardware and noticed there is something wrong with the Arm NEON/asimd performance-almost +100% runtime compared to soft on both Raspberry Pi and my phone.

I can't immediately see what's wrong with it except the L/S are unaligned (shouldn't be that bad), tried some equivalent ARX and load/store sequences didn't fix it either.

Assembly looks correct as well and no obvious reasons why it should be that slow... Unless I am missing something probably just bad interaction with hardware and we outta just remove it, bad luck.

The current A64 assembly: https://gist.github.com/eternal-flame-AD/540d80d33e1ac596740744fe8cd6c18f

If someone has a MacBook with Apple Sillion the results might be different, I am suspecting the aSIMD instruction latency (usually 2x-4x the SSE2 equivalent) is too bad for deliberately serial algorithms like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants