-
Notifications
You must be signed in to change notification settings - Fork 97
scrypt: SSE2/simd128 RoMix data layout optimization #622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: eternal-flame-AD <[email protected]>
Signed-off-by: eternal-flame-AD <[email protected]>
Signed-off-by: eternal-flame-AD <[email protected]>
Signed-off-by: eternal-flame-AD <[email protected]>
Signed-off-by: eternal-flame-AD <[email protected]>
I tested on: |
Sidebar: huh interesting, I wasn't aware of that wasmtime target but it's very cool you can run WASM benchmarks from the CLI like that. I assume tests work too? If so we should make use of that. |
@tarcieri it should be just: [target.wasm32-wasip1]
runner = "wasmtime" Then You can definitely set it up for CI (in another PR probably). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we usually prefer the module.rs
style instead of module/mod.rs
.
Signed-off-by: eternal-flame-AD <[email protected]>
Signed-off-by: eternal-flame-AD <[email protected]>
I am benchmarking on real A64 hardware and noticed there is something wrong with the Arm NEON/asimd performance-almost +100% runtime compared to soft on both Raspberry Pi and my phone. I can't immediately see what's wrong with it except the L/S are unaligned (shouldn't be that bad), tried some equivalent ARX and load/store sequences didn't fix it either. Assembly looks correct as well and no obvious reasons why it should be that slow... Unless I am missing something probably just bad interaction with hardware and we outta just remove it, bad luck. The current A64 assembly: https://gist.github.com/eternal-flame-AD/540d80d33e1ac596740744fe8cd6c18f If someone has a MacBook with Apple Sillion the results might be different, I am suspecting the aSIMD instruction latency (usually 2x-4x the SSE2 equivalent) is too bad for deliberately serial algorithms like this. |
Signed-off-by: eternal-flame-AD <[email protected]>
Prearranged data into 128bit lanes so we don't have to transpose back and forth in the BlockMix Salsa20 kernel on SSE2.
The permute constants are the same as https://github.com/RustCrypto/stream-ciphers/blob/07ee501ac9067abe0679a596aa771a575baec68e/salsa20/src/backends/soft.rs#L54-L57 read column wise.
After:
Before:
Picked from my own performance oriented implementation: https://github.com/eternal-flame-AD/scrypt-opt