Use carryless multiply in calculating Compressor offset vectors #1441

no-defun-allowed · 2026-01-08T07:37:18Z

This PR introduces an algorithm for computing the offset vector in the Compressor which uses the carryless multiply instruction, based on the branch-free and bit-parallel algorithm in https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/

And adjust the region and work packet sizes.

no-defun-allowed · 2026-01-08T07:49:41Z

Some issues come to mind.

I've now introduced CPU-specific code to MMTk, I don't know if we have a plan for how to put that in the codebase. My current approach of dumping #[cfg(target_arch = "x86_64")] in the Compressor directory is rather undisciplined.
ranges::break_byte_range is patterned after ranges::break_bit_range; I don't know if that's a good approach, and it's currently undocumented and untested.
I have a portable prefix sum in forwarding::prefix_sum — although according to the Wikipedia article it's a xor-scan, although xor is addition over ℤ₂ if you're so inclined — which appears about as fast as the original branchy algorithm. Is it worthwhile to keep the original branchy code, or should the fallback use that prefix sum algorithm?
~~I'm still waiting for properly-measured results, but so far~~ the carryless multiply algorithm is either within the noise or up to 5% faster in STW time than the branchy algorithm. This might not be a large enough speedup to deserve the complexity.

See https://blog.rust-lang.org/2025/05/15/Rust-1.87.0/#safe-architecture-intrinsics

no-defun-allowed · 2026-01-09T06:30:56Z

Plotty and here are the geomean results for clmul-enabled relative to clmul-disabled:

Heap factor	Total time	STW time
1000	0.982	0.970
1416	0.997	0.984
1892	0.989	0.970
2428	0.997	0.983
3023	0.996	0.980
3678	0.996	0.977
4392	0.994	0.970
5166	0.995	0.981
6000	1.001	0.965

qinsoon · 2026-01-11T23:44:33Z

Plotty and here are the geomean results for clmul-enabled relative to clmul-disabled:
Heap factor Total time STW time
1000 0.982 0.970
1416 0.997 0.984
1892 0.989 0.970
2428 0.997 0.983
3023 0.996 0.980
3678 0.996 0.977
4392 0.994 0.970
5166 0.995 0.981
6000 1.001 0.965

To clarify, do both builds include the change for computing multiple regions in one CalculateOffsetVector package? Just asking, as the PR has two changes (clmul and the work package change), and the performance impact of the work package change was not mentioned.

no-defun-allowed · 2026-01-12T00:22:52Z

To clarify, do both builds include the change for computing multiple regions in one CalculateOffsetVector package? Just asking, as the PR has two changes (clmul and the work package change), and the performance impact of the work package change was not mentioned.

Yes, both are with this pull request, with different values for MMTK_COMPRESSOR_USE_CLMUL. I haven't ran systemic tests for the new work packets, but I found that one region* per work packet was too small and scheduling overhead dominated the offset vector phase. Here is a trace for lusearch with the old work packets, for example:

and with the new work packets:

*I also neglected to mention that I shrunk regions from 1MiB to 256kiB, to improve work balancing. The original 1MiB size was more-or-less arbitrary; and I found that the regions near the start of the heap would often accumulate long-lived objects, and so the Compact phase would spend dramatically more time on these regions. (The offset vector phase would previously have spent more time on these regions too, but the time of the new branch-free algorithm is proportional to the region size and not the live data.)

no-defun-allowed · 2026-01-13T05:26:53Z

I also neglected to mention that I shrunk regions from 1MiB to 256kiB, to improve work balancing.

I tested this PR against upstream; Plotty and a tally of changes to stop-the-world times:

Speedup	Count
+20–25.1%	3
+10–20%	2
+5–10%	2
+2–5%	1
Too small to say	3
Too kafka to say	1
-2–5%	5
-5–5.4%	2

Some benchmarks see large improvements due to the better work balancing, some see small regressions due to worse locality with smaller region sizes. (I only found the mutator time to be consistently worse on tradesoap, which also has the worst STW regression at 5.4% slower.)

no-defun-allowed added 2 commits January 8, 2026 15:01

Calculate the offset vector by carryless-multiply

b6a7a85

And adjust the region and work packet sizes.

Comments, debug_assert_eq! that we have a prefix sum

fa5d018

no-defun-allowed added 4 commits January 8, 2026 18:50

Fine, fine, have a cargo fmt

187a0db

use_clmul is only used on x86_64

a0ef80c

x86_64 intrinsics are safe since 1.87.0

dcbc20d

See https://blog.rust-lang.org/2025/05/15/Rust-1.87.0/#safe-architecture-intrinsics

cargo fmt giveth, cargo fmt taketh away

4f417ec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use carryless multiply in calculating Compressor offset vectors #1441

Use carryless multiply in calculating Compressor offset vectors #1441

no-defun-allowed commented Jan 8, 2026

Uh oh!

no-defun-allowed commented Jan 8, 2026 •

edited

Loading

Uh oh!

no-defun-allowed commented Jan 9, 2026

Uh oh!

qinsoon commented Jan 11, 2026

Uh oh!

no-defun-allowed commented Jan 12, 2026

Uh oh!

no-defun-allowed commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use carryless multiply in calculating Compressor offset vectors #1441

Are you sure you want to change the base?

Use carryless multiply in calculating Compressor offset vectors #1441

Conversation

no-defun-allowed commented Jan 8, 2026

Uh oh!

no-defun-allowed commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

no-defun-allowed commented Jan 9, 2026

Uh oh!

qinsoon commented Jan 11, 2026

Uh oh!

no-defun-allowed commented Jan 12, 2026

Uh oh!

no-defun-allowed commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

no-defun-allowed commented Jan 8, 2026 •

edited

Loading