Skip to content

Conversation

@no-defun-allowed
Copy link
Collaborator

This PR introduces an algorithm for computing the offset vector in the Compressor which uses the carryless multiply instruction, based on the branch-free and bit-parallel algorithm in https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/

@no-defun-allowed
Copy link
Collaborator Author

no-defun-allowed commented Jan 8, 2026

Some issues come to mind.

  • I've now introduced CPU-specific code to MMTk, I don't know if we have a plan for how to put that in the codebase. My current approach of dumping #[cfg(target_arch = "x86_64")] in the Compressor directory is rather undisciplined.
  • ranges::break_byte_range is patterned after ranges::break_bit_range; I don't know if that's a good approach, and it's currently undocumented and untested.
  • I have a portable prefix sum in forwarding::prefix_sum — although according to the Wikipedia article it's a xor-scan, although xor is addition over ℤ₂ if you're so inclined — which appears about as fast as the original branchy algorithm. Is it worthwhile to keep the original branchy code, or should the fallback use that prefix sum algorithm?
  • I'm still waiting for properly-measured results, but so far the carryless multiply algorithm is either within the noise or up to 5% faster in STW time than the branchy algorithm. This might not be a large enough speedup to deserve the complexity.

@no-defun-allowed
Copy link
Collaborator Author

Plotty and here are the geomean results for clmul-enabled relative to clmul-disabled:

Heap factor Total time STW time
1000 0.982 0.970
1416 0.997 0.984
1892 0.989 0.970
2428 0.997 0.983
3023 0.996 0.980
3678 0.996 0.977
4392 0.994 0.970
5166 0.995 0.981
6000 1.001 0.965

@qinsoon
Copy link
Member

qinsoon commented Jan 11, 2026

Plotty and here are the geomean results for clmul-enabled relative to clmul-disabled:
Heap factor Total time STW time
1000 0.982 0.970
1416 0.997 0.984
1892 0.989 0.970
2428 0.997 0.983
3023 0.996 0.980
3678 0.996 0.977
4392 0.994 0.970
5166 0.995 0.981
6000 1.001 0.965

To clarify, do both builds include the change for computing multiple regions in one CalculateOffsetVector package? Just asking, as the PR has two changes (clmul and the work package change), and the performance impact of the work package change was not mentioned.

@no-defun-allowed
Copy link
Collaborator Author

To clarify, do both builds include the change for computing multiple regions in one CalculateOffsetVector package? Just asking, as the PR has two changes (clmul and the work package change), and the performance impact of the work package change was not mentioned.

Yes, both are with this pull request, with different values for MMTK_COMPRESSOR_USE_CLMUL. I haven't ran systemic tests for the new work packets, but I found that one region* per work packet was too small and scheduling overhead dominated the offset vector phase. Here is a trace for lusearch with the old work packets, for example:

Screenshot 2026-01-12 at 11 16 54 am

and with the new work packets:

Screenshot 2026-01-12 at 11 17 33 am

*I also neglected to mention that I shrunk regions from 1MiB to 256kiB, to improve work balancing. The original 1MiB size was more-or-less arbitrary; and I found that the regions near the start of the heap would often accumulate long-lived objects, and so the Compact phase would spend dramatically more time on these regions. (The offset vector phase would previously have spent more time on these regions too, but the time of the new branch-free algorithm is proportional to the region size and not the live data.)

@no-defun-allowed
Copy link
Collaborator Author

I also neglected to mention that I shrunk regions from 1MiB to 256kiB, to improve work balancing.

I tested this PR against upstream; Plotty and a tally of changes to stop-the-world times:

Speedup Count
+20–25.1% 3
+10–20% 2
+5–10% 2
+2–5% 1
Too small to say 3
Too kafka to say 1
-2–5% 5
-5–5.4% 2

Some benchmarks see large improvements due to the better work balancing, some see small regressions due to worse locality with smaller region sizes. (I only found the mutator time to be consistently worse on tradesoap, which also has the worst STW regression at 5.4% slower.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants