Cross Module Quickening Paper #696

Fidget-Spinner · 2024-09-16T07:38:19Z

I found this interesting paper on Cross-Module Quickening, by Felix Berlakovich and Stefan Brunthaler at ECOOP 2024. It makes use of CPython 3.12.0's specializing interpreter to quicken across C extensions. So far it seems they introduce more specializations to achieve this. Still reading the paper to learn more.

The key idea is to provide the C extension with an interface to register specialized, extension-specific interpreter instructions

https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ECOOP.2024.6

mdboom · 2024-09-18T17:40:29Z

This is really similar to my idea in #432, though they have obviously taken it much further. I'm intrigued.

mdboom · 2024-09-18T19:35:53Z

My thoughts after reading this paper:

The results are impressive, and having written a lot of Numpy-using code in the past, I'm not surprised that the type stability is so exploitable. As with most optimizations of the Python side of Numpy, this benefits most when the data is small, and decreases as the data gets larger. Still, seems like it could be a way for the Faster CPython work to have more of an impact on numeric computing. There is of course a small penalty for the extra work in the non-specialized instructions to make this work, even for those who never load an optimization-aware C extension. I would want to run this against some pure Python benchmarks, such as those in pyperformance, to better understand what that overhead is -- and maybe there are clever ways to negate it when we know it won't be needed.

I agree with the authors that Numpy is an important and interesting use case. It's not immediately clear how well it generalizes, but maybe within the stdlib, it could improve the performance of decimal and fraction (though maybe their work planning/type dispatch overhead is already so much lower than Numpy's that it doesn't matter as much).

There is mention of using the inline cache to store the subscript of the expression:

array[1:, 2:]

which is syntactic sugar for:

array[(slice(1, None, None), slice(2, None, None))]

Would the constant propagation work/plans we have already handle this case? (@Fidget-Spinner?) If so, at least for this somewhat common Numpy idiom, that optimization might be handled by the interpreter optimizer itself in the near future.

For reference the code from the paper:

Changes to Numpy: https://github.com/fberlakovich/cmq-numpy-ae
Changes to CPython: https://github.com/fberlakovich/cmq-ae

@fberlakovich, @sbrunthaler: Thanks for the great paper!

Fidget-Spinner · 2024-09-18T19:44:06Z

Would the constant propagation work/plans we have already handle this case?

If @brandtbucher 's work on constant pools for Tier 2 is merged, yeah we can do constant evaluation of slices in tier 2 in the future with the partial evaluation pass.

brandtbucher · 2024-09-18T22:31:02Z

If @brandtbucher 's work on constant pools for Tier 2 is merged, yeah we can do constant evaluation of slices in tier 2 in the future with the partial evaluation pass.

Not merged yet, but I'll open a PR soon.

mdboom · 2024-09-19T13:03:54Z

If @brandtbucher 's work on constant pools for Tier 2 is merged, yeah we can do constant evaluation of slices in tier 2 in the future with the partial evaluation pass.

Not merged yet, but I'll open a PR soon.

That may reduce the need for the "extension-delimited superinstructions" (specializing a range of instructions with a single instruction) mentioned in the paper, and avoid that complexity.

mdboom · 2024-09-19T16:51:52Z

The authors' patch seems to have a performance penalty of about 3% on pyperformance. Though let me stress this shouldn't be an argument against it because I doubt they specifically optimized for low impact when the optimization isn't in use. This is more just for information as I start to look at and understand the patches.

EDIT: I think the base commit was selected incorrectly (it's an unusual case because it's on the 3.12 branch, not main). I'm going to rerun the correct base and recreate the results.

mdboom · 2024-09-19T16:58:25Z

The author's patch has a performance penalty below the noise threshold. This is for benchmarks that don't take advantage of the added CMQ APIs, of course.

Fidget-Spinner added the research-interaction notes from researchers, papers etc. label Sep 16, 2024

mdboom mentioned this issue Sep 23, 2024

ENH: Discussion: Integrating Numpy with CPython interpreter specializations numpy/numpy#27456

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross Module Quickening Paper #696

Cross Module Quickening Paper #696

Fidget-Spinner commented Sep 16, 2024 •

edited

Loading

mdboom commented Sep 18, 2024

mdboom commented Sep 18, 2024

Fidget-Spinner commented Sep 18, 2024

brandtbucher commented Sep 18, 2024

mdboom commented Sep 19, 2024

mdboom commented Sep 19, 2024 •

edited

Loading

mdboom commented Sep 19, 2024

Cross Module Quickening Paper #696

Cross Module Quickening Paper #696

Comments

Fidget-Spinner commented Sep 16, 2024 • edited Loading

mdboom commented Sep 18, 2024

mdboom commented Sep 18, 2024

Fidget-Spinner commented Sep 18, 2024

brandtbucher commented Sep 18, 2024

mdboom commented Sep 19, 2024

mdboom commented Sep 19, 2024 • edited Loading

mdboom commented Sep 19, 2024

Fidget-Spinner commented Sep 16, 2024 •

edited

Loading

mdboom commented Sep 19, 2024 •

edited

Loading