Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross Module Quickening Paper #696

Open
Fidget-Spinner opened this issue Sep 16, 2024 · 7 comments
Open

Cross Module Quickening Paper #696

Fidget-Spinner opened this issue Sep 16, 2024 · 7 comments
Labels
research-interaction notes from researchers, papers etc.

Comments

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Sep 16, 2024

I found this interesting paper on Cross-Module Quickening, by Felix Berlakovich and Stefan Brunthaler at ECOOP 2024. It makes use of CPython 3.12.0's specializing interpreter to quicken across C extensions. So far it seems they introduce more specializations to achieve this. Still reading the paper to learn more.

The key idea is to provide the C extension with an interface to register specialized, extension-specific interpreter instructions

https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ECOOP.2024.6

@Fidget-Spinner Fidget-Spinner added the research-interaction notes from researchers, papers etc. label Sep 16, 2024
@mdboom
Copy link
Contributor

mdboom commented Sep 18, 2024

This is really similar to my idea in #432, though they have obviously taken it much further. I'm intrigued.

@mdboom
Copy link
Contributor

mdboom commented Sep 18, 2024

My thoughts after reading this paper:

The results are impressive, and having written a lot of Numpy-using code in the past, I'm not surprised that the type stability is so exploitable. As with most optimizations of the Python side of Numpy, this benefits most when the data is small, and decreases as the data gets larger. Still, seems like it could be a way for the Faster CPython work to have more of an impact on numeric computing. There is of course a small penalty for the extra work in the non-specialized instructions to make this work, even for those who never load an optimization-aware C extension. I would want to run this against some pure Python benchmarks, such as those in pyperformance, to better understand what that overhead is -- and maybe there are clever ways to negate it when we know it won't be needed.

I agree with the authors that Numpy is an important and interesting use case. It's not immediately clear how well it generalizes, but maybe within the stdlib, it could improve the performance of decimal and fraction (though maybe their work planning/type dispatch overhead is already so much lower than Numpy's that it doesn't matter as much).

There is mention of using the inline cache to store the subscript of the expression:

array[1:, 2:]

which is syntactic sugar for:

array[(slice(1, None, None), slice(2, None, None))]

Would the constant propagation work/plans we have already handle this case? (@Fidget-Spinner?) If so, at least for this somewhat common Numpy idiom, that optimization might be handled by the interpreter optimizer itself in the near future.

For reference the code from the paper:

@fberlakovich, @sbrunthaler: Thanks for the great paper!

@Fidget-Spinner
Copy link
Collaborator Author

Would the constant propagation work/plans we have already handle this case?

If @brandtbucher 's work on constant pools for Tier 2 is merged, yeah we can do constant evaluation of slices in tier 2 in the future with the partial evaluation pass.

@brandtbucher
Copy link
Member

If @brandtbucher 's work on constant pools for Tier 2 is merged, yeah we can do constant evaluation of slices in tier 2 in the future with the partial evaluation pass.

Not merged yet, but I'll open a PR soon.

@mdboom
Copy link
Contributor

mdboom commented Sep 19, 2024

If @brandtbucher 's work on constant pools for Tier 2 is merged, yeah we can do constant evaluation of slices in tier 2 in the future with the partial evaluation pass.

Not merged yet, but I'll open a PR soon.

That may reduce the need for the "extension-delimited superinstructions" (specializing a range of instructions with a single instruction) mentioned in the paper, and avoid that complexity.

@mdboom
Copy link
Contributor

mdboom commented Sep 19, 2024

The authors' patch seems to have a performance penalty of about 3% on pyperformance. Though let me stress this shouldn't be an argument against it because I doubt they specifically optimized for low impact when the optimization isn't in use. This is more just for information as I start to look at and understand the patches.

EDIT: I think the base commit was selected incorrectly (it's an unusual case because it's on the 3.12 branch, not main). I'm going to rerun the correct base and recreate the results.

@mdboom
Copy link
Contributor

mdboom commented Sep 19, 2024

The author's patch has a performance penalty below the noise threshold. This is for benchmarks that don't take advantage of the added CMQ APIs, of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research-interaction notes from researchers, papers etc.
Projects
None yet
Development

No branches or pull requests

3 participants