You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So this came from a discussion I had with @tekknolagi and Brandt at PyCon US. It depends on arbitrary-length superinstructions.
The main idea is that startup runs a lot of Python. There are two orthogonal ways to speed up startup: reduce work done at startup, or speed up Python. Ideally we should do both. In the spirit on whacky ideas, I will suggest a moonshot idea to significantly speed up Python only at startup:
Assuming startup code is mostly static, apart from fetching system locale, codecs, encoding, etc. During build time, we collect the traces formed by the JIT only at startup. We then pass the entire trace as a single stencil to clang to compile (still respecting the tree structure of course). During runtime, every startup will thus find the new "startup superinstructions" and the jitted code will be extremely efficient. There's main reason why this will be significantly faster than turning on the JIT is that the entire trace becomes a single instruction, allowing clang to perform whole-of-trace optimizations.
This is somewhat similar to Stefan Brunthaler's multi level quickening paper where there is sort of "PGO" but for benchmarks. However, since benchmarks are not a reliable example of real-world code, this just limits it to startup.
The text was updated successfully, but these errors were encountered:
Look, I love Futamura projections as much as the next compiler engineer, but... I think that an idea like this probably needs at least a proof-of-concept to proceed much further. Things that jump out to me as potential issues that will need to be tackled early on:
Handling the thousands of potential deopt events correctly.
Handling internal loops and other control flow in the superinstruction.
Staying on trace in the wide range of possible startup paths.
How to encode thousands of opargs, operands, etc.
That's not even counting the wrinkles raising and catching exceptions, performing calls through C code to more Python code, etc. Likely it may make more sense to just add some more reasonably-sized-but-maybe-a-little-longer superinstructions that don't require deep surgery on the tier two instruction format itself. This seems quite a bit easier to experiment with and more likely to succeed.
So this came from a discussion I had with @tekknolagi and Brandt at PyCon US. It depends on arbitrary-length superinstructions.
The main idea is that startup runs a lot of Python. There are two orthogonal ways to speed up startup: reduce work done at startup, or speed up Python. Ideally we should do both. In the spirit on whacky ideas, I will suggest a moonshot idea to significantly speed up Python only at startup:
Assuming startup code is mostly static, apart from fetching system locale, codecs, encoding, etc. During build time, we collect the traces formed by the JIT only at startup. We then pass the entire trace as a single stencil to clang to compile (still respecting the tree structure of course). During runtime, every startup will thus find the new "startup superinstructions" and the jitted code will be extremely efficient. There's main reason why this will be significantly faster than turning on the JIT is that the entire trace becomes a single instruction, allowing clang to perform whole-of-trace optimizations.
This is somewhat similar to Stefan Brunthaler's multi level quickening paper where there is sort of "PGO" but for benchmarks. However, since benchmarks are not a reliable example of real-world code, this just limits it to startup.
The text was updated successfully, but these errors were encountered: