You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently we're not doing constant and float propagation on the tier 2 optimizer in main because we don't want to create new objects or hold owned references in the trace.
I propose a way to circumvent this and do constant and float propagation. I will outline the design here. The idea is to pack the raw values into the operand itself.
We then introduce two new instructions:
LOAD_INT
LOAD_FLOAT
Which constructs a value from an immediate in the operand by calling PyLong_FromLong and PyFloat_FromDouble.
To keep the optimizer peepholes simple, we then introduce variants of the above instructions
POP_TWO_LOAD_INT
POP_TWO_LOAD_FLOAT
Thus the binary operations get constant propagated to the following instructions:
Which will then be cleaned up in a peephole pass as follows:
Before:
LOAD_FAST x
LOAD_FAST y
POP_TWO_LOAD_INT
LOAD_FAST z
POP_TWO_LOAD_INT
After
LOAD_INT (x + y + z)
Here's what must happen for this to be a win:
For floats:
The cost of loading two values from locals/registers, writing them to stack, adding dereferenced doubles, then decref specialized must be high. Note that because of the Py_REFCNT(lhs) == 1 trick that we use to reuse floats, I am not super optimistic about this being a big win for small runs of constants. So for now, this should not be prioritised. My own experiments with the tier 2 interpreter back then when I did the lazy block versioning experiment suggests that floats are not that expensive due to the float reuse. In fact, we might slow things down because PyFloat_FromDouble always has to allocate.
For ints:
This is almost always guaranteed to be faster if we can constant propagate. No refcounting tricks here for ints, and the cost of a boxed int is actually very high in comparison to floats. Ints always allocate unless they are from the small ints pool, in which case we still save actually going through the _PyLong_Add subroutine, which is extremely expensive as it assumes an array of long. Note that in the specialized interpreter, float addition is the machine double addition, while int addition is an expensive subroutine. We should prioritize this.
The text was updated successfully, but these errors were encountered:
Fidget-Spinner
changed the title
Tier 2 optimizer: Constant and Float propagation, without the reference problems
Tier 2 optimizer: Constant and Float propagation, without the refcount problems
Mar 30, 2024
Fidget-Spinner
changed the title
Tier 2 optimizer: Constant and Float propagation, without the refcount problems
Tier 2 optimizer: Constant Int and Float propagation, without the refcount problems
Mar 30, 2024
Why? do you have evidence that this will make a large enough difference to prioritize over other work?
I'm not saying we shouldn't do this. I'm just a bit skeptical that it will make much of a difference.
Why? do you have evidence that this will make a large enough difference to prioritize over other work?
I'm not saying we shouldn't do this. I'm just a bit skeptical that it will make much of a difference.
I don't. I meant that we should prioritize ints over floats initially because my past experiments with the tier 2 interpreter suggest float operations are quite cheap. I dont think any of this will show up on a benchmark necessarily.
Currently we're not doing constant and float propagation on the tier 2 optimizer in main because we don't want to create new objects or hold owned references in the trace.
I propose a way to circumvent this and do constant and float propagation. I will outline the design here. The idea is to pack the raw values into the operand itself.
We then introduce two new instructions:
Which constructs a value from an immediate in the operand by calling
PyLong_FromLong
andPyFloat_FromDouble
.To keep the optimizer peepholes simple, we then introduce variants of the above instructions
Thus the binary operations get constant propagated to the following instructions:
Which will then be cleaned up in a peephole pass as follows:
Before:
After
Here's what must happen for this to be a win:
For floats:
Py_REFCNT(lhs) == 1
trick that we use to reuse floats, I am not super optimistic about this being a big win for small runs of constants. So for now, this should not be prioritised. My own experiments with the tier 2 interpreter back then when I did the lazy block versioning experiment suggests that floats are not that expensive due to the float reuse. In fact, we might slow things down becausePyFloat_FromDouble
always has to allocate.For ints:
_PyLong_Add
subroutine, which is extremely expensive as it assumes an array oflong
. Note that in the specialized interpreter, float addition is the machinedouble
addition, while int addition is an expensive subroutine. We should prioritize this.The text was updated successfully, but these errors were encountered: