-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigation: Group benchmarks based on their profiling "characterization" #664
Comments
Drilling down on the individual benchmarks that are "heavily interpreter", these are the ones that get a lot slower with the JIT: unpack_sequence, hexiom, go, raytrace, chaos. Maybe looking at these more closely (with the pystats, perhaps), will reveal ways to improve the JIT. |
Did those get slower with the JIT, or tier two in general? |
This is all with the JIT vs. Tier 1. Here's those same top 5 slowdowns (on JIT) for Tier 2 and JIT:
So interestingly, For completeness, I did a full comparison of JIT vs Tier 2 (not something we usually generate), and the only interpreter-heavy benchmark where the JIT actually makes things slower than Tier 2 is |
Another interesting observation: the |
We might also want to shrink the size of the abstract interpreter -- probably 1024 for the max size, and *3 of trace length for the arena is more than sufficient for everything |
That's not really going to make it faster though, right? It saves some C stack space, but these buffers aren't all that big (by modern standards -- 30 years ago I'd be horrified :-) and we don't ever iterate over the entire buffer. (Or do we? In which case I'd take it back.)
Optimization attempts that fail because the trace is too short don't take up much time though -- they only process a few bytecodes and skip all of the expensive work (peephole, abstract interpreter, executor construction). Is the 99.5% rejected across all benchmarks or specific to those two? I'd almost think that if a benchmark spends a lot of time in |
The 99.5% rejected for being too short is across all benchmarks, but individually 99.8% of I'll see if I can get some more detail out of the profiler to see if there's anything in particular in |
Did these numbers change recently? I wonder if side-exits (a relatively new feature) might be involved. Maybe we should count optimization attempts from side-exits separately from those initiated by |
I had a talk with @brandtbucher and we thought that it might be possible that it's making stack-spilled variables extremely slow to access. Imagine having Also Brandt mentioned maybe the small helper functions we have in the uops symbols are not being inlined, because they are a separate compilation unit so they won't automatically be inlined unless LTO is on (and even then, will the compiler always do it?). Maybe we want to mark them static inline and put them in a header file to hint to the C compiler? |
That seems worth the effort to at least try.
The recent bug we had in the stats was that the side exit optimization attempts were not being counted (but all of the various error or success cases were). From that, we already know that the side exit stuff creates orders of magnitude more optimization attempts (1,500 traces vs. 1,267,520 side exits). As you say, that's all fine if most of them are short and bail out quickly, but it is a lot of work that ends up being thrown away if there's a faster way to know if we don't need to do it (I'm not sure there is...) |
I looked at hexiom in detail with uops debugging on, and it looks like the optimizer keeps creating executors for line 274: possible = sum((1 if v in cell else 0) for cell in done.cells) That's a generator expression called via C ( The message I see repeatedly is
Looking at the disassembly incorporated into the executor, I think it is this:
This is not the top of a loop, so it must be a side exit. Indeed, some more context shows: for v in range(8):
# If there is none, remove the possibility from all tiles
if (pos.tiles[v] > 0) and (left[v] == 0):
if done.remove_unfixed(v):
changed = True
else:
possible = sum((1 if v in cell else 0) for cell in done.cells) # <----------- HERE After turning on the right debug variable (
Maybe @Fidget-Spinner can diagnose this part further? There's more though. We should fail more loudly in this case -- int32_t new_temp = -1 * tstate->interp->optimizer_side_threshold;
exit->temperature = (new_temp < INT16_MIN) ? INT16_MIN : new_temp; Maybe the temperature calculation is wrong? @markshannon There definitely seems to be something that needs to be tuned better in case the abstract interpreter bails. |
PS. The "Created a trace for ..." debug message does not indicate that an executor was successfully created. It only means that the initial step of that process, trace projection, succeeded with a not-too-short trace. We then do various other things which may fail (-1) or bail (0) before constructing the final executor. We should probably add level 1 or level 2 debug messages for the various possible bail points. (And these should get the level from |
The only real reason In general, function versions are shared across code objects, with some checks. However, I don't know if that works for generator expressions. It might be that every single time a brand new function version is used, leading to it to deopt out of there. We should just tell tier 2 to never optimize that trace. |
There were some stats about the new optimizer that weren't added to the output tables. Once I did that and ran it locally, I see that out of 2,674,828 optimizer attempts, only 92,441 (3.5%) are successful. I don't know if that's just expected at this point. The |
Some more lab notes: For a smaller demo, one of the issues is that we can't translate def testfunc():
for i in range(20):
sum(i for i in range(20)) the following seems to happen:
It seems that we should implement some kind of exponential back-off to avoid such scenarios. This is not actually unique to generator expressions -- I can reproduce it with a generator function too: def generator():
for i in range(20):
yield i
def testfunc():
for i in range(20):
x = sum(generator())
return x Are there really only 5 bits for the temperature? Or do we have a chance to implement exponential backoff here? Another approach might be that if we know for sure that a trace can't be produced for a given side exit (like in this case), we generate a different exit instruction, that uses |
The other thing I learned is that, indeed, somehow the generator expression's function object isn't in the function version cache. This seems to happen more for generator expressions, but I can reproduce it with nested generator functions too. It seems that frequently calling |
That's definitely not expected. I am on the trail of several issues (see my messages above). |
Oh. The difference between generator expressions and generator functions is that a generator function (the way I wrote the code, anyway) is stored in a local, and thus has a lifetime that overlaps with the previous incarnation of the same function. And the reason this matters is that when a function object is deallocated, it is evicted from the version cache. When using a generator function, the sequence of bytecode is
and the function object thus created stays alive in the fast locals until the next iteration. Here, When projecting the trace, looking up the version number in the cache will then produce the correct function object, and all is well. (Except... At some later point it is also evicted. I don't understand that yet.) OTOH, for a generator expression, the code sequence is
i.e., the freshly created function object isn't stored in a local variable, just on the stack. When that stack entry is eventually popped, the function object is decrefed, which evicts it from the version cache, successfully. When projecting the trace, we therefore find nothing in the cache! I'm not sure what to do about this. We may have to talk about it in person Monday. Could we perhaps make the version cache store code objects instead of function objects? The trace projection process ultimately needs a code object anyway. (But I understand there's a subtle issue of changes to the function object's Separately, there's the mystery that after some number of successful iterations, the variant using a generator function also ends up finding nothing in the cache. I haven't looked into this yet -- I thought the issue with generator expressions was more pressing. (EDIT: See next comment; this is explainable.) In any case I believe that the things I've found so far (generator expressions, and the lack of exponential backoff for side exits) explain the problem with the "hexiom" benchmark spending a lot of time in |
It looks like the issue with generator functions finding nothing in the cache is separate -- it's a loop in the module being run through the Tier 2 optimizer, tracing into |
@mdboom, this issue is yours again -- I have created various cpython issues to deal with the things I found. |
By "characterization", I mean what category of functions dominate the runtime of each benchmark.
If we organize them by the top category in each benchmark, we get the following:
Benchmark by top profiling category
If you refine this to only include a benchmark in a category if that category represents more than 50% of the runtime:
Benchmarks that are heavily (over 50%) in a particular category
Interestingly, this doesn't seem to reveal too much related to profiling. (Admittedly, the only category where we would expect significant change is "interpreter"). The following results are for JIT (main) vs. Tier 1 (same commit), HPT at the 99th percentile:
Using only benchmarks where 50% of time is in a single category:
The text was updated successfully, but these errors were encountered: