Remove the commutation cache from the commutation checker#15988
Remove the commutation cache from the commutation checker#15988mtreinish wants to merge 1 commit intoQiskit:mainfrom
Conversation
An early performance optimization we made in the commutation analysis pass in Qiskit#3878 was to enable caching of the commutation relations between gates. Back then the commutation was only checked via the matrix multiplication method where we would compose a pair of gates' unitary matrices forwards and backwards and determine if the product of those forward and backward compositions were an identity. This was a fairly costly operation back then for various reasons and the cache enabled a large speedup by avoiding repeated computation unnecessarily. However, since that time > 6 years ago the code base has evolved substantially, including not relying on matrix multiplication based commutation determination as the default. Now we have a precomputed library of commuting gates and also have special handling for cases where we can know very easily whether gates commute or not. Additionally, all of this code has been ported to rust so the matrix multiplication based approach is not nearly as expensive (although it's not free either). In this new world the cache is actually doing more harm then good because maintaining the cache, adding extra lookups and hashing while constant time is not free. We've reached a point where all the complexity of the cache is no longer worth it, so this commit removes the commutation cache. One caveat is some aspects of this internal cache have leaked into the public API. Specifically the CommutationChecker class is public, and that includes a documented init argument `cache_max_entries` as well as two public methods `clear_cached_commutations` and `num_cached_entries`. I believe the original intent for these methods was either debugging the cache logic was correct (as they were used in tests) or to enable the user to manage the cache size manually if they so wished. The `cache_max_entries` argument was used to manage the total memory size for the cache to avoid using to much memory in certain applications. Since these are part of the public documented api we can't remove them without violating our stability guidelines. So instead this commit opts to just make them no-ops or in the case of the `num_cached_entries` it will always return 0 (since there are no longer any cached entries). These are all marked as deprecated in this PR to mark them for removal in 3.0. In practice there is a minimal performance difference from this change which means we don't need the extra code anymore, although in some very specific benchmarks a small speedup may be seen (those dominated by commutation checking). However, there is one case where running without a cache can be slower, in cases when there are a large number of gate pairs that involves a gate that we don't know whether it commutes or not without the matrix multiplication (i.e. it's not a known pauli, rotation gate, or in the library) and we have multiple repeated pairs of the same gate. In practice this doesn't come up very frequently because typically in a preset passmanager's workflow we have lowered to all 1q and 2q gates and that lowering involves standard gates we know how to work with in the checker. The only edge case is if there was a circuit with a large number of custom 1q or 2q gates that have matrix definitions in a circuit (which is not common). But, the asv benchmark for commutation analysis will likely show a roughly 5x slowdown with this commit. That benchmark is highlighting this edge case because it is running the pass on a random circuit with gates up to 3 qubits in width which will involve multiple repeated checks via matrix multiplication which does not come up in practice normally. Additionally, unlike in Qiskit#3878 the regression being flagged is only on the 5x slowdown is on the scale of tens of ms, back in 2020 we could only have dreamed to have the CommutationAnalysis pass execute in < 100ms in asv, let alone a world where a 5x regression flagged in asv would be so quick. The entire pass was at least 2 orders of magnitude slower at that point we introduced the cache.
|
One or more of the following people are relevant to this code:
|
|
I'm in big favor or removing the custom hash logic we've implement here too. But given that asv will flag a regression, do you have other benchmarks (I'm assuming you ran benchpress for this?) to support that we can drop the cache? |
| This method will always return 0 because there is no longer an | ||
| internal cache. | ||
| """ | ||
| return 0 |
There was a problem hiding this comment.
So this is a backward compatible API change, since the commutation checker does give the same results, just might take longer, but we are dropping functionality user's might've relied upon. I think it would be nice to point this out more explicitly in the release notes, what do you think?
There was a problem hiding this comment.
I can reword the release note to make it more explicit. Do you want me to add an other note to call it out more?
Functionally the caching only did anything in a very specific case, you had a gate pair of StandardGate that had all float parameters (which includes no params), and at least one gate we didn't account for in the library or another mechanism. You then were repeating the same checks with these gates repeatedly. Outside of this specific case we never cached anything. So I'm not sure how people could be relying on it in practice.
There was a problem hiding this comment.
Rewording is good enough, just to point out users would have to do their own caching if they relied on that 👍🏻 I also don't think it's a commonly used thing, but the past has shown that every minuscule feature is used somehow by someone 😄
Asv only flags a regression on the We also don't run the standalone pass anymore, only Here is the asv output of the benchmarks I ran: What I forgot to put in the commit message was that this PR is actually important for enabling subsequent optimizations. Specifically the caching is why the commutation checker needs to be mutable everywhere it is used right now, which prevents parallelism in the commutation analysis (without copying it everywhere). |
|
Ok thanks! |
Summary
An early performance optimization we made in the commutation analysis pass in #3878 was to enable caching of the commutation relations between gates. Back then the commutation was only checked via the matrix multiplication method where we would compose a pair of gates' unitary matrices forwards and backwards and determine if the product of those forward and backward compositions were an identity. This was a fairly costly operation back then for various reasons and the cache enabled a large speedup by avoiding repeated computation unnecessarily. However, since that time > 6 years ago the code base has evolved substantially, including not relying on matrix multiplication based commutation determination as the default. Now we have a pre-computed library of commuting gates and also have special handling for cases where we can know very easily whether gates commute or not. Additionally, all of this code has been ported to rust so the matrix multiplication based approach is not nearly as expensive (although it's not free either).
In this new world the cache is actually doing more harm then good because maintaining the cache, adding extra lookups and hashing while constant time is not free. We've reached a point where all the complexity of the cache is no longer worth it, so this commit removes the commutation cache. One caveat is some aspects of this internal cache have leaked into the public API. Specifically the CommutationChecker class is public, and that includes a documented init argument
cache_max_entriesas well as two public methodsclear_cached_commutationsandnum_cached_entries. I believe the original intent for these methods was either debugging the cache logic was correct (as they were used in tests) or to enable the user to manage the cache size manually if they so wished. Thecache_max_entriesargument was used to manage the total memory size for the cache to avoid using to much memory in certain applications. Since these are part of the public documented api we can't remove them without violating our stability guidelines. So instead this commit opts to just make them no-ops or in the case of thenum_cached_entriesit will always return 0 (since there are no longer any cached entries). These are all marked as deprecated in this PR to mark them for removal in 3.0.In practice there is a minimal performance difference from this change which means we don't need the extra code anymore, although in some very specific benchmarks a small speedup may be seen (those dominated by commutation checking). However, there is one case where running without a cache can be slower, in cases when there are a large number of gate pairs that involves a gate that we don't know whether it commutes or not without the matrix multiplication (i.e. it's not a known pauli, rotation gate, or in the library) and we have multiple repeated pairs of the same gate. In practice this doesn't come up very frequently because typically in a preset passmanager's workflow we have lowered to all 1q and 2q gates and that lowering involves standard gates we know how to work with in the checker. The only edge case is if there was a circuit with a large number of custom 1q or 2q gates that have matrix definitions in a circuit (which is not common). But, the asv benchmark for commutation analysis will likely show a roughly 5x slowdown with this commit. That benchmark is highlighting this edge case because it is running the pass on a random circuit with gates up to 3 qubits in width which will involve multiple repeated checks via matrix multiplication which does not come up in practice normally. Additionally, unlike in #3878 the regression being flagged is only on the 5x slowdown is on the scale of tens of ms, back in 2020 we could only have dreamed to have the CommutationAnalysis pass execute in < 100ms in asv, let alone a world where a 5x regression flagged in asv would be so quick. The entire pass was at least 2 orders of magnitude slower at that point we introduced the cache.
Details and comments