-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch-optimization #3313
base: master
Are you sure you want to change the base?
Batch-optimization #3313
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files
Flags with carried forward coverage won't be shown. Click here to find out more. |
2f7214a
to
63b5512
Compare
That sounds really promising, also this looks super-clean already! 🤩 Is this already ready for review or are you doing continued work here? |
Not quite yet! I wrote more tests that try it with different trie settings, and not all are passing yet. close! |
be5f023
to
d1d9ee2
Compare
OK ready for review! 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi there, I am now roughly through going through the commits and getting a broad overview on the changes! This looks super-clean regarding the execution, with this great and easy to review separation in clear commits, also generally really excited about this work! 🤩
Let's please find a way to do this with substantially less code additions. +588 is far too much for a PR with a one-feature change and this is introducing too much code redundancy and generally new code to care about. There should be a way to do with less, by adding additional optional parameters and such.
Do not have the full picture yet, what takes what and where and why 😋, we can discuss later the day on the call or after that in an async way.
f42329e
to
18d053a
Compare
This already looks a lot better respectively „more compact“, let me know when this should get another look! 🙂 |
cec3ce0
to
964b0ec
Compare
If this feels small enough to you, then it is ready for another review |
3455a3a
to
e075991
Compare
Rebased this via UI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏 🙏 🙏 🙏 🙏 🙏
This looks absolutely fabulous (without exaggeration)!!! 👍 🙂
Really dramatically clean now and super reduced and great to review.
One longer basic comment on the whole API design respectively necessary options to switch the new functionality on and off.
Let me know if you have questions! 🙂
} | ||
const sortedOps = orderBatch(ops, keyTransform) | ||
let stack: TrieNode[] = [] | ||
const stackPathCache: Map<string, TrieNode> = new Map() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make both the ordering as well as the caching optional for the batch()
method, and therefore provide extra configuration parameters to configure.
-
Ordering: Reason for ordering is that the ordering in certain cases does have an effect on the outcome. So if there is first a put for a key and then a delete for the same key this is obviously different then having this the other way around. Same for two puts for the same key but different values. So the whole ordering needs to be deactivated by default (also for backwards compatibility) and the user would need to activate manually.
-
Caching: Reason that caching would need to also get an option and also would needed to be deactivated by default is that one does not know how big a batch() operation will be (only the user knows (at best)). So this might "overload" the cache respectively there is the risk for out-of-memory situations.
I would suggest that we add a dedicated options dictionary with its own type in types.ts
called BatchOpts
for this, seems worth it, and then we add boolean flags sortOperations
(or so) and activatePathCache
(or so) for this.
My tendency actually would be to also include skipKeyTransform
and then be a bit hacky and change the function signature to async batch(ops: BatchDBOp[], skipKeyTransformOrOptsDict?: boolean | BatchOpts)
so that we can then do a switch respectively type check for the parameter and either assign skipKeyTransform
directly or use the options dict.
Then we can use the method directly with trie.batch(ops, opts)
(or trie.batch(ops, { sortOperations: true })
) and do not always need to channel in the skipKeyTransform
default.
(let me know if this has serious downsides though I am overlooking. I would think this should still be backwards compatible for people)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some further research into the effects of this in the context of other modules, and I agree about tweaking the options for how batch works based on the situation.
There are definitely circumstances in which it would be faster to simply iterate through the batch like we currently do, instead of sorting and caching. I also wonder how the results would change if we used different sorting and trie walking algorithms. If there's something useful there, it could be another option in batch()
I think any user with a bit of dyslexia will get dizzy trying to decipher |
Thinking out loud here about the cache memory overload issue
|
The way we currently do the sorting, multiple batch ops for the same key would remain the original order relative to each other. So the order of multiple put or del for the same key should not be affected, and the end result should be the same |
This would for sure only be temporary until the next breaking releases (maybe autumn or so?), then we would remove the This would have the advantage though that people then could already "go" into the new API and would not need to adjust again along breaking releases, so I think I would still be a fan of taking this version. 🙂 (we can also pick these things up on the call) |
I am unsure TBH. But might very well be that it's a non-issue. I would be ok with trying and eventually - if things doesn't hold for users in certain scenarios - push another release with an additional flag. |
Haven't looked at the sorting algorithm yet, but my imagination goes short atm to imagine a sorting algorithm that is so non-deterministic (in some sense, maybe "dynamic" might be a better word) that not one of the cases A, B and B, A is sorted in a way that behavior is changed. 🤔 Not sure, maybe I'll have some additional time for a closer look at the algorithm later on. |
Can this then please get an update, or otherwise please let me know where we still need to clarify things! 🙂 So, I would assume we now do:
Let me know if there is further need to discuss certain points, eventually you can independently also push the things which are unambiguous. |
Continues work on optimizations described by @holgerd77 in #3293
trie.batch()
1. Externalized to
trie/batch.ts
trie.batch()
will now call an externalized function calledbatch
, imported frombatch.ts
.batch.ts
also contains helper functions and modified trie operations related tobatch
.2. Return updated stack
trie._updateNode
andtrie._createInitialNode
have been modified to return the stack of nodes after updating. This will be utilized in the batch optimization.3.
orderBatch()
orderBatch()
is a helper function which sorts a batch of trie operations by key nibbles. This sorting will allow for an optimized approach to updating trie nodes.4. custom
_put
and_del
functionsmodified version of trie operations
_put
and_del
were added tobatch.ts
. These functions accept additionalpath
parameters, and return the modifiedstack
of nodes following a trie update. This will allow for caching of recently visited nodes within thebatch
operation, which reduces the amount of trie walking (findPath() calls) necessary for updates.5.
batchPut
batchPut is a modified version of
trie.put
which acts as an intermediary between thebatch
method, and the modified_put
and_del
methods.6.
batch
Refactored
batch
method keeps the same parameters. Nobatch
calls need to be modified elsewhere in the codebase.stackPathCache: Map<string, TrieNode>
_batch
will keep track of traversed node paths and nodes. with each trie update, the updated path from the new root to the new leaf will be stored in a map by their path nibbles._batch
will callbatchPut
for eachop
with the added parameter of the cached stack nodes, and the nibble remainder of the next key.Benchmarks:
This refactored
batch
process completes 33% faster than before!