-
Notifications
You must be signed in to change notification settings - Fork 43
Add checkpoint/resume functionality to VespaNNParameterOptimizer
#1173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds checkpoint/resume functionality to the VespaNNParameterOptimizer class, enabling users to save intermediate optimization results and resume long-running optimization runs. The implementation stores query bucket indices and stage completion status to disk in JSON format.
Key changes:
- Checkpoint mechanism saves optimization progress at each stage (bucket distribution, filterFirstExploration, filterFirstThreshold, approximateThreshold, postFilterThreshold)
- Buckets now store query indices instead of full query objects to reduce state file size
- New parameters
run_nameandresumecontrol checkpoint naming and resume behavior - Replaced print statements with logger calls throughout the optimizer
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 13 comments.
| File | Description |
|---|---|
| vespa/evaluation.py | Added checkpoint/resume functionality with state management methods, modified optimizer to store query indices instead of full queries in buckets, replaced print statements with logging, and refactored run() method into staged execution with checkpoint saving |
| tests/unit/test_evaluator.py | Added comprehensive tests for checkpoint save/load functionality and resume behavior validation |
|
Not sure if one maybe should just add the option to specify values for the parameters instead. 🤔 Then, if a value is specified, that stage is skipped and the given value is used. This achieves roughly the same and is much simpler since it is the benchmarks that take time while the initial hit-ratio determination is comparatively short. If the tool takes so long that it becomes unbearable, one should probably make it work faster instead of working around the slowness. Since it should only be run once, it is ok to be slow I think, but it should not be unbearably slow. For example, one could limit the number of queries in a bucket. If there are more queries in a bucket than the limit specifies, one could just choose a random subset. |
|
Added |
|
Don't we want to sample Ideally, we would want to get the same number of queries, say 100, in a bucket. This then allows to make a reasonable choice for the number of benchmark repetitions. For example 10 repetitions than would mean that we take 10 * 100 samples, which would be constant across all buckets. |
I did consider that, but isn`t much easier for user to relate to total queries? And don't we want the distribution of queries to resemble their real query load? Considering that many queries may have same hit ratios, but look completely different and have different response times, it seems reasonable to try to match the distribution for more transferrable results? Though I do agree that this may cause very few queries for some buckets, but as long as they are warned and aware, isn't that OK? |
Limiting the number of queries per bucket still limits the number of total queries by number_queries * number_of_buckets, so this still let's you control the run time. But the goal isn't to control the run time, the goal is to get a reasonable result in a reasonable amount of time, is it?
What would be the advantage of that? This means that we are letting buckets with few queries "starve", which in turn means that we get nonsensical results for these buckets. And that we really don't want. When analyzing the behavior across hit ratios, queries with different hit ratios are "independent" in some sense since they lead to different strategies in Vespa. And that's also what we are doing in the tool and in the plots: we are looking at all these buckets independently and not at the average response time and recall across all hit ratios.
That's the point of sampling a subset of the queries in a bucket if the number of queries in a bucket is very high. We want to get reasonable information about that bucket without spending too much time on it if the number of queries is high. If we have a bucket with only a few queries, we can just use all queries in that bucket to get as much information as possible about the behavior at that hit ratio. Again, since we are looking at all buckets independently, we probably do not want a distribution that resembles the real load. That would make sense when taking the average of all buckets. |
|
Think I got it.
Good discussion 😄 Still need some input before making the changes:
Yes, but I think we also need to make it usable. |
Exactly. I don't think we would benefit from approximating the total distribution at this point. You could use this to get the overall average response time/recall, but we do not compute this anywhere.
Yes. Having a single inaccurate bucket could throw the results off (suggest a latency spike where there is none, for example). So one has to try to avoid that.
We could do that. Here, one could either do this in general (leave out some buckets) or try this in a more targeted way. Here, one should try to do this with the Edit: Another meaningful optimization could be to cache the exact results in recall computations. Edit 2: And run the |
I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.
Closing #1170
Key changes:
run_nameandresumecontrol checkpoint naming and resume behavior