Skip to content

Conversation

@thomasht86
Copy link
Collaborator

@thomasht86 thomasht86 commented Nov 20, 2025

I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.

Closing #1170

Key changes:

  • Checkpoint mechanism saves optimization progress at each stage (bucket distribution, filterFirstExploration, filterFirstThreshold, approximateThreshold, postFilterThreshold)
  • Buckets now store query indices instead of full query objects to reduce state file size
  • New parameters run_name and resume control checkpoint naming and resume behavior
  • Replaced print statements with logger calls throughout the optimizer
  • Added validation of state against current parameters to avoid resumption of a run that were run with different input.

@thomasht86 thomasht86 requested a review from Copilot November 20, 2025 12:34
Copilot finished reviewing on behalf of thomasht86 November 20, 2025 12:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds checkpoint/resume functionality to the VespaNNParameterOptimizer class, enabling users to save intermediate optimization results and resume long-running optimization runs. The implementation stores query bucket indices and stage completion status to disk in JSON format.

Key changes:

  • Checkpoint mechanism saves optimization progress at each stage (bucket distribution, filterFirstExploration, filterFirstThreshold, approximateThreshold, postFilterThreshold)
  • Buckets now store query indices instead of full query objects to reduce state file size
  • New parameters run_name and resume control checkpoint naming and resume behavior
  • Replaced print statements with logger calls throughout the optimizer

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 13 comments.

File Description
vespa/evaluation.py Added checkpoint/resume functionality with state management methods, modified optimizer to store query indices instead of full queries in buckets, replaced print statements with logging, and refactored run() method into staged execution with checkpoint saving
tests/unit/test_evaluator.py Added comprehensive tests for checkpoint save/load functionality and resume behavior validation

@thomasht86 thomasht86 requested a review from boeker November 20, 2025 14:19
@thomasht86 thomasht86 marked this pull request as ready for review November 20, 2025 14:26
@boeker
Copy link
Contributor

boeker commented Nov 21, 2025

Not sure if one maybe should just add the option to specify values for the parameters instead. 🤔 Then, if a value is specified, that stage is skipped and the given value is used. This achieves roughly the same and is much simpler since it is the benchmarks that take time while the initial hit-ratio determination is comparatively short.

If the tool takes so long that it becomes unbearable, one should probably make it work faster instead of working around the slowness. Since it should only be run once, it is ok to be slow I think, but it should not be unbearably slow. For example, one could limit the number of queries in a bucket. If there are more queries in a bucket than the limit specifies, one could just choose a random subset.

@thomasht86
Copy link
Collaborator Author

Added max_queries_per_benchmark for default sampling of queries.
PTAL @boeker

@boeker
Copy link
Contributor

boeker commented Nov 26, 2025

Don't we want to sample max_queries_per_benchmark queries per bucket? If we have 20.000 queries in a bucket, then we should sample 100 from these, but if we only have 90 in a bucket, then we should use all these 90 queries. Otherwise, we might get very few queries in a bucket benchmark, which means that that specific benchmark result will be reliable.

Ideally, we would want to get the same number of queries, say 100, in a bucket. This then allows to make a reasonable choice for the number of benchmark repetitions. For example 10 repetitions than would mean that we take 10 * 100 samples, which would be constant across all buckets.

@thomasht86
Copy link
Collaborator Author

Don't we want to sample max_queries_per_benchmark queries per bucket? If we have 20.000 queries in a bucket, then we should sample 100 from these, but if we only have 90 in a bucket, then we should use all these 90 queries. Otherwise, we might get very few queries in a bucket benchmark, which means that that specific benchmark result will be reliable.

Ideally, we would want to get the same number of queries, say 100, in a bucket. This then allows to make a reasonable choice for the number of benchmark repetitions. For example 10 repetitions than would mean that we take 10 * 100 samples, which would be constant across all buckets.

I did consider that, but isn`t much easier for user to relate to total queries?
In the case you're suggesting the user can't control number of total queries (and thus runtime) directly, right?

And don't we want the distribution of queries to resemble their real query load?

Considering that many queries may have same hit ratios, but look completely different and have different response times, it seems reasonable to try to match the distribution for more transferrable results? Though I do agree that this may cause very few queries for some buckets, but as long as they are warned and aware, isn't that OK?

@boeker
Copy link
Contributor

boeker commented Nov 26, 2025

I did consider that, but isn`t much easier for user to relate to total queries? In the case you're suggesting the user can't control number of total queries (and thus runtime) directly, right?

Limiting the number of queries per bucket still limits the number of total queries by number_queries * number_of_buckets, so this still let's you control the run time. But the goal isn't to control the run time, the goal is to get a reasonable result in a reasonable amount of time, is it?

And don't we want the distribution of queries to resemble their real query load?

What would be the advantage of that? This means that we are letting buckets with few queries "starve", which in turn means that we get nonsensical results for these buckets. And that we really don't want. When analyzing the behavior across hit ratios, queries with different hit ratios are "independent" in some sense since they lead to different strategies in Vespa. And that's also what we are doing in the tool and in the plots: we are looking at all these buckets independently and not at the average response time and recall across all hit ratios.

Considering that many queries may have same hit ratios, but look completely different and have different response times, it seems reasonable to try to match the distribution for more transferrable results? Though I do agree that this may cause very few queries for some buckets, but as long as they are warned and aware, isn't that OK?

That's the point of sampling a subset of the queries in a bucket if the number of queries in a bucket is very high. We want to get reasonable information about that bucket without spending too much time on it if the number of queries is high. If we have a bucket with only a few queries, we can just use all queries in that bucket to get as much information as possible about the behavior at that hit ratio. Again, since we are looking at all buckets independently, we probably do not want a distribution that resembles the real load. That would make sense when taking the average of all buckets.

@thomasht86
Copy link
Collaborator Author

Think I got it.
So you're basically saying that:

  1. It is more important to have enough queries per bucket than to approximate total distribution.
  2. More stable results for all buckets is more important than more accurate results for the few buckets with many queries.

Good discussion 😄

Still need some input before making the changes:

Limiting the number of queries per bucket still limits the number of total queries by number_queries * number_of_buckets, so this still let's you control the run time. But the goal isn't to control the run time, the goal is to get a reasonable result in a reasonable amount of time, is it?

Yes, but I think we also need to make it usable.
Considering our buckets_per_percent-parameter being an int, this means we get at least 100 buckets.
And then setting queries_per_bucket to 1, would still give 100 queries for benchmarking, which may cause 3+ hours runtime.
Should we consider not testing all percentages?

@boeker
Copy link
Contributor

boeker commented Nov 26, 2025

  1. It is more important to have enough queries per bucket than to approximate total distribution.

Exactly. I don't think we would benefit from approximating the total distribution at this point. You could use this to get the overall average response time/recall, but we do not compute this anywhere.

  • More stable results for all buckets is more important than more accurate results for the few buckets with many queries.

Yes. Having a single inaccurate bucket could throw the results off (suggest a latency spike where there is none, for example). So one has to try to avoid that.

Should we consider not testing all percentages?

We could do that. Here, one could either do this in general (leave out some buckets) or try this in a more targeted way. Here, one should try to do this with the filterFirstExploration suggestion first since this takes the most time and has the most optimization potential (Only the results for buckets with low hit ratios matter, so we only have to benchmark these.)

Edit: Another meaningful optimization could be to cache the exact results in recall computations.

Edit 2: And run the approximateThreshold benchmark only on hit ratios up to, say, 10%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants