Add checkpoint/resume functionality to `VespaNNParameterOptimizer` #1173

thomasht86 · 2025-11-20T12:29:00Z

I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.

Closing #1170

Key changes:

Checkpoint mechanism saves optimization progress at each stage (bucket distribution, filterFirstExploration, filterFirstThreshold, approximateThreshold, postFilterThreshold)
Buckets now store query indices instead of full query objects to reduce state file size
New parameters run_name and resume control checkpoint naming and resume behavior
Replaced print statements with logger calls throughout the optimizer
Added validation of state against current parameters to avoid resumption of a run that were run with different input.

Copilot

Pull Request Overview

This PR adds checkpoint/resume functionality to the VespaNNParameterOptimizer class, enabling users to save intermediate optimization results and resume long-running optimization runs. The implementation stores query bucket indices and stage completion status to disk in JSON format.

Key changes:

Checkpoint mechanism saves optimization progress at each stage (bucket distribution, filterFirstExploration, filterFirstThreshold, approximateThreshold, postFilterThreshold)
Buckets now store query indices instead of full query objects to reduce state file size
New parameters run_name and resume control checkpoint naming and resume behavior
Replaced print statements with logger calls throughout the optimizer

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 13 comments.

File	Description
vespa/evaluation.py	Added checkpoint/resume functionality with state management methods, modified optimizer to store query indices instead of full queries in buckets, replaced print statements with logging, and refactored run() method into staged execution with checkpoint saving
tests/unit/test_evaluator.py	Added comprehensive tests for checkpoint save/load functionality and resume behavior validation

vespa/evaluation.py

tests/unit/test_evaluator.py

vespa/evaluation.py

boeker · 2025-11-21T13:25:42Z

Not sure if one maybe should just add the option to specify values for the parameters instead. 🤔 Then, if a value is specified, that stage is skipped and the given value is used. This achieves roughly the same and is much simpler since it is the benchmarks that take time while the initial hit-ratio determination is comparatively short.

If the tool takes so long that it becomes unbearable, one should probably make it work faster instead of working around the slowness. Since it should only be run once, it is ok to be slow I think, but it should not be unbearably slow. For example, one could limit the number of queries in a bucket. If there are more queries in a bucket than the limit specifies, one could just choose a random subset.

thomasht86 · 2025-11-25T08:13:07Z

Added max_queries_per_benchmark for default sampling of queries.
PTAL @boeker

boeker · 2025-11-26T10:34:06Z

Don't we want to sample max_queries_per_benchmark queries per bucket? If we have 20.000 queries in a bucket, then we should sample 100 from these, but if we only have 90 in a bucket, then we should use all these 90 queries. Otherwise, we might get very few queries in a bucket benchmark, which means that that specific benchmark result will be reliable.

Ideally, we would want to get the same number of queries, say 100, in a bucket. This then allows to make a reasonable choice for the number of benchmark repetitions. For example 10 repetitions than would mean that we take 10 * 100 samples, which would be constant across all buckets.

thomasht86 · 2025-11-26T10:51:01Z

Don't we want to sample max_queries_per_benchmark queries per bucket? If we have 20.000 queries in a bucket, then we should sample 100 from these, but if we only have 90 in a bucket, then we should use all these 90 queries. Otherwise, we might get very few queries in a bucket benchmark, which means that that specific benchmark result will be reliable.

Ideally, we would want to get the same number of queries, say 100, in a bucket. This then allows to make a reasonable choice for the number of benchmark repetitions. For example 10 repetitions than would mean that we take 10 * 100 samples, which would be constant across all buckets.

I did consider that, but isn`t much easier for user to relate to total queries?
In the case you're suggesting the user can't control number of total queries (and thus runtime) directly, right?

And don't we want the distribution of queries to resemble their real query load?

Considering that many queries may have same hit ratios, but look completely different and have different response times, it seems reasonable to try to match the distribution for more transferrable results? Though I do agree that this may cause very few queries for some buckets, but as long as they are warned and aware, isn't that OK?

boeker · 2025-11-26T11:19:12Z

I did consider that, but isn`t much easier for user to relate to total queries? In the case you're suggesting the user can't control number of total queries (and thus runtime) directly, right?

Limiting the number of queries per bucket still limits the number of total queries by number_queries * number_of_buckets, so this still let's you control the run time. But the goal isn't to control the run time, the goal is to get a reasonable result in a reasonable amount of time, is it?

And don't we want the distribution of queries to resemble their real query load?

What would be the advantage of that? This means that we are letting buckets with few queries "starve", which in turn means that we get nonsensical results for these buckets. And that we really don't want. When analyzing the behavior across hit ratios, queries with different hit ratios are "independent" in some sense since they lead to different strategies in Vespa. And that's also what we are doing in the tool and in the plots: we are looking at all these buckets independently and not at the average response time and recall across all hit ratios.

Considering that many queries may have same hit ratios, but look completely different and have different response times, it seems reasonable to try to match the distribution for more transferrable results? Though I do agree that this may cause very few queries for some buckets, but as long as they are warned and aware, isn't that OK?

That's the point of sampling a subset of the queries in a bucket if the number of queries in a bucket is very high. We want to get reasonable information about that bucket without spending too much time on it if the number of queries is high. If we have a bucket with only a few queries, we can just use all queries in that bucket to get as much information as possible about the behavior at that hit ratio. Again, since we are looking at all buckets independently, we probably do not want a distribution that resembles the real load. That would make sense when taking the average of all buckets.

thomasht86 · 2025-11-26T12:00:50Z

Think I got it.
So you're basically saying that:

It is more important to have enough queries per bucket than to approximate total distribution.
More stable results for all buckets is more important than more accurate results for the few buckets with many queries.

Good discussion 😄

Still need some input before making the changes:

Limiting the number of queries per bucket still limits the number of total queries by number_queries * number_of_buckets, so this still let's you control the run time. But the goal isn't to control the run time, the goal is to get a reasonable result in a reasonable amount of time, is it?

Yes, but I think we also need to make it usable.
Considering our buckets_per_percent-parameter being an int, this means we get at least 100 buckets.
And then setting queries_per_bucket to 1, would still give 100 queries for benchmarking, which may cause 3+ hours runtime.
Should we consider not testing all percentages?

boeker · 2025-11-26T12:43:18Z

It is more important to have enough queries per bucket than to approximate total distribution.

Exactly. I don't think we would benefit from approximating the total distribution at this point. You could use this to get the overall average response time/recall, but we do not compute this anywhere.

More stable results for all buckets is more important than more accurate results for the few buckets with many queries.

Yes. Having a single inaccurate bucket could throw the results off (suggest a latency spike where there is none, for example). So one has to try to avoid that.

Should we consider not testing all percentages?

We could do that. Here, one could either do this in general (leave out some buckets) or try this in a more targeted way. Here, one should try to do this with the filterFirstExploration suggestion first since this takes the most time and has the most optimization potential (Only the results for buckets with low hit ratios matter, so we only have to benchmark these.)

Edit: Another meaningful optimization could be to cache the exact results in recall computations.

Edit 2: And run the approximateThreshold benchmark only on hit ratios up to, say, 10%.

thomasht86 added 9 commits November 19, 2025 14:33

initial implementation

50fa65c

fix logging

0d985ee

add unit tests

2bff8f1

only store indices of queries - not full query

c941523

move bucket report to separate function

337e529

better unit tests

d4d226f

add usage example in docstring

027ecc1

log output path

70bd398

log run name

26d287f

thomasht86 requested a review from Copilot November 20, 2025 12:34

Copilot started reviewing on behalf of thomasht86 November 20, 2025 12:34 View session

Copilot finished reviewing on behalf of thomasht86 November 20, 2025 12:38

Copilot AI reviewed Nov 20, 2025

View reviewed changes

thomasht86 added 6 commits November 20, 2025 14:23

log and handle error on save stage instead of propagating exception

a01fb46

mock expanduser instead of setting path in unit test

bee9f62

override os.expanduser path instead in unit test

6380b20

add validation of saved state

40ca207

unit tests for validation

7ca3d7f

assert correct string in unit test

69cba3a

thomasht86 requested a review from boeker November 20, 2025 14:19

thomasht86 marked this pull request as ready for review November 20, 2025 14:26

thomasht86 added 6 commits November 25, 2025 06:40

introduce max_queries_per_benchmark

66e7f03

unit test for query sampling

c959e71

move import

7251a78

more info in docstring

e8f4e92

set to 0 to use all

0135727

unit test for 0 to use all

bce618c

Merge branch 'master' into thomasht86/checkpoint-nnparameteroptimizer

d1bfa9c

Add checkpoint/resume functionality to VespaNNParameterOptimizer #1173

Are you sure you want to change the base?

Add checkpoint/resume functionality to VespaNNParameterOptimizer #1173

Uh oh!

Conversation

thomasht86 commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

boeker commented Nov 21, 2025

Uh oh!

thomasht86 commented Nov 25, 2025

Uh oh!

boeker commented Nov 26, 2025

Uh oh!

thomasht86 commented Nov 26, 2025

Uh oh!

boeker commented Nov 26, 2025

Uh oh!

thomasht86 commented Nov 26, 2025

Uh oh!

boeker commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add checkpoint/resume functionality to `VespaNNParameterOptimizer` #1173

Add checkpoint/resume functionality to `VespaNNParameterOptimizer` #1173

thomasht86 commented Nov 20, 2025 •

edited

Loading

boeker commented Nov 26, 2025 •

edited

Loading