Skip to content

Conversation

@tmi
Copy link
Collaborator

@tmi tmi commented Jan 26, 2026

NOTE: CI fails because of the agent failure, but locally all passes. I will re-run again before merge

The restore part of the checkpoints -- meaning that if you run a cascade job with checkpoint config, and there is some checkpoint data, we now load them and modify the graph so that instead of computing from scratch it utilizes those

For the review: this is a complex biggie. Best review as follows:
1/ recall the checkpoint spec interface from the previous PRs: https://github.com/ecmwf/earthkit-workflows/blob/develop/src/cascade/low/core.py#L204

2/ there is code which deals with the low level persist/restore -- consult the added tests that deal with it https://github.com/ecmwf/earthkit-workflows/pull/160/changes#diff-64909731b802bdb1b07934081dc30763f03655685eb7dc6b8ebcd7891bf09148 as well as the logic itself , thats relatively easy to grasp

3/ there is code which deals with the "which parts of the graph are not needed now that we have some checkpoints". The workings is demonstrated in this test https://github.com/ecmwf/earthkit-workflows/pull/160/changes#diff-3472d7b27ddfe2e710b3fd5ad31f579805fbe9cbfe812d77f2fa5fa27ef32e1d, and the implementation is primarily this function https://github.com/ecmwf/earthkit-workflows/pull/160/changes#diff-eb0acf634594a88c44c85cda2d95ff6784023662a8bf6ff3d8a7d221319350b5R105

4/ we add an end-to-end pytest which runs a cascade job with persist config, twice. The first run saves the checkpoint, the second reads it. That demonstrates that the new thing "works" -- see https://github.com/ecmwf/earthkit-workflows/pull/160/changes#diff-447ac6920314e599bc9cc7c3a6ee4c8a1f832bafe3575415560e7bc0a7007f3c

5/ there is a ton of small changes all around which are I fear hard to understand without context. Its mostly about "original code assumed some invariants, but now the completion of tasks via checkpoint reading is breaking them, so we need to fix here and there". No need to worry about detailed review, sadly

For the record, my original detailed todo list which mirrors the changes

        [✓] util list persisted datasets
        [✓] scheduler trim the preschedule/context:
            a/ start from sink and topologically recursive -- mark sinks 'must' unless they are checkpointed, propagate 'must'
            b/ cornercases -- outputs and persist targets, drop or recompute etc
            if something went wrong 78753825b175ad8ffc64770293747cb02a581af3 was the original commit
        [✓] controller introduce a fake host with datasets only but no workers
        [✓] controller invoke 'task_done' for each stored -- probably just DatasetPublished to controller.notify.notify
        [✓] controller handle the invariants in schedule originally set by plan/assign for these 'task_done' virtual invocations
        [✓] bridge implement transmit-restore
        [✓] executor dataserver handle the restore
        [✓] executor handle the restore->publish
        [✓] controller handle restore and purge/fetch/persist
        [✓] controller progress probably ok as is, since operates on execution context

@tmi tmi force-pushed the feat/scheduling/restore branch from 3677f9f to e1aaecb Compare January 26, 2026 13:34
@tmi tmi force-pushed the feat/scheduling/restore branch from f7cd6d9 to 6fe2369 Compare January 28, 2026 15:58
@tmi tmi force-pushed the feat/scheduling/restore branch from 6fe2369 to 4d0c4f6 Compare January 28, 2026 16:02
@tmi tmi marked this pull request as ready for review January 29, 2026 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants