Skip to content

feat: add checkpoint management to TrainingClient#29

Merged
void-main merged 10 commits intomainfrom
agent/issue-28
Mar 9, 2026
Merged

feat: add checkpoint management to TrainingClient#29
void-main merged 10 commits intomainfrom
agent/issue-28

Conversation

@void-main
Copy link
Contributor

@void-main void-main commented Mar 6, 2026

Summary

  • Add save_state(), load_state(), load_state_with_optimizer(), and list_checkpoints() methods to TrainingClient
  • Add Checkpoint dataclass type in weaver/types/checkpoint.py
  • 14 unit tests + integration test

Details

Implements the checkpoint management API specified in #28:

Method Server Endpoint Description
save_state() POST /models/:id/checkpoints Save current weights as checkpoint (synchronous)
load_state(checkpoint) POST /models/:id/load Restore weights only (async operation)
load_state_with_optimizer(checkpoint) POST /models/:id/load Restore weights + optimizer (async operation)
list_checkpoints() GET /models/:id/checkpoints List all checkpoints for model

API notes (discovered during integration testing)

  • save_state() is synchronous (returns Checkpoint directly, not an OperationHandle)
  • load_state() accepts a Checkpoint object or raw checkpoint ID string
  • Server uses type field (not checkpoint_type) and checkpoint_id (not path) for load

Integration test results

Step Result
Training (6 steps, loss 4.25→0.71)
save_state(name="after-3-steps") 201 Created
list_checkpoints() ✅ Found saved checkpoint
load_state(checkpoint) ⚠️ Server accepts request (202) but load_weights operation stays pending

The load_weights operation not completing is a server-side/trainer-side gap, consistent with the issue noting dependency on weaver-server#74. The SDK correctly sends {checkpoint_id, include_optimizer} to POST /models/:id/load and polls the operation.

Test plan

  • make ci passes (lint + mypy + 56 unit tests)
  • License headers on all new files
  • Integration test: save_state + list_checkpoints verified against live server
  • Integration test: load_state sends correct payload (blocked by server-side)

Fixes #28

Add save_state, load_state, load_state_with_optimizer, and
list_checkpoints methods to TrainingClient, enabling checkpoint-based
resume training through the SDK.

Fixes #28
- save_state: synchronous POST (not operation), sends {type, path}
- load_state: sends checkpoint_id (not path), returns OperationHandle
- Checkpoint.from_payload: handle server's "type" field
- Add integration test (save/list verified; load blocked by server)
The server should generate the checkpoint path (including model_id),
not the SDK. Changed save_state() to send {"name": ...} instead of
{"path": ...} so the server can construct proper namespaced paths
like weaver://{model_id}/checkpoints/{name}.

Also added checkpoint test scripts for LoRA, FullFT, and baseline.

Refs: china-qijizhifeng/weaver-server#106
save_state input is name, output is Checkpoint with server-generated path.
load_state/load_state_with_optimizer now accept a path string (weaver://
URI) or Checkpoint object (extracts .path), and send {"path": ...} to the
server instead of {"checkpoint_id": ...}.

Refs: china-qijizhifeng/weaver-server#106
Replace "training" with "weight" and "training_with_optimizer" with
"weight_and_optimizer" for clearer checkpoint type semantics.
save_state now dispatches an async operation via enqueue_operation
(same pattern as load_state) instead of a synchronous POST. The
server will dispatch a save task to the trainer, which writes weight
files to disk. Adds wait parameter with overloads: returns Checkpoint
when wait=True (default), OperationHandle when wait=False.

Refs: china-qijizhifeng/weaver-server#109
Server returns {"checkpoint": {...}, "operation": {...}} instead of
a flat operation response. Extract operation for polling and checkpoint
data for the result.
…onse

Server should return a flat Operation (like all other async endpoints)
instead of nested {"checkpoint": ..., "operation": ...}. Revert to the
standard enqueue_operation pattern. Checkpoint data will come from the
operation's response field after completion.

Refs: china-qijizhifeng/weaver-server#111
Copy link
Collaborator

@CriusT CriusT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@void-main void-main merged commit 6d11774 into main Mar 9, 2026
5 checks passed
@void-main void-main deleted the agent/issue-28 branch March 9, 2026 03:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FR] TrainingClient 缺少 checkpoint 管理方法:save_state / load_state / load_state_with_optimizer

2 participants