What's Changed
- chore(docker): add openclaw, ironclaw, zeroclaw, and nanobot-ai to runtime image by @garrett4wade in #1051
- feat(agent-service): add Agent Service microservice infrastructure by @CormickKneey in #1048
- feat(gateway): Add rollout gateway infrastructure with controller, router, and data proxy by @nuzant in #1043
- feat: estimators for kl divergence by @NicolasArias in #1060
- test(infra): speed up inference service integration tests by @nuzant in #1068
- fix(infra): simplify RTensor serialization in data proxy by @garrett4wade in #1067
- fix(rpc): resolve connection reset during RTensor fetch with large payloads by @pratyush618 in #1075
- docs: add gitcgr code graph badge by @vitali87 in #1073
- fix(openai): handle streaming responses in chat/completions endpoint by @Zijun9 in #1053
- fix: add PIL image and processor serialization for VLM RPC by @Adiactive in #1070
- refactor(api): migrate allocation_mode to per-engine backend fields by @garrett4wade in #1044
- chore(agents): add Codex harness and align AI workflows by @rchardx in #1082
- feat(platform): add NUMA CPU affinity binding for training engines by @HT-Yuan in #1083
- feat(commands): add fork workflow support to create-pr skill by @guozhihao-224 in #1092
- fix(rpc): batch HTTP RTensor fetches for large multimodal batches by @Wangxiaoxiaoa in #1077
- fix(fsdp): stabilize Qwen-VL rope-index argument binding and dtype by @Adiactive in #1094
- Refactor(vllm): use pause_generation from vllm instead of abort_all_req in areal_vllm_server by @HwVanICI in #1091
- feat: add BailingMoeV2.5 support with Lightning Attention + MLA + MoE + CP by @dingzhiqiang in #1079
- feat:support model training in IPv6-only environment by @TaoZex in #1072
- fix: fix
pad_packed_tensor_dictby @HKAB in #1104 - feat: megatron bridge adaptation by @gursimar in #1056
- fix(engine): remove duplicate trust_remote_code kwarg in MegatronBridge init by @rchardx in #1107
- fix(dataloader): prevent data drop and padding during validation for accurate metrics by @Anguo-star in #1100
- fix(archon): add missing POST /data/batch endpoint to data proxy by @rchardx in #1105
- refactor(engine): abstract CUDA calls via current_platform in PerLayerOptimWrapper by @guozhihao-224 in #1108
- perf(fsdp): pipeline distributed weight sync with a single pending bucket by @HT-Yuan in #1074
- fix(engine): restore SGLang VLM training by @garrett4wade in #1098
- feat(archon): add FP8 blockwise training support by @rchardx in #1087
- chore(ci): update GCP CI image by @garrett4wade in #1115
- feat(inference-service): complete vLLM backend support in inference service by @garrett4wade in #1112
- fix(archon): harden FP8 blockwise training for TP and MoE scenarios by @rchardx in #1118
- feat(inference_service): add VLM image input support to OpenAI-compatible API by @garrett4wade in #1119
- feat(utils): add Trackio experiment tracking backend by @guozhihao-224 in #1113
- refactor(infra): decompose rpc_server into shared guard + blueprints by @garrett4wade in #1126
- refactor(agents): redesign review-pr taxonomy and sync flow by @rchardx in #1124
- feat(infra): add client-side fetch buffer for RTensor by @guozhihao-224 in #1122
- chore: fix gcp image to latest by @nuzant in #1130
- docs: add Trackio configuration to CLI reference by @rchardx in #1131
- feat(service): support online inference service by @nuzant in #1121
- fix(engine): for broken tree training due to bad indent in PR #1056 by @gursimar in #1135
- feat(service): add vllm backend support for inference service demo by @nuzant in #1136
- fix(api): add mode validation for WandBConfig and SwanlabConfig by @guozhihao-224 in #1134
- feat: enable LoRA RL-training in Megatron via megatron-bridge by @gursimar in #1123
- fix: harden padded distributed eval across training engines by @rchardx in #1109
- feat(ci): separate vllm and sglang pyproject.toml by @garrett4wade in #1141
- fix(vllm_ext): clear multimodal caches after generation pause by @Adiactive in #1144
- fix(ci): sync uv.vllm.lock with the current pyproject.vllm.toml by @garrett4wade in #1146
- fix(vllm_ext): XCCL lora weights update when PP>1 by buffering and merging PP shards by @gursimar in #1145
- chore: fix pre-commit by @garrett4wade in #1148
- Fix #1040: [Feature] Fixed bugs in Archon LoRA Backend by @JiwaniZakir in #1139
- feat(infra): add distributed data loading service by @garrett4wade in #1120
- refactor(infra): standardize list-first trajectory batch dispatch by @garrett4wade in #1150
- feat(infra): allow colocation with offloading and disk weight updates by @garrett4wade in #1157
- refactor: replace manual JSON parsing with Pydantic models by @koladefaj in #1154
- fix(engine): FSDP compute_logp fails for Qwen3.5 with dict attention_mask by @pratyush618 in #1153
- chore: update readme and enforce license by @garrett4wade in #1170
- chore: ensure SPDX license header in python source files by @garrett4wade in #1171
- fix: add missing pre-commit check file by @garrett4wade in #1173
- chore: add project governance for PyTorch ecosystem by @garrett4wade in #1174
- feat(infra): add microservice-based training service (controller v2) by @garrett4wade in #1169
- chore: renew qrcode by @garrett4wade in #1184
- feat(archon): support multi-node inference in gateway controller by @guozhihao-224 in #1178
- feat(agent-service): add Controller, Guard, and Claude Agent SDK example by @CormickKneey in #1177
- fix _update_weights_from_disk function to prevent training to be stuck by @asif07hossain in #1181
- refactor: mount data blueprint via WSGI and adopt Pydantic in engine blueprint by @koladefaj in #1179
- fix(engine): use meta device for non-rank-0 in FSDP memory_efficient_load by @yulangz in #1182
- ci: parallelize unit and integration tests across 4 GPU instances by @nuzant in #1185
- chore: dump v1.0.3 by @garrett4wade in #1191
New Contributors
- @pratyush618 made their first contribution in #1075
- @vitali87 made their first contribution in #1073
- @Adiactive made their first contribution in #1070
- @guozhihao-224 made their first contribution in #1092
- @TaoZex made their first contribution in #1072
- @HKAB made their first contribution in #1104
- @Anguo-star made their first contribution in #1100
- @JiwaniZakir made their first contribution in #1139
- @koladefaj made their first contribution in #1154
- @asif07hossain made their first contribution in #1181
Full Changelog: v1.0.2...v1.0.3