Skip to content

v1.0.3

Latest

Choose a tag to compare

@garrett4wade garrett4wade released this 16 Apr 12:00
· 13 commits to main since this release
376ecbb

What's Changed

  • chore(docker): add openclaw, ironclaw, zeroclaw, and nanobot-ai to runtime image by @garrett4wade in #1051
  • feat(agent-service): add Agent Service microservice infrastructure by @CormickKneey in #1048
  • feat(gateway): Add rollout gateway infrastructure with controller, router, and data proxy by @nuzant in #1043
  • feat: estimators for kl divergence by @NicolasArias in #1060
  • test(infra): speed up inference service integration tests by @nuzant in #1068
  • fix(infra): simplify RTensor serialization in data proxy by @garrett4wade in #1067
  • fix(rpc): resolve connection reset during RTensor fetch with large payloads by @pratyush618 in #1075
  • docs: add gitcgr code graph badge by @vitali87 in #1073
  • fix(openai): handle streaming responses in chat/completions endpoint by @Zijun9 in #1053
  • fix: add PIL image and processor serialization for VLM RPC by @Adiactive in #1070
  • refactor(api): migrate allocation_mode to per-engine backend fields by @garrett4wade in #1044
  • chore(agents): add Codex harness and align AI workflows by @rchardx in #1082
  • feat(platform): add NUMA CPU affinity binding for training engines by @HT-Yuan in #1083
  • feat(commands): add fork workflow support to create-pr skill by @guozhihao-224 in #1092
  • fix(rpc): batch HTTP RTensor fetches for large multimodal batches by @Wangxiaoxiaoa in #1077
  • fix(fsdp): stabilize Qwen-VL rope-index argument binding and dtype by @Adiactive in #1094
  • Refactor(vllm): use pause_generation from vllm instead of abort_all_req in areal_vllm_server by @HwVanICI in #1091
  • feat: add BailingMoeV2.5 support with Lightning Attention + MLA + MoE + CP by @dingzhiqiang in #1079
  • feat:support model training in IPv6-only environment by @TaoZex in #1072
  • fix: fix pad_packed_tensor_dict by @HKAB in #1104
  • feat: megatron bridge adaptation by @gursimar in #1056
  • fix(engine): remove duplicate trust_remote_code kwarg in MegatronBridge init by @rchardx in #1107
  • fix(dataloader): prevent data drop and padding during validation for accurate metrics by @Anguo-star in #1100
  • fix(archon): add missing POST /data/batch endpoint to data proxy by @rchardx in #1105
  • refactor(engine): abstract CUDA calls via current_platform in PerLayerOptimWrapper by @guozhihao-224 in #1108
  • perf(fsdp): pipeline distributed weight sync with a single pending bucket by @HT-Yuan in #1074
  • fix(engine): restore SGLang VLM training by @garrett4wade in #1098
  • feat(archon): add FP8 blockwise training support by @rchardx in #1087
  • chore(ci): update GCP CI image by @garrett4wade in #1115
  • feat(inference-service): complete vLLM backend support in inference service by @garrett4wade in #1112
  • fix(archon): harden FP8 blockwise training for TP and MoE scenarios by @rchardx in #1118
  • feat(inference_service): add VLM image input support to OpenAI-compatible API by @garrett4wade in #1119
  • feat(utils): add Trackio experiment tracking backend by @guozhihao-224 in #1113
  • refactor(infra): decompose rpc_server into shared guard + blueprints by @garrett4wade in #1126
  • refactor(agents): redesign review-pr taxonomy and sync flow by @rchardx in #1124
  • feat(infra): add client-side fetch buffer for RTensor by @guozhihao-224 in #1122
  • chore: fix gcp image to latest by @nuzant in #1130
  • docs: add Trackio configuration to CLI reference by @rchardx in #1131
  • feat(service): support online inference service by @nuzant in #1121
  • fix(engine): for broken tree training due to bad indent in PR #1056 by @gursimar in #1135
  • feat(service): add vllm backend support for inference service demo by @nuzant in #1136
  • fix(api): add mode validation for WandBConfig and SwanlabConfig by @guozhihao-224 in #1134
  • feat: enable LoRA RL-training in Megatron via megatron-bridge by @gursimar in #1123
  • fix: harden padded distributed eval across training engines by @rchardx in #1109
  • feat(ci): separate vllm and sglang pyproject.toml by @garrett4wade in #1141
  • fix(vllm_ext): clear multimodal caches after generation pause by @Adiactive in #1144
  • fix(ci): sync uv.vllm.lock with the current pyproject.vllm.toml by @garrett4wade in #1146
  • fix(vllm_ext): XCCL lora weights update when PP>1 by buffering and merging PP shards by @gursimar in #1145
  • chore: fix pre-commit by @garrett4wade in #1148
  • Fix #1040: [Feature] Fixed bugs in Archon LoRA Backend by @JiwaniZakir in #1139
  • feat(infra): add distributed data loading service by @garrett4wade in #1120
  • refactor(infra): standardize list-first trajectory batch dispatch by @garrett4wade in #1150
  • feat(infra): allow colocation with offloading and disk weight updates by @garrett4wade in #1157
  • refactor: replace manual JSON parsing with Pydantic models by @koladefaj in #1154
  • fix(engine): FSDP compute_logp fails for Qwen3.5 with dict attention_mask by @pratyush618 in #1153
  • chore: update readme and enforce license by @garrett4wade in #1170
  • chore: ensure SPDX license header in python source files by @garrett4wade in #1171
  • fix: add missing pre-commit check file by @garrett4wade in #1173
  • chore: add project governance for PyTorch ecosystem by @garrett4wade in #1174
  • feat(infra): add microservice-based training service (controller v2) by @garrett4wade in #1169
  • chore: renew qrcode by @garrett4wade in #1184
  • feat(archon): support multi-node inference in gateway controller by @guozhihao-224 in #1178
  • feat(agent-service): add Controller, Guard, and Claude Agent SDK example by @CormickKneey in #1177
  • fix _update_weights_from_disk function to prevent training to be stuck by @asif07hossain in #1181
  • refactor: mount data blueprint via WSGI and adopt Pydantic in engine blueprint by @koladefaj in #1179
  • fix(engine): use meta device for non-rank-0 in FSDP memory_efficient_load by @yulangz in #1182
  • ci: parallelize unit and integration tests across 4 GPU instances by @nuzant in #1185
  • chore: dump v1.0.3 by @garrett4wade in #1191

New Contributors

Full Changelog: v1.0.2...v1.0.3