feat:suppot infer slice tensor at dim > 0 and optimize memory#106
feat:suppot infer slice tensor at dim > 0 and optimize memory#106chaokunyang merged 1 commit intoinclusionAI:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the NCCL and HCCL weight transfer logic to support non-contiguous tensors and introduces a one-by-one communication mode for debugging and memory efficiency. It also updates the SGLang converter to handle shared experts in MoE models and adds configuration for HCCL buffer sizes. Review feedback suggests removing redundant del statements for local list references to improve code clarity.
| non_contiguous_tensor_pairs.clear() | ||
| del non_contiguous_tensor_pairs |
There was a problem hiding this comment.
The del non_contiguous_tensor_pairs statement is redundant. The list is cleared on the previous line, and the local reference will be garbage collected when the function returns. Removing this line will make the code cleaner without affecting functionality.
| non_contiguous_tensor_pairs.clear() | |
| del non_contiguous_tensor_pairs | |
| non_contiguous_tensor_pairs.clear() | |
| non_contiguous_tensor_pairs.clear() | ||
| del non_contiguous_tensor_pairs |
There was a problem hiding this comment.
The del non_contiguous_tensor_pairs statement is redundant here. The list is cleared on the previous line, and the local reference will be garbage collected automatically. Removing this line would improve code clarity.
| non_contiguous_tensor_pairs.clear() | |
| del non_contiguous_tensor_pairs | |
| non_contiguous_tensor_pairs.clear() | |
What does this PR do?
support infer slice tensor at dim > 0 which could make p2p tensor not continuous in NCCL or HCCL and optimize memory using
1、support infer tensor slice at dim >0 such as attention.dense.weight when infer's tp < train's tp which could make p2p tensor not continuous in NCCL or HCCL. another scene is the experts in VLLM-ascend has been transposed but shared_experts not.
2、optimize memory by supporting p2p send receive one by one, it's useful for debug or insufficient memory if infer with closed sleep mode and it's also useful for the hardware diff scene which using batch_send_recv will be error,such as 910B2 and 910B1. p2p send receive one by one comparing with batch send receive, can increase time consumption less than 10% but decrease peak memory 25% for qwen3-30B
3、optimize memory by using local part process group and destroy weight exchange process group in NPU HCCL, because the HCCL_BUFFERSIZE actually occupies twice the memory, and it also requires HCCL_BUFFERSIZE keep consistency at infer and train. Using local part process group can optimize this portion of the memory.
Related issues
No
Does this PR introduce any user-facing change?
No