Skip to content

Commit 2859b11

Browse files
wconstabpytorchmergebot
authored andcommitted
[pytorch/ncclx] Remove Alltoallv specialization for PTD all_to_all (pytorch#145045)
Summary: PTD all_to_all uses a list of tensors, while ncclAllToAllv (provided by NCCLX and RCCL) assumes that a single contiguous buffer is used. These are fundamentally mismatched. The list of tensors might not be contiguous or even ordered (buffer addresses might not be in increasing order). This patch removes the ncclAllToAllv specialization for PTD all_to_all, and instead let's it directly call ncclSend/ncclRecv. Co-authored by @pavanbalaji Pull Request resolved: pytorch#145045 Approved by: https://github.com/pavanbalaji, https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/ezyang
1 parent 07669ed commit 2859b11

File tree

1 file changed

+0
-43
lines changed

1 file changed

+0
-43
lines changed

torch/csrc/cuda/nccl.cpp

Lines changed: 0 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -957,48 +957,6 @@ void all2all(
957957
using namespace torch::cuda::nccl::detail;
958958
auto comm = to_nccl_comm(_comm);
959959

960-
#ifdef NCCL_ALLTOALLV_SUPPORTED
961-
// NCCL_ALLTOALLV_SUPPORTED is used so NCCL can differentiate send/recv
962-
// operations issued as a part of the collective (e.g. alltoallv) vs those
963-
// inside traditional p2p operations.
964-
TORCH_INTERNAL_ASSERT(
965-
outputTensors.size() == inputTensors.size(),
966-
"number of input tensors is not equal to number of output tensors");
967-
std::vector<size_t> sendCounts(inputTensors.size());
968-
std::vector<size_t> sendDisps(inputTensors.size());
969-
std::vector<size_t> recvCounts(outputTensors.size());
970-
std::vector<size_t> recvDisps(outputTensors.size());
971-
uintptr_t sendBase = reinterpret_cast<uintptr_t>(inputTensors[0].data_ptr());
972-
uintptr_t recvBase = reinterpret_cast<uintptr_t>(outputTensors[0].data_ptr());
973-
size_t dtypeSize = inputTensors.front().element_size();
974-
975-
for (const int r : c10::irange(outputTensors.size())) {
976-
sendCounts[r] = inputTensors[r].numel();
977-
auto sendOffset =
978-
reinterpret_cast<uintptr_t>(inputTensors[r].data_ptr()) - sendBase;
979-
TORCH_INTERNAL_ASSERT(
980-
sendOffset % dtypeSize == 0,
981-
"sendOffset is not divisible by dtypeSize");
982-
sendDisps[r] = sendOffset / dtypeSize;
983-
recvCounts[r] = outputTensors[r].numel();
984-
auto recvOffset =
985-
reinterpret_cast<uintptr_t>(outputTensors[r].data_ptr()) - recvBase;
986-
TORCH_INTERNAL_ASSERT(
987-
recvOffset % dtypeSize == 0,
988-
"recvOffset is not divisible by dtypeSize");
989-
recvDisps[r] = recvOffset / dtypeSize;
990-
}
991-
NCCL_CHECK(ncclAllToAllv(
992-
inputTensors[0].data_ptr(),
993-
sendCounts.data(),
994-
sendDisps.data(),
995-
outputTensors[0].data_ptr(),
996-
recvCounts.data(),
997-
recvDisps.data(),
998-
to_nccl_data_type(inputTensors.front()),
999-
comm,
1000-
stream.stream()));
1001-
#else
1002960
NCCL_CHECK(ncclGroupStart());
1003961
for (const int r : c10::irange(static_cast<int>(outputTensors.size()))) {
1004962
at::Tensor& input = inputTensors[r];
@@ -1028,7 +986,6 @@ void all2all(
1028986
#else
1029987
NCCL_CHECK_TIMEOUT(ncclGroupEnd(), _comm);
1030988
#endif
1031-
#endif
1032989
#else
1033990
TORCH_CHECK(false, "all2all is only supported for NCCL lib version >= 2.7.0");
1034991
#endif

0 commit comments

Comments
 (0)