- [2025/09] Released v0.5:
- Adds AMD support (hipAdaptor and rcclAdaptor).
- Introduces flagcxNetAdaptor to unify network backends, currently supporting SOCKET, IBRC, UCX and IBUC (experimently).
- Enables zero-copy device-buffer RDMA (user-buffer RDMA) to boost small-message performance.
- Supports automatic tuning in homogeneous scenarios via flagcxTuner.
- Integrates automated PyTorch API tests into CI/CD.
 
- [2025/08] Released v0.4:
- Supports heterogeneous training of ERNIE4.5 on Nvidia and Iluvatar GPUs with Paddle + FlagCX.
- Enables more robust and flexible deployments with full support of heterogeneous communication across arbitrary NIC configurations (bug fixes).
- Introduces an early experimental net plugin interface extending its support for both IBRC and SOCKET, along with the ability to register device buffers via DMA-BUF.
- Adds an InterOp-level DSL to allow users designing customized C2C algorithms.
- Provides usage documentation under docs/.
 
- [2025/07] Released v0.3:
- Integrates three additional native communication libraries: HCCL, MUSACCL and MPI.
- Enhances heterogeneous collective communication operations with pipeline optimizations.
- Introduces a device-side function mechanism to enable device-buffer RDMA, complementing the original host-side function mechanism.
- Delivers a full-stack open-source solution, FlagScale + FlagCX, for efficient heterogeneous prefilling-decoding disaggregation.
 
- [2025/05] Released v0.2:
- Integrates three additional native communications libraries, including MCCL, XCCL and DUCCL.
- Improves 11 heterogeneous collective communication operations with automatic topology detection, fully supporting both single-NIC and multi-NIC environments.
 
- [2025/04] Released v0.1:
- Integrates five native communications libraries including NCCL, IXCCL, CNCL, BOOTSTRAP and GLOO.
- Supports 11 heterogeneous collective communication operations using the originally proposed C2C (Cluster-to-Cluster) algorithm.
- Provides a full-stack open-source solution, FlagScale + FlagCX, for efficient heterogeneous training.
- Natively integrated into PaddlePaddle v3.0.0, with support for both dynamic and static graphs.
 
FlagCX is a scalable and adaptive cross-chip communication library developed with the backing of the Beijing Academy of Artificial Intelligence (BAAI).
FlagCX is also a part of FlagAI-Open, an open-source initiative by BAAI that aims to foster an open-source ecosystem for AI technologies. It serves as a platform where developers, researchers, and AI enthusiasts can collaborate on various AI projects, contribute to the development of cutting-edge AI solutions, and share their work with the global community.
FlagCX leverages native collective communications libraries to provide the full support of single-chip communications on different platforms. In addition to its native x-CCL support, FlagCX provides an original device-buffer RDMA design to offer advanced support for cross-chip high-performance sendrecev operations, which can also be integrated with native x-CCL backends to enable optimized cross-chip collective communications. A comprehensive list of currently supported communication backends and their different capabilities are listed as follows:
| Backend | NCCL | IXCCL | CNCL | MCCL | XCCL | DUCCL | HCCL | MUSACCL | RCCL | 
|---|---|---|---|---|---|---|---|---|---|
| Mode | Homo/Hetero | Homo/Hetero | Homo/Hetero | Homo/Hetero | Homo/Hetero | Homo/Hetero | Homo/Hetero | Homo/Hetero | Homo/Hetero | 
| send | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/☓ | ✓/✓ | ✓/✓ | 
| recv | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/☓ | ✓/✓ | ✓/✓ | 
| broadcast | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/☓ | ✓/✓ | ✓/✓ | 
| gather | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ☓/☓ | ✓/☓ | ✓/☓ | ✓/✓ | ✓/✓ | 
| scatter | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/☓ | ✓/☓ | ✓/✓ | ✓/✓ | 
| reduce | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/☓ | ✓/✓ | ✓/✓ | 
| allreduce | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/☓ | ✓/✓ | ✓/✓ | 
| allgather | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/☓ | ✓/✓ | ✓/✓ | 
| reducescatter | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/☓ | ✓/✓ | ✓/✓ | 
| alltoall | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/☓ | ✓/✓ | ✓/✓ | 
| alltoallv | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/☓ | ✓/☓ | ✓/✓ | ✓/✓ | 
| group ops | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/✓ | ✓/☓ | ✓/☓ | ✓/✓ | ✓/✓ | 
Note that Homo and Hetero modes refer to communications among homogeneous and heterogeneous clusters. All native collective communications libraries can be referenced through the links below:
- NCCL, NVIDIA Collective Communications Library.
- IXCCL, Iluvatar Corex Collective Communications Library.
- CNCL, Cambricon Communications Library.
- MCCL, Metax Collective Communications Library.
- XCCL, XPU Collective Communications Library.
- DUCCL, DU Collective Communications Library.
- HCCL, Ascend Communications Library.
- MUSACCL, Musa Collective Communications Library.
- RCCL, ROCm Communication Collectives Library.
Additionally, FlagCX supports three collective communication libraries for host-side communication: BOOTSTRAP, GLOO, and MPI. Besides BOOTSTRAP, which is built using the FlagCX bootstrap component, the other two libraries are described as follows:
FlagCX also integrates with upper-layer applications such as PyTorch and PaddlePaddle based on its unified APIs. The table below presents all supported frameworks by FlagCX and their related communication operations, where the batch_XXX and XXX_coalesced ops refer to the usage of group primitives.
| Framework | PyTorch | PaddlePaddle | 
|---|---|---|
| send | ✓ | ✓ | 
| recv | ✓ | ✓ | 
| batch_isend_irecv | ✓ | ✓ | 
| broadcast | ✓ | ✓ | 
| all_reduce | ✓ | ✓ | 
| all_reduce_coalesced | ✓ (in order, no aggregation) | ✘ | 
| reduce | ✓ | ✓ | 
| all_gather | ✓ | ✓ | 
| all_gather_into_tensor_coalesced | ✓ (in order, no aggregation) | ✘ | 
| gather | ✓ | ✓ | 
| scatter | ✓ | ✓ | 
| reduce_scatter | ✓ | ✓ | 
| reduce_scatter_tensor_coalesced | ✓ (in order, no aggregation) | ✘ | 
| all_to_all | ✓ | ✓ | 
| all_to_all_single | ✓ | ✓ | 
| barrier | ✓ | ✓ | 
- 
Clone the repository: git clone https://github.com/FlagOpen/FlagCX.git 
- 
Build the library with different flags targeting to different platforms: cd FlagCX make [USE_NVIDIA/USE_ILUVATAR_COREX/USE_CAMBRICON/USE_GLOO/USE_MPI/USE_METAX/USE_MUSA/USE_KUNLUNXIN/USE_DU/USE_ASCEND/USE_AMD]=1The default install path is set to build/, you can manually setBUILDDIRto specify the build path. You may also defineDEVICE_HOMEandCCL_HOMEto indicate the install paths of device runtime and communication libraries.
Tests for FlagCX are maintained in test/perf.
cd test/perf
make [USE_NVIDIA/USE_ILUVATAR_COREX/USE_CAMBRICON/USE_METAX/USE_MUSA/USE_KUNLUNXIN/USE_DU/USE_ASCEND]=1
mpirun --allow-run-as-root -np 8 ./test_allreduce -b 128K -e 4G -f 2Note that the default MPI install path is set to /usr/local/mpi, you may specify the MPI path with:
make MPI_HOME=<path to mpi install>All tests support the same set of arguments:
- Sizes to scan
- -b <min size in bytes>minimum size to start with. Default: 1M.
- -e <max size in bytes>maximum size to end at. Default: 1G.
- -f <increment factor>multiplication factor between sizes. Default: 2.
 
- Performance
- -w, <warmup iteration count>number of warmup iterations (not timed). Default: 5.
- -n, <iteration count>number of iterations. Default: 20.
 
- Test Operation
- -R, <0/1>enable local buffer registration on send/recv buffers. Default: 0.
- -s, <OCT/DEC/HEX>specify MPI communication split mode. Default: 0
 
- Utils
- -p, <0/1>print buffer info. Default: 0.
- -hprint help message. Default: disabled.
 
After building and testing FlagCX, you can start training models using upper-layer deep learning frameworks such as PyTorch or PaddlePaddle with FlagCX as communication backend. We provide detailed user guides for both homogeneous and heterogeneous training across different hardware platforms. Please refer to the docs below:
This project is licensed under the Apache License (Version 2.0).
