The ROCmβ’ Data Center Tool (RDC) simplifies administration and addresses key infrastructure challenges in AMD GPUs within cluster and datacenter environments. RDC offers a suite of features to enhance your GPU management and monitoring.
- GPU Telemetry π
- GPU Statistics for Jobs π
- Integration with Third-Party Tools π
- Open Source π οΈ
Note
The published documentation is available at ROCm Data Center Tool in an organized, easy-to-read format, with search and a table of contents. The documentation source files reside in the rdc/docs
folder of this repository. As with all ROCm projects, the documentation is open source. For more information on contributing to the documentation, see Contribute to ROCm documentation.
Before setting up RDC, ensure your system meets the following requirements:
- Supported Platforms: RDC runs on AMD ROCm-supported platforms. Refer to the List of Supported Operating Systems for details.
- Dependencies:
For certificate generation, refer to the RDC Developer Handbook (Generate Files for Authentication) or consult the concise guide located at authentication/readme.txt
.
RDC supports two primary modes of operation: Standalone and Embedded. Choose the mode that best fits your deployment needs.
Standalone mode allows RDC to run independently with all its components installed.
-
Start RDCD with Authentication (Monitor-Only Capabilities):
/opt/rocm/bin/rdcd
-
Start RDCD with Authentication (Full Capabilities):
sudo /opt/rocm/bin/rdcd
-
Start RDCD without Authentication (Monitor-Only):
/opt/rocm/bin/rdcd -u
-
Start RDCD without Authentication (Full Capabilities):
sudo /opt/rocm/bin/rdcd -u
Embedded mode integrates RDC directly into your existing management tools using its library format.
-
Run RDC in Embedded Mode:
python your_management_tool.py --rdc_embedded
Note: Ensure that the rdcd
daemon is not running separately when using embedded mode.
-
Copy the Service File:
sudo cp /opt/rocm/libexec/rdc/rdc.service /etc/systemd/system/
-
Configure Capabilities:
-
Full Capabilities: Ensure the following lines are uncommented in
/etc/systemd/system/rdc.service
:CapabilityBoundingSet=CAP_DAC_OVERRIDE AmbientCapabilities=CAP_DAC_OVERRIDE
-
Monitor-Only Capabilities: Comment out the above lines to restrict RDCD to monitoring.
-
-
Start the Service:
sudo systemctl start rdc sudo systemctl status rdc
-
Modify RDCD Options:
Edit
/opt/rocm/share/rdc/conf/rdc_options.conf
to append any additional RDCD parameters.sudo nano /opt/rocm/share/rdc/conf/rdc_options.conf
Example Configuration:
RDC_OPTS="-p 50051 -u -d"
- Flags:
-p 50051
: Use port 50051-u
: Unauthenticated mode-d
: Enable debug messages
- Flags:
If you prefer to build RDC from source, follow the steps below.
Important: RDC requires gRPC and protoc to be built from source as pre-built packages are not available.
-
Install Required Tools:
sudo apt-get update sudo apt-get install automake make cmake g++ unzip build-essential autoconf libtool pkg-config libgflags-dev libgtest-dev clang libc++-dev curl
-
Clone and Build gRPC:
git clone -b v1.67.1 https://github.com/grpc/grpc --depth=1 --shallow-submodules --recurse-submodules cd grpc export GRPC_ROOT=/opt/grpc cmake -B build \ -DgRPC_INSTALL=ON \ -DgRPC_BUILD_TESTS=OFF \ -DBUILD_SHARED_LIBS=ON \ -DCMAKE_SHARED_LINKER_FLAGS_INIT=-Wl,--enable-new-dtags,--build-id=sha1,--rpath,'$ORIGIN' \ -DCMAKE_INSTALL_PREFIX="$GRPC_ROOT" \ -DCMAKE_INSTALL_LIBDIR=lib \ -DCMAKE_BUILD_TYPE=Release make -C build -j $(nproc) sudo make -C build install echo "$GRPC_ROOT" | sudo tee /etc/ld.so.conf.d/grpc.conf sudo ldconfig cd ..
-
Clone the RDC Repository:
git clone https://github.com/ROCm/rdc cd rdc
-
Configure the Build:
cmake -B build -DGRPC_ROOT="$GRPC_ROOT"
- Optional Features:
-
Enable ROCm Profiler:
cmake -B build -DBUILD_PROFILER=ON
-
Enable RVS:
cmake -B build -DBUILD_RVS=ON
-
Build RDC Library Only (without rdci and rdcd):
cmake -B build -DBUILD_STANDALONE=OFF
-
Build RDC Library Without ROCm Run-time:
cmake -B build -DBUILD_RUNTIME=OFF
-
- Optional Features:
-
Build and Install:
make -C build -j $(nproc) sudo make -C build install
-
Update System Library Path:
export RDC_LIB_DIR=/opt/rocm/lib/rdc export GRPC_LIB_DIR="/opt/grpc/lib" echo "${RDC_LIB_DIR}" | sudo tee /etc/ld.so.conf.d/x86_64-librdc_client.conf echo "${GRPC_LIB_DIR}" | sudo tee -a /etc/ld.so.conf.d/x86_64-librdc_client.conf sudo ldconfig
Locate and display information about GPUs present in a compute node.
Example:
rdci discovery <host_name> -l
Output:
2 GPUs found
+-----------+----------------------------------------------+
| GPU Index | Device Information |
+-----------+----------------------------------------------+
| 0 | Name: AMD Radeon Instinct MI50 Accelerator |
| 1 | Name: AMD Radeon Instinct MI50 Accelerator |
+-----------+----------------------------------------------+
Create, delete, and list logical groups of GPUs.
Create a Group:
rdci group -c GPU_GROUP
Add GPUs to Group:
rdci group -g 1 -a 0,1
List Groups:
rdci group -l
Delete a Group:
rdci group -d 1
Manage field groups to monitor specific GPU metrics.
Create a Field Group:
rdci fieldgroup -c <fgroup> -f 150,155
List Field Groups:
rdci fieldgroup -l
Delete a Field Group:
rdci fieldgroup -d 1
Important
Define fields to monitor RAS ECC counters.
-
Correctable ECC Errors:
312 RDC_FI_ECC_CORRECT_TOTAL
-
Uncorrectable ECC Errors:
313 RDC_FI_ECC_UNCORRECT_TOTAL
Monitor GPU fields such as temperature, power usage, and utilization.
Command:
rdci dmon -f <field_group> -g <gpu_group> -c 5 -d 1000
Sample Output:
1 group found
+-----------+-------------+---------------+
| GPU Index | TEMP (mΒ°C) | POWER (Β΅W) |
+-----------+-------------+---------------+
| 0 | 25000 | 520500 |
+-----------+-------------+---------------+
Display GPU statistics for any given workload.
Start Recording Stats:
rdci stats -s 2 -g 1
Stop Recording Stats:
rdci stats -x 2
Display Job Stats:
rdci stats -j 2
Sample Output:
Summary:
Executive Status:
Start time: 1586795401
End time: 1586795445
Total execution time: 44
Energy Consumed (Joules): 21682
Power Usage (Watts): Max: 49 Min: 13 Avg: 34
GPU Clock (MHz): Max: 1000 Min: 300 Avg: 903
GPU Utilization (%): Max: 69 Min: 0 Avg: 2
Max GPU Memory Used (bytes): 524320768
Memory Utilization (%): Max: 12 Min: 11 Avg: 12
Run diagnostics on a GPU group to ensure system health.
Command:
rdci diag -g <gpu_group>
Sample Output:
No compute process: Pass
Node topology check: Pass
GPU parameters check: Pass
Compute Queue ready: Pass
System memory check: Pass
=============== Diagnostic Details ==================
No compute process: No processes running on any devices.
Node topology check: No link detected.
GPU parameters check: GPU 0 Critical Edge temperature in range.
Compute Queue ready: Run binary search task on GPU 0 Pass.
System memory check: Max Single Allocation Memory Test for GPU 0 Pass. CPUAccessToGPUMemoryTest for GPU 0 Pass. GPUAccessToCPUMemoryTest for GPU 0 Pass.
RDC integrates seamlessly with tools like Prometheus, Grafana, and Reliability, Availability, and Serviceability (RAS) to enhance monitoring and visualization.
RDC provides a generic Python class RdcReader
to simplify telemetry gathering.
Sample Program:
from RdcReader import RdcReader
from RdcUtil import RdcUtil
from rdc_bootstrap import *
import time
default_field_ids = [
rdc_field_t.RDC_FI_POWER_USAGE,
rdc_field_t.RDC_FI_GPU_UTIL
]
class SimpleRdcReader(RdcReader):
def __init__(self):
super().__init__(ip_port=None, field_ids=default_field_ids, update_freq=1000000)
def handle_field(self, gpu_index, value):
field_name = self.rdc_util.field_id_string(value.field_id).lower()
print(f"{value.ts} {gpu_index}:{field_name} {value.value.l_int}")
if __name__ == '__main__':
reader = SimpleRdcReader()
while True:
time.sleep(1)
reader.process()
Running the Example:
# Ensure RDC shared libraries are in the library path and RdcReader.py is in PYTHONPATH
python SimpleReader.py
The Prometheus plugin allows you to monitor events and send alerts.
Installation:
-
Install Prometheus Client:
pip install prometheus_client
-
Run the Prometheus Plugin:
python rdc_prometheus.py
-
Verify Plugin:
curl localhost:5000
Integration Steps:
-
Download and Install Prometheus:
-
Configure Prometheus Targets:
- Modify
prometheus_targets.json
to point to your compute nodes.
[ { "targets": [ "rdc_test1.amd.com:5000", "rdc_test2.amd.com:5000" ] } ]
- Modify
-
Start Prometheus with Configuration File:
prometheus --config.file=/path/to/rdc_prometheus_example.yml
-
Access Prometheus UI:
- Open http://localhost:9090 in your browser.
Grafana provides advanced visualization capabilities for RDC metrics.
Installation:
-
Download Grafana:
-
Install Grafana:
- Follow the Installation Instructions.
-
Start Grafana Server:
sudo systemctl start grafana-server sudo systemctl status grafana-server
-
Access Grafana:
- Open http://localhost:3000 in your browser and log in with the default credentials (
admin
/admin
).
- Open http://localhost:3000 in your browser and log in with the default credentials (
Configuration Steps:
-
Add Prometheus Data Source:
- Navigate to Configuration β Data Sources β Add data source β Prometheus.
- Set the URL to http://localhost:9090 and save.
-
Import RDC Dashboard:
- Click the + icon and select Import.
- Upload
rdc_grafana_dashboard_example.json
from thepython_binding
folder. - Select the desired compute node for visualization.
The RAS plugin enables monitoring and counting of ECC (Error-Correcting Code) errors.
Installation:
-
Ensure GPU Supports RAS:
- The GPU must support RAS features.
-
RDC Installation Includes RAS Library:
librdc_ras.so
is located in/opt/rocm-4.2.0/rdc/lib
.
Usage:
-
Monitor ECC Errors:
rdci dmon -i 0 -e 600,601
Sample Output:
GPU ECC_CORRECT ECC_UNCORRECT 0 0 0
Important
-
Missing Libraries:
- Verify
/opt/rocm/lib/rdc/librdc_*.so
exists. - Ensure all related libraries (rocprofiler, rocruntime, etc.) are present.
- Verify
-
Unsupported GPU:
- Most metrics work on MI300 and newer.
- Limited metrics on MI200.
- Consumer GPUs (e.g., RX6800) have fewer supported metrics.
Solution:
Set the HSA_TOOLS_LIB
environment variable before running a compute job.
export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1
Example:
# Terminal 1
rdcd -u
# Terminal 2
export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1
gpu-burn
# Terminal 3
rdci dmon -u -e 800,801 -i 0 -c 1
# Output:
GPU OCCUPANCY_PERCENT ACTIVE_WAVES
0 001.000 32640.000
Error Message:
terminate called after throwing an instance of 'std::runtime_error'
what(): hsa error code: 4104 HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
Aborted (core dumped)
Solution:
-
Missing Groups:
- Ensure
video
andrender
groups exist.
sudo usermod -aG video,render $USER
- Log out and log back in to apply group changes.
- Ensure
-
View RDCD Logs:
sudo journalctl -u rdc
-
Run RDCD with Debug Logs:
RDC_LOG=DEBUG /opt/rocm/bin/rdcd
- Logging Levels Supported: ERROR, INFO, DEBUG
-
Enable Additional Logging Messages:
export RSMI_LOGGING=3
RDC is open-source and available under the MIT License.
For support and further inquiries, please refer to the ROCm Documentation or contact the maintainers through the repository's issue tracker.