-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
I'm trying to run the sample finetuning test in the README, but no matter how I configure the accelerate tool it keeps trying to use two GPUs. My most recent config is:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: '0'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
I'm running on Ubuntu 24.04 LTS, Python 3.12.11, with the following installed devices and CUDA versions:
$ nvidia-smi
Tue Dec 2 02:05:45 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:65:00.0 Off | Off |
| 0% 40C P8 19W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Logging tail is:
============================================================
sft/train_sft.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-12-02_02:06:19
host : beast
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 31116)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Full log can be found here: https://gist.github.com/arpieb/d03351dbc74212b1e92de77962e61926
Debugging through the exception shows it's trying to request a GPU with an ID of 1, which would be the second GPU if one was installed, even though I have explicitly set the GPU device list to "0" in the config.
It looks like maybe something to do with the LOCAL_RANK env var is in play here as it defaults to 0 but is getting set to 1 for some reason...?
Metadata
Metadata
Assignees
Labels
No labels