Skip to content

Unable to get TinyTrainer to run on a single-GPU setup #2

@arpieb

Description

@arpieb

I'm trying to run the sample finetuning test in the README, but no matter how I configure the accelerate tool it keeps trying to use two GPUs. My most recent config is:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: '0'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

I'm running on Ubuntu 24.04 LTS, Python 3.12.11, with the following installed devices and CUDA versions:

$ nvidia-smi
Tue Dec  2 02:05:45 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:65:00.0 Off |                  Off |
|  0%   40C    P8             19W /  450W |       1MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Logging tail is:

============================================================
sft/train_sft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-12-02_02:06:19
  host      : beast
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 31116)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Full log can be found here: https://gist.github.com/arpieb/d03351dbc74212b1e92de77962e61926

Debugging through the exception shows it's trying to request a GPU with an ID of 1, which would be the second GPU if one was installed, even though I have explicitly set the GPU device list to "0" in the config.

It looks like maybe something to do with the LOCAL_RANK env var is in play here as it defaults to 0 but is getting set to 1 for some reason...?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions