Potential configuration problem. Gradient offload not working. #2583

tok99bc · 2022-12-07T14:47:44Z

tok99bc
Dec 7, 2022

Hi DeepSpeed-Team, hi everyone,

in my setup, the gradient offload seems not to work properly.

Gradient files, or more precisely files that contain the word
"gradient" in their file name, apparently are different when
swapping out and swapping in. However for other files, for example the
parameters, the data is completelly equal.

I have looked at this a bit more closely by always printing the first
few bytes of the tensor at different points of the read and write
process (swap in and swap out). I observed that when swapping in gradient files the first few
bytes never changed. However, when loading parameter files they do.

For me the only reason seems to be a misconfiguration on our side.

Does anyone have a clue what we could do? Did I misconfigure DeepSpeed?

I would really appreciate any hint that helps me to move forward.

Best regards!

P.S.: I would also be open to a short Zoom discussion, if someone is interested.

tjruwase · 2022-12-07T17:43:44Z

tjruwase
Dec 7, 2022
Maintainer

@tok99bc, thanks for the question. It is best to open a ticket where you can share helpful debugging information like loss values and log snippets.

In the meantime, could try comparing loss curves with and without nvme offloading? You might need to run with a smaller model, and if possible, run on a single GPU.

By the way, please note that not all gradient values are swapped out, rather some are retained in memory and merged with the swapped values after swapping in. See below:

https://github.com/microsoft/DeepSpeed/blob/731965db330d9ee0d393526412134063dbd3709c/deepspeed/runtime/swap_tensor/optimizer_utils.py#L216

1 reply

tok99bc Dec 8, 2022
Author

@tjruwase thanks a lot.
Our neural network is just AlexNet and we're training on the CIFAR-10 dataset. We only have a single GPU in our setup.

I've now plotted the loss curves. I am not sure if I did everything correctly, but you can still see a very clear difference between training with and without gradient offfload. When the gradient offload is activated, the loss just increases over time.

We've checked the first few and last few bytes of the buffer at different points of the swap-in and swap-out process, both in the C++ and in the Python code, and observed that with our configuration the buffer never changes when swapping in gradients, however for parameters and optimizer states the mechanism seems to behave as expected.

I'm also going to add our ds_config.json file, maybe there are some very apparent issues with it that we just don't know about.

{
  "train_batch_size": 16,
  "steps_per_print": 2000,
  "optimizer": {
    "type": "Adam",
    "params": {
      "adam_w_mode": true,
      "lr": 1e-3,
      "betas": [
        0.8,
        0.999
      ],
      "eps": 1e-8,
      "weight_decay": 3e-7
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 1e-5,
      "warmup_num_steps": 1000
    }
  },
  "gradient_clipping": 1.0,
  "prescale_gradients": false,
  "fp16": {
      "enabled": true,
      "auto_cast": true,
      "fp16_master_weights_and_grads": false,
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "hysteresis": 2,
      "min_loss_scale": 1,
      "initial_scale_power": 7
  },
  "wall_clock_breakdown": false,
  "zero_optimization": {
      "stage": 3,
      "allgather_partitions": true,
      "reduce_scatter": true,
      "allgather_bucket_size": 50000000,
      "reduce_bucket_size": 50000000,
      "overlap_comm": true,
      "contiguous_gradients": true,
      "offload_optimizer": {
                "device": "nvme",
                "nvme_path": "/tmp",
                "pin_memory": false
      },
      "offload_param": {
                "device": "nvme",
                "nvme_path": "/tmp",
                "pin_memory": false,
                "buffer_count": 6,
                "buffer_size": 1e9,
		"max_in_cpu": 15e6
      }
  }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential configuration problem. Gradient offload not working. #2583

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Potential configuration problem. Gradient offload not working. #2583

tok99bc Dec 7, 2022

Replies: 1 comment · 1 reply

tjruwase Dec 7, 2022 Maintainer

tok99bc Dec 8, 2022 Author

tok99bc
Dec 7, 2022

Replies: 1 comment 1 reply

tjruwase
Dec 7, 2022
Maintainer

tok99bc Dec 8, 2022
Author