Replies: 1 comment 1 reply
-
@tok99bc, thanks for the question. It is best to open a ticket where you can share helpful debugging information like loss values and log snippets. In the meantime, could try comparing loss curves with and without nvme offloading? You might need to run with a smaller model, and if possible, run on a single GPU. By the way, please note that not all gradient values are swapped out, rather some are retained in memory and merged with the swapped values after swapping in. See below: |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi DeepSpeed-Team, hi everyone,
in my setup, the gradient offload seems not to work properly.
Gradient files, or more precisely files that contain the word
"gradient" in their file name, apparently are different when
swapping out and swapping in. However for other files, for example the
parameters, the data is completelly equal.
I have looked at this a bit more closely by always printing the first
few bytes of the tensor at different points of the read and write
process (swap in and swap out). I observed that when swapping in gradient files the first few
bytes never changed. However, when loading parameter files they do.
For me the only reason seems to be a misconfiguration on our side.
Does anyone have a clue what we could do? Did I misconfigure DeepSpeed?
I would really appreciate any hint that helps me to move forward.
Best regards!
P.S.: I would also be open to a short Zoom discussion, if someone is interested.
Beta Was this translation helpful? Give feedback.
All reactions