GPUDirect Storage prototype tutorial #3317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

svekars merged 7 commits into main from gds_tutorial

May 2, 2025

Contributor

mikaylagawarecki commented Apr 8, 2025

No description provided.

pytorch-bot bot commented Apr 8, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3317

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the cla signed label

mikaylagawarecki requested a review from albanD

April 8, 2025 14:24

mikaylagawarecki marked this pull request as draft

April 8, 2025 14:24

albanD reviewed

View reviewed changes

.jenkins/validate_tutorials_built.py

    
                  "prototype_source/vmap_recipe",

                  "prototype_source/torchscript_freezing",

                  "prototype_source/nestedtensor",

                  "prototype_source/gpu_direct_storage", # requires specific filesystem + GPUDirect Storage to be set up

Contributor

albanD Apr 8, 2025

Doesn't it run in compat mode with a random machine?

Contributor Author

mikaylagawarecki Apr 8, 2025

You need a specific filesystem

prototype_source/gpu_direct_storage.py Show resolved Hide resolved

prototype_source/gpu_direct_storage.py Show resolved Hide resolved

prototype_source/gpu_direct_storage.py Show resolved Hide resolved

prototype_source/gpu_direct_storage.py Outdated

    
              # The loading flow is the inverse, we can ``torch.load`` under the ``torch.serialization.skip_data`` context

              # manager to load everything except the storage bytes. This means that any tensors in the checkpoint will be

              # created but their storages will be empty (i.e. the tensors will be created via ``torch.empty``). If the

              # tensors to be loaded to are persistent, one can use the ``torch.cuda.gds.gds_register_buffer`` API to register

Contributor

albanD Apr 8, 2025

The register API is not used here?

prototype_source/gpu_direct_storage.py

    
                  f.load_storage(v.untyped_storage(), offset)

                  assert torch.equal(v, sd[k])

              del f

Contributor

albanD Apr 8, 2025

Similar synchronization question as above

Contributor Author

mikaylagawarecki Apr 8, 2025 •

edited

Loading

I don't think synchronization is needed after the call as cuFileRead/Write are blocking operations that block until IO is complete https://docs.nvidia.com/gpudirect-storage/api-reference-guide/index.html#cufileread. You might need to synchronize before these ops (rather than after) though

Screenshot 2025-04-08 at 11 21 32 AM

prototype_source/gpu_direct_storage.py Outdated Show resolved Hide resolved

mikaylagawarecki commented

View reviewed changes

prototype_source/gpu_direct_storage.py Outdated

Comment on lines 68 to 71

    
              # If you are continuously saving the same state dictionary during training, you

              # would only need to obtain the offsets once and the same offsets can be re-used. Similarly if tensor is going to

              # be loaded to repeatedly one can use the ``torch.cuda.gds.gds_register_buffer`` which wraps

              # ``cuFileBufRegister`` to register the storages as gds buffers.

Contributor Author

mikaylagawarecki Apr 8, 2025

@albanD is this better?

mikaylagawarecki added the 2.7 label

svekars reviewed

View reviewed changes

Contributor

svekars left a comment •

edited

Loading

Just a few minor nits

prototype_source/gpu_direct_storage.py Outdated Show resolved Hide resolved

prototype_source/gpu_direct_storage.py Outdated Show resolved Hide resolved

prototype_source/gpu_direct_storage.py Outdated Show resolved Hide resolved

prototype_source/gpu_direct_storage.py Outdated Show resolved Hide resolved

prototype_source/gpu_direct_storage.py Outdated Show resolved Hide resolved

prototype_source/gpu_direct_storage.py Outdated Show resolved Hide resolved

prototype_source/gpu_direct_storage.py Outdated Show resolved Hide resolved

prototype_source/gpu_direct_storage.py Outdated Show resolved Hide resolved

prototype_source/gpu_direct_storage.py Outdated Show resolved Hide resolved

prototype_source/gpu_direct_storage.py Outdated Show resolved Hide resolved

malfet reviewed

View reviewed changes

prototype_source/gpu_direct_storage.py Outdated Show resolved Hide resolved

svekars reviewed

View reviewed changes

prototype_source/gpu_direct_storage.py

    
              del f

              # Conclusion

Contributor

svekars Apr 16, 2025 •

edited

Loading

This part is rendered as a code comment: https://docs-preview.pytorch.org/pytorch/tutorials/3317/prototype/gpu_direct_storage.html#using-gpudirect-storage-with-torch-save-and-torch-load

The above fix should fix it.

svekars reviewed

View reviewed changes

prototype_source/gpu_direct_storage.py Outdated Show resolved Hide resolved

mikaylagawarecki requested a review from albanD

April 23, 2025 17:42

albanD approved these changes

View reviewed changes

Contributor

albanD left a comment

Ok !

svekars reviewed

View reviewed changes

prototype_source/gpu_direct_storage.py Outdated Show resolved Hide resolved

mikaylagawarecki marked this pull request as ready for review

April 29, 2025 15:40

mikaylagawarecki and others added 6 commits

May 1, 2025 11:20


          GPUDirect Storage prototype tutorial

6e2bd60


          address comments

8a9fb9a


          one more fix

a1bc4fc


          address comments

85c0c94


          Fix formatting

dfa94ad


          Update prototype_source/gpu_direct_storage.py

30082cf

mikaylagawarecki force-pushed the gds_tutorial branch from 46d2edc to 30082cf Compare

May 1, 2025 18:20


          Merge branch 'main' into gds_tutorial

bbe1ba6

svekars approved these changes

View reviewed changes

svekars merged commit fd981a5 into main

17 checks passed

svekars deleted the gds_tutorial branch

May 2, 2025 16:01

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels