Skip to content

Commit 82cdb4b

Browse files
committed
initial build
1 parent 8ca04d7 commit 82cdb4b

File tree

479 files changed

+201083
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

479 files changed

+201083
-0
lines changed

Diff for: โ€Ždocs/.buildinfo

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Sphinx build info version 1
2+
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3+
config: 64a5804c254257c0ad7a8dc6e19c1484
4+
tags: 645f666f9bcd5a90fca523b33c5a78b7

Diff for: โ€Ždocs/.nojekyll

Whitespace-only changes.

Diff for: โ€Ždocs/CNAME

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
docs.pytorchlightning.kr

Diff for: โ€Ždocs/_images/figure-parity-times.png

30.8 KB
Loading

Diff for: โ€Ždocs/_images/lr_finder.png

17 KB
Loading

Diff for: โ€Ždocs/_images/profiler.png

127 KB
Loading

Diff for: โ€Ždocs/_modules/index.html

+690
Large diffs are not rendered by default.

Diff for: โ€Ždocs/_modules/pytorch_lightning/callbacks/base.html

+1,057
Large diffs are not rendered by default.

Diff for: โ€Ždocs/_modules/pytorch_lightning/core/datamodule.html

+953
Large diffs are not rendered by default.

Diff for: โ€Ždocs/_modules/pytorch_lightning/core/lightning.html

+2,738
Large diffs are not rendered by default.

Diff for: โ€Ždocs/_modules/pytorch_lightning/loggers/comet.html

+1,035
Large diffs are not rendered by default.

Diff for: โ€Ždocs/_modules/pytorch_lightning/loggers/csv_logs.html

+937
Large diffs are not rendered by default.

Diff for: โ€Ždocs/_modules/pytorch_lightning/loggers/mlflow.html

+968
Large diffs are not rendered by default.

Diff for: โ€Ždocs/_modules/pytorch_lightning/loggers/neptune.html

+1,369
Large diffs are not rendered by default.

Diff for: โ€Ždocs/_modules/pytorch_lightning/loggers/tensorboard.html

+1,013
Large diffs are not rendered by default.

Diff for: โ€Ždocs/_modules/pytorch_lightning/loggers/wandb.html

+1,206
Large diffs are not rendered by default.

Diff for: โ€Ždocs/_modules/pytorch_lightning/loops/base.html

+1,042
Large diffs are not rendered by default.

Diff for: โ€Ždocs/_modules/pytorch_lightning/trainer/trainer.html

+3,553
Large diffs are not rendered by default.
+165
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
:orphan:
2+
3+
.. _gpu_prepare:
4+
5+
########################################
6+
Hardware agnostic training (preparation)
7+
########################################
8+
9+
To train on CPU/GPU/TPU without changing your code, we need to build a few good habits :)
10+
11+
----
12+
13+
*****************************
14+
Delete .cuda() or .to() calls
15+
*****************************
16+
17+
Delete any calls to .cuda() or .to(device).
18+
19+
.. testcode::
20+
21+
# before lightning
22+
def forward(self, x):
23+
x = x.cuda(0)
24+
layer_1.cuda(0)
25+
x_hat = layer_1(x)
26+
27+
28+
# after lightning
29+
def forward(self, x):
30+
x_hat = layer_1(x)
31+
32+
----
33+
34+
**********************************************
35+
Init tensors using type_as and register_buffer
36+
**********************************************
37+
When you need to create a new tensor, use ``type_as``.
38+
This will make your code scale to any arbitrary number of GPUs or TPUs with Lightning.
39+
40+
.. testcode::
41+
42+
# before lightning
43+
def forward(self, x):
44+
z = torch.Tensor(2, 3)
45+
z = z.cuda(0)
46+
47+
48+
# with lightning
49+
def forward(self, x):
50+
z = torch.Tensor(2, 3)
51+
z = z.type_as(x)
52+
53+
The :class:`~pytorch_lightning.core.lightning.LightningModule` knows what device it is on. You can access the reference via ``self.device``.
54+
Sometimes it is necessary to store tensors as module attributes. However, if they are not parameters they will
55+
remain on the CPU even if the module gets moved to a new device. To prevent that and remain device agnostic,
56+
register the tensor as a buffer in your modules' ``__init__`` method with :meth:`~torch.nn.Module.register_buffer`.
57+
58+
.. testcode::
59+
60+
class LitModel(LightningModule):
61+
def __init__(self):
62+
...
63+
self.register_buffer("sigma", torch.eye(3))
64+
# you can now access self.sigma anywhere in your module
65+
66+
----
67+
68+
***************
69+
Remove samplers
70+
***************
71+
72+
:class:`~torch.utils.data.distributed.DistributedSampler` is automatically handled by Lightning.
73+
74+
See :ref:`replace-sampler-ddp` for more information.
75+
76+
----
77+
78+
***************************************
79+
Synchronize validation and test logging
80+
***************************************
81+
82+
When running in distributed mode, we have to ensure that the validation and test step logging calls are synchronized across processes.
83+
This is done by adding ``sync_dist=True`` to all ``self.log`` calls in the validation and test step.
84+
This ensures that each GPU worker has the same behaviour when tracking model checkpoints, which is important for later downstream tasks such as testing the best checkpoint across all workers.
85+
The ``sync_dist`` option can also be used in logging calls during the step methods, but be aware that this can lead to significant communication overhead and slow down your training.
86+
87+
Note if you use any built in metrics or custom metrics that use `TorchMetrics <https://torchmetrics.readthedocs.io/>`_, these do not need to be updated and are automatically handled for you.
88+
89+
.. testcode::
90+
91+
def validation_step(self, batch, batch_idx):
92+
x, y = batch
93+
logits = self(x)
94+
loss = self.loss(logits, y)
95+
# Add sync_dist=True to sync logging across all GPU workers (may have performance impact)
96+
self.log("validation_loss", loss, on_step=True, on_epoch=True, sync_dist=True)
97+
98+
99+
def test_step(self, batch, batch_idx):
100+
x, y = batch
101+
logits = self(x)
102+
loss = self.loss(logits, y)
103+
# Add sync_dist=True to sync logging across all GPU workers (may have performance impact)
104+
self.log("test_loss", loss, on_step=True, on_epoch=True, sync_dist=True)
105+
106+
It is possible to perform some computation manually and log the reduced result on rank 0 as follows:
107+
108+
.. testcode::
109+
110+
def test_step(self, batch, batch_idx):
111+
x, y = batch
112+
tensors = self(x)
113+
return tensors
114+
115+
116+
def test_epoch_end(self, outputs):
117+
mean = torch.mean(self.all_gather(outputs))
118+
119+
# When logging only on rank 0, don't forget to add
120+
# ``rank_zero_only=True`` to avoid deadlocks on synchronization.
121+
if self.trainer.is_global_zero:
122+
self.log("my_reduced_metric", mean, rank_zero_only=True)
123+
124+
----
125+
126+
**********************
127+
Make models pickleable
128+
**********************
129+
It's very likely your code is already `pickleable <https://docs.python.org/3/library/pickle.html>`_,
130+
in that case no change in necessary.
131+
However, if you run a distributed model and get the following error:
132+
133+
.. code-block::
134+
135+
self._launch(process_obj)
136+
File "/net/software/local/python/3.6.5/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47,
137+
in _launch reduction.dump(process_obj, fp)
138+
File "/net/software/local/python/3.6.5/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
139+
ForkingPickler(file, protocol).dump(obj)
140+
_pickle.PicklingError: Can't pickle <function <lambda> at 0x2b599e088ae8>:
141+
attribute lookup <lambda> on __main__ failed
142+
143+
This means something in your model definition, transforms, optimizer, dataloader or callbacks cannot be pickled, and the following code will fail:
144+
145+
.. code-block:: python
146+
147+
import pickle
148+
149+
pickle.dump(some_object)
150+
151+
This is a limitation of using multiple processes for distributed training within PyTorch.
152+
To fix this issue, find your piece of code that cannot be pickled. The end of the stacktrace
153+
is usually helpful.
154+
ie: in the stacktrace example here, there seems to be a lambda function somewhere in the code
155+
which cannot be pickled.
156+
157+
.. code-block::
158+
159+
self._launch(process_obj)
160+
File "/net/software/local/python/3.6.5/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47,
161+
in _launch reduction.dump(process_obj, fp)
162+
File "/net/software/local/python/3.6.5/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
163+
ForkingPickler(file, protocol).dump(obj)
164+
_pickle.PicklingError: Can't pickle [THIS IS THE THING TO FIND AND DELETE]:
165+
attribute lookup <lambda> on __main__ failed

Diff for: โ€Ždocs/_sources/accelerators/gpu.rst.txt

+63
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
.. _gpu:
2+
3+
Accelerator: GPU training
4+
=========================
5+
6+
.. raw:: html
7+
8+
<div class="display-card-container">
9+
<div class="row">
10+
11+
.. Add callout items below this line
12+
13+
.. displayitem::
14+
:header: Prepare your code (Optional)
15+
:description: Prepare your code to run on any hardware
16+
:col_css: col-md-4
17+
:button_link: accelerator_prepare.html
18+
:height: 150
19+
:tag: basic
20+
21+
.. displayitem::
22+
:header: Basic
23+
:description: Learn the basics of single and multi-GPU training.
24+
:col_css: col-md-4
25+
:button_link: gpu_basic.html
26+
:height: 150
27+
:tag: basic
28+
29+
.. displayitem::
30+
:header: Intermediate
31+
:description: Learn about different distributed strategies, torchelastic and how to optimize communication layers.
32+
:col_css: col-md-4
33+
:button_link: gpu_intermediate.html
34+
:height: 150
35+
:tag: intermediate
36+
37+
.. displayitem::
38+
:header: Advanced
39+
:description: Train 1 trillion+ parameter models with these techniques.
40+
:col_css: col-md-4
41+
:button_link: gpu_advanced.html
42+
:height: 150
43+
:tag: advanced
44+
45+
.. displayitem::
46+
:header: Expert
47+
:description: Develop new strategies for training and deploying larger and larger models.
48+
:col_css: col-md-4
49+
:button_link: gpu_expert.html
50+
:height: 150
51+
:tag: expert
52+
53+
.. displayitem::
54+
:header: FAQ
55+
:description: Frequently asked questions about GPU training.
56+
:col_css: col-md-4
57+
:button_link: gpu_faq.html
58+
:height: 150
59+
60+
.. raw:: html
61+
62+
</div>
63+
</div>

Diff for: โ€Ždocs/_sources/accelerators/gpu_advanced.rst.txt

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
:orphan:
2+
3+
.. _gpu_advanced:
4+
5+
GPU training (Advanced)
6+
=======================
7+
**Audience:** Users looking to scale massive models (ie: 1 Trillion parameters).
8+
9+
----
10+
11+
For experts pushing the state-of-the-art in model development, Lightning offers various techniques to enable Trillion+ parameter-scale models.
12+
13+
----
14+
15+
..
16+
.. include:: ../advanced/model_parallel.rst

Diff for: โ€Ždocs/_sources/accelerators/gpu_basic.rst.txt

+97
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
:orphan:
2+
3+
.. _gpu_basic:
4+
5+
GPU training (Basic)
6+
====================
7+
**Audience:** Users looking to save money and run large models faster using single or multiple
8+
9+
----
10+
11+
What is a GPU?
12+
--------------
13+
A Graphics Processing Unit (GPU), is a specialized hardware accelerator designed to speed up mathematical computations used in gaming and deep learning.
14+
15+
----
16+
17+
Train on 1 GPU
18+
--------------
19+
20+
Make sure you're running on a machine with at least one GPU. There's no need to specify any NVIDIA flags
21+
as Lightning will do it for you.
22+
23+
.. testcode::
24+
:skipif: torch.cuda.device_count() < 1
25+
26+
trainer = Trainer(accelerator="gpu", devices=1)
27+
28+
----------------
29+
30+
31+
.. _multi_gpu:
32+
33+
Train on multiple GPUs
34+
----------------------
35+
36+
To use multiple GPUs, set the number of devices in the Trainer or the index of the GPUs.
37+
38+
.. code::
39+
40+
trainer = Trainer(accelerator="gpu", devices=4)
41+
42+
Choosing GPU devices
43+
^^^^^^^^^^^^^^^^^^^^
44+
45+
You can select the GPU devices using ranges, a list of indices or a string containing
46+
a comma separated list of GPU ids:
47+
48+
.. testsetup::
49+
50+
k = 1
51+
52+
.. testcode::
53+
:skipif: torch.cuda.device_count() < 2
54+
55+
# DEFAULT (int) specifies how many GPUs to use per node
56+
Trainer(accelerator="gpu", devices=k)
57+
58+
# Above is equivalent to
59+
Trainer(accelerator="gpu", devices=list(range(k)))
60+
61+
# Specify which GPUs to use (don't use when running on cluster)
62+
Trainer(accelerator="gpu", devices=[0, 1])
63+
64+
# Equivalent using a string
65+
Trainer(accelerator="gpu", devices="0, 1")
66+
67+
# To use all available GPUs put -1 or '-1'
68+
# equivalent to list(range(torch.cuda.device_count()))
69+
Trainer(accelerator="gpu", devices=-1)
70+
71+
The table below lists examples of possible input formats and how they are interpreted by Lightning.
72+
73+
+------------------+-----------+---------------------+---------------------------------+
74+
| `devices` | Type | Parsed | Meaning |
75+
+==================+===========+=====================+=================================+
76+
| 3 | int | [0, 1, 2] | first 3 GPUs |
77+
+------------------+-----------+---------------------+---------------------------------+
78+
| -1 | int | [0, 1, 2, ...] | all available GPUs |
79+
+------------------+-----------+---------------------+---------------------------------+
80+
| [0] | list | [0] | GPU 0 |
81+
+------------------+-----------+---------------------+---------------------------------+
82+
| [1, 3] | list | [1, 3] | GPUs 1 and 3 |
83+
+------------------+-----------+---------------------+---------------------------------+
84+
| "3" | str | [0, 1, 2] | first 3 GPUs |
85+
+------------------+-----------+---------------------+---------------------------------+
86+
| "1, 3" | str | [1, 3] | GPUs 1 and 3 |
87+
+------------------+-----------+---------------------+---------------------------------+
88+
| "-1" | str | [0, 1, 2, ...] | all available GPUs |
89+
+------------------+-----------+---------------------+---------------------------------+
90+
91+
.. note::
92+
93+
When specifying number of ``devices`` as an integer ``devices=k``, setting the trainer flag
94+
``auto_select_gpus=True`` will automatically help you find ``k`` GPUs that are not
95+
occupied by other processes. This is especially useful when GPUs are configured
96+
to be in "exclusive mode", such that only one process at a time can access them.
97+
For more details see the :doc:`trainer guide <../common/trainer>`.

Diff for: โ€Ždocs/_sources/accelerators/gpu_expert.rst.txt

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
:orphan:
2+
3+
.. _gpu_expert:
4+
5+
GPU training (Expert)
6+
=====================
7+
**Audience:** Experts creating new scaling techniques such as Deepspeed or FSDP
8+
9+
----
10+
11+
Lightning enables experts focused on researching new ways of optimizing distributed training/inference strategies to create new strategies and plug them into Lightning.
12+
13+
For example, Lightning worked closely with the Microsoft team to develop a Deepspeed integration and with the Facebook(Meta) team to develop a FSDP integration.
14+
15+
----
16+
17+
.. include:: ../advanced/strategy_registry.rst
18+
19+
----
20+
21+
.. include:: ../extensions/strategy.rst

0 commit comments

Comments
ย (0)