-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Deploying to gh-pages from @ a9a4180 🚀
- Loading branch information
0 parents
commit a50d6b2
Showing
102 changed files
with
8,650 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: b1d4f130b437b34669c086e08878ee12 | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
.. _hivemind_intermediate: | ||
|
||
Training on unreliable mixed GPUs across the internet (Advanced) | ||
================================================================ | ||
|
||
Reducing Communication By Overlapping Communication | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
We can reduce the impact of communication across all machines by overlapping communication with our training iterations. In short, we enable communication to happen | ||
in the background of training. | ||
|
||
Overlap Gradient and State Averaging | ||
"""""""""""""""""""""""""""""""""""" | ||
|
||
When the target batch size is reached, all processes that are included in the step send gradients and model states to each other. By enabling some flags through | ||
the strategy, communication can happen in the background. This allows training to continue (with slightly outdated weights) but provides us the means | ||
to overlap communication with computation. | ||
|
||
.. warning:: | ||
Enabling overlapping communication means convergence will slightly be affected. | ||
|
||
.. note:: | ||
Enabling these flags means that you must pass in a ``scheduler_fn`` to the ``HivemindStrategy`` instead of relying on a scheduler from ``configure_optimizers``. | ||
The optimizer is re-created by Hivemind, and as a result, the scheduler has to be re-created. | ||
|
||
.. code-block:: python | ||
import torch | ||
from functools import partial | ||
from pytorch_lightning import Trainer | ||
from lightning_hivemind.strategy import HivemindStrategy | ||
trainer = Trainer( | ||
strategy=HivemindStrategy( | ||
target_batch_size=8192, | ||
delay_state_averaging=True, | ||
delay_grad_averaging=True, | ||
delay_optimizer_step=True, | ||
offload_optimizer=True, # required to delay averaging | ||
scheduler_fn=partial(torch.optim.lr_scheduler.ExponentialLR, gamma=...), | ||
), | ||
accelerator="gpu", | ||
devices=1, | ||
) | ||
Reducing GPU Memory requirements by re-using buffers & CPU offloading | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
We can also offload the optimizer state to the CPU whilst re-using gradient buffers to reduce the memory requirement for machines. | ||
|
||
Offloading Optimizer State to the CPU | ||
""""""""""""""""""""""""""""""""""""" | ||
|
||
Offloading the Optimizer state to the CPU works the same as Deepspeed Zero-stage-2-offload, where we save GPU memory by keeping all optimizer states on the CPU. | ||
|
||
.. note:: | ||
Enabling these flags means that you must pass in a ``scheduler_fn`` to the ``HivemindStrategy`` instead of relying on a scheduler from ``configure_optimizers``. | ||
The optimizer is re-created by Hivemind, and as a result, the scheduler has to be re-created. | ||
|
||
We suggest enabling offloading and overlapping communication to hide the additional overhead from having to communicate with the CPU. | ||
|
||
.. code-block:: python | ||
import torch | ||
from functools import partial | ||
from pytorch_lightning import Trainer | ||
from lightning_hivemind.strategy import HivemindStrategy | ||
trainer = Trainer( | ||
strategy=HivemindStrategy( | ||
target_batch_size=8192, | ||
offload_optimizer=True, | ||
scheduler_fn=partial(torch.optim.lr_scheduler.ExponentialLR, gamma=...), | ||
), | ||
accelerator="gpu", | ||
devices=1, | ||
) | ||
Re-using Gradient Buffers | ||
""""""""""""""""""""""""" | ||
|
||
By default, Hivemind accumulates gradients in a separate buffer. This means additional GPU memory is required to store gradients. You can enable re-using the model parameter gradient buffers by passing ``reuse_grad_buffers=True`` to the ``HivemindStrategy``. | ||
|
||
.. warning:: | ||
The ``HivemindStrategy`` will override ``zero_grad`` in your ``LightningModule`` to have no effect. This is because gradients are accumulated in the model | ||
and Hivemind manages when they need to be cleared. | ||
|
||
.. code-block:: python | ||
from pytorch_lightning import Trainer | ||
from lightning_hivemind.strategy import HivemindStrategy | ||
trainer = Trainer( | ||
strategy=HivemindStrategy(target_batch_size=8192, reuse_grad_buffers=True), accelerator="gpu", devices=1 | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
.. _hivemind_expert: | ||
|
||
Training on unreliable mixed GPUs across the internet (Expert) | ||
============================================================== | ||
|
||
Using Compression to Optimize Communications | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Below are some ways to reduce communication when training collaboratively. As the size of your model increase, bottlenecks in communication become more apparent. | ||
|
||
Compress Gradients & State | ||
"""""""""""""""""""""""""" | ||
|
||
Hivemind allows you to compress gradients and states before sending them to other machines. This helps reduce the communication overhead substantially when training across the internet. | ||
|
||
Below, we enable Float16 compression, which compresses gradients and states to Float16 before sending it to other machines. | ||
|
||
.. note:: | ||
Compressing gradients can affect convergence if you're lowering the precision (i.e training in Float32, but compressing gradients to FP16). | ||
|
||
.. code-block:: python | ||
from hivemind import Float16Compression | ||
from pytorch_lightning import Trainer | ||
from lightning_hivemind.strategy import HivemindStrategy | ||
trainer = Trainer( | ||
strategy=HivemindStrategy( | ||
target_batch_size=target_batch_size, | ||
grad_compression=Float16Compression(), | ||
state_averaging_compression=Float16Compression(), | ||
), | ||
accelerator="gpu", | ||
devices=1, | ||
) | ||
A slightly more advanced scheme is dynamic compression based on value size. Below, we enable 8-bit quantization for large numbers, and Float16 compression for small values, reducing communication bottlenecks even further. | ||
|
||
Size Adaptive Compression has been used in a variety of Hivemind applications and has shown success, but does quantize gradients further, meaning we lose precision when compressing. | ||
|
||
.. code-block:: python | ||
from hivemind import Float16Compression, Uniform8BitQuantization | ||
from pytorch_lightning import Trainer | ||
from lightning_hivemind.strategy import HivemindStrategy | ||
# compresses values above threshold with 8bit Quantization, lower with Float16 | ||
compression = SizeAdaptiveCompression( | ||
threshold=2 ** 16 + 1, less=Float16Compression(), greater_equal=Uniform8BitQuantization() | ||
) | ||
trainer = Trainer( | ||
strategy=HivemindStrategy( | ||
target_batch_size=target_batch_size, | ||
grad_compression=compression, | ||
state_averaging_compression=compression, | ||
), | ||
accelerator="gpu", | ||
devices=1, | ||
) | ||
PowerSGD | ||
"""""""" | ||
|
||
`PowerSGD <https://arxiv.org/abs/1905.13727>`_ is a technique to reduce distributed communication of gradients across processes. | ||
In short, PowerSGD uses a low-rank approximation to compress gradients before running an `all-reduce` step to sync gradients across all processes. | ||
|
||
.. note:: | ||
Though PowerSGD can impact convergence, it can also substantially reduce communication between processes. | ||
|
||
.. code-block:: python | ||
from pytorch_lightning import Trainer | ||
from lightning_hivemind.strategy import HivemindStrategy | ||
from functools import partial | ||
from hivemind.optim.power_sgd_averager import PowerSGDGradientAverager | ||
trainer = Trainer( | ||
strategy=HivemindStrategy( | ||
target_batch_size=8192, | ||
grad_averager_factory=partial(PowerSGDGradientAverager, averager_rank=32, min_compression_ratio=0.5), | ||
), | ||
accelerator="gpu", | ||
devices=1, | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
.. Lightning-AI-Sandbox documentation master file, created by | ||
sphinx-quickstart on Wed Mar 25 21:34:07 2020. | ||
You can adapt this file completely to your liking, but it should at least | ||
contain the root `toctree` directive. | ||
.. include:: readme.rst | ||
|
||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:name: start | ||
:caption: Start here | ||
|
||
advanced | ||
expert |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
:orphan: | ||
|
||
################################################################ | ||
Hivemind - training on unreliable mixed GPUs across the internet | ||
################################################################ | ||
|
||
Collaborative Training tries to solve the need for top-tier multi-GPU servers by allowing you to train across unreliable machines, | ||
such as local machines or even preemptible cloud compute across the internet. | ||
|
||
Under the hood, we use `Hivemind <https://github.com/learning-at-home/hivemind>`__ which provides de-centralized training across the internet. | ||
|
||
.. warning:: This is an :ref:`experimental <versioning:Experimental API>` feature. | ||
|
||
|
||
To use Collaborative Training, you need to first this extension. | ||
|
||
.. code-block:: bash | ||
pip install lightning-hivemind | ||
This will install both the `Hivemind <https://pypi.org/project/hivemind/>`__ package as well as the ``HivemindStrategy`` for the Lightning Trainer: | ||
|
||
Reducing Communication By Overlapping Communication | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
We can reduce the impact of communication across all machines by overlapping communication with our training iterations. In short, we enable communication to happen in the background of training. | ||
|
||
Overlap Gradient and State Averaging | ||
"""""""""""""""""""""""""""""""""""" | ||
|
||
When the target batch size is reached, all processes that are included in the step send gradients and model states to each other. By enabling some flags through | ||
the strategy, communication can happen in the background. This allows training to continue (with slightly outdated weights) but provides us the means | ||
to overlap communication with computation. | ||
|
||
.. warning:: | ||
Enabling overlapping communication means convergence will slightly be affected. | ||
|
||
.. note:: | ||
Enabling these flags means that you must pass in a ``scheduler_fn`` to the ``HivemindStrategy`` instead of relying on a scheduler from ``configure_optimizers``. | ||
The optimizer is re-created by Hivemind, and as a result, the scheduler has to be re-created. | ||
|
||
.. code-block:: python | ||
import torch | ||
from functools import partial | ||
from lightning import Trainer | ||
from lightning_hivemind.strategy import HivemindStrategy | ||
trainer = Trainer( | ||
strategy=HivemindStrategy( | ||
target_batch_size=8192, | ||
delay_state_averaging=True, | ||
delay_grad_averaging=True, | ||
delay_optimizer_step=True, | ||
offload_optimizer=True, # required to delay averaging | ||
scheduler_fn=partial(torch.optim.lr_scheduler.ExponentialLR, gamma=...), | ||
), | ||
accelerator="gpu", | ||
devices=1, | ||
) | ||
For more information on the strategy capabilities, see the `lightning-hivemind <https://github.com/Lightning-Universe/lightning-hivemind>`__ repo. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
Lightning + Hivemind | ||
==================== | ||
|
||
Collaborative Training tries to solve the need for top-tier multi-GPU | ||
servers by allowing you to train across unreliable machines, such as | ||
local machines or even preemptible cloud computing across the internet. | ||
|
||
Under the hood, we use | ||
`Hivemind <https://github.com/learning-at-home/hivemind>`__, which | ||
provides de-centralized training across the internet. | ||
|
||
To use Collaborative Training, you need first to have this extension. | ||
|
||
.. code:: bash | ||
pip install -U lightning-Hivemind | ||
The ``HivemindStrategy`` accumulates gradients from all collaborating | ||
processes until they reach a ``target_batch_size``. By default, we use | ||
the batch size of the first batch to determine what each local machine | ||
batch contributes towards the ``target_batch_size``. Once the | ||
``target_batch_size`` is reached, an optimizer step is made on all | ||
processes. | ||
|
||
When using ``HivemindStrategy``, note that you cannot use gradient | ||
accumulation (``accumulate_grad_batches``). This is because Hivemind | ||
manages accumulation internally. | ||
|
||
.. code:: py | ||
from lightning import Trainer | ||
from lightning_hivemind.strategy import HivemindStrategy | ||
trainer = Trainer(strategy=HivemindStrategy(target_batch_size=8192), accelerator="gpu", devices=1) | ||
Followed by: | ||
|
||
.. code:: bash | ||
python train.py | ||
# Other machines can connect by running the same command: | ||
# INITIAL_PEERS=... python train.py | ||
# or passing the peers to the strategy:" | ||
# HivemindStrategy(initial_peers=...)" | ||
A helper message is printed once your training begins, showing you how | ||
to train on other machines using the same code. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
:orphan: | ||
|
||
.. warning:: | ||
|
||
This file is a placeholder for the real versioning policy page in the main | ||
documentation. It is here so we can cross link to it from the changelog. | ||
|
||
.. _versioning: | ||
|
||
Versioning Policy | ||
################# | ||
|
||
API Stability | ||
************* | ||
|
||
Stable API | ||
---------- | ||
|
||
Experimental API | ||
---------------- | ||
|
||
API Evolution | ||
************* | ||
|
||
Compatibility matrix | ||
******************** |
Oops, something went wrong.