Skip to content

Releases: huggingface/accelerate

v0.13.0 Launcher update (multinode and GPU selection) and mutliple bug fixes

05 Oct 18:47
a54cd0a
Compare
Choose a tag to compare

Better multinode support in the launcher

The accelerate command launch did not work well for distributed training using several machines. This is fixed in this version.

Launch training on specific GPUs only

Instead of prefixing your launch command with CUDA_VISIBLE_DEVICES=xxx you can now specify the GPUs you want to use in your Accelerate config.

Better tracebacks and rich support

The tracebacks are now cleaned up to avoid printing several times the same error, and rich is integrated as an optional dependency.

What's new?

v0.12.0 New doc, gather_for_metrics, balanced device map and M1 support

04 Aug 13:14
Compare
Choose a tag to compare

New documentation

The whole documentation has been revamped, just go look at it here!

New gather_for_metrics method

When doing distributed evaluation, the dataloader loops back at the beginning of the dataset to make batches that have a round multiple of the number of processes. This causes the predictions to be slightly bigger than the length of the dataset, which used to require some truncating. This is all done behind the scenes now if you replace the gather your did in evaluation by gather_for_metrics.

Balanced device maps

When loading big models for inference, device_map="auto" used to fill the GPUs sequentially, making it hard to use a batch size > 1. It now balances the weights evenly on the GPUs so if you have more GPU space than the model size, you can do predictions with a bigger batch size!

M1 GPU support

Accelerate now supports M1 GPUs, to learn more about how to setup your environment, see the documentation.

What's new?

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @sywangyi
    • ccl version check and import different module according to version (#567)
    • set default num_cpu_threads_per_process to improve oob performance (#562)
    • fix some parameter setting does not work for CPU DDP and bf16 fail in… (#527)
  • @ZhiyuanChen
    • add on_main_process decorators (#488)

v0.11.0 Gradient Accumulation and SageMaker Data Parallelism

18 Jul 13:02
eebeb59
Compare
Choose a tag to compare

Gradient Accumulation

Accelerate now handles gradient accumulation if you want, just pass along gradient_accumulation_steps=xxx when instantiating the Accelerator and put all your training loop step under a with accelerator.accumulate(model):. Accelerate will then handle the loss re-scaling and gradient accumulation for you (avoiding slowdowns in distributed training when gradients only need to be synced when you want to step). More details in the documentation.

  • Add gradient accumulation doc by @muellerzr in #511
  • Make gradient accumulation work with dispatched dataloaders by @muellerzr in #510
  • Introduce automatic gradient accumulation wrapper + fix a few test issues by @muellerzr in #484

Support for SageMaker Data parallelism

Accelerate now support SageMaker specific brand of data parallelism.

  • SageMaker enhancements to allow custom docker image, input channels referring to s3/remote data locations and metrics logging by @pacman100 in #504
  • SageMaker DP Support by @pacman100 in #494

What's new?

V0.10.0 DeepSpeed integration revamp and TPU speedup

15 Jun 18:07
Compare
Choose a tag to compare

This release adds two major new features: the DeepSpeed integration has been revamped to match the one in Transformers Trainer, with multiple new options unlocked, and the TPU integration has been sped up.

This version also officially stops supporting Python 3.6 and requires Python 3.7+

DeepSpeed integration revamp

Users can now specify a DeepSpeed config file when they want to use DeepSpeed, which unlocks many new options. More details in the new documentation.

TPU speedup

If you're using TPUs we have sped up the dataloaders and models quite a bit, on top of a few bug fixes.

  • Revamp TPU internals to be more efficient + enable mixed precision types by @muellerzr in #441

What's new?

v0.9.0: Refactor utils to use in Transformers

20 May 17:54
Compare
Choose a tag to compare

v0.9.0: Refactor utils to use in Transformers

This release offers no significant new API, it is just needed to have access to some utils in Transformers.

v0.8.0: Big model inference

12 May 15:01
Compare
Choose a tag to compare

v0.8.0: Big model inference

Big model inference

To handle very large models, new functionality has been added in Accelerate:

  • a context manager to initalize empty models
  • a function to load a sharded checkpoint directly on the right devices
  • a set of custom hooks that allow execution of a model split on different devices, as well as CPU or disk offload
  • a magic method that auto-determines a device map for a given model, maximizing the GPU spaces, available RAM before using disk offload as a last resort.
  • a function that wraps the last three blocks in one simple call (load_checkpoint_and_dispatch)

See more in the documentation

What's new

v0.7.1 Patch release

29 Apr 13:16
Compare
Choose a tag to compare

v0.7.1 Patch release

  • Fix fdsp config in cluster 331
  • Add guards for batch size finder 334
  • Patchfix infinite loop 335

v0.7.0: Logging API, FSDP, batch size finder and examples revamp

28 Apr 17:14
Compare
Choose a tag to compare

v0.7.0: Logging API, FSDP, batch size finder and examples revamp

Logging API

Use any of your favorite logging libraries (TensorBoard, Wandb, CometML...) with just a few lines of code inside your training scripts with Accelerate. All details are in the documentation.

Support for FSDP (fully sharded DataParallel)

PyTorch recently released a new model wrapper for sharded DDP training called FSDP. This release adds support for it (note that it doesn't work with mixed precision yet). See all caveats in the documentation.

Batch size finder

Say goodbye to the CUDA OOM errors with the new find_executable_batch_size decorator. Just decorate your training function and pick a starting batch size, then let Accelerate do the rest.

  • Add a memory-aware decorator for CUDA OOM avoidance by @muellerzr in #324

Examples revamp

The Accelerate examples are now split in two: you can find in the base folder a very simple nlp and computer vision examples, as well as complete versions incorporating all features. But you can also browse the examples in the by_feature subfolder, which will show you exactly what code to add for each given feature (checkpointing, tracking, cross-validation etc.)

What's Changed

New Contributors

Full Changelog: v0.6.0...v0.7.0

v0.6.2: Fix launcher with mixed precision

31 Mar 13:28
Compare
Choose a tag to compare

The launcher was ignoring the mixed precision attribute of the config since v0.6.0. This patch fixes that.

v0.6.1: Hot fix

18 Mar 21:47
Compare
Choose a tag to compare

Patches an issue with mixed precision (see #286)