Releases · huggingface/accelerate

05 Oct 18:47

sgugger

v0.13.0

a54cd0a

v0.13.0 Launcher update (multinode and GPU selection) and mutliple bug fixes

Better multinode support in the launcher

The accelerate command launch did not work well for distributed training using several machines. This is fixed in this version.

Use torchrun for multinode by @muellerzr in #631
Fix multi-node issues from launch by @muellerzr in #672

Launch training on specific GPUs only

Instead of prefixing your launch command with CUDA_VISIBLE_DEVICES=xxx you can now specify the GPUs you want to use in your Accelerate config.

Allow for GPU-ID specification on CLI by @muellerzr in #732

Better tracebacks and rich support

The tracebacks are now cleaned up to avoid printing several times the same error, and rich is integrated as an optional dependency.

Integrate Rich into Accelerate by @muellerzr in #613
Make rich an optional dep by @muellerzr in #673

What's new?

Fix typo in docs/index.mdx by @mishig25 in #610
Fix DeepSpeed CI by @muellerzr in #612
Added GANs example to examples by @EyalMichaeli in #619
Fix example by @muellerzr in #620
Update README.md by @ezhang7423 in #622
Fully remove subprocess from the multi-gpu launcher by @muellerzr in #623
M1 mps fixes by @pacman100 in #625
Fix multi-node issues and simplify param logic by @muellerzr in #627
update MPS support docs by @pacman100 in #629
minor tracker fixes for complete* examples by @pacman100 in #630
Put back in place the guard by @muellerzr in #634
make init_trackers to launch on main process by @Gladiator07 in #642
remove check for main process for trackers initialization by @Gladiator07 in #643
fix link by @philschmid in #645
Add static_graph arg to DistributedDataParallelKwargs. by @rom1504 in #637
Small nits to grad accum docs by @muellerzr in #656
Saving hyperparams in yaml file for Tensorboard for #521 by @Shreyz-max in #657
Use debug for loggers by @muellerzr in #655
Improve docstrings more by @muellerzr in #666
accelerate bibtex by @pacman100 in #660
Cache torch_tpu check by @muellerzr in #670
Manim animation of big model inference by @muellerzr in #671
Add aim tracker for accelerate by @muellerzr in #649
Specify local network on multinode by @muellerzr in #674
Test for min torch version + fix all issues by @muellerzr in #638
deepspeed enhancements and fixes by @pacman100 in #676
DeepSpeed launcher related changes by @pacman100 in #626
adding torchrun elastic params by @pacman100 in #680
🐛 fix by @pacman100 in #683
Fix skip in dispatch dataloaders by @sgugger in #682
Clean up DispatchDataloader a bit more by @sgugger in #686
rng state sync for FSDP by @pacman100 in #688
Fix DataLoader with samplers that are batch samplers by @sgugger in #687
fixing support for Apple Silicon GPU in notebook_launcher by @pacman100 in #695
fixing rng sync when using custom sampler and batch_sampler by @pacman100 in #696
Improve init_empty_weights to override tensor constructor by @thomasw21 in #699
override DeepSpeed grad_acc_steps from accelerator obj by @pacman100 in #698
[doc] Fix 404'd link in memory usage guides by @tomaarsen in #702
Add in report generation for test failures and make fail-fast false by @muellerzr in #703
Update runners with report structure, adjust env variable by @muellerzr in #704
docs: examples readability improvements by @ryanrussell in #709
docs: utils readability fixups by @ryanrussell in #711
refactor(test_tracking): key_occurrence readability fixup by @ryanrussell in #710
docs: hooks readability improvements by @ryanrussell in #712
sagemaker fixes and improvements by @pacman100 in #708
refactor(accelerate): readability improvements by @ryanrussell in #713
More docstring nits by @muellerzr in #715
Allow custom device placements for different objects by @sgugger in #716
Specify gradients in model preparation by @muellerzr in #722
Fix regression issue by @muellerzr in #724
Fix default for num processes by @sgugger in #726
Build and Release docker images on a release by @muellerzr in #725
Make running tests more efficient by @muellerzr in #611
Fix old naming by @muellerzr in #727
Fix issue with one-cycle logic by @muellerzr in #728
Remove auto-bug label in issue template by @sgugger in #735
Add a tutorial on proper benchmarking by @muellerzr in #734
Add an example zoo to the documentation by @muellerzr in #737
trlx by @muellerzr in #738
Fix memory leak by @muellerzr in #739
Include examples for CI by @muellerzr in #740
Auto grad accum example by @muellerzr in #742

Contributors

ryanrussell, rom1504, and 11 other contributors

Assets 2

04 Aug 13:14

sgugger

v0.12.0

7c8eb78

v0.12.0 New doc, gather_for_metrics, balanced device map and M1 support

New documentation

The whole documentation has been revamped, just go look at it here!

Complete revamp of the docs by @muellerzr in #495

New gather_for_metrics method

When doing distributed evaluation, the dataloader loops back at the beginning of the dataset to make batches that have a round multiple of the number of processes. This causes the predictions to be slightly bigger than the length of the dataset, which used to require some truncating. This is all done behind the scenes now if you replace the gather your did in evaluation by gather_for_metrics.

Reenable Gather for Metrics by @muellerzr in #590
Fix gather_for_metrics by @muellerzr in #578
Add a gather_for_metrics capability by @muellerzr in #540

Balanced device maps

When loading big models for inference, device_map="auto" used to fill the GPUs sequentially, making it hard to use a batch size > 1. It now balances the weights evenly on the GPUs so if you have more GPU space than the model size, you can do predictions with a bigger batch size!

M1 GPU support

Accelerate now supports M1 GPUs, to learn more about how to setup your environment, see the documentation.

M1 GPU mps device integration by @pacman100 in #596

What's new?

Small fixed for balanced device maps by @sgugger in #583
Add balanced option for auto device map creation by @sgugger in #534
fixing deepspeed slow tests issue by @pacman100 in #604
add more conditions on casting by @younesbelkada in #606
Remove redundant .run in WandBTracker. by @zh-plus in #605
Fix some typos + wordings by @muellerzr in #603
reorg of test scripts and minor changes to tests by @pacman100 in #602
Move warning by @muellerzr in #598
Shorthand way to grab a tracker by @muellerzr in #594
Pin deepspeed by @muellerzr in #595
Improve docstring by @muellerzr in #591
TESTS! by @muellerzr in #589
Fix DispatchDataloader by @sgugger in #588
Use main_process_first in the examples by @muellerzr in #581
Skip and raise NotImplementedError for gather_for_metrics for now by @muellerzr in #580
minor FSDP launcher fix by @pacman100 in #579
Refine test in set_module_tensor_to_device by @sgugger in #577
Fix set_module_tensor_to_device by @sgugger in #576
Add 8 bit support - chapter II by @younesbelkada in #539
Fix tests, add wandb to gitignore by @muellerzr in #573
Fix step by @muellerzr in #572
Speed up main CI by @muellerzr in #571
ccl version check and import different module according to version by @sywangyi in #567
set default num_cpu_threads_per_process to improve oob performance by @sywangyi in #562
Add a tqdm helper by @muellerzr in #564
Rename actions to be a bit more accurate by @muellerzr in #568
Fix clean by @muellerzr in #569
enhancements and fixes for FSDP and DeepSpeed by @pacman100 in #532
fix: saving model weights by @csarron in #556
add on_main_process decorators by @ZhiyuanChen in #488
Update imports.py by @KimBioInfoStudio in #554
unpin datasets by @lhoestq in #563
Create good defaults in accelerate launch by @muellerzr in #553
Fix a few minor issues with example code in docs by @BenjaminBossan in #551
deepspeed version 0.6.7 fix by @pacman100 in #544
Rename test extras to testing by @muellerzr in #545
Add production testing + fix failing CI by @muellerzr in #547
Add a gather_for_metrics capability by @muellerzr in #540
Allow for kwargs to be passed to trackers by @muellerzr in #542
Add support for downcasting bf16 on TPUs by @muellerzr in #523
Add more documentation for device maps computations by @sgugger in #530
Restyle prepare one by @muellerzr in #531
Pick a better default for offload_state_dict by @sgugger in #529
fix some parameter setting does not work for CPU DDP and bf16 fail in… by @sywangyi in #527
Fix accelerate tests command by @sgugger in #528

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@sywangyi
- ccl version check and import different module according to version (#567)
- set default num_cpu_threads_per_process to improve oob performance (#562)
- fix some parameter setting does not work for CPU DDP and bf16 fail in… (#527)
@ZhiyuanChen
- add on_main_process decorators (#488)

Contributors

BenjaminBossan, muellerzr, and 9 other contributors

Assets 2

18 Jul 13:02

sgugger

v0.11.0

eebeb59

v0.11.0 Gradient Accumulation and SageMaker Data Parallelism

Gradient Accumulation

Accelerate now handles gradient accumulation if you want, just pass along gradient_accumulation_steps=xxx when instantiating the Accelerator and put all your training loop step under a with accelerator.accumulate(model):. Accelerate will then handle the loss re-scaling and gradient accumulation for you (avoiding slowdowns in distributed training when gradients only need to be synced when you want to step). More details in the documentation.

Add gradient accumulation doc by @muellerzr in #511
Make gradient accumulation work with dispatched dataloaders by @muellerzr in #510
Introduce automatic gradient accumulation wrapper + fix a few test issues by @muellerzr in #484

Support for SageMaker Data parallelism

Accelerate now support SageMaker specific brand of data parallelism.

SageMaker enhancements to allow custom docker image, input channels referring to s3/remote data locations and metrics logging by @pacman100 in #504
SageMaker DP Support by @pacman100 in #494

What's new?

Fix accelerate tests command by @sgugger in #528
FSDP integration enhancements and fixes by @pacman100 in #522
Warn user if no trackers are installed by @muellerzr in #524
Fixup all example CI tests and properly fail by @muellerzr in #517
fixing deepspeed multi-node launcher by @pacman100 in #514
Add special Parameters modules support by @younesbelkada in #519
Don't unwrap in save_state() by @cccntu in #489
Fix a bug when reduce a tensor. by @wwhio in #513
Add benchmarks by @sgugger in #506
Fix DispatchDataLoader length when split_batches=True by @sgugger in #509
Fix scheduler in gradient accumulation example by @muellerzr in #500
update dataloader wrappers to have total_batch_size attribute by @pacman100 in #493
Introduce automatic gradient accumulation wrapper + fix a few test issues by @muellerzr in #484
add use_distributed property by @ZhiyuanChen in #487
fixing fsdp autowrap functionality by @pacman100 in #475
Use datasets 2.2.0 for now by @muellerzr in #481
Rm gradient accumulation on TPU by @muellerzr in #479
Revert "Pin datasets for now by @muellerzr in #477)"
Pin datasets for now by @muellerzr in #477
Some typos and cosmetic fixes by @douwekiela in #472
Fix when TPU device check is ran by @muellerzr in #469
Refactor Utility Documentation by @muellerzr in #467
Add docbuilder to quality by @muellerzr in #468
Expose some is_*_available utils in docs by @muellerzr in #466
Cleanup CI Warnings by @muellerzr in #465
Link CI slow runners to the commit by @muellerzr in #464
Fix subtle bug in BF16 by @muellerzr in #463
Include bf16 support for TPUs and CPUs, and a better check for if a CUDA device supports BF16 by @muellerzr in #462
Handle bfloat16 weights in disk offload without adding memory overhead by @noamwies in #460)
Handle bfloat16 weights in disk offload by @sgugger in #460
Raise a clear warning if a user tries to modify the AcceleratorState by @muellerzr in #458
Right step point by @muellerzr in #459
Better checks for if a TPU device exists by @muellerzr in #456
Offload and modules with unused submodules by @sgugger in #442

Contributors

noamwies, wwhio, and 7 other contributors

Assets 2

15 Jun 18:07

sgugger

v0.10.0

3d92caa

V0.10.0 DeepSpeed integration revamp and TPU speedup

This release adds two major new features: the DeepSpeed integration has been revamped to match the one in Transformers Trainer, with multiple new options unlocked, and the TPU integration has been sped up.

This version also officially stops supporting Python 3.6 and requires Python 3.7+

DeepSpeed integration revamp

Users can now specify a DeepSpeed config file when they want to use DeepSpeed, which unlocks many new options. More details in the new documentation.

Migrate HFDeepSpeedConfig from trfrs to accelerate by @pacman100 in #432
DeepSpeed Revamp by @pacman100 in #405

TPU speedup

If you're using TPUs we have sped up the dataloaders and models quite a bit, on top of a few bug fixes.

Revamp TPU internals to be more efficient + enable mixed precision types by @muellerzr in #441

What's new?

Fix docstring by @muellerzr in #447
Add psutil as depenedency by @sgugger in #445
fix fsdp torch version dependency by @pacman100 in #437
Create Gradient Accumulation Example by @muellerzr in #431
init by @muellerzr in #429
Introduce no_sync context wrapper + clean up some more warnings for DDP by @muellerzr in #428
updating tests to resolve runner failures wrt deepspeed revamp by @pacman100 in #427
Fix secrets in Docker workflow by @muellerzr in #426
Introduce a Dependency Checker to trigger new Docker Builds on main by @muellerzr in #424
Enable slow tests nightly by @muellerzr in #421
Push out python 3.6 + fix all tests related to the upgrade by @muellerzr in #420
Speedup main CI by @muellerzr in #419
Switch to evaluate for metrics by @sgugger in #417
Create an issue template for Accelerate by @muellerzr in #415
Introduce post-merge runners by @muellerzr in #416
Fix debug_launcher issues by @muellerzr in #413
Use main egg by @muellerzr in #414
Introduce nightly runners by @muellerzr in #410
Update requirements to pin tensorboard and include psutil by @muellerzr in #408
Fix CUDA examples tests by @muellerzr in #407
Move datasets and transformers to under func by @muellerzr in #411
Fix CUDA Dockerfile by @muellerzr in #409
Hotfix all failing GPU tests by @muellerzr in #401
improve metrics logged in examples by @pacman100 in #399
Refactor offload_state_dict and fix in offload_weight by @sgugger in #398
Refactor version checking into a utility by @muellerzr in #395
Include fastai in frameworks by @muellerzr in #396
Add packaging to requirements by @muellerzr in #394
Better dispatch for submodules by @sgugger in #392
Build Docker Images nightly by @muellerzr in #391
Small bugfix for the stalebot workflow by @muellerzr in #390
Introduce stalebot by @muellerzr in #387
Create Dockerfiles for Accelerate by @muellerzr in #377
Mix precision -> Mixed precision by @muellerzr in #388
Fix OneCycle step length when in multiprocess by @muellerzr in #385

Contributors

muellerzr, pacman100, and sgugger

Assets 2

20 May 17:54

sgugger

v0.9.0

f626d87

v0.9.0: Refactor utils to use in Transformers

This release offers no significant new API, it is just needed to have access to some utils in Transformers.

Handle deprication errors in launch by @muellerzr in #360
Update launchers.py by @tmabraham in #363
fix tracking by @pacman100 in #361
Remove tensor call by @muellerzr in #365
Add a utility for writing a barebones config file by @muellerzr in #371
fix deepspeed model saving by @pacman100 in #370
deepspeed save model temp fix by @pacman100 in #374
Refactor tests to use accelerate launch by @muellerzr in #373
fix zero stage-1 by @pacman100 in #378
fix shuffling for ShufflerIterDataPipe instances by @loubnabnl in #376
Better check for deepspeed availability by @sgugger in #379
Refactor some parts in utils by @sgugger in #380

Contributors

muellerzr, pacman100, and 3 other contributors

Assets 2

12 May 15:01

sgugger

v0.8.0

2943172

v0.8.0: Big model inference

Big model inference

To handle very large models, new functionality has been added in Accelerate:

a context manager to initalize empty models
a function to load a sharded checkpoint directly on the right devices
a set of custom hooks that allow execution of a model split on different devices, as well as CPU or disk offload
a magic method that auto-determines a device map for a given model, maximizing the GPU spaces, available RAM before using disk offload as a last resort.
a function that wraps the last three blocks in one simple call (load_checkpoint_and_dispatch)

See more in the documentation

Big model inference by @sgugger in #345

What's new

Create peak_memory_uasge_tracker.py by @pacman100 in #336
Fixed a typo to enable running accelerate correctly by @Idodox in #339
Introduce multiprocess logger by @muellerzr in #337
Refactor utils into its own module by @muellerzr in #340
Improve num_processes question in CLI by @muellerzr in #343
Handle Manual Wrapping in FSDP. Minor fix of fsdp example. by @pacman100 in #342
Better prompt for number of training devices by @muellerzr in #344
Fix prompt for num_processes by @pacman100 in #347
Fix sample calculation in examples by @muellerzr in #352
Fixing metric eval in distributed setup by @pacman100 in #355
DeepSpeed and FSDP plugin support through script by @pacman100 in #356

Contributors

muellerzr, pacman100, and 2 other contributors

Assets 2

29 Apr 13:16

sgugger

v0.7.1

3eac8e7

v0.7.1 Patch release

Fix fdsp config in cluster 331
Add guards for batch size finder 334
Patchfix infinite loop 335

Assets 2

28 Apr 17:14

sgugger

v0.7.0

f0bb5f0

v0.7.0: Logging API, FSDP, batch size finder and examples revamp

Logging API

Use any of your favorite logging libraries (TensorBoard, Wandb, CometML...) with just a few lines of code inside your training scripts with Accelerate. All details are in the documentation.

Add logging capabilities by @muellerzr in #293

Support for FSDP (fully sharded DataParallel)

PyTorch recently released a new model wrapper for sharded DDP training called FSDP. This release adds support for it (note that it doesn't work with mixed precision yet). See all caveats in the documentation.

PyTorch FSDP Feature Incorporation by @pacman100 in #321

Batch size finder

Say goodbye to the CUDA OOM errors with the new find_executable_batch_size decorator. Just decorate your training function and pick a starting batch size, then let Accelerate do the rest.

Add a memory-aware decorator for CUDA OOM avoidance by @muellerzr in #324

Examples revamp

The Accelerate examples are now split in two: you can find in the base folder a very simple nlp and computer vision examples, as well as complete versions incorporating all features. But you can also browse the examples in the by_feature subfolder, which will show you exactly what code to add for each given feature (checkpointing, tracking, cross-validation etc.)

Refactor Examples by Feature by @muellerzr in #312

What's Changed

Document save/load state by @muellerzr in #290
Refactor precisions to its own enum by @muellerzr in #292
Load model and optimizet states on CPU to void OOMs by @sgugger in #299
Fix example for datasets v2 by @sgugger in #298
Leave default as None in mixed_precision for launch command by @sgugger in #300
Pass lr_scheduler to Accelerator.prepare by @sgugger in #301
Create new TestCase classes and clean up W&B tests by @muellerzr in #304
Have custom trackers work with the API by @muellerzr in #305
Write tests for comet_ml by @muellerzr in #306
Fix training in DeepSpeed by @sgugger in #308
Update example scripts by @muellerzr in #307
Use --no_local_rank for DeepSpeed launch by @sgugger in #309
Fix Accelerate CLI CPU option + small fix for W&B tests by @muellerzr in #311
Fix DataLoader sharding for deepspeed in accelerate by @m3rlin45 in #315
Create a testing framework for example scripts and fix current ones by @muellerzr in #313
Refactor Tracker logic and write guards for logging_dir by @muellerzr in #316
Create Cross-Validation example by @muellerzr in #317
Create alias for Accelerator.free_memory by @muellerzr in #318
fix typo in docs of accelerate tracking by @loubnabnl in #320
Update examples to show how to deal with extra validation copies by @muellerzr in #319
Fixup all checkpointing examples by @muellerzr in #323
Introduce reduce operator by @muellerzr in #326

New Contributors

@m3rlin45 made their first contribution in #315
@loubnabnl made their first contribution in #320
@pacman100 made their first contribution in #321

Full Changelog: v0.6.0...v0.7.0

Contributors

m3rlin45, muellerzr, and 3 other contributors

Assets 2

31 Mar 13:28

sgugger

v0.6.2

a5b8811

v0.6.2: Fix launcher with mixed precision

The launcher was ignoring the mixed precision attribute of the config since v0.6.0. This patch fixes that.

Assets 2

18 Mar 21:47

sgugger

v0.6.1

8bc6c83

v0.6.1: Hot fix

Patches an issue with mixed precision (see #286)

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better multinode support in the launcher

Launch training on specific GPUs only

Better tracebacks and rich support

What's new?

Contributors

New documentation

New gather_for_metrics method

Balanced device maps

M1 GPU support

What's new?

Significant community contributions

Contributors

Gradient Accumulation

Support for SageMaker Data parallelism

What's new?

Contributors

DeepSpeed integration revamp

TPU speedup

What's new?

Contributors

v0.9.0: Refactor utils to use in Transformers

Contributors

v0.8.0: Big model inference

Big model inference

What's new

Contributors

v0.7.1 Patch release

Logging API

Support for FSDP (fully sharded DataParallel)

Batch size finder

Examples revamp

What's Changed

New Contributors

Contributors

Releases: huggingface/accelerate

v0.13.0 Launcher update (multinode and GPU selection) and mutliple bug fixes

Better multinode support in the launcher

Launch training on specific GPUs only

Better tracebacks and rich support

What's new?

Contributors

v0.12.0 New doc, gather_for_metrics, balanced device map and M1 support

New documentation

New gather_for_metrics method

Balanced device maps

M1 GPU support

What's new?

Significant community contributions

Contributors

v0.11.0 Gradient Accumulation and SageMaker Data Parallelism

Gradient Accumulation

Support for SageMaker Data parallelism

What's new?

Contributors

V0.10.0 DeepSpeed integration revamp and TPU speedup

DeepSpeed integration revamp

TPU speedup

What's new?

Contributors

v0.9.0: Refactor utils to use in Transformers

v0.9.0: Refactor utils to use in Transformers

Contributors

v0.8.0: Big model inference

v0.8.0: Big model inference

Big model inference

What's new

Contributors

v0.7.1 Patch release

v0.7.1 Patch release

v0.7.0: Logging API, FSDP, batch size finder and examples revamp

Logging API

Support for FSDP (fully sharded DataParallel)

Batch size finder

Examples revamp

What's Changed

New Contributors

Contributors

v0.6.2: Fix launcher with mixed precision

v0.6.1: Hot fix