Releases: huggingface/accelerate
v0.13.0 Launcher update (multinode and GPU selection) and mutliple bug fixes
Better multinode support in the launcher
The accelerate command
launch did not work well for distributed training using several machines. This is fixed in this version.
- Use torchrun for multinode by @muellerzr in #631
- Fix multi-node issues from launch by @muellerzr in #672
Launch training on specific GPUs only
Instead of prefixing your launch command with CUDA_VISIBLE_DEVICES=xxx
you can now specify the GPUs you want to use in your Accelerate config.
- Allow for GPU-ID specification on CLI by @muellerzr in #732
Better tracebacks and rich support
The tracebacks are now cleaned up to avoid printing several times the same error, and rich is integrated as an optional dependency.
- Integrate Rich into Accelerate by @muellerzr in #613
- Make rich an optional dep by @muellerzr in #673
What's new?
- Fix typo in docs/index.mdx by @mishig25 in #610
- Fix DeepSpeed CI by @muellerzr in #612
- Added GANs example to examples by @EyalMichaeli in #619
- Fix example by @muellerzr in #620
- Update README.md by @ezhang7423 in #622
- Fully remove
subprocess
from the multi-gpu launcher by @muellerzr in #623 - M1 mps fixes by @pacman100 in #625
- Fix multi-node issues and simplify param logic by @muellerzr in #627
- update MPS support docs by @pacman100 in #629
- minor tracker fixes for complete* examples by @pacman100 in #630
- Put back in place the guard by @muellerzr in #634
- make init_trackers to launch on main process by @Gladiator07 in #642
- remove check for main process for trackers initialization by @Gladiator07 in #643
- fix link by @philschmid in #645
- Add static_graph arg to DistributedDataParallelKwargs. by @rom1504 in #637
- Small nits to grad accum docs by @muellerzr in #656
- Saving hyperparams in yaml file for Tensorboard for #521 by @Shreyz-max in #657
- Use debug for loggers by @muellerzr in #655
- Improve docstrings more by @muellerzr in #666
- accelerate bibtex by @pacman100 in #660
- Cache torch_tpu check by @muellerzr in #670
- Manim animation of big model inference by @muellerzr in #671
- Add aim tracker for accelerate by @muellerzr in #649
- Specify local network on multinode by @muellerzr in #674
- Test for min torch version + fix all issues by @muellerzr in #638
- deepspeed enhancements and fixes by @pacman100 in #676
- DeepSpeed launcher related changes by @pacman100 in #626
- adding torchrun elastic params by @pacman100 in #680
- 🐛 fix by @pacman100 in #683
- Fix skip in dispatch dataloaders by @sgugger in #682
- Clean up DispatchDataloader a bit more by @sgugger in #686
- rng state sync for FSDP by @pacman100 in #688
- Fix DataLoader with samplers that are batch samplers by @sgugger in #687
- fixing support for Apple Silicon GPU in
notebook_launcher
by @pacman100 in #695 - fixing rng sync when using custom sampler and batch_sampler by @pacman100 in #696
- Improve
init_empty_weights
to override tensor constructor by @thomasw21 in #699 - override DeepSpeed
grad_acc_steps
fromaccelerator
obj by @pacman100 in #698 - [doc] Fix 404'd link in memory usage guides by @tomaarsen in #702
- Add in report generation for test failures and make fail-fast false by @muellerzr in #703
- Update runners with report structure, adjust env variable by @muellerzr in #704
- docs: examples readability improvements by @ryanrussell in #709
- docs:
utils
readability fixups by @ryanrussell in #711 - refactor(test_tracking):
key_occurrence
readability fixup by @ryanrussell in #710 - docs:
hooks
readability improvements by @ryanrussell in #712 - sagemaker fixes and improvements by @pacman100 in #708
- refactor(accelerate): readability improvements by @ryanrussell in #713
- More docstring nits by @muellerzr in #715
- Allow custom device placements for different objects by @sgugger in #716
- Specify gradients in model preparation by @muellerzr in #722
- Fix regression issue by @muellerzr in #724
- Fix default for num processes by @sgugger in #726
- Build and Release docker images on a release by @muellerzr in #725
- Make running tests more efficient by @muellerzr in #611
- Fix old naming by @muellerzr in #727
- Fix issue with one-cycle logic by @muellerzr in #728
- Remove auto-bug label in issue template by @sgugger in #735
- Add a tutorial on proper benchmarking by @muellerzr in #734
- Add an example zoo to the documentation by @muellerzr in #737
- trlx by @muellerzr in #738
- Fix memory leak by @muellerzr in #739
- Include examples for CI by @muellerzr in #740
- Auto grad accum example by @muellerzr in #742
v0.12.0 New doc, gather_for_metrics, balanced device map and M1 support
New documentation
The whole documentation has been revamped, just go look at it here!
- Complete revamp of the docs by @muellerzr in #495
New gather_for_metrics method
When doing distributed evaluation, the dataloader loops back at the beginning of the dataset to make batches that have a round multiple of the number of processes. This causes the predictions to be slightly bigger than the length of the dataset, which used to require some truncating. This is all done behind the scenes now if you replace the gather
your did in evaluation by gather_for_metrics
.
- Reenable Gather for Metrics by @muellerzr in #590
- Fix gather_for_metrics by @muellerzr in #578
- Add a gather_for_metrics capability by @muellerzr in #540
Balanced device maps
When loading big models for inference, device_map="auto"
used to fill the GPUs sequentially, making it hard to use a batch size > 1. It now balances the weights evenly on the GPUs so if you have more GPU space than the model size, you can do predictions with a bigger batch size!
M1 GPU support
Accelerate now supports M1 GPUs, to learn more about how to setup your environment, see the documentation.
- M1 GPU
mps
device integration by @pacman100 in #596
What's new?
- Small fixed for balanced device maps by @sgugger in #583
- Add balanced option for auto device map creation by @sgugger in #534
- fixing deepspeed slow tests issue by @pacman100 in #604
- add more conditions on casting by @younesbelkada in #606
- Remove redundant
.run
inWandBTracker
. by @zh-plus in #605 - Fix some typos + wordings by @muellerzr in #603
- reorg of test scripts and minor changes to tests by @pacman100 in #602
- Move warning by @muellerzr in #598
- Shorthand way to grab a tracker by @muellerzr in #594
- Pin deepspeed by @muellerzr in #595
- Improve docstring by @muellerzr in #591
- TESTS! by @muellerzr in #589
- Fix DispatchDataloader by @sgugger in #588
- Use main_process_first in the examples by @muellerzr in #581
- Skip and raise NotImplementedError for gather_for_metrics for now by @muellerzr in #580
- minor FSDP launcher fix by @pacman100 in #579
- Refine test in set_module_tensor_to_device by @sgugger in #577
- Fix
set_module_tensor_to_device
by @sgugger in #576 - Add 8 bit support - chapter II by @younesbelkada in #539
- Fix tests, add wandb to gitignore by @muellerzr in #573
- Fix step by @muellerzr in #572
- Speed up main CI by @muellerzr in #571
- ccl version check and import different module according to version by @sywangyi in #567
- set default num_cpu_threads_per_process to improve oob performance by @sywangyi in #562
- Add a tqdm helper by @muellerzr in #564
- Rename actions to be a bit more accurate by @muellerzr in #568
- Fix clean by @muellerzr in #569
- enhancements and fixes for FSDP and DeepSpeed by @pacman100 in #532
- fix: saving model weights by @csarron in #556
- add on_main_process decorators by @ZhiyuanChen in #488
- Update imports.py by @KimBioInfoStudio in #554
- unpin
datasets
by @lhoestq in #563 - Create good defaults in
accelerate launch
by @muellerzr in #553 - Fix a few minor issues with example code in docs by @BenjaminBossan in #551
- deepspeed version
0.6.7
fix by @pacman100 in #544 - Rename test extras to testing by @muellerzr in #545
- Add production testing + fix failing CI by @muellerzr in #547
- Add a gather_for_metrics capability by @muellerzr in #540
- Allow for kwargs to be passed to trackers by @muellerzr in #542
- Add support for downcasting bf16 on TPUs by @muellerzr in #523
- Add more documentation for device maps computations by @sgugger in #530
- Restyle prepare one by @muellerzr in #531
- Pick a better default for offload_state_dict by @sgugger in #529
- fix some parameter setting does not work for CPU DDP and bf16 fail in… by @sywangyi in #527
- Fix accelerate tests command by @sgugger in #528
Significant community contributions
The following contributors have made significant changes to the library over the last release:
v0.11.0 Gradient Accumulation and SageMaker Data Parallelism
Gradient Accumulation
Accelerate now handles gradient accumulation if you want, just pass along gradient_accumulation_steps=xxx
when instantiating the Accelerator
and put all your training loop step under a with accelerator.accumulate(model):
. Accelerate will then handle the loss re-scaling and gradient accumulation for you (avoiding slowdowns in distributed training when gradients only need to be synced when you want to step). More details in the documentation.
- Add gradient accumulation doc by @muellerzr in #511
- Make gradient accumulation work with dispatched dataloaders by @muellerzr in #510
- Introduce automatic gradient accumulation wrapper + fix a few test issues by @muellerzr in #484
Support for SageMaker Data parallelism
Accelerate now support SageMaker specific brand of data parallelism.
- SageMaker enhancements to allow custom docker image, input channels referring to s3/remote data locations and metrics logging by @pacman100 in #504
- SageMaker DP Support by @pacman100 in #494
What's new?
- Fix accelerate tests command by @sgugger in #528
- FSDP integration enhancements and fixes by @pacman100 in #522
- Warn user if no trackers are installed by @muellerzr in #524
- Fixup all example CI tests and properly fail by @muellerzr in #517
- fixing deepspeed multi-node launcher by @pacman100 in #514
- Add special Parameters modules support by @younesbelkada in #519
- Don't unwrap in save_state() by @cccntu in #489
- Fix a bug when reduce a tensor. by @wwhio in #513
- Add benchmarks by @sgugger in #506
- Fix DispatchDataLoader length when
split_batches=True
by @sgugger in #509 - Fix scheduler in gradient accumulation example by @muellerzr in #500
- update dataloader wrappers to have
total_batch_size
attribute by @pacman100 in #493 - Introduce automatic gradient accumulation wrapper + fix a few test issues by @muellerzr in #484
- add use_distributed property by @ZhiyuanChen in #487
- fixing fsdp autowrap functionality by @pacman100 in #475
- Use datasets 2.2.0 for now by @muellerzr in #481
- Rm gradient accumulation on TPU by @muellerzr in #479
- Revert "Pin datasets for now by @muellerzr in #477)"
- Pin datasets for now by @muellerzr in #477
- Some typos and cosmetic fixes by @douwekiela in #472
- Fix when TPU device check is ran by @muellerzr in #469
- Refactor Utility Documentation by @muellerzr in #467
- Add docbuilder to quality by @muellerzr in #468
- Expose some is_*_available utils in docs by @muellerzr in #466
- Cleanup CI Warnings by @muellerzr in #465
- Link CI slow runners to the commit by @muellerzr in #464
- Fix subtle bug in BF16 by @muellerzr in #463
- Include bf16 support for TPUs and CPUs, and a better check for if a CUDA device supports BF16 by @muellerzr in #462
- Handle bfloat16 weights in disk offload without adding memory overhead by @noamwies in #460)
- Handle bfloat16 weights in disk offload by @sgugger in #460
- Raise a clear warning if a user tries to modify the AcceleratorState by @muellerzr in #458
- Right step point by @muellerzr in #459
- Better checks for if a TPU device exists by @muellerzr in #456
- Offload and modules with unused submodules by @sgugger in #442
V0.10.0 DeepSpeed integration revamp and TPU speedup
This release adds two major new features: the DeepSpeed integration has been revamped to match the one in Transformers Trainer, with multiple new options unlocked, and the TPU integration has been sped up.
This version also officially stops supporting Python 3.6 and requires Python 3.7+
DeepSpeed integration revamp
Users can now specify a DeepSpeed config file when they want to use DeepSpeed, which unlocks many new options. More details in the new documentation.
- Migrate HFDeepSpeedConfig from trfrs to accelerate by @pacman100 in #432
- DeepSpeed Revamp by @pacman100 in #405
TPU speedup
If you're using TPUs we have sped up the dataloaders and models quite a bit, on top of a few bug fixes.
- Revamp TPU internals to be more efficient + enable mixed precision types by @muellerzr in #441
What's new?
- Fix docstring by @muellerzr in #447
- Add psutil as depenedency by @sgugger in #445
- fix fsdp torch version dependency by @pacman100 in #437
- Create Gradient Accumulation Example by @muellerzr in #431
- init by @muellerzr in #429
- Introduce
no_sync
context wrapper + clean up some more warnings for DDP by @muellerzr in #428 - updating tests to resolve runner failures wrt deepspeed revamp by @pacman100 in #427
- Fix secrets in Docker workflow by @muellerzr in #426
- Introduce a Dependency Checker to trigger new Docker Builds on main by @muellerzr in #424
- Enable slow tests nightly by @muellerzr in #421
- Push out python 3.6 + fix all tests related to the upgrade by @muellerzr in #420
- Speedup main CI by @muellerzr in #419
- Switch to evaluate for metrics by @sgugger in #417
- Create an issue template for Accelerate by @muellerzr in #415
- Introduce post-merge runners by @muellerzr in #416
- Fix debug_launcher issues by @muellerzr in #413
- Use main egg by @muellerzr in #414
- Introduce nightly runners by @muellerzr in #410
- Update requirements to pin tensorboard and include psutil by @muellerzr in #408
- Fix CUDA examples tests by @muellerzr in #407
- Move datasets and transformers to under func by @muellerzr in #411
- Fix CUDA Dockerfile by @muellerzr in #409
- Hotfix all failing GPU tests by @muellerzr in #401
- improve metrics logged in examples by @pacman100 in #399
- Refactor offload_state_dict and fix in offload_weight by @sgugger in #398
- Refactor version checking into a utility by @muellerzr in #395
- Include fastai in frameworks by @muellerzr in #396
- Add packaging to requirements by @muellerzr in #394
- Better dispatch for submodules by @sgugger in #392
- Build Docker Images nightly by @muellerzr in #391
- Small bugfix for the stalebot workflow by @muellerzr in #390
- Introduce stalebot by @muellerzr in #387
- Create Dockerfiles for Accelerate by @muellerzr in #377
- Mix precision -> Mixed precision by @muellerzr in #388
- Fix OneCycle step length when in multiprocess by @muellerzr in #385
v0.9.0: Refactor utils to use in Transformers
v0.9.0: Refactor utils to use in Transformers
This release offers no significant new API, it is just needed to have access to some utils in Transformers.
- Handle deprication errors in launch by @muellerzr in #360
- Update launchers.py by @tmabraham in #363
- fix tracking by @pacman100 in #361
- Remove tensor call by @muellerzr in #365
- Add a utility for writing a barebones config file by @muellerzr in #371
- fix deepspeed model saving by @pacman100 in #370
- deepspeed save model temp fix by @pacman100 in #374
- Refactor tests to use accelerate launch by @muellerzr in #373
- fix zero stage-1 by @pacman100 in #378
- fix shuffling for ShufflerIterDataPipe instances by @loubnabnl in #376
- Better check for deepspeed availability by @sgugger in #379
- Refactor some parts in utils by @sgugger in #380
v0.8.0: Big model inference
v0.8.0: Big model inference
Big model inference
To handle very large models, new functionality has been added in Accelerate:
- a context manager to initalize empty models
- a function to load a sharded checkpoint directly on the right devices
- a set of custom hooks that allow execution of a model split on different devices, as well as CPU or disk offload
- a magic method that auto-determines a device map for a given model, maximizing the GPU spaces, available RAM before using disk offload as a last resort.
- a function that wraps the last three blocks in one simple call (
load_checkpoint_and_dispatch
)
See more in the documentation
What's new
- Create peak_memory_uasge_tracker.py by @pacman100 in #336
- Fixed a typo to enable running accelerate correctly by @Idodox in #339
- Introduce multiprocess logger by @muellerzr in #337
- Refactor utils into its own module by @muellerzr in #340
- Improve num_processes question in CLI by @muellerzr in #343
- Handle Manual Wrapping in FSDP. Minor fix of fsdp example. by @pacman100 in #342
- Better prompt for number of training devices by @muellerzr in #344
- Fix prompt for num_processes by @pacman100 in #347
- Fix sample calculation in examples by @muellerzr in #352
- Fixing metric eval in distributed setup by @pacman100 in #355
- DeepSpeed and FSDP plugin support through script by @pacman100 in #356
v0.7.1 Patch release
v0.7.0: Logging API, FSDP, batch size finder and examples revamp
v0.7.0: Logging API, FSDP, batch size finder and examples revamp
Logging API
Use any of your favorite logging libraries (TensorBoard, Wandb, CometML...) with just a few lines of code inside your training scripts with Accelerate. All details are in the documentation.
- Add logging capabilities by @muellerzr in #293
Support for FSDP (fully sharded DataParallel)
PyTorch recently released a new model wrapper for sharded DDP training called FSDP. This release adds support for it (note that it doesn't work with mixed precision yet). See all caveats in the documentation.
- PyTorch FSDP Feature Incorporation by @pacman100 in #321
Batch size finder
Say goodbye to the CUDA OOM errors with the new find_executable_batch_size
decorator. Just decorate your training function and pick a starting batch size, then let Accelerate do the rest.
- Add a memory-aware decorator for CUDA OOM avoidance by @muellerzr in #324
Examples revamp
The Accelerate examples are now split in two: you can find in the base folder a very simple nlp and computer vision examples, as well as complete versions incorporating all features. But you can also browse the examples in the by_feature
subfolder, which will show you exactly what code to add for each given feature (checkpointing, tracking, cross-validation etc.)
- Refactor Examples by Feature by @muellerzr in #312
What's Changed
- Document save/load state by @muellerzr in #290
- Refactor precisions to its own enum by @muellerzr in #292
- Load model and optimizet states on CPU to void OOMs by @sgugger in #299
- Fix example for datasets v2 by @sgugger in #298
- Leave default as None in
mixed_precision
for launch command by @sgugger in #300 - Pass
lr_scheduler
toAccelerator.prepare
by @sgugger in #301 - Create new TestCase classes and clean up W&B tests by @muellerzr in #304
- Have custom trackers work with the API by @muellerzr in #305
- Write tests for comet_ml by @muellerzr in #306
- Fix training in DeepSpeed by @sgugger in #308
- Update example scripts by @muellerzr in #307
- Use --no_local_rank for DeepSpeed launch by @sgugger in #309
- Fix Accelerate CLI CPU option + small fix for W&B tests by @muellerzr in #311
- Fix DataLoader sharding for deepspeed in accelerate by @m3rlin45 in #315
- Create a testing framework for example scripts and fix current ones by @muellerzr in #313
- Refactor Tracker logic and write guards for logging_dir by @muellerzr in #316
- Create Cross-Validation example by @muellerzr in #317
- Create alias for Accelerator.free_memory by @muellerzr in #318
- fix typo in docs of accelerate tracking by @loubnabnl in #320
- Update examples to show how to deal with extra validation copies by @muellerzr in #319
- Fixup all checkpointing examples by @muellerzr in #323
- Introduce reduce operator by @muellerzr in #326
New Contributors
- @m3rlin45 made their first contribution in #315
- @loubnabnl made their first contribution in #320
- @pacman100 made their first contribution in #321
Full Changelog: v0.6.0...v0.7.0
v0.6.2: Fix launcher with mixed precision
The launcher was ignoring the mixed precision attribute of the config since v0.6.0. This patch fixes that.
v0.6.1: Hot fix
Patches an issue with mixed precision (see #286)