Releases: huggingface/pytorch-image-models
Swin Transformer V2 (CR) weights and experiments
This release holds weights for timm's variant of Swin V2 (from @ChristophReich1996 impl, https://github.com/ChristophReich1996/Swin-Transformer-V2)
NOTE: ns variants of the models have extra norms on the main branch at the end of each stage, this seems to help training. The current small model is not using this, but currently training one. Will have a non-ns tiny soon as well as a comparsion. in21k and 1k base models are also in the works...
small checkpoints trained on TPU-VM instances via the TPU-Research Cloud (https://sites.research.google/trc/about/)
swin_v2_tiny_ns_224- 81.80 top-1swin_v2_small_224- 83.13 top-1swin_v2_small_ns_224- 83.5 top-1
TPU VM trained weight release w/ PyTorch XLA
A wide range of mid-large sized models trained in PyTorch XLA on TPU VM instances. Demonstrating viability of the TPU + PyTorch combo for excellent image model results. All models trained w/ the bits_and_tpu branch of this codebase.
A big thanks to the TPU Research Cloud (https://sites.research.google/trc/about/) for the compute used in these experiments.
This set includes several novel weights, including EvoNorm-S RegNetZ (C/D timm variants) and ResNet-V2 model experiments, as well as custom pre-activation model variants of RegNet-Y (called RegNet-V) and Xception (Xception-P) models.
Many if not all of the included RegNet weights surpass original paper results by a wide margin and remain above other known results (e.g. recent torchvision updates) in ImageNet-1k validation and especially OOD test set / robustness performance and scaling to higher resolutions.
RegNets
regnety_040- 82.3 @ 224, 82.96 @ 288regnety_064- 83.0 @ 224, 83.65 @ 288regnety_080- 83.17 @ 224, 83.86 @ 288regnetv_040- 82.44 @ 224, 83.18 @ 288 (timm pre-act)regnetv_064- 83.1 @ 224, 83.71 @ 288 (timm pre-act)regnetz_040- 83.67 @ 256, 84.25 @ 320regnetz_040h- 83.77 @ 256, 84.5 @ 320 (w/ extra fc in head)
Alternative norm layers (no BN!)
resnetv2_50d_gn- 80.8 @ 224, 81.96 @ 288 (pre-act GroupNorm)resnetv2_50d_evos80.77 @ 224, 82.04 @ 288 (pre-act EvoNormS)regnetz_c16_evos- 81.9 @ 256, 82.64 @ 320 (EvoNormS)regnetz_d8_evos- 83.42 @ 256, 84.04 @ 320 (EvoNormS)
Xception redux
xception41p- 82 @ 299 (timm pre-act)xception65- 83.17 @ 299xception65p- 83.14 @ 299 (timm pre-act)
ResNets (w/ SE and/or NeXT)
resnext101_64x4d- 82.46 @ 224, 83.16 @ 288seresnext101_32x8d- 83.57 @ 224, 84.27 @ 288seresnext101d_32x8d- 83.69 @ 224, 84.35 @ 288seresnextaa101d_32x8d- 83.85 @ 224, 84.57 @ 288resnetrs200- 83.85 @ 256, 84.44 @ 320
Vision transformer experiments -- relpos, residual-post-norm, layer-scale, fc-norm, and GAP
vit_relpos_base_patch32_plus_rpn_256- 79.5 @ 256, 80.6 @ 320 -- rel pos + extended width + res-post-norm, no class token, avg poolvit_relpos_small_patch16_224- 81.5 @ 224, 82.5 @ 320 -- rel pos, layer scale, no class token, avg poolvit_relpos_medium_patch16_rpn_224- 82.3 @ 224, 83.1 @ 320 -- rel pos + res-post-norm, no class token, avg poolvit_base_patch16_rpn_224- 82.3 @ 224 -- rel pos + res-post-norm, no class token, avg poolvit_relpos_medium_patch16_224- 82.5 @ 224, 83.3 @ 320 -- rel pos, layer scale, no class token, avg poolvit_relpos_base_patch16_224- 82.5 @ 224, 83.6 @ 320 -- rel pos, layer scale, no class token, avg poolvit_relpos_base_patch16_gapcls_224- 82.8 @ 224, 83.9 @ 320 -- rel pos, layer scale, class token, avg pool (by mistake)
MobileViT weights
Pretrained weights for MobileViT and MobileViT-V2 adapted from Apple impl at https://github.com/apple/ml-cvnets
Checkpoints remapped to timm impl of the model with BGR corrected to RGB (for V1).
v0.5.4 - More weights, models. ResNet strikes back, self-attn - convnet hybrids, optimizers and more
Default conv_mlp to False across the board for ConvNeXt, causing issu…
v0.1-rsb-weights
Weights for ResNet Strikes Back
Paper: https://arxiv.org/abs/2110.00476
More details on weights and hparams to come...
v0.1-attn-weights
A collection of weights I've trained comparing various types of SE-like (SE, ECA, GC, etc), self-attention (bottleneck, halo, lambda) blocks, and related non-attn baselines.
ResNet-26-T series
- [2, 2, 2, 2] repeat Bottlneck block ResNet architecture
- ReLU activations
- 3 layer stem with 24, 32, 64 chs, max-pool
- avg pool in shortcut downsample
- self-attn blocks replace 3x3 in both blocks for last stage, and second block of penultimate stage
| model | top1 | top1_err | top5 | top5_err | param_count | img_size | cropt_pct | interpolation |
|---|---|---|---|---|---|---|---|---|
| botnet26t_256 | 79.246 | 20.754 | 94.53 | 5.47 | 12.49 | 256 | 0.95 | bicubic |
| halonet26t | 79.13 | 20.87 | 94.314 | 5.686 | 12.48 | 256 | 0.95 | bicubic |
| lambda_resnet26t | 79.112 | 20.888 | 94.59 | 5.41 | 10.96 | 256 | 0.94 | bicubic |
| lambda_resnet26rpt_256 | 78.964 | 21.036 | 94.428 | 5.572 | 10.99 | 256 | 0.94 | bicubic |
| resnet26t | 77.872 | 22.128 | 93.834 | 6.166 | 16.01 | 256 | 0.94 | bicubic |
Details:
- HaloNet - 8 pixel block size, 2 pixel halo (overlap), relative position embedding
- BotNet - relative position embedding
- Lambda-ResNet-26-T - 3d lambda conv, kernel = 9
- Lambda-ResNet-26-RPT - relative position embedding
Benchmark - RTX 3090 - AMP - NCHW - NGC 21.09
| model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
|---|---|---|---|---|---|---|---|---|---|
| resnet26t | 2967.55 | 86.252 | 256 | 256 | 857.62 | 297.984 | 256 | 256 | 16.01 |
| botnet26t_256 | 2642.08 | 96.879 | 256 | 256 | 809.41 | 315.706 | 256 | 256 | 12.49 |
| halonet26t | 2601.91 | 98.375 | 256 | 256 | 783.92 | 325.976 | 256 | 256 | 12.48 |
| lambda_resnet26t | 2354.1 | 108.732 | 256 | 256 | 697.28 | 366.521 | 256 | 256 | 10.96 |
| lambda_resnet26rpt_256 | 1847.34 | 138.563 | 256 | 256 | 644.84 | 197.892 | 128 | 256 | 10.99 |
Benchmark - RTX 3090 - AMP - NHWC - NGC 21.09
| model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
|---|---|---|---|---|---|---|---|---|---|
| resnet26t | 3691.94 | 69.327 | 256 | 256 | 1188.17 | 214.96 | 256 | 256 | 16.01 |
| botnet26t_256 | 3291.63 | 77.76 | 256 | 256 | 1126.68 | 226.653 | 256 | 256 | 12.49 |
| halonet26t | 3230.5 | 79.232 | 256 | 256 | 1077.82 | 236.934 | 256 | 256 | 12.48 |
| lambda_resnet26rpt_256 | 2324.15 | 110.133 | 256 | 256 | 864.42 | 147.485 | 128 | 256 | 10.99 |
| lambda_resnet26t | Not Supported |
ResNeXT-26-T series
- [2, 2, 2, 2] repeat Bottlneck block ResNeXt architectures
- SiLU activations
- grouped 3x3 convolutions in bottleneck, 32 channels per group
- 3 layer stem with 24, 32, 64 chs, max-pool
- avg pool in shortcut downsample
- channel attn (active in non self-attn blocks) between 3x3 and last 1x1 conv
- when active, self-attn blocks replace 3x3 conv in both blocks for last stage, and second block of penultimate stage
| model | top1 | top1_err | top5 | top5_err | param_count | img_size | cropt_pct | interpolation |
|---|---|---|---|---|---|---|---|---|
| eca_halonext26ts | 79.484 | 20.516 | 94.600 | 5.400 | 10.76 | 256 | 0.94 | bicubic |
| eca_botnext26ts_256 | 79.270 | 20.730 | 94.594 | 5.406 | 10.59 | 256 | 0.95 | bicubic |
| bat_resnext26ts | 78.268 | 21.732 | 94.1 | 5.9 | 10.73 | 256 | 0.9 | bicubic |
| seresnext26ts | 77.852 | 22.148 | 93.784 | 6.216 | 10.39 | 256 | 0.9 | bicubic |
| gcresnext26ts | 77.804 | 22.196 | 93.824 | 6.176 | 10.48 | 256 | 0.9 | bicubic |
| eca_resnext26ts | 77.446 | 22.554 | 93.57 | 6.43 | 10.3 | 256 | 0.9 | bicubic |
| resnext26ts | 76.764 | 23.236 | 93.136 | 6.864 | 10.3 | 256 | 0.9 | bicubic |
Benchmark - RTX 3090 - AMP - NCHW - NGC 21.09
| model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
|---|---|---|---|---|---|---|---|---|---|
| resnext26ts | 3006.57 | 85.134 | 256 | 256 | 864.4 | 295.646 | 256 | 256 | 10.3 |
| seresnext26ts | 2931.27 | 87.321 | 256 | 256 | 836.92 | 305.193 | 256 | 256 | 10.39 |
| eca_resnext26ts | 2925.47 | 87.495 | 256 | 256 | 837.78 | 305.003 | 256 | 256 | 10.3 |
| gcresnext26ts | 2870.01 | 89.186 | 256 | 256 | 818.35 | 311.97 | 256 | 256 | 10.48 |
| eca_botnext26ts_256 | 2652.03 | 96.513 | 256 | 256 | 790.43 | 323.257 | 256 | 256 | 10.59 |
| eca_halonext26ts | 2593.03 | 98.705 | 256 | 256 | 766.07 | 333.541 | 256 | 256 | 10.76 |
| bat_resnext26ts | 2469.78 | 103.64 | 256 | 256 | 697.21 | 365.964 | 256 | 256 | 10.73 |
Benchmark - RTX 3090 - AMP - NHWC - NGC 21.09
NOTE: there are performance issues with certain grouped conv configs with channels last layout, backwards pass in particular is really slow. Also causing issues for RegNet and NFNet networks.
| model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | train_samples_per_sec | train_step_time | train_batch_size | train_img_size | param_count |
|---|---|---|---|---|---|---|---|---|---|
| resnext26ts | 3952.37 | 64.755 | 256 | 256 | 608.67 | 420.049 | 256 | 256 | 10.3 |
| eca_resnext26ts | 3815.77 | 67.074 | 256 | 256 | 594.35 | 430.146 | 256 | 256 | 10.3 |
| seresnext26ts | 3802.75 | 67.304 | 256 | 256 | 592.82 | 431.14 | 256 | 256 | 10.39 |
| gcresnext26ts | 3626.97 | 70.57 | 256 | 256 | 581.83 | 439.119 | 256 | 256 | 10.48 |
| eca_botnext26ts_256 | 3515.84 | 72.8 | 256 | 256 | 611.71 | 417.862 | 256 | 256 | 10.59 |
| eca_halonext26ts | 3410.12 | 75.057 | 256 | 256 | 597.52 | 427.789 | 256 | 256 | 10.76 |
| bat_resnext26ts | 3053.83 | 83.811 | 256 | 256 | 533.23 | 478.839 | 256 | 256 | 10.73 |
ResNet-33-T series.
- [2, 3, 3, 2] repeat Bottlneck block ResNet architecture
- SiLU activations
- 3 layer stem with 24, 32, 64 chs, no max-pool, 1st and 3rd conv stride 2
- avg pool in shortcut downsample
- channel attn (active in non self-attn blocks) between 3x3 and last 1x1 conv
- when active, self-attn blocks replace 3x3 conv last block of stage 2 and 3, and both blocks of final stage
- FC 1x1 conv between last block and classifier
The 33-layer models have an extra 1x1 FC layer between last conv block and classifier. There is both a non-attenion 33 layer baseline and a 32 layer without the extra FC.
| model | top1 | top1_err | top5 | top5_err | param_count | img_size | cropt_pct | interpolation |
|---|---|---|---|---|---|---|---|---|
| sehalonet33ts | 80.986 | 19.014 | 95.272 | 4.728 | 13.69 | 256 | 0.94 | bicubic |
| seresnet33ts | 80.388 | 19.612 | 95.108 | 4.892 | 19.78 | 256 | 0.94 | bicubic |
| eca_resnet33ts | 80.132 | 19.868 | 95.054 | 4.946 | 19.68 | 256 | 0.94 | bicubic |
| gcresnet33ts | 79.99 | 20.01 | 94.988 | 5.012 | 19.88 | 256 | 0.94 | bicubic |
| resnet33ts | 79.352 | 20.648 | 94.596 | 5.404 | 19.68 | 256 | 0.94 | bicubic |
| resnet32ts | 79.028... |
v0.4.12. Vision Transformer AugReg support and more
- Vision Transformer AugReg weights and model defs (https://arxiv.org/abs/2106.10270)
- ResMLP official weights
- ECA-NFNet-L2 weights
- gMLP-S weights
- ResNet51-Q
- Visformer, LeViT, ConViT, Twins
- Many fixes, improvements, better test coverage
3rd Party Vision Transformer Weights
A catch-all (ish) release for storing vision transformer weights adapted/rehosted from 3rd parties. Too many incoming models for one release per source...
Containing weights from:
- Twins - https://github.com/Meituan-AutoML/Twins
- Visformer - danczs/Visformer#2
- NesT (Aggregated Nested Transformer) - weights converted from https://github.com/google-research/nested-transformer by @alexander-soare ' script
v0.4.9. EfficientNetV2. MLP-Mixer. ResNet-RS. More vision transformers.
Fix drop/drop_path arg on MLP-Mixer model. Fix #641
EfficientNet-V2 weights ported from Tensorflow impl
Weights from https://github.com/google/automl/tree/master/efficientnetv2
Paper: EfficientNetV2: Smaller Models and Faster Training - https://arxiv.org/abs/2104.00298