|
1 | 1 | # [MASK] is All You Need
|
2 | 2 |
|
3 |
| - |
4 |
| -This repository represents the official implementation of the paper titled "[MASK] is All You Need". |
5 |
| - |
6 |
| -[](https://compvis.github.io/mask) |
7 |
| -[](https://arxiv.org/abs/2412.06787) |
8 |
| -[](https://huggingface.co/collections/taohu/mask-is-all-you-need-6749a2ca0be7c4c5c055c122) |
9 |
| -[](https://github.com/CompVis/mask) |
10 |
| -[](https://github.com/CompVis/mask/issues?q=is%3Aissue+is%3Aclosed) |
11 |
| -[](https://www.apache.org/licenses/LICENSE-2.0) |
12 |
| - |
13 |
| - |
14 |
| -[Vincent Tao Hu](http://taohu.me), |
15 |
| -[Björn Ommer](https://ommer-lab.com/people/ommer/ ) |
16 |
| - |
17 |
| -## TLDR |
18 |
| - |
19 |
| -We present Discrete Interpolants, to bridge the Diffusion Models and Maskged Generative Models in discrete-state, and scale it up in vision domain. |
20 |
| - |
21 |
| - |
22 |
| - |
23 |
| - |
24 |
| - |
25 |
| -## 🎓 Citation |
26 |
| - |
27 |
| -Please cite our paper: |
28 |
| - |
29 |
| -```bibtex |
30 |
| -@InProceedings{hu2024mask, |
31 |
| - title={[MASK] is All You Need}, |
32 |
| - author={Vincent Tao Hu and Björn Ommer}, |
33 |
| - booktitle = {Arxiv}, |
34 |
| - year={2024} |
35 |
| -} |
36 |
| -``` |
37 |
| - |
38 |
| -## :white_check_mark: Updates |
39 |
| -* **` Feb. 4th, 2025`**: Training code released. |
40 |
| -* **` Dec. 10th, 2024`**: Arxiv released. |
41 |
| - |
42 |
| -## 📦 Training |
43 |
| - |
44 |
| - |
45 |
| -#### COCO training(Deepspeed) |
46 |
| - |
47 |
| -``` |
48 |
| -CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --num_processes 4 --num_machines 1 --main_process_ip 127.0.0.1 --main_process_port 8868 train_ds_vq.py model=uvit_s2deep_it data=coco14_cond_indices dynamic=linear dynamic.mask_ce=1 input_tensor_type=bwh tokenizer=sd_vq_f8 optim.wd=0.00 "optim.betas=[0.9, 0.9]" data.train_steps=1_000_000 ckpt_every=20_000 data.sample_fid_every=100_000 data.sample_fid_n=20_000 data.batch_size=64 optim.name=adam optim.lr=2e-4 lrschedule.warmup_steps=5000 dstep_num=500 mixed_precision=bf16 accum=4 |
49 |
| -``` |
50 |
| - |
51 |
| -#### ImageNet training(accelerator,bs256) |
52 |
| - |
53 |
| -```bash |
54 |
| -CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --num_processes 4 --num_machines 1 --main_process_ip 127.0.0.1 --main_process_port 8868 train_acc_vq.py model=uvit_h2_it dynamic=linear input_tensor_type=bwh tokenizer=sd_vq_f8 data=imagenet256_cond_indices data.batch_size=64 data.sample_vis_n=16 data.sample_fid_every=50_000 ckpt_every=20_000 data.train_steps=1500_000 data.sample_fid_n=5_000 optim.name=adamw optim.lr=1e-4 optim.wd=0.0 lrschedule.warmup_steps=1 mixed_precision=bf16 accum=1 |
55 |
| -``` |
56 |
| - |
57 |
| -#### FaceForensics training(accelerator,bs64) |
58 |
| - |
59 |
| - |
60 |
| -```bash |
61 |
| -CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --num_processes 4 --num_machines 1 --main_process_ip 127.0.0.1 --main_process_port 8868 train_acc_vq.py model=dlattte_xl2_uncond_it dynamic=linear input_tensor_type=btwh tokenizer=sd_vq_f8 data=ffs_indices data.sample_fid_every=10_000 data.batch_size=2 data.sample_fid_bs=1 data.sample_fid_n=10_00 data.train_steps=400_000 data.sample_vis_n=1 ckpt_latte=pretrained_ckpt/dit/DiT-XL-2-256x256.pt accum=8 mixed_precision=bf16 |
62 |
| -``` |
63 |
| - |
64 |
| - |
65 |
| -## Evaluation |
66 |
| - |
67 |
| -#### ImageNet |
68 |
| - |
69 |
| -```bash |
70 |
| -CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --num_processes 4 --num_machines 1 --main_process_ip 127.0.0.1 --main_process_port 8868 sample_ds_vq.py model=dit_xl2_it dynamic=linear input_tensor_type=bwh tokenizer=sd_vq_f8 data=imagenet256_cond_indices data.batch_size=64 data.sample_vis_n=16 data.sample_fid_every=40_000 data.sample_fid_n=5_000 optim.name=adamw optim.lr=1e-4 optim.wd=0.0 lrschedule.warmup_steps=0 data.train_steps=1_400_000 ckpt_every=20_000 mixed_precision=bf16 accum=1 num_fid_samples=50000 offline.lbs=100 dynamic.disint.scheduler=linear dynamic.disint.sampler=maskgit maskgit_randomize=linear top_k=0 top_p=0 offline.save_samples_to_disk=1 sm_t=1.3 use_cfg=1 cfg_scale=2 dstep_num=20 ckpt="in256_ditxl2_it_1220000.pt" |
71 |
| -``` |
72 |
| - |
73 |
| -#### COCO |
74 |
| - |
75 |
| -```bash |
76 |
| -CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --num_processes 4 --num_machines 1 --main_process_ip 127.0.0.1 --main_process_port 8868 sample_acc_vq.py model=uvit_s2deep_it data=coco14_cond_indices dynamic=linear dynamic.mask_ce=1 input_tensor_type=bwh tokenizer=sd_vq_f8 optim.wd=0.00 "optim.betas=[0.9, 0.9]" data.train_steps=1_000_000 ckpt_every=20_000 data.sample_fid_every=100_000 data.sample_fid_n=20_000 data.batch_size=64 optim.name=adam optim.lr=2e-4 lrschedule.warmup_steps=5000 dstep_num=500 mixed_precision=bf16 num_fid_samples=50000 offline.lbs=100 dynamic.disint.scheduler=linear dynamic.disint.sampler=maskgit maskgit_randomize=linear top_k=0 top_p=0 offline.save_samples_to_disk=1 sm_t=1.3 use_cfg=1 cfg_scale=2 dstep_num=20 ckpt="coco14_uvit_s2deep_it_1600000.pt" |
77 |
| -``` |
78 |
| - |
79 |
| -#### FaceForensics |
80 |
| - |
81 |
| -```bash |
82 |
| -TODO |
83 |
| -``` |
84 |
| - |
85 |
| -## Weights |
86 |
| - |
87 |
| -| Dataset | Model | FID $\downarrow$ | HF weights🤗 | |
88 |
| -|:----------:|:-----:|:-------:|:------------------------------------------------------------------------------------| |
89 |
| -| ImageNet $256\times 256$, latents: $32\times 32$| DiT_XL2_IT | 8.26 | [weight.pth](https://huggingface.co/CompVis/discrete_interpolants/blob/main/in256_ditxl2_it_1220000.pt) | |
90 |
| -| COCO $256\times 256$, latents: $32\times 32$| DiT_S2Deep_IT | - | [weight.pth](https://huggingface.co/CompVis/discrete_interpolants/blob/main/coco14_uvit_s2deep_it_1600000.pt) | |
91 |
| - |
92 |
| - |
93 |
| - |
94 |
| -## Dataset Preparation |
95 |
| - |
96 |
| -TODO |
97 |
| - |
98 |
| -## Star History |
99 |
| - |
100 |
| -[](https://star-history.com/#CompVis/discrete-interpolants&Date) |
101 |
| - |
102 |
| -## 🎫 License |
103 |
| - |
104 |
| -This work is licensed under the Apache License, Version 2.0 (as defined in the [LICENSE](LICENSE.txt)). |
105 |
| - |
106 |
| -By downloading and using the code and model you agree to the terms in the [LICENSE](LICENSE.txt). |
107 |
| - |
108 |
| -[](https://www.apache.org/licenses/LICENSE-2.0) |
109 |
| - |
| 3 | +New Repo: [https://github.com/CompVis/discrete-interpolants](https://github.com/CompVis/discrete-interpolants) |
110 | 4 |
|
111 | 5 |
|
112 | 6 |
|
|
0 commit comments