fp8 #266

xrsrke · 2024-12-18T15:23:18Z

This PR contains the implementation of the 2nd fp8 pretraining recipe except an FP8 optimizer stater and update clipping. For the experimental implementation of recipe 1, please check out [this pull request].

Convergence

Found two stable FP8 pretraining recipes that pretrained a LLaMA 2 architecture in FP8 for both forward and backward passes, as well as both momentums (50% memory reduction), while matching the standard BF16 mixed-precision baseline after 100B tokens [link] [pull request]

Recipe 1 with architectural and optimizer changes [link]

Ablated recipe 2 without architectural changes (better recipe) [[link]]: Remove all the architectures changes in the recipe 1, and add gradient clipping (silly mistake)

Trained a 1b llama2's loss curve for 100B tokens

Trained 7b llama2's loss curve for 24k steps with 300k batch size (except 2nd momentum in bfloat16)

Speed

Got 1033 FLOPs for a fp8 tensor parallel linear [link]

Failed Experiments

Smooth Quantization (in the paper, they show it works well for inference): without gradient clipping, it solves some divergence issues, but with gradient clipping, it hurts model performance
AdamW_atan2: it even blew up before AdamW
Sync the amax of the same fp8 tensor across tp ranks: weights, input gradients, weight gradients, output gradients → doesn't result in much difference in performance
Weight decay without learning rate decay
QKV clipping
CohereLayerNorm
Tune Triton RMS norm
Tune gradient clipping's epsilon factor (expected at some scale, this factor does influence training stability)
Delayed quantization: not much difference
Tuned the quantization interval: 1, 2, 16, 32 → not much of a difference
Warm up quantization: not much difference (warmup the amax, before calculating amax in interval)
Try truncated normal distribution (timm's trunc_normal_) in initialization of fp8 weights → it didn't fix the divergence in recipe 1's setup
- trunc_normal_(weight, std=0.02)
- trunc_normal_(weight, std=math.sqrt(1 / 64))
- trunc_normal_(weight, std=math.sqrt(1 / 64 * 4))
- trunc_normal_(weight, std=1)
In recipe 1's setup
- The model loss stuck at 8 when using PyTorch's default random weight initialization

…ht initialization

…e rms norm due to illegal memory

… fp8 after mark sharded, tied, and parametrization

…Parameter)

…n to resid dtype

… to fail

xrsrke added 30 commits October 9, 2024 10:15

move the basics of fp8 to this branch

44e6574

add fp8 tensor

6c9a4d0

add fp8 linear

0f8f672

add fp8 tensor parallel

7dfe3ac

add fp8 tp profiler

dda00a4

update profiling script

b4156dc

remove uncessary .contiguous() in fp8 backward

39a4960

remove unnecessary .transpose in fp8 linear backward

1ddc44c

remove uncessary transpose input in the fwd pass, and contiguous weig…

c827594

…ht initialization

add bencmark speed with 5% speed up

c937375

add speed benchmark

f3e3495

65% speed up in fwd+bwd pass with m=n=k=32768

4b26cf1

add dumb transpose in fp8_matmul_kernel

edb1e87

remove transpose in kernel

e93cf55

backup before doing monkey dispatch fp8 tp

9510f57

remove fp8 tp from llama's modeling code, fix no grad in param, remov…

478984a

…e rms norm due to illegal memory

refactor NanotronParameter to support fp8

c7d9e8a

keep FP8 NanotronParameter's dtype in 8 bit, move converting model to…

23d66cb

… fp8 after mark sharded, tied, and parametrization

add tests for create_param_that_share_metadata, and generating hash

2864391

add fp8 optim init

1800efe

refactor fp8 linear, tp, parameter tests

c5bcbe7

move master weights to gradient accumulator

a4d6f15

refactor + add test for fp8 initialization

fbbbf4d

new changes

b764b97

fix hanging due to NanotronParameter.__repr__ (param.data == Nanotron…

afdfbf1

…Parameter)

by default, do not quantize the first and last layer

79341ea

fix nan in fwd pass

b440408

fix grad_clipping for fp8

4723335

fix didn't update fp8 parameters in optim.step() due to grad_accum

dd3259b

remove ablated fp8 config, and uncessary files/code

a3a13ce

xrsrke added 16 commits January 10, 2025 11:46

clean up

e8b114b

add

ebea115

Merge branch 'main' into xrsrke/fp8_for_nanotron

9a99ab6

resolve merge conflicts

c0cb423

add tp_recompute_allgather to column linear

21b2408

fix bias[None, :] in tp's functional

f8c40ad

fix test_base_model

3dde0af

fix tensor slicing in NanotronParameter

31af4f7

disable nanoset

678343e

fix import te in cicd tests

18ec446

fix failed fp8 tests

b857a9a

remove constants.CONFIG and remove manual attn_output dtype conversio…

222bd00

…n to resid dtype

remove fp8 parameters

f70230e

remove importing fp8 modules in fp8's __init__ because it causes cicd…

7c7949c

… to fail

untyping tex.DType

d113dda

add back parallel tests

bcf70c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp8 #266

fp8 #266

xrsrke commented Dec 18, 2024 •

edited

Loading

fp8 #266

Are you sure you want to change the base?

fp8 #266

Conversation

xrsrke commented Dec 18, 2024 • edited Loading

Convergence

Speed

Failed Experiments

xrsrke commented Dec 18, 2024 •

edited

Loading