Support Z Loss in CE #239

Tcc0403 · 2024-09-10T07:37:38Z

Summary

This PR aims to resolve #197

Implemented z loss in LigerCrossEntropy.

note: lse_square_scale not exposed at flce yet, having issues passing the tests.

Details

For loss:

$$\begin{align} L_{total} &= L_{ce} + z\_loss\\ z\_loss &= lse\_square\_scale \cdot lse^2\\ lse &= log \sum e^{X_i} \end{align}$$

We can use $m = max(X_i)$ and $d = \sum e^{X_i - m}$, obtained from online softmax algorithm, to calculate $lse$ directly.

$$\begin{align} lse &= log \sum e^{X_i}\\ &= log \sum e^{X_i - m + m} = log \sum e^{X_i -m} \cdot e^m\\ &= log\ e^m\sum e^{X_i - m} = m + d \end{align}$$

For gradients:

First, we calculate the derivative of lse

$$\begin{align} \frac{\partial}{\partial x_i}(lse) &= \frac{\partial}{\partial x_i}(log \sum e^{x_i}) \\ &= \frac{1}{\sum e^{x_i}} \cdot \frac{\partial}{\partial x_i} \sum e^{x_i}\\ &= \frac{e^{x_i}}{\sum e^{x_i}} = softmax(x_i). \end{align}$$

Then we can obtain the derivative of z_loss by chain rule.

$$\frac{\partial z\_loss}{\partial x_i} = \frac{\partial}{\partial x_i}\left( lse\_square\_scale \cdot lse^2\right) = 2\cdot lse\_square\_scale \cdot lse \cdot softmax(x_i),$$

and we have the derivative of cross entropy loss with label smoothing

$$\frac{\partial L_{ce}}{\partial x_i} = softmax(x_i) - (1 - \epsilon)\delta_{k,y} + \frac{\epsilon}{K}= \begin{cases} softmax(x_i) - \frac{\epsilon}{K}, & i \neq y \\\ softmax(x_i) - \frac{\epsilon}{K} - (1 - \epsilon) & i = y \end{cases}$$

where $\epsilon$ is label_smoothing and $K$ is the number of total classes.
Thus, the derivative of total loss is

$$\begin{align} \frac{\partial}{\partial x_i}L_{total} &= \frac{\partial}{\partial x_i}L_{ce} + \frac{\partial}{\partial x_i}z\_loss\\ &= softmax(x_i) - \frac{\epsilon}{K} - (1 - \epsilon)\delta_{k,y} + 2\cdot lse\_square\_scale \cdot lse \cdot softmax(x_i)\\ &=\begin{cases} (1 + 2\cdot lse\_square\_scale \cdot lse)\ softmax(x_i) - \frac{\epsilon}{K}, & i \neq y\\\ (1 + 2\cdot lse\_square\_scale \cdot lse)\ softmax(x_i) - \frac{\epsilon}{K} - (1 - \epsilon), & i = y \end{cases} \end{align}$$

Reference

PaLM: Scaling Language Modeling with Pathways
Chameleon: Mixed-Modal Early-Fusion Foundation Models

Testing Done

benchmark gist
neglectable error in speed benchmark.

This benchmark was done on my machine, which is probably not accurate.

liger ce: 66.123ms
Peak mem:  8.66200832

liger ce with zloss: 65.991ms
Peak mem:  8.66200832

liger ce with zloss with return zloss: 65.951ms
Peak mem:  8.662073856

Hardware Type:
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

…s and update comments

Tcc0403 · 2024-09-10T18:53:09Z

Passed all tests. Ready for review!

Tcc0403 · 2024-09-13T02:45:16Z

src/liger_kernel/ops/cross_entropy.py

    loss_stride,
    n_cols,
    n_non_ignore,
    ignore_index,
    label_smoothing: tl.constexpr,
+    lse_square_scale: tl.constexpr,


I'm not sure if making label_smoothing and lse_square_scale tl.constexpr is a correct move.
Not familiar with model training. Are these two parameters often changed in practice? I'm worried that it might cause the same issue as #146.

Flash-attention's implementation creates a new constexpr for it in triton.heuristics to solve branching issues.
I wonder what the difference is between

declarelabel_smoothing as a constexpr, and

do calculations in triton.heuristics then assign a value to the constexpr HAS_SMOOTHING

My assumption is that:
in case 1, JIT every time label_smoothing changes
in case 2, JIT only when HAS_SMOOTHING changes because of calculations on label_smoothing.

If so, I will go with flash-attn's approach.

Tcc0403 · 2024-09-14T04:40:54Z

Ignore OOM errors, the current custom CrossEntropyWithZLoss (torch.nn.module), as a ground truth implementation, has precision issue on gradients calculations with bfloat16 and reduction="sum".

LigerCrossEntropyLoss in this PR has no issue passing tests if comparing to flash-attn's CrossEntropyLoss.
(gist)

Current goal is to make the custom torch implementation on par with flash-attn's.

Update: problems solved

Tcc0403 · 2024-09-14T17:49:16Z

All passed

Tcc0403 and others added 12 commits September 9, 2024 10:35

Implement z loss in LigerCrossEntropyFunction

0454a12

Merge branch 'main' into z-loss

9349e89

Merge branch 'main' into z-loss

27783be

Rename z_loss_scale to lse_square_scale

02e90db

Merge branch 'z-loss' of github.com:Tcc0403/Liger-Kernel into z-loss

aa43dca

Fix a mistake of the gradient calculation and update comments

aa4a4b2

Remove the parameter lse_square_scale in FusedLinearCrossEntropyLos…

f53f61c

…s and update comments

Implement z loss in LigerCrossEntropyFunction

b43c457

Rename z_loss_scale to lse_square_scale

59bc0a3

Fix a mistake of the gradient calculation and update comments

0921c81

Remove the parameter lse_square_scale in FusedLinearCrossEntropyLos…

c19f69c

…s and update comments

Merge branch 'z-loss' of github.com:Tcc0403/Liger-Kernel into z-loss

83c99ad

Tcc0403 changed the title ~~Ce z loss~~ Support Z Loss in CE Sep 10, 2024

Merge branch 'main' into ce-z-loss

1ee07de

lancerts requested a review from ByronHsu September 10, 2024 23:36

Tcc0403 added 2 commits September 11, 2024 21:13

Support z loss in flce

83f23d0

Merge branch 'main' into ce-z-loss

fcd5ff4

ByronHsu added the reviewing label Sep 12, 2024

Tcc0403 commented Sep 13, 2024

View reviewed changes

Tcc0403 added 3 commits September 14, 2024 06:27

Merge branch 'main' into ce-z-loss

295aab7

Fix parameter orders of ce and flce

f72e9bb

Fix functional tests

10fa578

Tcc0403 added 2 commits September 15, 2024 00:53

Fix bfloat16 precision issue on custom model

03beb05

Add missing arguments in test and cleanup stdout

3a6cad4

Merge branch 'main' into ce-z-loss

7e4cc4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Z Loss in CE #239

Support Z Loss in CE #239

Tcc0403 commented Sep 10, 2024 •

edited

Loading

Tcc0403 commented Sep 10, 2024

Tcc0403 Sep 13, 2024

Tcc0403 commented Sep 14, 2024 •

edited

Loading

Tcc0403 commented Sep 14, 2024

Support Z Loss in CE #239

Are you sure you want to change the base?

Support Z Loss in CE #239

Conversation

Tcc0403 commented Sep 10, 2024 • edited Loading

Summary

Details

For loss:

For gradients:

Reference

Testing Done

Tcc0403 commented Sep 10, 2024

Tcc0403 Sep 13, 2024

Choose a reason for hiding this comment

Tcc0403 commented Sep 14, 2024 • edited Loading

Tcc0403 commented Sep 14, 2024

Tcc0403 commented Sep 10, 2024 •

edited

Loading

Tcc0403 commented Sep 14, 2024 •

edited

Loading