cache var used by each iteration in grid persistent kernel, e.g. weight in layer norm backward

### 🚀 The feature, motivation and pitch

In layer norm backward:
For input DataType::Half, the persistent buffers are projected to three inputs (dy, x, weight), total size is 3 * sizeof(half) * dim1
For input DataType::Float the persistent buffers are NOT projected, they are xhat and d_xhat, the total size is 2 * sizeof(float) * dim1
If I enforce projection for input DataType::Float, there is a significiant speedup, e.g. for case 2048 x 10240 the time is reduced from 274 us to 207 us, for case 2048 x 1024 the time is reduced from 39 us to 36 us. The reason is because weight is shared across different rows. If we keep it persistent, we don't need to reload it in the iteration over different rows. The projected version needs more registers per thread but it doesn't reduce the occupancy ratio as the all the blocks must be active at the same time for this grid persistent kernel.

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache var used by each iteration in grid persistent kernel, e.g. weight in layer norm backward #2525

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

cache var used by each iteration in grid persistent kernel, e.g. weight in layer norm backward #2525

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions