ND Sharding #24243

ayerofieiev-tt · 2025-06-26T20:29:52Z

ayerofieiev-tt
Jun 26, 2025
Maintainer

ND Sharding

Background

There are multiple ways to distribute data on hardware today:

Interleaved (closest to "shared memory")
2D Sharded (good for data locality)
- Height sharded
- Width sharded
- Block sharded

Switching from Interleaved to Sharded distribution is one of important optimization strategies on our hardware.
This transition is hard for humans and compilers, because Operations have very limited and not well documented support for different distributions. Limited support also means shorter chains of sharded operations and more re-distribution calls.

The effort to increase generality would require implementation of Interleaved/Height/Width/Block sharded kernels for all operations. This is a huge effort and it is unclear when/if it is scheduled or completed.

Besides that, current distribution options are not expressive enough. They don't allow to distribute data in ways that could be most efficient for some operations. For example, with existing APIs it is not possible to distribute [B, C, H, W] tensor across cores in the way, that would be best for reduction operations over a Batch dimension.

Proposal

We propose a path forward, which promises

to make life easier for model developers and compilers by enabling more generality
to enable more performance for some operations by allowing more friendly data distributions

This is how we want to achieve this

Introduce a new way to distribute data
- ND Shards
- Allow multiple shards on each core
Adjust Interleaved programs to support both Interleaved and ND Sharded distributions.
- We will introduce utils to pass tensor spec during program setup and utils to iterate over data in kernels
- We will make minor adjustments for existing interleaved programs and kernels to let them support both interleaved and nd sharded tensors
- We will make sure that performance of interleaved path remains the same
We will let existing Sharded program setups work for exact cases as before
- We will ensure that existing checks (is height / width / block sharded?) work as before
- We will ensure that performance remains the same

After this change, a single kernel supports any possible distribution. Its perf it on par with interleaved, but when distribution is well aligned with the operation (e.g. reduction over batch and batch sharded tensor) the performance will be closer to specialized sharded kernels. This means:

less "this distribution is not supported" errors, less reshards and so easier life for all.
better perf for things like reduction over batch which can leverage new more flexible ways to place data

To get the top performance developers can leverage the knowledge of tensor distribution and write specialized optimized kernels as they do today.

ND Sharding

ND Sharding allows to split Tensor of any dimensionality into ND chunks and distribute them over a set of cores in a round-robin fashion, similar to interleaved.
Tool used ✨
Given Tensor Shape [3, 7, 5], Shard Shape [2, 2, 2] and Core grid [2, 3]

In Interleaved case a page is a Tile or a Row, depending on the Layout. With ND sharding we distribute multiple pages that belong to a chunk of a Tensor. ND sharding can represent Interleaved and 2D sharding cases - this makes it a universal way to specify a distribution, which also means that kernels that support ND sharding, work for any of those cases.

Note: it is important to understand that with such distributions some pages in a shard can be a "padding".

Tensor Shape [10, 10, 10], Shard Shape [1, 10, 10], Core Grid [2, 3]

In this case each shard is going to be a whole "channel"

And when they are distributed to cores, see how some cores are getting multiple shards, more then others

Tensor Shape [10, 10, 10], Shard Shape [10, 1, 1], Core Grid [2, 3]

See how in this case each shard is a cut across all channels.

Each such cut lands on a core in a round robin manner

Such distribution is particularly friendly if an operation needs the whole stick to be local.
Hopefully it helps to imagine how it works for higher dimensions.

Accessor

Please read more about the accessor in this tech report Tensor Accessor

We are currently working to provide an accessor class that unifies both ND Sharded Accessor and an Interleaved Address Generator.

nsorabaTT · 2025-07-02T14:35:52Z

nsorabaTT
Jul 2, 2025
Collaborator

The visualization linked makes it even more easier to follow. The link to tech report is returning a 404 error, could you please add a fix?

1 reply

ayerofieiev-tt Jul 2, 2025
Maintainer Author

Updated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ND Sharding #24243

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ND Sharding #24243

Uh oh!

Uh oh!

ayerofieiev-tt Jun 26, 2025 Maintainer

ND Sharding

Background

Proposal

ND Sharding

Tensor Shape [10, 10, 10], Shard Shape [1, 10, 10], Core Grid [2, 3]

Tensor Shape [10, 10, 10], Shard Shape [10, 1, 1], Core Grid [2, 3]

Accessor

Replies: 1 comment · 1 reply

Uh oh!

nsorabaTT Jul 2, 2025 Collaborator

Uh oh!

ayerofieiev-tt Jul 2, 2025 Maintainer Author

ayerofieiev-tt
Jun 26, 2025
Maintainer

Replies: 1 comment 1 reply

nsorabaTT
Jul 2, 2025
Collaborator

ayerofieiev-tt Jul 2, 2025
Maintainer Author