ND Sharding #24243
ayerofieiev-tt
started this conversation in
General
ND Sharding
#24243
Replies: 1 comment 1 reply
-
|
The visualization linked makes it even more easier to follow. The link to tech report is returning a 404 error, could you please add a fix? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
ND Sharding
Background
There are multiple ways to distribute data on hardware today:
Switching from Interleaved to Sharded distribution is one of important optimization strategies on our hardware.
This transition is hard for humans and compilers, because Operations have very limited and not well documented support for different distributions. Limited support also means shorter chains of sharded operations and more re-distribution calls.
The effort to increase generality would require implementation of Interleaved/Height/Width/Block sharded kernels for all operations. This is a huge effort and it is unclear when/if it is scheduled or completed.
Besides that, current distribution options are not expressive enough. They don't allow to distribute data in ways that could be most efficient for some operations. For example, with existing APIs it is not possible to distribute [B, C, H, W] tensor across cores in the way, that would be best for reduction operations over a Batch dimension.
Proposal
We propose a path forward, which promises
This is how we want to achieve this
After this change, a single kernel supports any possible distribution. Its perf it on par with interleaved, but when distribution is well aligned with the operation (e.g. reduction over batch and batch sharded tensor) the performance will be closer to specialized sharded kernels. This means:
To get the top performance developers can leverage the knowledge of tensor distribution and write specialized optimized kernels as they do today.
ND Sharding
ND Sharding allows to split Tensor of any dimensionality into ND chunks and distribute them over a set of cores in a round-robin fashion, similar to interleaved.

Tool used ✨
Given Tensor Shape [3, 7, 5], Shard Shape [2, 2, 2] and Core grid [2, 3]
In Interleaved case a page is a Tile or a Row, depending on the Layout. With ND sharding we distribute multiple pages that belong to a chunk of a Tensor. ND sharding can represent Interleaved and 2D sharding cases - this makes it a universal way to specify a distribution, which also means that kernels that support ND sharding, work for any of those cases.
Tensor Shape [10, 10, 10], Shard Shape [1, 10, 10], Core Grid [2, 3]
In this case each shard is going to be a whole "channel"


And when they are distributed to cores, see how some cores are getting multiple shards, more then others
Tensor Shape [10, 10, 10], Shard Shape [10, 1, 1], Core Grid [2, 3]
See how in this case each shard is a cut across all channels.



Each such cut lands on a core in a round robin manner
Such distribution is particularly friendly if an operation needs the whole stick to be local.
Hopefully it helps to imagine how it works for higher dimensions.
Accessor
Please read more about the accessor in this tech report Tensor Accessor
We are currently working to provide an accessor class that unifies both ND Sharded Accessor and an Interleaved Address Generator.
Beta Was this translation helpful? Give feedback.
All reactions