You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Shape: [sequence_length, ceil(token_dimension / 128)]. If using_ue8m0_scale is True, the shape is [sequence_length, ceil(ceil(token_dimension / 128)/4)].
59
+
Data type: float32 or int32(Only when using_ue8m0_scale is True). If using_ue8m0_scale is True, the data type of scale is int32 which is packed of four ue8m0 scaling factors.
59
60
expert_routemap_topk (Tensor): Tensor indicating expert assignments for each token (top-k experts).
60
61
Each value represents the expert index the token is assigned to (-1 indicates not assigned).
61
62
Shape: [sequence_length, top_k_experts]
@@ -70,6 +71,7 @@ def moe_permute(
70
71
padding_alignment (int): Tokens alignment requirement for expert buffers (in bytes).
71
72
Must be a power of 2. Typical values are 16, 32 or 64 for optimal memory access.
72
73
do_gather(bool): Decide whether do actual tokens gather operation or not, default is True.
74
+
using_ue8m0_scale (bool): Whether to use the ue8m0 scaling for float8 inputs. Default is False.
73
75
name (str|None, optional): Name prefix for the operation (optional).
0 commit comments