I found this paper: https://arxiv.org/html/2603.19289v1 which was interesting in speculative expert loading to speed things up.
I thought this may apply to mesh, but not in terms of loading but when experts are distributed, to speculate which one to pass to before the current one is finished (if 90% hit rate then we can race - ie do a lot of async calls in the hope that it gets it right).
This is kind of similar to pipeline parallelism with its draft model to make up for latencies. thought it was worth a try.
from NotebookLM (I love using that thing): https://notebooklm.google.com/notebook/6486d625-6c39-4558-8a7a-7ecc046b1a64
And implementation thoughts:
Technical Implementation Guide: Speculative Activation Forwarding in Distributed MoE Meshes
1. Architectural Transition: From Expert Sharding to Speculative Forwarding
In high-performance distributed inference, we are shifting from the legacy "Overlapping Shards" model—which relied on redundant expert replication to minimize cross-node traffic—to the Speculative Activation Forwarding paradigm. This new architecture treats the distributed mesh as a unified fabric where predictive network routing replaces local expert storage.
In single-node offloading scenarios (e.g., YALIS), the primary bottleneck is the PCIe bus. In a distributed mesh (e.g., Mesh-LLM), the inter-node network latency (NIC/RTT) becomes the functional equivalent of the PCIe bottleneck. Speculative Activation Forwarding mitigates this by overlapping the network "copy" time with GPU compute, effectively treating the interconnect as a low-latency pipeline for activation tensors.
Feature | Legacy Overlapping Shards | Speculative Activation Forwarding
-- | -- | --
Storage Strategy | Redundant expert replication (core experts stored on all nodes). | Optimized distribution; removal of redundant storage in favor of predictive routing.
Communication | Zero cross-node inference traffic (solo node execution). | Predictive network routing; asynchronous activation forwarding over the fabric.
Inference Logic | Serial: Wait for local routing completion before execution. | Speculative: "Early Fire" dispatches activations based on hidden state signals.
Primary Bottleneck | Local CPU-to-GPU weight transfer (PCIe-bound). | Inter-node network latency/RTT (NIC-bound).
2. Theoretical Foundation: Adapting the Quasi-Hidden State for Network Routing
The core signal for peer prediction in the mesh is the Quasi-Hidden State (q_l). In a distributed environment, q_l serves as the metadata signal that is serialized into the header of an "Early Fire" packet, allowing a node to identify its target peer before the local MoE block has finished its forward pass.
The q_l state is synthesized by approximating the input the next layer’s router will receive: q_l = LN_{l+1}(d_l + r_l)
Where:
- d_l (Layer-level Default Representation): The weighted combination of "default vectors" (offline-aggregated average activations) for the experts selected at layer l. This captures the expert-conditioned contribution to the residual stream.
- r_l (Post-attention Residual): The representation generated immediately after the multi-head attention (MHA) mechanism, but prior to the completion of the layer l MoE computation.
- LN_{l+1}: The normalization function applied to the residual stream before routing at layer l+1.
By calculating q_l mid-computation, the node speculates the expert indices for the subsequent layer. This allows the system to determine the physical mesh address of the peer node hosting those experts while the current GPU kernels are still active.
3. The 'Early Fire' Protocol: Peer-to-Peer Activation Forwarding
The 'Early Fire' protocol is the operational mechanism designed to mask interconnect latency through asynchronous DMA-like transfers over the fabric.
Step-by-Step Execution Sequence
- Signal Generation: During the execution of layer l, the node calculates q_l using the post-attention residuals.
- Predictive Mesh Mapping: The node maps the predicted expert indices from q_l to a specific peer node’s ID.
- Asynchronous Dispatch: The node initiates an asynchronous transmission of the activation tensors to the predicted peer.
Consistent with Mesh-LLM's architecture, these are direct server-to-server transfers via TCP or QUIC. This bypasses the primary coordinator node, eliminating the O(N) communication bottleneck that occurs when a coordinator must relay tensors between all mesh participants. This asynchronicity is made possible by the "Double Buffering" implementation in the receiving node's memory pool.
4. The 'No-Stop' Execution Policy: Speculative Peer Processing
To maximize interconnect throughput and eliminate synchronization barriers, receiving nodes operate under a 'No-Stop' execution policy. When an 'Early Fire' activation arrives, the peer node does not wait for a routing verification handshake or a "correctness" signal from the sender. It immediately triggers the FFN computation using its local experts.
Note on Accuracy and Expert Rank Alignment
This policy is justified by the concept of Expert Rank Alignment. Data indicates that the "dominant experts"—those with the highest routing weights that contribute the most to the activation—are predicted with high reliability. For dominant experts, the hit rate remains \sim90%. Mispredictions typically occur only in lower-ranked experts with negligible weighting mass. By prioritizing a router-free inference path, the 'No-Stop' policy preserves the network pipeline and downstream task accuracy in reasoning-heavy benchmarks, whereas pausing for verification would collapse the TPOT gains.
5. Performance Metrics and Hit Rate Analysis
The efficacy of speculative forwarding is highly sensitive to representational drift, particularly in the initial stages of the model.
- Early Layer Drift (Uncertainty Regime): In models like Qwen3-30B-A3B, representational drift is concentrated in Layers 1–2. In this regime, the standard Router-PF (fixed router speculation) hit rate is significantly lower. Accuracy in these layers is critical; using a "Neural Estimator" (Est-PF) can improve this hit rate by \sim25% over the fixed router.
- Steady-State Accuracy: Beyond the initial layers, the hidden state stabilizes. The hit rate reaches a steady \sim90%, which sustains the efficiency of the 'No-Stop' policy.
Time-per-Output-Token (TPOT) Improvements: In distributed meshes where interconnect and memory transfers account for 84–88% of the TPOT bottleneck, Speculative Activation Forwarding yields the following improvements:
- High-Bandwidth Interconnects (GH200/NVLink): \sim5% TPOT reduction.
- Standard Interconnects (A6000/PCIe 4.0): \sim14% TPOT reduction. The gains are more pronounced on "weaker" hardware or slower fabrics where the compute-to-transfer ratio is lower.
6. Implementation Constraints and Error Handling
A stable Speculative Forwarding mesh must operate within these HPC constraints:
- Latency Caps: Participating nodes must adhere to an 80ms RTT hard cap. If a node's network latency exceeds this, the speculative forward will arrive after the compute block finishes, rendering the overlap useless. Such nodes are relegated to API-client status.
- Buffer Management: The mesh utilizes Double Buffering to alternate activation targets. This allows the GPU to compute using Buffer A while Buffer B simultaneously receives the 'Early Fire' prefetch for the subsequent layer.
- Misprediction Handling: Under the 'No-Stop' policy, if the sender's finalized router eventually selects a different expert than the one speculated, the mesh does not re-fetch. The cost of a network stall and a broken pipeline is significantly higher than the marginal accuracy loss of executing a high-rank (though not top-rank) speculative expert. This "fail-forward" approach is essential for maintaining the target TPOT in distributed environments.
I found this paper: https://arxiv.org/html/2603.19289v1 which was interesting in speculative expert loading to speed things up.
I thought this may apply to mesh, but not in terms of loading but when experts are distributed, to speculate which one to pass to before the current one is finished (if 90% hit rate then we can race - ie do a lot of async calls in the hope that it gets it right).
This is kind of similar to pipeline parallelism with its draft model to make up for latencies. thought it was worth a try.
from NotebookLM (I love using that thing): https://notebooklm.google.com/notebook/6486d625-6c39-4558-8a7a-7ecc046b1a64
And implementation thoughts:
Technical Implementation Guide: Speculative Activation Forwarding in Distributed MoE Meshes
1. Architectural Transition: From Expert Sharding to Speculative Forwarding
In high-performance distributed inference, we are shifting from the legacy "Overlapping Shards" model—which relied on redundant expert replication to minimize cross-node traffic—to the Speculative Activation Forwarding paradigm. This new architecture treats the distributed mesh as a unified fabric where predictive network routing replaces local expert storage.
In single-node offloading scenarios (e.g., YALIS), the primary bottleneck is the PCIe bus. In a distributed mesh (e.g., Mesh-LLM), the inter-node network latency (NIC/RTT) becomes the functional equivalent of the PCIe bottleneck. Speculative Activation Forwarding mitigates this by overlapping the network "copy" time with GPU compute, effectively treating the interconnect as a low-latency pipeline for activation tensors.
Feature | Legacy Overlapping Shards | Speculative Activation Forwarding-- | -- | --
Storage Strategy | Redundant expert replication (core experts stored on all nodes). | Optimized distribution; removal of redundant storage in favor of predictive routing.
Communication | Zero cross-node inference traffic (solo node execution). | Predictive network routing; asynchronous activation forwarding over the fabric.
Inference Logic | Serial: Wait for local routing completion before execution. | Speculative: "Early Fire" dispatches activations based on hidden state signals.
Primary Bottleneck | Local CPU-to-GPU weight transfer (PCIe-bound). | Inter-node network latency/RTT (NIC-bound).
2. Theoretical Foundation: Adapting the Quasi-Hidden State for Network Routing
The core signal for peer prediction in the mesh is the Quasi-Hidden State (q_l). In a distributed environment, q_l serves as the metadata signal that is serialized into the header of an "Early Fire" packet, allowing a node to identify its target peer before the local MoE block has finished its forward pass.
The q_l state is synthesized by approximating the input the next layer’s router will receive: q_l = LN_{l+1}(d_l + r_l)
Where:
By calculating q_l mid-computation, the node speculates the expert indices for the subsequent layer. This allows the system to determine the physical mesh address of the peer node hosting those experts while the current GPU kernels are still active.
3. The 'Early Fire' Protocol: Peer-to-Peer Activation Forwarding
The 'Early Fire' protocol is the operational mechanism designed to mask interconnect latency through asynchronous DMA-like transfers over the fabric.
Step-by-Step Execution Sequence
Consistent with Mesh-LLM's architecture, these are direct server-to-server transfers via TCP or QUIC. This bypasses the primary coordinator node, eliminating the O(N) communication bottleneck that occurs when a coordinator must relay tensors between all mesh participants. This asynchronicity is made possible by the "Double Buffering" implementation in the receiving node's memory pool.
4. The 'No-Stop' Execution Policy: Speculative Peer Processing
To maximize interconnect throughput and eliminate synchronization barriers, receiving nodes operate under a 'No-Stop' execution policy. When an 'Early Fire' activation arrives, the peer node does not wait for a routing verification handshake or a "correctness" signal from the sender. It immediately triggers the FFN computation using its local experts.
Note on Accuracy and Expert Rank Alignment
This policy is justified by the concept of Expert Rank Alignment. Data indicates that the "dominant experts"—those with the highest routing weights that contribute the most to the activation—are predicted with high reliability. For dominant experts, the hit rate remains \sim90%. Mispredictions typically occur only in lower-ranked experts with negligible weighting mass. By prioritizing a router-free inference path, the 'No-Stop' policy preserves the network pipeline and downstream task accuracy in reasoning-heavy benchmarks, whereas pausing for verification would collapse the TPOT gains.
5. Performance Metrics and Hit Rate Analysis
The efficacy of speculative forwarding is highly sensitive to representational drift, particularly in the initial stages of the model.
Time-per-Output-Token (TPOT) Improvements: In distributed meshes where interconnect and memory transfers account for 84–88% of the TPOT bottleneck, Speculative Activation Forwarding yields the following improvements:
6. Implementation Constraints and Error Handling
A stable Speculative Forwarding mesh must operate within these HPC constraints: