How to enable Expert Parallelism (EP) for Qwen MoE during decoding (token generation) inference?

Hi team,

I am experimenting with enabling **Expert Parallelism (EP)** for the Qwen MoE model during the **decoding / token generation** stage. While EP works as expected during the prefill stage, I encountered issues when enabling EP for token generation.

This issue is mainly related to missing configuration options in official `main.py` and a compilation failure caused by selective loading constraints during decoding.

---

## Environment

- Model: Qwen3-30B-A3B
- Instance type:  AWS EC2  `trn2.3xlarge`
- Neuron cores: 4
- Framework: NeuronX / NXDI
- EP configuration:
  - `moe-tp-degree = 1`
  - `moe-ep-degree = 4`
- Stage: Token generation (decoding)

---

## Issue 1: Missing Expert Parallelism arguments in `main.py`

Currently, `main.py` does not expose any command-line arguments for configuring expert parallelism.

Based on the latest NXDI parallelism configuration for MoE models ([here](https://github.com/aws-neuron/neuronx-distributed-inference/blob/aa7987ffc66ac2bd9894427621ca9b6f3fc40ed9/src/neuronx_distributed_inference/inference_demo.py#L217
)), there are two additional arguments:


- `--moe-tp-degree`
- `--moe-ep-degree`


These options are necessary to control tensor parallelism and expert parallelism independently.

### Question

In future updates of `main.py`, is it planned to officially add `--moe-tp-degree` and `--moe-ep-degree` as supported arguments?  
Or is it expected that participants manually modify `main.py` to add support for expert parallelism?

---

## Issue 2: Compilation failure when enabling EP for token generation

After manually adding support for expert parallelism in `main.py` and running with:

```
python3 main.py \
        --mode evaluate_all \
        --model-path ~/qwen-30b-a3b/hf_model \
        --compiled-model-path ~/qwen-30b-a3b/traced_model \
        --prompt "What is the capital of France?" \
        --moe-tp-degree 1 \
        --moe-ep-degree 4 
```
the compilation fails during token generation with the following error:
```
NotImplementedError: Selective Loading with Expert parallelism is not supported in token generation.
```
## Root Cause Analysis

The failure appears to be caused by the **batch size configuration during decoding**.

In `main.py`, the batch size is always set as:  `args.batch_size = len(args.prompts)` ([here](https://github.com/aws-neuron/nki-moe/blob/8be692149b00a5e4c450d8aa1e37395890598d1d/main.py#L539)), which is alway 1. As a result, `perc_experts_loaded` is always lower than `DEFAULT_SELECTIVE_LOADING_THRESHOLD`（[here](https://github.com/aws-neuron/neuronx-distributed/blob/47f695c2e23ce0fd6afc35529549ed41ba3cc16d/src/neuronx_distributed/modules/moe/expert_mlps_v2.py#L1448)), which triggers the [NotImplementedError](https://github.com/aws-neuron/neuronx-distributed/blob/47f695c2e23ce0fd6afc35529549ed41ba3cc16d/src/neuronx_distributed/modules/moe/expert_mlps_v2.py#L1455) once EP is enabled. 

According to the definition of `perc_experts_loaded` ([here](https://github.com/aws-neuron/neuronx-distributed/blob/47f695c2e23ce0fd6afc35529549ed41ba3cc16d/src/neuronx_distributed/modules/moe/expert_mlps_v2.py#L1447)), to satisfy: `perc_experts_loaded >= DEFAULT_SELECTIVE_LOADING_THRESHOLD`, the minimum required total tokens for Qwen3-30B-A3B (which equals batch size during decoding) should be: 
```
outed_experts_mlp_config.num_experts (128) // routed_experts_mlp_config.top_k (8) = 16
```
After changing `args.batch_size = len(args.prompts)` in [main.py](https://github.com/aws-neuron/nki-moe/blob/8be692149b00a5e4c450d8aa1e37395890598d1d/main.py#L539) to the minimum value `args.batch_size = 16`, the model compiles successfully with expert parallelism.


### Question 

In future updated versions of `main.py`, is it expected that `args.batch_size = len(args.prompts)` will be changed to `args.batch_size = 16`, or will participants be required to implement Selective Loading with Expert Parallelism in token generation ([here](https://github.com/aws-neuron/neuronx-distributed/blob/47f695c2e23ce0fd6afc35529549ed41ba3cc16d/src/neuronx_distributed/modules/moe/expert_mlps_v2.py#L1455)) by themselves?


Thank you!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to enable Expert Parallelism (EP) for Qwen MoE during decoding (token generation) inference? #8

Environment

Issue 1: Missing Expert Parallelism arguments in `main.py`

Question

Issue 2: Compilation failure when enabling EP for token generation

Root Cause Analysis

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to enable Expert Parallelism (EP) for Qwen MoE during decoding (token generation) inference? #8

Description

Environment

Issue 1: Missing Expert Parallelism arguments in main.py

Question

Issue 2: Compilation failure when enabling EP for token generation

Root Cause Analysis

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Issue 1: Missing Expert Parallelism arguments in `main.py`