[inactive] Track entropy and MI of routing distribution for topk MoE #188

oleksost · 2025-03-14T21:07:02Z

✨ Description

To better detect potential routing collapse and have a better understanding about the routing distribution, we can track the average entropy and mutual information of routing probabilities.

Collapse routing would have low entropy and low mutual information. A healthy and specialised router would have low entropy and high mutual information, meaning that routing is specialised and considerably different across tokens.

More specifically:
Mutual info. measures the difference between:

The entropy of the average distribution across all tokens.
The average of the individual token entropies.

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

added calculation of both metrics in the mixture_of_experts.py, they are calculated only for the topk routing type.

✅ Checklist

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

I am not 100% sure there is no performance impact, we are calculating the stats at each forward pass through the router.

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

tscholak

idea is good, thanks @oleksost.
bit weird that all these metrics are appearing as losses. that name should be reserved for things for which gradients are computed. just call this dict metrics?

oleksost · 2025-03-15T00:21:45Z

Yes @tscholak, addressed. Using metrics dict instead.

jlamypoirier

Looks good, got some comments on the structure.

jlamypoirier · 2025-03-19T22:32:14Z

fast_llm/engine/base_model/base_model.py

@@ -135,3 +135,7 @@ def get_tied_weights(self) -> dict[str, tuple[ParameterMeta, tuple[int, ...]]]:
    @abc.abstractmethod
    def loss_defs(self) -> list[LossDef]:
        pass
+
+    @property


This loss/metric split is way more complicated than needed. How about having a single entry, and using a is_metric flag in LossDef (or a derived class) to distinguish? Then no change is needed other than extracting metrics from the context before returning from run_step

This would be nice!

Maybe better to leave it for a separate pr? It would make this one larger as it would require also changing the interfaces of the models' forward functions (that expect losses and metrics) as well as making sure that metrics are only calculated when return_metrics is True.

There isn't much change needed actually, just need to add kwargs["return_metrics"]. I would prefer doing this here so we don't grow ScheduleRunner too much.

jlamypoirier · 2025-03-19T22:33:27Z

fast_llm/engine/schedule/runner.py

@@ -289,6 +312,19 @@ def _reduce_losses(self, context: BatchContext) -> dict[str, float | int]:
            for name, reduced_loss in reduced_losses.items()
        }

+    def _is_reduced_metric(self, metric_name: str) -> bool:
+        """Check if a metric should be reduced (is defined in a TransformerReducedMetrics subclass)."""
+        from fast_llm.layers.transformer.config import TransformerReducedMetrics


We can't use hard-coded values here. Suggestion above would fix it, or there are a few other ways to get this dynamically.

Simplified the setup s.t. all metrics that come back from a forward pass are reduced automatically, hence no need for this function.

jlamypoirier · 2025-03-19T22:38:14Z

fast_llm/layers/transformer/mixture_of_experts.py

+
+
+            # Store these metrics        
+            if metrics is not None:


Given the extra computation involved, this should be enabled through a config parameter

how much compute are we talking about for these metrics? likely this won't be noticeable.

this is already controlled by training.logs.interval parameter afaiu. Do you think we need a seperate parameter for MoE stats logging?

Ideally we'd like to calculate the bare minimum by default (and the computation isn't optimized), so I think yes.

If we really want by default I guess we could have a parameter with that defaults to true but can disabled for performance, eg. for benchmarks.

jlamypoirier · 2025-03-19T22:40:00Z

tests/test_moe_metrics.py

+    assert 0.0 < mutual_info < 1.0, f"Expected value between 0 and 1, got {mutual_info}"
+
+
+def test_edge_cases():


More explicit name?

jlamypoirier · 2025-03-19T22:41:17Z

tests/test_moe_metrics.py

+
+
+@pytest.fixture
+def setup_runner():


These don't belong here. How about test_runner.py?

why don't they belong here? this is fixture is only useful for the tests in this suite.

Maybe we can move it to common.py in the future, it maybe be reused by other tests (e.g. ssm test)

I'm talking about the associated tests, not the fixture. They test a feature of ScheduleRunner and have nothing to do with MoE other than the loss names which aren't relevant to the tests.

I see, makes sense, will move it to a new test file.

jlamypoirier · 2025-03-19T22:41:46Z

tests/test_moe_metrics.py

+
+
+
+if __name__ == "__main__":


jlamypoirier · 2025-03-19T22:43:27Z

fast_llm/layers/transformer/mixture_of_experts.py

@@ -26,6 +27,35 @@
 logger = logging.getLogger(__name__)


+def calculate_normalized_average_entropy(probs: torch.Tensor) -> torch.Tensor:


Could try @torch.compile on these for a free performance boost.

jlamypoirier · 2025-03-19T22:44:00Z

fast_llm/layers/transformer/mixture_of_experts.py

+    average_entropy = entropy_values.mean()  # Average over batch and tokens
+    return average_entropy / torch.log(torch.tensor(n_experts, dtype=probs.dtype, device=probs.device))
+
+def entropy(probs: torch.Tensor) -> torch.Tensor:


calculate_entropy

jlamypoirier · 2025-04-03T22:24:24Z

@oleksost Are you still working on this?

oleksost · 2025-04-04T16:45:05Z

@jlamypoirier yes, will address your comments today. Sorry, it was deprioritised in favour of mamba.

…into routing_stats

oleksost · 2025-04-07T13:17:39Z

@jlamypoirier I think I addressed all the comments.

jlamypoirier · 2025-04-07T23:28:49Z

fast_llm/layers/transformer/mixture_of_experts.py

+    n_experts = probs.size(-1)
+    entropy_values = calculate_entropy(probs)
+    average_entropy = entropy_values.mean()  # Average over batch and tokens
+    return average_entropy / torch.log(torch.tensor(n_experts, dtype=probs.dtype, device=probs.device))


average_entropy/math.log(n_experts) (same elsewhere)

jlamypoirier · 2025-04-08T00:18:42Z

fast_llm/engine/base_model/base_model.py

@@ -135,3 +135,7 @@ def get_tied_weights(self) -> dict[str, tuple[ParameterMeta, tuple[int, ...]]]:
    @abc.abstractmethod
    def loss_defs(self) -> list[LossDef]:
        pass
+
+    @property


There isn't much change needed actually, just need to add kwargs["return_metrics"]. I would prefer doing this here so we don't grow ScheduleRunner too much.

oleksost added 4 commits March 14, 2025 19:23

added mutual information and entropy for routing probs

2a7cf1b

format

dd85e84

pre-commits

aef18e7

improved

bef39d8

oleksost added the enhancement New feature or request label Mar 14, 2025

oleksost requested review from jlamypoirier and tscholak March 14, 2025 21:07

oleksost self-assigned this Mar 14, 2025

tscholak reviewed Mar 14, 2025

View reviewed changes

using metrics dict instead of losses

620ec76

oleksost requested a review from tscholak March 15, 2025 00:38

oleksost added 3 commits March 16, 2025 19:22

reduce metrics

7a93aee

check return_metrics before reducing metrics

eb617e8

check return metrics before reducing

440738a

oleksost marked this pull request as draft March 16, 2025 20:18

oleksost added 2 commits March 16, 2025 20:25

corrwect averaging with number of layers

e5f3c4b

device

27e2a5c

oleksost marked this pull request as ready for review March 17, 2025 14:44

jlamypoirier reviewed Mar 19, 2025

View reviewed changes

oleksost added 5 commits April 4, 2025 13:04

Merge branch 'main' into routing_stats

b016d95

polishing

7b9ac8c

simplified: all metrics from forward are reduced

0577b2c

Merge branch 'routing_stats' of https://github.com/ServiceNow/Fast-LLM …

9e2ec37

…into routing_stats

nvm

efd16bf

oleksost requested a review from jlamypoirier April 4, 2025 17:40

oleksost added 2 commits April 7, 2025 13:05

moved runner test to a new file

1202f5f

parameter for MoE metrics calculation

9855b82

Merge branch 'main' into routing_stats

9c47764

jlamypoirier reviewed Apr 8, 2025

View reviewed changes

jlamypoirier changed the title ~~[feat] Track entropy and MI of routing distribution for topk MoE~~ [inactive] Track entropy and MI of routing distribution for topk MoE Jun 16, 2025

		assert 0.0 < mutual_info < 1.0, f"Expected value between 0 and 1, got {mutual_info}"


		def test_edge_cases():

		@@ -26,6 +27,35 @@
		logger = logging.getLogger(__name__)


		def calculate_normalized_average_entropy(probs: torch.Tensor) -> torch.Tensor:

[inactive] Track entropy and MI of routing distribution for topk MoE #188

Are you sure you want to change the base?

[inactive] Track entropy and MI of routing distribution for topk MoE #188

Uh oh!

Conversation

oleksost commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Testing

Performance Impact

📊 Performance Impact Details

🗒️ Additional Notes

Uh oh!

tscholak left a comment

Choose a reason for hiding this comment

Uh oh!

oleksost commented Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlamypoirier commented Apr 3, 2025

Uh oh!

oleksost commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oleksost commented Apr 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

oleksost commented Mar 14, 2025 •

edited

Loading

oleksost commented Mar 15, 2025 •

edited

Loading

oleksost commented Apr 4, 2025 •

edited

Loading