[feat] support TiledMLP in Deepspeed and FSDP2 #7090

kevssim · 2025-12-17T03:46:28Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Summary

Integrate Tiled MLP into Swift training framework for memory-efficient long sequence training. Tiled MLP splits sequence computation into shards, trading compute time for significant memory savings at long sequence lengths.

Features

FSDP2: Uses custom TiledMLP implementation
DeepSpeed/Single GPU: Uses liger_kernel's LigerTiledSwiGLUMLP
DeepSpeed/Single NPU: Uses LigerTiledSwiGLUMLP with native PyTorch _mlp_forward

Usage

Add the following arguments to enable tiled MLP:

--use_tiled_mlp true
--tiled_mlp_num_shards 4  # optional, default is 4

Experiment results

1. FSDP2 + GPU (2x NVIDIA H800)

Env: 2x NVIDIA H800, FSDP2, bf16, batch_size=1, num_shards=4, hidden_size=4096, intermediate_size=12288

Seq Length	Regular (ms)	Tiled (ms)	Speed	Regular (MB)	Tiled (MB)	Memory
2048	18.59	23.51	+26%	1216.00	1444.00	+19%
4096	22.43	31.57	+41%	1504.00	1576.00	+5%
8192	33.11	46.01	+39%	2080.00	1840.00	-12%
16384	54.06	76.76	+42%	3424.00	2368.00	-31%
32768	103.93	136.67	+31%	6112.00	3456.00	-43%
65536	190.89	258.67	+36%	11488.00	5696.00	-50%
131072	391.50	501.15	+28%	22240.00	10176.00	-54%

2. FSDP2 + NPU (910B)

Env: FSDP2, bf16, NPU

Seq Length	Regular (ms)	Tiled (ms)	Speed	Regular (MB)	Tiled (MB)	Memory
2048	54.41	46.26	-15%	1152.01	1460.02	+27%
4096	65.69	81.05	+23%	1460.01	1576.01	+8%
8192	80.46	98.85	+23%	2036.01	1808.01	-11%
16384	108.22	141.93	+31%	3360.01	2324.01	-31%
32768	177.25	228.90	+29%	6048.01	3392.01	-44%
65536	313.80	419.55	+34%	11424.01	5632.01	-51%
131072	601.27	791.40	+32%	22176.01	10112.01	-54%

3. DeepSpeed ZeRO-3 + GPU

Env: DeepSpeed ZeRO-3, bf16, GPU

Seq Length	Standard (ms)	Tiled (ms)	Speed	Standard (MB)	Tiled (MB)	Memory
2048	9.75	20.73	+113%	1920.96	3009.91	+57%
4096	12.63	21.60	+71%	4546.87	5451.82	+20%
8192	17.55	26.45	+51%	7524.78	8053.73	+7%
16384	29.66	40.83	+38%	11238.69	10951.64	-3%
32768	53.93	71.41	+32%	16360.60	14441.55	-12%
65536	100.70	132.27	+31%	24298.51	19115.46	-21%
131072	194.67	254.97	+31%	37868.42	26157.37	-31%

4. DeepSpeed ZeRO-3 + NPU

Env: DeepSpeed ZeRO-3, bf16, NPU

Seq Length	Standard (ms)	Tiled (ms)	Speed	Standard (MB)	Tiled (MB)	Memory
2048	34.77	77.10	+122%	1888.97	3029.93	+60%
4096	40.82	83.04	+103%	4502.88	5483.84	+22%
8192	56.68	103.27	+82%	7460.80	8085.76	+8%
16384	92.51	143.88	+56%	11174.71	11003.67	-2%
32768	157.39	229.23	+46%	16296.63	14665.59	-10%
65536	299.29	418.76	+40%	24234.54	19723.50	-19%
131072	592.62	804.98	+36%	37804.46	27533.42	-27%

Note: Speed ↑ means tiled is slower, Memory ↓ means tiled uses less memory

gemini-code-assist · 2025-12-17T03:46:47Z

Summary of Changes

Hello @kevssim, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the training capabilities by integrating Tiled MLP, a technique crucial for optimizing memory usage, especially when working with very long sequences in large language models. By providing specific support for FSDP2 and leveraging existing solutions for DeepSpeed and single-GPU environments, it allows users to train larger models or longer contexts more efficiently, preventing out-of-memory errors and improving scalability.

Highlights

Tiled MLP Integration: Introduces Tiled MLP for memory-efficient training, supporting FSDP2 with a custom implementation and DeepSpeed/single-GPU with liger_kernel.
FSDP2 Compatibility: Provides a dedicated TiledSwiGLUMLP implementation and an autograd function (TiledMLPFunction) specifically designed for FSDP2, including a GradientAccumulator for efficient gradient handling.
Dynamic MLP Replacement: Implements a mechanism to dynamically replace standard MLP layers with their tiled counterparts based on the detected distributed training environment (FSDP2, DeepSpeed, or single GPU), with FSDP1 explicitly not supported.
New Training Arguments: Adds --use_tiled_mlp and --tiled_mlp_num_shards to the training arguments, allowing users to easily enable and configure Tiled MLP.
Example Scripts: Includes new example scripts (train_deepspeed.sh, train_fsdp2.sh, fsdp2.json) to demonstrate Tiled MLP usage with DeepSpeed and FSDP2.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for Tiled MLP to reduce memory usage during training, with a custom implementation for FSDP2 and a fallback to liger_kernel for DeepSpeed and other setups. The changes include new training arguments, example scripts, and the core Tiled MLP implementation. The code is well-structured, particularly the environment detection logic and the dynamic MLP replacement. The custom autograd function for FSDP2 compatibility is a sophisticated piece of engineering. I have a couple of suggestions to improve robustness regarding gradient accumulation precision and exception handling.

gemini-code-assist · 2025-12-17T03:49:48Z

swift/plugin/tiled_mlp.py

+        # Split on dim=0
+        x_shards = list(torch.chunk(x, chunks=shards, dim=0))
+
+        grad_accumulator = GradientAccumulator(compute_params, shards, dtype=x.dtype)


The GradientAccumulator is initialized with dtype=x.dtype. This means that if x is a low-precision type like bfloat16 or float16, the gradients for the MLP weights will be accumulated in that same low precision. While this might be acceptable for bfloat16, it can lead to precision loss and numerical instability with float16. It is generally safer to perform gradient accumulation in float32 to maintain precision. The GradientAccumulator already defaults to torch.float32 if dtype is not provided. I suggest removing dtype=x.dtype to use the safer default.

Suggested change

grad_accumulator = GradientAccumulator(compute_params, shards, dtype=x.dtype)

grad_accumulator = GradientAccumulator(compute_params, shards)

gemini-code-assist · 2025-12-17T03:49:48Z

swift/plugin/tiled_mlp.py

+        try:
+            from accelerate import PartialState
+            state = PartialState()
+            if hasattr(state, 'fsdp_plugin') and state.fsdp_plugin is not None:
+                # Check if fsdp_version is 2 in the plugin
+                if hasattr(state.fsdp_plugin, 'fsdp_version'):
+                    return state.fsdp_plugin.fsdp_version == 2
+        except Exception:
+            pass


Using a broad except Exception: clause can hide unexpected errors and make debugging more difficult. It's better to catch specific exceptions that you expect to handle, such as ImportError or RuntimeError if accelerate is not fully configured. This will make the code more robust and maintainable. The same applies to the is_fsdp1_enabled function.

Suggested change

try:

from accelerate import PartialState

state = PartialState()

if hasattr(state, 'fsdp_plugin') and state.fsdp_plugin is not None:

# Check if fsdp_version is 2 in the plugin

if hasattr(state.fsdp_plugin, 'fsdp_version'):

return state.fsdp_plugin.fsdp_version == 2

except Exception:

pass

try:

from accelerate import PartialState

state = PartialState()

if hasattr(state, 'fsdp_plugin') and state.fsdp_plugin is not None:

# Check if fsdp_version is 2 in the plugin

if hasattr(state.fsdp_plugin, 'fsdp_version'):

return state.fsdp_plugin.fsdp_version == 2

except (ImportError, RuntimeError):

# It's possible that accelerate is not fully initialized or available.

# In such cases, we can safely ignore the exception and rely on env vars.

pass

gemini-code-assist · 2025-12-17T03:49:48Z

swift/plugin/tiled_mlp.py

+        try:
+            from accelerate import PartialState
+            state = PartialState()
+            if hasattr(state, 'fsdp_plugin') and state.fsdp_plugin is not None:
+                if hasattr(state.fsdp_plugin, 'fsdp_version'):
+                    return state.fsdp_plugin.fsdp_version != 2
+        except Exception:
+            pass


Similar to the is_fsdp2_enabled function, using a broad except Exception: is not ideal as it can suppress unexpected errors. It's better to catch specific exceptions like ImportError and RuntimeError to avoid masking other potential issues and improve code robustness.

Suggested change

try:

from accelerate import PartialState

state = PartialState()

if hasattr(state, 'fsdp_plugin') and state.fsdp_plugin is not None:

if hasattr(state.fsdp_plugin, 'fsdp_version'):

return state.fsdp_plugin.fsdp_version != 2

except Exception:

pass

try:

from accelerate import PartialState

state = PartialState()

if hasattr(state, 'fsdp_plugin') and state.fsdp_plugin is not None:

if hasattr(state.fsdp_plugin, 'fsdp_version'):

return state.fsdp_plugin.fsdp_version != 2

except (ImportError, RuntimeError):

# It's possible that accelerate is not fully initialized or available.

# In such cases, we can safely ignore the exception and rely on env vars.

pass

ji-huazhong · 2025-12-18T00:58:18Z

swift/plugin/tiled_mlp.py

+
+This module provides a tiled MLP implementation that is compatible with FSDP2.
+- FSDP2: Uses custom TiledMLP implementation (this file)
+- DeepSpeed/Single GPU: Uses liger_kernel's LigerTiledSwiGLUMLP


@kevssim I'm not sure if LigerTiledSwiGLUMLP is available on an NPU. Is it possible to provide an NPU-compatible implementation?

LigerTiledSwiGLUMLP is implemented using native PyTorch and theoretically supports NPU, but further testing and verification are needed.

Overall, this PR LGTM. Just one small request: could you test its gain on the NPU? Thanks.

I have added NPU support to LigerTiledSwiGLUMLP by replacing the LigerSiLUMulFunction with a native PyTorch implementation.

Additionally, I have supplemented the test results on NPU.

tiled mlp

e5229ef

gemini-code-assist bot reviewed Dec 17, 2025

View reviewed changes

kevssim added 2 commits December 17, 2025 11:49

lint fix

d0cb8bd

update docs

7d11e87

ji-huazhong reviewed Dec 18, 2025

View reviewed changes

kevssim marked this pull request as draft December 18, 2025 02:42

npu support

4700709

kevssim marked this pull request as ready for review December 18, 2025 07:17

Merge branch 'modelscope:main' into feat/tiled_mlp

8901b38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] support TiledMLP in Deepspeed and FSDP2 #7090

[feat] support TiledMLP in Deepspeed and FSDP2 #7090

Uh oh!

kevssim commented Dec 17, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 17, 2025

Uh oh!

gemini-code-assist bot Dec 17, 2025

Uh oh!

gemini-code-assist bot Dec 17, 2025

Uh oh!

ji-huazhong Dec 18, 2025

Uh oh!

kevssim Dec 18, 2025

Uh oh!

ji-huazhong Dec 18, 2025

Uh oh!

kevssim Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	grad_accumulator = GradientAccumulator(compute_params, shards, dtype=x.dtype)
	grad_accumulator = GradientAccumulator(compute_params, shards)

[feat] support TiledMLP in Deepspeed and FSDP2 #7090

Are you sure you want to change the base?

[feat] support TiledMLP in Deepspeed and FSDP2 #7090

Uh oh!

Conversation

kevssim commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

Summary

Features

Usage

Experiment results

1. FSDP2 + GPU (2x NVIDIA H800)

2. FSDP2 + NPU (910B)

3. DeepSpeed ZeRO-3 + GPU

4. DeepSpeed ZeRO-3 + NPU

Uh oh!

gemini-code-assist bot commented Dec 17, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

ji-huazhong Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

kevssim Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

ji-huazhong Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

kevssim Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kevssim commented Dec 17, 2025 •

edited

Loading