add multi-stage guide #234

tscholak · 2025-04-16T17:43:48Z

✨ Description

add short multi-stage guide

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

…ulti-stage-guide

jlamypoirier

Thanks for the guide, I have some minor comments

docs/user_guide/multi-stage.md

jlamypoirier

Thanks for the changes! This looks, though I have some minor suggestions.

jlamypoirier · 2025-04-17T16:28:47Z

docs/user_guide/multi-stage.md

+| `2`           | Replicated | Sharded    | Sharded          | Moderate[^1]              |
+| `3`           | Sharded    | Sharded    | Sharded          | High[^2]                  |
+
+[^1]: Communication overhead for ZeRO Stage 2 is similar to Stage 1, except during (depth-first) gradient accumulation when additional all-reduce operations occur.


Technically reduce-scatter

jlamypoirier · 2025-04-17T16:35:15Z

docs/user_guide/multi-stage.md

+
+### Buffers
+
+When gradients or weights are sharded, Fast-LLM accumulates partial results in shared *buffers* during forward and backward passes, separately for gradients and weights. These buffers reduce communication overhead by batching gradient or weight updates across GPUs or nodes. The options `num_grad_buffers` and `num_weight_buffers` control the number of buffers used for gradients and weights, respectively.


Might be useful to state explicitly how this relates to ZeRO stages:

num_layers buffers: Store all layers in memory, as in traditional (non-ZeRO) DP

2 Keep weights/gradients one layer at the time, i.e. ZeRO stage 2/3. Second buffer is there for network overlap.

jlamypoirier · 2025-04-17T16:37:09Z

docs/user_guide/multi-stage.md

+
+By default, Fast-LLM assigns one gradient and weight buffer per stage, where the number of stages equals the total number of logical partitions (stages) of the model. This enables overlapping communication (e.g., data transfers between GPUs or nodes) with computation (actual processing done by each GPU or node). Lower values (e.g., 1) reduce this overlap, potentially increasing communication waiting times.
+
+Increasing `num_grad_buffers` or `num_weight_buffers` provides more room for overlapping communication with compute. This can help in some setups, especially when stages are imbalanced, but generally isn't necessary. Note that this does not reduce total communication; it just shifts when it happens.


Missing transition from the last paragraph, this makes it look like we're going higher than num_layers. Reducing (to 1) is also an option to sacrifice network overlap for lower memory usage.

jlamypoirier · 2025-04-17T16:37:57Z

docs/user_guide/multi-stage.md

+- **`stages_per_pipeline_stage`**: Intended to specify how many stages run per pipeline worker when pipeline parallelism is active.
+
+    !!! warning
+        This feature is currently **not implemented**. Changing this value has no effect.


Technically validation will fail

add multi-stage guide

9ed2286

tscholak requested a review from jlamypoirier April 16, 2025 17:44

Merge branch 'main' of github.com:ServiceNow/Fast-LLM into tscholak/m…

1d711d0

…ulti-stage-guide

jlamypoirier reviewed Apr 16, 2025

View reviewed changes

address comments

120df73

tscholak requested a review from jlamypoirier April 17, 2025 00:46

jlamypoirier approved these changes Apr 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add multi-stage guide #234

add multi-stage guide #234

tscholak commented Apr 16, 2025

jlamypoirier left a comment

jlamypoirier left a comment

jlamypoirier Apr 17, 2025

jlamypoirier Apr 17, 2025

jlamypoirier Apr 17, 2025

jlamypoirier Apr 17, 2025


		### Buffers

		When gradients or weights are sharded, Fast-LLM accumulates partial results in shared buffers during forward and backward passes, separately for gradients and weights. These buffers reduce communication overhead by batching gradient or weight updates across GPUs or nodes. The options `num_grad_buffers` and `num_weight_buffers` control the number of buffers used for gradients and weights, respectively.


		By default, Fast-LLM assigns one gradient and weight buffer per stage, where the number of stages equals the total number of logical partitions (stages) of the model. This enables overlapping communication (e.g., data transfers between GPUs or nodes) with computation (actual processing done by each GPU or node). Lower values (e.g., 1) reduce this overlap, potentially increasing communication waiting times.

		Increasing `num_grad_buffers` or `num_weight_buffers` provides more room for overlapping communication with compute. This can help in some setups, especially when stages are imbalanced, but generally isn't necessary. Note that this does not reduce total communication; it just shifts when it happens.

add multi-stage guide #234

Are you sure you want to change the base?

add multi-stage guide #234

Conversation

tscholak commented Apr 16, 2025

✨ Description

🔍 Type of change

jlamypoirier left a comment

Choose a reason for hiding this comment

jlamypoirier left a comment

Choose a reason for hiding this comment

jlamypoirier Apr 17, 2025

Choose a reason for hiding this comment

jlamypoirier Apr 17, 2025

Choose a reason for hiding this comment

jlamypoirier Apr 17, 2025

Choose a reason for hiding this comment

jlamypoirier Apr 17, 2025

Choose a reason for hiding this comment