[Model] Support smollm2 #858

ftgreat · 2025-10-13T03:19:15Z

Support smollm2 tokenizer, training and checkpoint convert.

Signed-off-by: ldwang <[email protected]>

gemini-code-assist · 2025-10-13T03:19:33Z

Summary of Changes

Hello @ftgreat, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces full support for the smollm2 model within the system. It provides the necessary configuration for training the SmolLM2-135m variant, integrates its specific tokenizer, and extends the checkpoint conversion tools to handle smollm2 models. This comprehensive update ensures that users can effectively train, manage, and convert smollm2 models using the existing framework.

Highlights

SmolLM2 Model Configuration: New YAML configuration files have been added to support the training of the SmolLM2-135m model, defining experiment settings, system parameters, model architecture, training hyperparameters, and data specifics.
Tokenizer Integration: The tokenizer factory has been updated to include SmolLM2TokenizerFS, allowing the system to correctly identify and utilize the SmolLM2 tokenizer for processing text data.
Checkpoint Conversion Support: The checkpoint conversion utility now officially supports smollm2 models, enabling seamless conversion of model checkpoints between different formats.
Argument and Checkpoint Utilities: New Python modules (args.py and ckpt.py) have been introduced under tools/checkpoint/smollm2 to handle the conversion of model arguments between Hugging Face and Megatron formats, and to manage the loading and saving of attention and MLP layer checkpoints, including considerations for tensor parallelism.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for the smollm2 model, including configuration files, tokenizer integration, and checkpoint conversion scripts. The overall structure is consistent with how other models are supported in this repository. However, I've found several issues, including hardcoded paths in configuration files which harm reproducibility, and what appear to be copy-paste errors in the checkpoint conversion scripts that could lead to bugs. I've also noted some opportunities for code cleanup and improvement regarding style and efficiency. Please see my detailed comments below.

gemini-code-assist · 2025-10-13T03:21:07Z

tools/checkpoint/smollm2/args.py

+    args.consumed_train_samples = 0
+    args.consumed_valid_samples = 0
+    args.norm_has_bias = False
+    args.tokenizer_type = "Llama3TokenizerFS"


The tokenizer_type is hardcoded to Llama3TokenizerFS. Since this script is for smollm2, it should be set to SmolLM2TokenizerFS. This is a critical bug that will cause the checkpoint conversion to fail or produce an incorrect result.

Suggested change

args.tokenizer_type = "Llama3TokenizerFS"

args.tokenizer_type = "SmolLM2TokenizerFS"

gemini-code-assist · 2025-10-13T03:21:07Z

tools/checkpoint/smollm2/args.py

+    args.rotary_seq_len_interpolation_factor = (
+        None if llama_args["rope_scaling"] == "null" else llama_args["rope_scaling"]
+    )


The check llama_args["rope_scaling"] == "null" is likely incorrect. In JSON, null is a keyword, not a string. When loaded by Python's json library, it becomes None. The check should be llama_args["rope_scaling"] is None.

Suggested change

args.rotary_seq_len_interpolation_factor = (

None if llama_args["rope_scaling"] == "null" else llama_args["rope_scaling"]

)

args.rotary_seq_len_interpolation_factor = (

None if llama_args.get("rope_scaling") is None else llama_args["rope_scaling"]

)

gemini-code-assist · 2025-10-13T03:21:07Z

examples/smollm2/conf/train/135m.yaml

+  no_mmap_bin_files: true
+  tokenizer:
+    tokenizer_type: SmolLM2TokenizerFS
+    tokenizer_path: /share/projset/ldwang/models/HuggingFaceTB/SmolLM2-135M


The tokenizer_path is a hardcoded absolute path. This will prevent the training script from running on any machine other than the one where this path is valid. Please replace it with a placeholder or a relative path to ensure the example is reproducible.

gemini-code-assist · 2025-10-13T03:21:07Z

tools/checkpoint/smollm2/args.py

+    args.bf16 = llama_args["dtype"] == "bfloat16"
+    args.fp16 = llama_args["dtype"] == "float16"


The Hugging Face config.json for Llama-like models typically uses the key torch_dtype, not dtype. Using dtype here will likely result in a KeyError. This seems to be a copy-paste error. Please verify the key in the smollm2 model's config.json and use the correct one, which is probably torch_dtype.

Suggested change

args.bf16 = llama_args["dtype"] == "bfloat16"

args.fp16 = llama_args["dtype"] == "float16"

args.bf16 = llama_args["torch_dtype"] == "bfloat16"

args.fp16 = llama_args["torch_dtype"] == "float16"

gemini-code-assist · 2025-10-13T03:21:07Z

tools/checkpoint/smollm2/args.py

+
+
+def save_args_mg2hf(args):
+    from transformers import LlamaConfig


The script imports and uses LlamaConfig. This is misleading and potentially incorrect for the smollm2 model. If smollm2 has its own config class, that should be used. If not, it's better to use a more generic AutoConfig to avoid issues if the architectures are not perfectly compatible.

gemini-code-assist · 2025-10-13T03:21:07Z

examples/smollm2/conf/135m.yaml

+    # nnodes: 1
+    nproc_per_node: 1
+  cmds:
+    before_start: ulimit -n 1048576 && source /root/miniconda3/bin/activate flagscale-train


The path /root/miniconda3/bin/activate is hardcoded. This makes the configuration less portable and might fail in environments where miniconda is installed elsewhere. Consider using an environment variable or documenting this as a prerequisite with a clear setup instruction.

gemini-code-assist · 2025-10-13T03:21:08Z

examples/smollm2/conf/135m.yaml

+    NCCL_P2P_LL_THRESHOLD: 0
+    IXCCL_MIX_NV: 1
+    IXCCL_FUSED_ENABLE: 0
+    NCCL_IB_DISABLE: 0


The configuration key NCCL_IB_DISABLE is defined here, but it's also present (though commented out) on line 30. This duplication can be confusing. Please remove the redundant entry to improve clarity.

gemini-code-assist · 2025-10-13T03:21:08Z

tools/checkpoint/smollm2/ckpt.py

+
+import torch
+
+sys.path.append("..")


Modifying sys.path is generally considered bad practice as it can lead to unexpected behavior and makes the code harder to maintain. You can use a relative import here instead.

from ..mixtral.ckpt import (

gemini-code-assist · 2025-10-13T03:21:08Z

tools/checkpoint/smollm2/model.py

@@ -0,0 +1,4 @@
+import sys
+
+sys.path.append("..")


Modifying sys.path is generally considered bad practice. A relative import would be a better solution here.

gemini-code-assist · 2025-10-13T03:21:08Z

tools/checkpoint/smollm2/model.py

+import sys
+
+sys.path.append("..")
+from mixtral.model import *


Using wildcard imports (*) is discouraged because it can pollute the namespace and make it unclear where functions or classes are coming from. It's better to import the specific names you need.

Suggested change

from mixtral.model import *

from ..mixtral.model import get_hf_model, get_mg_model

Support smollm2

b3648df

Signed-off-by: ldwang <[email protected]>

ftgreat requested review from a team, aoyulong, heavyrain-lzy and zhaoyinglia as code owners October 13, 2025 03:19

gemini-code-assist bot reviewed Oct 13, 2025

View reviewed changes

	args.tokenizer_type = "Llama3TokenizerFS"
	args.tokenizer_type = "SmolLM2TokenizerFS"

		args.bf16 = llama_args["dtype"] == "bfloat16"
		args.fp16 = llama_args["dtype"] == "float16"



		def save_args_mg2hf(args):
		from transformers import LlamaConfig

	from mixtral.model import *
	from ..mixtral.model import get_hf_model, get_mg_model

Uh oh!

[Model] Support smollm2 #858

Are you sure you want to change the base?

[Model] Support smollm2 #858

Conversation

ftgreat commented Oct 13, 2025

Uh oh!

gemini-code-assist bot commented Oct 13, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant