Skip to content

Conversation

@ftgreat
Copy link
Collaborator

@ftgreat ftgreat commented Oct 13, 2025

Support smollm2 tokenizer, training and checkpoint convert.

Signed-off-by: ldwang <[email protected]>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ftgreat, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces full support for the smollm2 model within the system. It provides the necessary configuration for training the SmolLM2-135m variant, integrates its specific tokenizer, and extends the checkpoint conversion tools to handle smollm2 models. This comprehensive update ensures that users can effectively train, manage, and convert smollm2 models using the existing framework.

Highlights

  • SmolLM2 Model Configuration: New YAML configuration files have been added to support the training of the SmolLM2-135m model, defining experiment settings, system parameters, model architecture, training hyperparameters, and data specifics.
  • Tokenizer Integration: The tokenizer factory has been updated to include SmolLM2TokenizerFS, allowing the system to correctly identify and utilize the SmolLM2 tokenizer for processing text data.
  • Checkpoint Conversion Support: The checkpoint conversion utility now officially supports smollm2 models, enabling seamless conversion of model checkpoints between different formats.
  • Argument and Checkpoint Utilities: New Python modules (args.py and ckpt.py) have been introduced under tools/checkpoint/smollm2 to handle the conversion of model arguments between Hugging Face and Megatron formats, and to manage the loading and saving of attention and MLP layer checkpoints, including considerations for tensor parallelism.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the smollm2 model, including configuration files, tokenizer integration, and checkpoint conversion scripts. The overall structure is consistent with how other models are supported in this repository. However, I've found several issues, including hardcoded paths in configuration files which harm reproducibility, and what appear to be copy-paste errors in the checkpoint conversion scripts that could lead to bugs. I've also noted some opportunities for code cleanup and improvement regarding style and efficiency. Please see my detailed comments below.

args.consumed_train_samples = 0
args.consumed_valid_samples = 0
args.norm_has_bias = False
args.tokenizer_type = "Llama3TokenizerFS"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The tokenizer_type is hardcoded to Llama3TokenizerFS. Since this script is for smollm2, it should be set to SmolLM2TokenizerFS. This is a critical bug that will cause the checkpoint conversion to fail or produce an incorrect result.

Suggested change
args.tokenizer_type = "Llama3TokenizerFS"
args.tokenizer_type = "SmolLM2TokenizerFS"

Comment on lines +25 to +27
args.rotary_seq_len_interpolation_factor = (
None if llama_args["rope_scaling"] == "null" else llama_args["rope_scaling"]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The check llama_args["rope_scaling"] == "null" is likely incorrect. In JSON, null is a keyword, not a string. When loaded by Python's json library, it becomes None. The check should be llama_args["rope_scaling"] is None.

Suggested change
args.rotary_seq_len_interpolation_factor = (
None if llama_args["rope_scaling"] == "null" else llama_args["rope_scaling"]
)
args.rotary_seq_len_interpolation_factor = (
None if llama_args.get("rope_scaling") is None else llama_args["rope_scaling"]
)

no_mmap_bin_files: true
tokenizer:
tokenizer_type: SmolLM2TokenizerFS
tokenizer_path: /share/projset/ldwang/models/HuggingFaceTB/SmolLM2-135M
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The tokenizer_path is a hardcoded absolute path. This will prevent the training script from running on any machine other than the one where this path is valid. Please replace it with a placeholder or a relative path to ensure the example is reproducible.

Comment on lines +30 to +31
args.bf16 = llama_args["dtype"] == "bfloat16"
args.fp16 = llama_args["dtype"] == "float16"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Hugging Face config.json for Llama-like models typically uses the key torch_dtype, not dtype. Using dtype here will likely result in a KeyError. This seems to be a copy-paste error. Please verify the key in the smollm2 model's config.json and use the correct one, which is probably torch_dtype.

Suggested change
args.bf16 = llama_args["dtype"] == "bfloat16"
args.fp16 = llama_args["dtype"] == "float16"
args.bf16 = llama_args["torch_dtype"] == "bfloat16"
args.fp16 = llama_args["torch_dtype"] == "float16"



def save_args_mg2hf(args):
from transformers import LlamaConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The script imports and uses LlamaConfig. This is misleading and potentially incorrect for the smollm2 model. If smollm2 has its own config class, that should be used. If not, it's better to use a more generic AutoConfig to avoid issues if the architectures are not perfectly compatible.

# nnodes: 1
nproc_per_node: 1
cmds:
before_start: ulimit -n 1048576 && source /root/miniconda3/bin/activate flagscale-train
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The path /root/miniconda3/bin/activate is hardcoded. This makes the configuration less portable and might fail in environments where miniconda is installed elsewhere. Consider using an environment variable or documenting this as a prerequisite with a clear setup instruction.

NCCL_P2P_LL_THRESHOLD: 0
IXCCL_MIX_NV: 1
IXCCL_FUSED_ENABLE: 0
NCCL_IB_DISABLE: 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The configuration key NCCL_IB_DISABLE is defined here, but it's also present (though commented out) on line 30. This duplication can be confusing. Please remove the redundant entry to improve clarity.


import torch

sys.path.append("..")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Modifying sys.path is generally considered bad practice as it can lead to unexpected behavior and makes the code harder to maintain. You can use a relative import here instead.

from ..mixtral.ckpt import (

@@ -0,0 +1,4 @@
import sys

sys.path.append("..")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Modifying sys.path is generally considered bad practice. A relative import would be a better solution here.

import sys

sys.path.append("..")
from mixtral.model import *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using wildcard imports (*) is discouraged because it can pollute the namespace and make it unclear where functions or classes are coming from. It's better to import the specific names you need.

Suggested change
from mixtral.model import *
from ..mixtral.model import get_hf_model, get_mg_model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant