Skip to content

[feat] Added sdcript to create fast_llm_config.yaml files for datasets prepared with older fast llm version #173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 14, 2025

Conversation

oleksost
Copy link
Contributor

@oleksost oleksost commented Mar 6, 2025

✨ Description

Added a script that is supposed to create the fast_llm_config.yaml file for datasets prepared with older fast llm version.

I think this is effected by the same problem as #172

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

  1. created function generate_config_yaml_for_sharded_dst in prepare.py
  2. added generate_config_yaml_for_sharded_dst.py script

✅ Checklist

General

  • 📜 I have read and followed the contributing guidelines.
  • 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
  • 🎉 The functionality is complete, and I have tested the changes.
  • 📝 I have updated the documentation if needed.
  • ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • 🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

  • 🐋 I have updated the Docker configuration or dependencies, if applicable.
  • 🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

  • 🧪 I have added or updated tests to cover my changes.
  • ✔️ New and existing tests pass locally with my changes.
  • 🚦 I have tested these changes on GPUs and verified training stability.
  • 🏋️ I have tested the changes on realistic training workloads, if applicable.

@oleksost oleksost marked this pull request as draft March 6, 2025 21:42
Copy link
Collaborator

@jlamypoirier jlamypoirier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but please make sure the file is formatted properly with pre-commit.

"dataset": {"path": "unknown"},
"tokenizer": {"path": "no_tokenizer"},
}
main(config_dict)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing pre-commit?

@@ -0,0 +1,86 @@
import pathlib
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a disclaimer docstring saying that this is only for older datasets, and not really intended for anything else

@oleksost
Copy link
Contributor Author

oleksost commented Mar 7, 2025

@jlamypoirier I suspect this script is not really needed if we are backward compatible with the json files that were produced by older version of the prepare command. Could you give an example of how to use those older json files in in the current version?

@jlamypoirier
Copy link
Collaborator

jlamypoirier commented Mar 7, 2025

@oleksost the only supported way to use the json format is with the old way of defining datasets, using a single json file as-is to define the whole dataset. You can't really use it to mix datasets or combine with the file format, so you do need this script for a proper solution to #25. (Or add more backward compatibility for json format, but I think it would need more work)

.gitignore Outdated
@@ -36,3 +36,6 @@ devenv.*

# direnv
.direnv

# private folders
__*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a common convention? Also missing newline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not common, I removed it.

@oleksost oleksost merged commit 331ff06 into main Mar 14, 2025
4 checks passed
@oleksost oleksost deleted the dataset_convertion_script branch March 14, 2025 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants