[feat] Added sdcript to create `fast_llm_config.yaml` files for datasets prepared with older fast llm version #173

oleksost · 2025-03-06T21:26:23Z

✨ Description

Added a script that is supposed to create the fast_llm_config.yaml file for datasets prepared with older fast llm version.

I think this is effected by the same problem as #172

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

created function generate_config_yaml_for_sharded_dst in prepare.py
added generate_config_yaml_for_sharded_dst.py script

✅ Checklist

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

jlamypoirier

Looks good, but please make sure the file is formatted properly with pre-commit.

jlamypoirier · 2025-03-06T22:37:52Z

tools/generate_config_yaml_for_sharded_dst.py

+            "dataset": {"path": "unknown"},
+            "tokenizer": {"path": "no_tokenizer"},
+        }
+    main(config_dict)


Missing pre-commit?

jlamypoirier · 2025-03-06T22:40:36Z

tools/generate_config_yaml_for_sharded_dst.py

@@ -0,0 +1,86 @@
+import pathlib


I'd add a disclaimer docstring saying that this is only for older datasets, and not really intended for anything else

oleksost · 2025-03-07T01:49:52Z

@jlamypoirier I suspect this script is not really needed if we are backward compatible with the json files that were produced by older version of the prepare command. Could you give an example of how to use those older json files in in the current version?

jlamypoirier · 2025-03-07T02:26:32Z

@oleksost the only supported way to use the json format is with the old way of defining datasets, using a single json file as-is to define the whole dataset. You can't really use it to mix datasets or combine with the file format, so you do need this script for a proper solution to #25. (Or add more backward compatibility for json format, but I think it would need more work)

jlamypoirier · 2025-03-13T18:06:09Z

.gitignore

@@ -36,3 +36,6 @@ devenv.*

 # direnv
 .direnv
+
+# private folders
+__*


Is this a common convention? Also missing newline.

its not common, I removed it.

oleksost added 2 commits March 6, 2025 16:07

convertion script

41c6c35

removed unnecessary lines

6dc7410

oleksost requested review from tscholak and jlamypoirier March 6, 2025 21:28

oleksost marked this pull request as draft March 6, 2025 21:42

jlamypoirier approved these changes Mar 6, 2025

View reviewed changes

jlamypoirier reviewed Mar 6, 2025

View reviewed changes

oleksost added 2 commits March 7, 2025 02:32

Merge branch 'main' into dataset_convertion_script

ae4560d

doc string

5952421

oleksost marked this pull request as ready for review March 8, 2025 18:52

jlamypoirier mentioned this pull request Mar 10, 2025

[bug] Failing to run training with concatenation of file datasets #176

Open

pre-commits

15cede8

jlamypoirier reviewed Mar 13, 2025

View reviewed changes

format

d514823

oleksost merged commit 331ff06 into main Mar 14, 2025
4 checks passed

oleksost deleted the dataset_convertion_script branch March 14, 2025 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Added sdcript to create `fast_llm_config.yaml` files for datasets prepared with older fast llm version #173

[feat] Added sdcript to create `fast_llm_config.yaml` files for datasets prepared with older fast llm version #173

oleksost commented Mar 6, 2025 •

edited

Loading

jlamypoirier left a comment

jlamypoirier Mar 6, 2025

jlamypoirier Mar 6, 2025

oleksost commented Mar 7, 2025

jlamypoirier commented Mar 7, 2025 •

edited

Loading

jlamypoirier Mar 13, 2025

oleksost Mar 14, 2025

[feat] Added sdcript to create fast_llm_config.yaml files for datasets prepared with older fast llm version #173

[feat] Added sdcript to create fast_llm_config.yaml files for datasets prepared with older fast llm version #173

Conversation

oleksost commented Mar 6, 2025 • edited Loading

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

jlamypoirier left a comment

Choose a reason for hiding this comment

jlamypoirier Mar 6, 2025

Choose a reason for hiding this comment

jlamypoirier Mar 6, 2025

Choose a reason for hiding this comment

oleksost commented Mar 7, 2025

jlamypoirier commented Mar 7, 2025 • edited Loading

jlamypoirier Mar 13, 2025

Choose a reason for hiding this comment

oleksost Mar 14, 2025

Choose a reason for hiding this comment

[feat] Added sdcript to create `fast_llm_config.yaml` files for datasets prepared with older fast llm version #173

[feat] Added sdcript to create `fast_llm_config.yaml` files for datasets prepared with older fast llm version #173

oleksost commented Mar 6, 2025 •

edited

Loading

jlamypoirier commented Mar 7, 2025 •

edited

Loading