-
Notifications
You must be signed in to change notification settings - Fork 30
[feat] Added sdcript to create fast_llm_config.yaml
files for datasets prepared with older fast llm version
#173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but please make sure the file is formatted properly with pre-commit.
"dataset": {"path": "unknown"}, | ||
"tokenizer": {"path": "no_tokenizer"}, | ||
} | ||
main(config_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing pre-commit?
@@ -0,0 +1,86 @@ | |||
import pathlib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a disclaimer docstring saying that this is only for older datasets, and not really intended for anything else
@jlamypoirier I suspect this script is not really needed if we are backward compatible with the json files that were produced by older version of the prepare command. Could you give an example of how to use those older json files in in the current version? |
@oleksost the only supported way to use the json format is with the old way of defining datasets, using a single json file as-is to define the whole dataset. You can't really use it to mix datasets or combine with the |
.gitignore
Outdated
@@ -36,3 +36,6 @@ devenv.* | |||
|
|||
# direnv | |||
.direnv | |||
|
|||
# private folders | |||
__* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a common convention? Also missing newline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not common, I removed it.
✨ Description
Added a script that is supposed to create the
fast_llm_config.yaml
file for datasets prepared with older fast llm version.I think this is effected by the same problem as #172
🔍 Type of change
Select all that apply:
📝 Changes
List the key changes introduced in this PR:
generate_config_yaml_for_sharded_dst
inprepare.py
generate_config_yaml_for_sharded_dst.py
script✅ Checklist
General
Dependencies and Configuration
Testing