Community-driven data-juicer recipes and best practices for various pre-training/fine-tuning tasks.
Detail documentation about the recipes can be found here.
There are plenty of prepared recipes for data processing on different tasks. You can make use of them by cloning this repo and set the `--config`` with the local path of the target recipe file:
# clone this repo to somewhere on your local machine
git clone https://github.com/datajuicer/data-juicer-hub.git
# run with the actual local path to the target recipe
dj-process --config <root-of-data-juicer-hub>/demo/process.yaml --dataset_path <your-dataset-path>This is a community-driven repo, so feel free to upload your own recipes to this repo! 😄