diff --git a/README.md b/README.md
index f778815b8..36cc476ae 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
-# Open R1
+# Suanfamama F1 Fashion Model
 
-*A fully open reproduction of DeepSeek-R1. This repo is a work in progress, let's build it together!*
+*A domain specialized fashion model. This repo is a work in progress, let's build it together!*
 
 **Table of Contents**  
 1. [Overview](#overview)  
@@ -10,31 +10,28 @@
    - [SFT](#sft)  
    - [GRPO](#grpo)  
 5. [Evaluating models](#evaluating-models)  
-6. [Reproducing Deepseek's evaluation results](#reproducing-deepseeks-evaluation-results)  
-7. [Data generation](#data-generation)  
-   - [Generate data from a smol distilled R1 model](#generate-data-from-a-smol-distilled-r1-model)  
-   - [Generate data from DeepSeek-R1](#generate-data-from-deepseek-r1)  
-8. [Contributing](#contributing)
+6. [Data generation](#data-generation)  
+7. [Contributing](#contributing)
 
 ## Overview
 
-The goal of this repo is to build the missing pieces of the R1 pipeline such that everybody can reproduce and build on top of it. The project is simple by design and mostly consists of:
+The goal of this repo is to train a specialized fashion model from DeepSeek-R1. We aim to build a model that excels at understanding fashion-related queries, generating style recommendations, and analyzing fashion trends. The project is simple by design and mostly consists of:
 
 
 - `src/open_r1`: contains the scripts to train and evaluate models as well as generate synthetic data:
     - `grpo.py`: trains a model with GRPO on a given dataset.
     - `sft.py`: performs a simple SFT of a model on a dataset.
-    - `evaluate.py`: evaluates a model on the R1 benchmarks.
-    - `generate.py`: generates synthetic data from a model using [Distilabel](https://github.com/argilla-io/distilabel).
-- `Makefile`: contains easy-to-run commands for each step in the R1 pipeline leveraging the scripts above.
+    - `evaluate.py`: evaluates a model on the fashion benchmarks.
+    - `generate.py`: generates synthetic fashion data from a model using [Distilabel](https://github.com/argilla-io/distilabel).
+- `Makefile`: contains easy-to-run commands for each step in the fashion model pipeline leveraging the scripts above.
 
 ### Plan of attack
 
-We will use the DeepSeek-R1 [tech report](https://github.com/deepseek-ai/DeepSeek-R1) as a guide, which can roughly be broken down into three main steps:
+We will use the DeepSeek-R1 as our base model, and our approach can be broken down into three main steps:
 
-* Step 1: replicate the R1-Distill models by distilling a high-quality corpus from DeepSeek-R1.
-* Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will likely involve curating new, large-scale datasets for math, reasoning, and code.
-* Step 3: show we can go from base model to RL-tuned via multi-stage training.
+* Step 1: Curate a high-quality fashion dataset for training, including product descriptions, style guides, and fashion terminology.
+* Step 2: Fine-tune the DeepSeek-R1 model on this fashion dataset using SFT techniques.
+* Step 3: Further refine the model using GRPO to enhance its ability to generate relevant and accurate fashion recommendations.
 
 <center>
     <img src="assets/plan-of-attack.png" width="500">
@@ -91,13 +88,13 @@ sudo apt-get install git-lfs
 
 ## Training models
 
-We support training models with either DDP or DeepSpeed (ZeRO-2 and ZeRO-3). For example, to run SFT on a dataset distilled from DeepSeek-R1 with reasoning traces such as [open-r1/OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k), run:
+We support training models with either DDP or DeepSpeed (ZeRO-2 and ZeRO-3). For example, to run SFT on a fashion dataset, run:
 
 ```shell
 # Train via command line
 accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
-    --model_name_or_path Qwen/Qwen2.5-1.5B-Instruct \
-    --dataset_name open-r1/OpenR1-Math-220k \
+    --model_name_or_path deepseek-ai/DeepSeek-R1 \
+    --dataset_name fashion-dataset \
     --learning_rate 1.0e-5 \
     --num_train_epochs 1 \
     --packing \
@@ -105,11 +102,11 @@ accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r
     --per_device_train_batch_size 16 \
     --gradient_checkpointing \
     --bf16 \
-    --output_dir data/Qwen2.5-1.5B-Open-R1-Distill
+    --output_dir data/DeepSeek-R1-Fashion
 
 # Train via YAML config
 accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
-    --config recipes/Qwen2.5-1.5B-Instruct/sft/config_demo.yaml
+    --config recipes/DeepSeek-R1/sft/config_fashion.yaml
 ```
 
 Currently, the following tasks are supported:
@@ -125,7 +122,7 @@ By default, these scripts will push each model to your Hugging Face Hub username
 ```shell
 # Change batch size, number of epochs etc
 accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
-    --config recipes/Qwen2.5-1.5B-Instruct/sft/config_demo.yaml
+    --config recipes/DeepSeek-R1/sft/config_fashion.yaml
     --per_device_train_batch_size=1 --num_train_epochs=5
 ```
 
@@ -133,8 +130,8 @@ If you also wish to override the Weights and Biases default settings, you can do
 
 ```shell
 accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
-    --config recipes/Qwen2.5-1.5B-Instruct/sft/config_demo.yaml
-    --wandb_entity huggingface --wandb_project open-r1 --run_name Qwen2.5-1.5B-GRPO
+    --config recipes/DeepSeek-R1/sft/config_fashion.yaml
+    --wandb_entity huggingface --wandb_project fashion-r1 --run_name DeepSeek-R1-Fashion
 ```
 
 > [!NOTE]
@@ -142,12 +139,12 @@ accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r
 
 ### SFT
 
-To run SFT on a dataset distilled from DeepSeek-R1 with reasoning traces such as [open-r1/OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k), run:
+To run SFT on a fashion dataset, run:
 
 ```shell
 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
     src/open_r1/sft.py \
-    --config recipes/Qwen2.5-1.5B-Instruct/sft/config_demo.yaml
+    --config recipes/DeepSeek-R1/sft/config_fashion.yaml
 ```
 
 ### GRPO
@@ -157,89 +154,7 @@ To train via the GRPO trainer, we use one GPU to run vLLM for faster generation
 ```shell
 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \
     --num_processes=7 src/open_r1/grpo.py \
-    --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml
-```
-
-> [!WARNING]
-> The chat template used in the distilled DeepSeek models omits the contents of the reasoning block within the `<think>` and `</think>` tags. It also prefills the assistant response with `<think>` which interferes with the format reward function. To handle that, it is important to override the chat template as done in e.g.  [recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml](./recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml).
-
-
-We provide a minimal reproducible experiment using GRPO for mathematical reasoning, referencing the approach from [SimpleRL-Reason](https://hkust-nlp.notion.site/simplerl-reason) which uses a 7B model trained on 8K examples. Running this on 8 H100 80G GPU takes about 3 hours:
-
-```shell
-ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \
-    --num_processes=7 src/open_r1/grpo.py \
-    --config recipes/Qwen2.5-Math-7B/grpo/config_simple_rl.yaml
-```
-
-Our final [model](https://huggingface.co/Dongwei/Qwen-2.5-7B_Base_Math_smalllr), while using different learning rates, loss functions and reward structures, achieves 69.4% accuracy on MATH-500, demonstrating a 17%+ improvement over the base model.
-
-#### 👨‍💻 Training with a code interpreter
-
-We provide a `code` reward function for executing code generated by the policy during training. Currently, this reward function targets code contests like [Codeforces](https://codeforces.com), where solutions are executed against a set of test cases and the overall success rate is returned as the final reward. To ensure safe execution, we use [E2B](https://e2b.dev) sandboxes, which are fast and cheap to run. To use this reward function, first install the necessary dependencies:
-
-```shell
-uv pip install -e '.[code]'
-```
-
-Then create a `.env` file and place an API token from E2B within it:
-
-```
-E2B_API_KEY="e2b_xxx"
-```
-
-Then make sure your dataset contains a `verification_info` column with the following schema (adopted from PrimeIntellect's excellent [datasets](https://huggingface.co/collections/PrimeIntellect/synthetic-1-67a2c399cfdd6c9f7fae0c37) of verifiable problems):
-
-```python
-{
-    "language": "python",
-    "test_cases": [
-        {
-            "input": "4\n4\n0001\n1000\n0011\n0111\n3\n010\n101\n0\n2\n00000\n00001\n4\n01\n001\n0001\n00001\n",
-            "output": "1\n3 \n-1\n0\n\n2\n1 2 \n",
-            "type": "stdin_stdout",
-        }
-    ],
-}
-```
-
-For example, to train a smol model on Python problems, run:
-
-```shell
-ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \
-    --num_processes=7 src/open_r1/grpo.py \
-    --config recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo_code.yaml
-```
-
-#### Data decontamination
-
-Following [s1: Simple test-time scaling](https://arxiv.org/abs/2501.19393) the data can be decontaminated using the script at: [scripts/decontaminate.py](./scripts/decontaminate.py), which decontaminates a dataset using 8-grams and deduplicate the data. Sample run:
-
-```shell
-python scripts/decontaminate.py \
-    --dataset "open-r1/verifiable-coding-problems-python" \
-    --problem_column problem \
-    --cleanup
-```
-
-It will decontaminate against the benchmark datasets, and remove the contaminated samples afterwards. If no argument `--new_dataset_name` is provided, the same dataset will be reused, adding a `_decontaminated`. It runs against the prompt, which for this dataset is the column `problem`, but a different one can be provided.
-
-Arguments for the script:
-
-```shell
-usage: decontaminate.py [-h] --dataset DATASET [--split SPLIT] [--ngram_size NGRAM_SIZE] [--problem_column PROBLEM_COLUMN] [--cleanup] [--new_dataset_name NEW_DATASET_NAME]
-
-options:
-  -h, --help            show this help message and exit
-  --dataset DATASET     Name of the dataset to check for contamination.
-  --split SPLIT         Split to check for contamination, defaults to `train`.
-  --ngram_size NGRAM_SIZE
-                        Size of n-grams to build, defaults to 8.
-  --problem_column PROBLEM_COLUMN
-                        Name of the column containing the problem (prompt).
-  --cleanup           Whether to remove the contaminated rows before pushing the dataset.
-  --new_dataset_name NEW_DATASET_NAME
-                        New name for the dataset. If not provided, will reuse the name and add a `_decontaminated` to the name.
+    --config recipes/DeepSeek-R1-Fashion/grpo/config_fashion.yaml
 ```
 
 ### Launching jobs on a Slurm cluster
@@ -247,14 +162,14 @@ options:
 If you have access to a Slurm cluster, we provide a `slurm/train.slurm` script that will automatically queue training jobs for you. Here's how you can use it:
 
 ```shell
-sbatch --job-name=open_r1 --nodes=1 slurm/train.slurm {model_name} {task} {config_suffix} {accelerator}
+sbatch --job-name=fashion_r1 --nodes=1 slurm/train.slurm DeepSeek-R1 sft fashion zero3
 ```
 
-Here `{model_name}` and `{task}` are defined as above, while `{config_suffix}` refers to the specific config and `{accelerator}` refers to the choice of 🤗 Accelerate config in `recipes/accelerate_configs`. If you wish to override the default config parameters, you can provide them by appending a space-separated string like `'--arg1=value1 --arg2=value2'`. Here's a concrete example to run SFT on 1 node of 8 GPUs:
+Here `DeepSeek-R1` is the model name, `sft` is the task, `fashion` refers to the specific config and `zero3` refers to the choice of 🤗 Accelerate config in `recipes/accelerate_configs`. If you wish to override the default config parameters, you can provide them by appending a space-separated string like `'--arg1=value1 --arg2=value2'`. Here's a concrete example to run SFT on 1 node of 8 GPUs:
 
 ```shell
 # Launch on Slurm and override default hyperparameters
-sbatch --job-name=open_r1 --nodes=1 slurm/train.slurm Qwen2.5-1.5B-Instruct sft demo zero3 '--per_device_train_batch_size=1 --num_train_epochs=5'
+sbatch --job-name=fashion_r1 --nodes=1 slurm/train.slurm DeepSeek-R1 sft fashion zero3 '--per_device_train_batch_size=1 --num_train_epochs=5'
 ```
 
 You can scale the number of nodes by increasing the `--nodes` flag.
@@ -264,38 +179,26 @@ You can scale the number of nodes by increasing the `--nodes` flag.
 
 ## Evaluating models
 
-We use `lighteval` to evaluate models, with custom tasks defined in `src/open_r1/evaluate.py`. For models which fit on a single GPU, run:
+We use `lighteval` to evaluate our fashion model, with custom tasks defined in `src/open_r1/evaluate.py`. For models which fit on a single GPU, run:
 
 ```shell
-MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
+MODEL=username/DeepSeek-R1-Fashion
 MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
 OUTPUT_DIR=data/evals/$MODEL
 
-# AIME 2024
-TASK=aime24
-lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
-    --custom-tasks src/open_r1/evaluate.py \
-    --use-chat-template \
-    --output-dir $OUTPUT_DIR
-
-# MATH-500
-TASK=math_500
+# Fashion style evaluation
+TASK=fashion_style
 lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
     --custom-tasks src/open_r1/evaluate.py \
     --use-chat-template \
     --output-dir $OUTPUT_DIR
 
-# GPQA Diamond
-TASK=gpqa:diamond
+# Fashion recommendation accuracy
+TASK=fashion_recommendation
 lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
     --custom-tasks src/open_r1/evaluate.py \
     --use-chat-template \
     --output-dir $OUTPUT_DIR
-
-# LiveCodeBench
-lighteval vllm $MODEL_ARGS "extended|lcb:codegeneration|0|0" \
-    --use-chat-template \
-    --output-dir $OUTPUT_DIR 
 ```
 
 > [!IMPORTANT]
@@ -305,9 +208,9 @@ To increase throughput across multiple GPUs, use _data parallel_ as follows:
 
 ```shell
 NUM_GPUS=8
-MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
+MODEL=username/DeepSeek-R1-Fashion
 MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
-TASK=aime24
+TASK=fashion_style
 OUTPUT_DIR=data/evals/$MODEL
 
 lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
@@ -316,177 +219,9 @@ lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
     --output-dir $OUTPUT_DIR 
 ```
 
-For large models which require sharding across GPUs, use _tensor parallel_ and run:
-
-```shell
-NUM_GPUS=8
-MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
-MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,tensor_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
-TASK=aime24
-OUTPUT_DIR=data/evals/$MODEL
-
-export VLLM_WORKER_MULTIPROC_METHOD=spawn
-lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
-    --custom-tasks src/open_r1/evaluate.py \
-    --use-chat-template \
-    --output-dir $OUTPUT_DIR 
-```
-
-You can also launch an evaluation with `make evaluate`, specifying the model, task, and optionally the parallelism technique and number of GPUs.
-
-To evaluate on a single GPU:
-
-```shell
-make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24
-```
-
-To use Data Parallelism:
-
-```shell
-make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=data NUM_GPUS=8
-```
-
-To use Tensor Parallelism:
-
-```shell
-make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=tensor NUM_GPUS=8
-```
-
-## Reproducing Deepseek's evaluation results
-
-> [!NOTE]
-> The DeepSeek-R1 paper uses sampling with 64 responses per query to estimate `pass@1`. Below, we report the results from sampling 1 response per query, which likely explains the small 1-3σ discrepancies between our results and theirs.
-
-### AIME 2024
-
-We are able to reproduce Deepseek's reported results on the AIME 2024 benchmark within ~1-3 standard deviations:
-
-| Model                         | AIME 2024 (🤗 LightEval) | AIME 2024 (DeepSeek Reported) |
-|:------------------------------|:-----------------------:|:----------------------------:|
-| DeepSeek-R1-Distill-Qwen-1.5B |          26.7           |             28.9             |
-| DeepSeek-R1-Distill-Qwen-7B   |          56.6           |             55.5             |
-| DeepSeek-R1-Distill-Qwen-14B  |          60.0           |             69.7             |
-| DeepSeek-R1-Distill-Qwen-32B  |          73.2           |             72.6             |
-| DeepSeek-R1-Distill-Llama-8B  |          43.3           |             50.4             |
-| DeepSeek-R1-Distill-Llama-70B |          73.3           |             70.0             |
-
-To reproduce these results use the following command:
-
-```shell
-NUM_GPUS=1 # Set to 8 for 32B and 70B models
-MODEL=deepseek-ai/{model_name}
-MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
-OUTPUT_DIR=data/evals/$MODEL
-
-lighteval vllm $MODEL_ARGS "custom|aime24|0|0" \
-    --custom-tasks src/open_r1/evaluate.py \
-    --use-chat-template \
-    --output-dir $OUTPUT_DIR
-```
-
-Alternatively, you can launch Slurm jobs as follows:
-
-```shell
-python scripts/run_benchmarks.py --model-id {model_id}  --benchmarks aime24
-```
-
-### MATH-500
-
-We are able to reproduce Deepseek's reported results on the MATH-500 benchmark within ~1-3 standard deviations:
-
-| Model                         | MATH-500 (🤗 LightEval) | MATH-500 (DeepSeek Reported) |
-|:------------------------------|:-----------------------:|:----------------------------:|
-| DeepSeek-R1-Distill-Qwen-1.5B |          84.6           |             83.9             |
-| DeepSeek-R1-Distill-Qwen-7B   |          93.0           |             92.8             |
-| DeepSeek-R1-Distill-Qwen-14B  |          95.0           |             93.9             |
-| DeepSeek-R1-Distill-Qwen-32B  |          96.6           |             94.3             |
-| DeepSeek-R1-Distill-Llama-8B  |          88.6           |             89.1             |
-| DeepSeek-R1-Distill-Llama-70B |          96.4           |             94.5             |
-
-To reproduce these results use the following command:
-
-```shell
-NUM_GPUS=1 # Set to 8 for 32B and 70B models
-MODEL=deepseek-ai/{model_name}
-MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
-OUTPUT_DIR=data/evals/$MODEL
-
-lighteval vllm $MODEL_ARGS "custom|math_500|0|0" \
-    --custom-tasks src/open_r1/evaluate.py \
-    --use-chat-template \
-    --output-dir $OUTPUT_DIR
-```
-
-Alternatively, you can launch Slurm jobs as follows:
-
-```shell
-python scripts/run_benchmarks.py --model-id {model_id}  --benchmarks math_500
-```
-
-### GPQA Diamond
-
-We are able to reproduce Deepseek's reported results on the GPQA Diamond benchmark within ~1-3 standard deviations:
-
-| Model                         | GPQA Diamond (🤗 LightEval) | GPQA Diamond (DeepSeek Reported) |
-|:------------------------------|:---------------------------:|:--------------------------------:|
-| DeepSeek-R1-Distill-Qwen-1.5B |            34.3             |               33.8               |
-| DeepSeek-R1-Distill-Qwen-7B   |            50.5             |               49.1               |
-| DeepSeek-R1-Distill-Qwen-14B  |            59.6             |               59.1               |
-| DeepSeek-R1-Distill-Qwen-32B  |            63.6             |               62.1               |
-| DeepSeek-R1-Distill-Llama-8B  |            52.0             |               49.0               |
-| DeepSeek-R1-Distill-Llama-70B |            67.2             |               65.2               |
-
-To reproduce these results use the following command:
-
-```shell
-NUM_GPUS=1 # Set to 8 for 32B and 70B models
-MODEL=deepseek-ai/{model_name}
-MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
-OUTPUT_DIR=data/evals/$MODEL
-
-lighteval vllm $MODEL_ARGS "custom|gpqa:diamond|0|0" \
-    --custom-tasks src/open_r1/evaluate.py \
-    --use-chat-template \
-    --output-dir $OUTPUT_DIR
-```
-
-```shell
-python scripts/run_benchmarks.py --model-id {model_id}  --benchmarks gpqa
-```
-
-### LiveCodeBench
-
-We are able to reproduce Deepseek's reported results on the LiveCodeBench code generation benchmark within ~1-3 standard deviations:
-
-| Model                         | LiveCodeBench (🤗 LightEval) | GPQA Diamond (DeepSeek Reported) |
-|:------------------------------|:----------------------------:|:--------------------------------:|
-| DeepSeek-R1-Distill-Qwen-1.5B |             16.3             |               16.9               |
-| DeepSeek-R1-Distill-Qwen-7B   |             36.6             |               37.6               |
-| DeepSeek-R1-Distill-Qwen-14B  |             51.5             |               53.1               |
-| DeepSeek-R1-Distill-Qwen-32B  |             56.6             |               57.2               |
-| DeepSeek-R1-Distill-Llama-8B  |             37.0             |               39.6               |
-| DeepSeek-R1-Distill-Llama-70B |             54.5             |               57.5               |
-
-To reproduce these results use the following command:
-
-```shell
-NUM_GPUS=1 # Set to 8 for 32B and 70B models, or data_parallel_size=8 with the smaller models for speed
-MODEL=deepseek-ai/{model_name}
-MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
-OUTPUT_DIR=data/evals/$MODEL
-
-lighteval vllm $MODEL_ARGS "extended|lcb:codegeneration|0|0" \
-    --use-chat-template \
-    --output-dir $OUTPUT_DIR
-```
-
-```shell
-python scripts/run_benchmarks.py --model-id {model_id}  --benchmarks lcb
-```
-
 ## Data generation
 
-### Generate data from a smol distilled R1 model
+### Generate fashion data 
 
 The following example can be run in 1xH100. 
 First install the following dependencies:
@@ -495,7 +230,7 @@ First install the following dependencies:
 uv pip install "distilabel[vllm]>=1.5.2"
 ```
 
-Now save the following snippet into a file named `pipeline.py` and run it with `python pipeline.py`. It will generate 4 outputs for each of the 10 examples (change the username for the repository to your org/user name):
+Now save the following snippet into a file named `pipeline.py` and run it with `python pipeline.py`. It will generate fashion recommendations and style descriptions:
 
 ```python
 from datasets import load_dataset
@@ -505,16 +240,16 @@ from distilabel.steps.tasks import TextGeneration
 
 
 prompt_template = """\
-You will be given a problem. Please reason step by step, and put your final answer within \boxed{}:
+You are a fashion expert. Please provide detailed style advice for the following scenario:
 {{ instruction }}"""
 
-dataset = load_dataset("AI-MO/NuminaMath-TIR", split="train").select(range(10))
+dataset = load_dataset("fashion-queries-dataset", split="train").select(range(10))
 
-model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"  # Exchange with another smol distilled r1
+model_id = "deepseek-ai/DeepSeek-R1"
 
 with Pipeline(
-    name="distill-qwen-7b-r1",
-    description="A pipeline to generate data from a distilled r1 model",
+    name="fashion-r1",
+    description="A pipeline to generate fashion recommendations",
 ) as pipeline:
 
     llm = vLLM(
@@ -529,7 +264,7 @@ with Pipeline(
             "max_new_tokens": 8192,
         },
     )
-    prompt_column = "problem"
+    prompt_column = "query"
     text_generation = TextGeneration(
         llm=llm, 
         template=prompt_template,
@@ -540,37 +275,9 @@ with Pipeline(
 
 if __name__ == "__main__":
     distiset = pipeline.run(dataset=dataset)
-    distiset.push_to_hub(repo_id="username/numina-deepseek-r1-qwen-7b")
-```
-
-Take a look at the sample dataset at [HuggingFaceH4/numina-deepseek-r1-qwen-7b](https://huggingface.co/datasets/HuggingFaceH4/numina-deepseek-r1-qwen-7b).
-
-
-### Generate data from DeepSeek-R1
-
-To run the bigger DeepSeek-R1, we used 2 nodes, each with 8×H100 GPUs using the slurm file present in this repo at `slurm/generate.slurm`. First, install the dependencies:
-
-(for now we need to install the vllm dev wheel that [fixes the R1 cuda graph capture](https://github.com/vllm-project/vllm/commits/221d388cc5a836fa189305785ed7e887cea8b510/csrc/moe/moe_align_sum_kernels.cu))
-```shell
-pip install https://wheels.vllm.ai/221d388cc5a836fa189305785ed7e887cea8b510/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu121
-
-uv pip install "distilabel[vllm,ray,openai]>=1.5.2"
+    distiset.push_to_hub(repo_id="username/fashion-deepseek-r1")
 ```
 
-And then run the following command:
-
-```shell
-sbatch slurm/generate.slurm \
-    --hf-dataset AI-MO/NuminaMath-TIR \
-    --temperature 0.6 \
-    --prompt-column problem \
-    --model deepseek-ai/DeepSeek-R1 \
-    --hf-output-dataset username/r1-dataset
-```
-
-> [!NOTE]  
-> While the job is running, you can setup an SSH tunnel through the cluster login node to access the Ray dashboard from your computer running `ssh -L 8265:ray_ip_head_node:8265 <login_node>`, then browsing `http://localhost:8265`
-
 ## Contributing
 
-Contributions are welcome. Please refer to https://github.com/huggingface/open-r1/issues/23.
+Contributions are welcome. Please refer to the issues section for current tasks and priorities.
diff --git a/README.md.backup b/README.md.backup
new file mode 100644
index 000000000..f778815b8
--- /dev/null
+++ b/README.md.backup
@@ -0,0 +1,576 @@
+# Open R1
+
+*A fully open reproduction of DeepSeek-R1. This repo is a work in progress, let's build it together!*
+
+**Table of Contents**  
+1. [Overview](#overview)  
+2. [Plan of attack](#plan-of-attack)  
+3. [Installation](#installation)  
+4. [Training models](#training-models)  
+   - [SFT](#sft)  
+   - [GRPO](#grpo)  
+5. [Evaluating models](#evaluating-models)  
+6. [Reproducing Deepseek's evaluation results](#reproducing-deepseeks-evaluation-results)  
+7. [Data generation](#data-generation)  
+   - [Generate data from a smol distilled R1 model](#generate-data-from-a-smol-distilled-r1-model)  
+   - [Generate data from DeepSeek-R1](#generate-data-from-deepseek-r1)  
+8. [Contributing](#contributing)
+
+## Overview
+
+The goal of this repo is to build the missing pieces of the R1 pipeline such that everybody can reproduce and build on top of it. The project is simple by design and mostly consists of:
+
+
+- `src/open_r1`: contains the scripts to train and evaluate models as well as generate synthetic data:
+    - `grpo.py`: trains a model with GRPO on a given dataset.
+    - `sft.py`: performs a simple SFT of a model on a dataset.
+    - `evaluate.py`: evaluates a model on the R1 benchmarks.
+    - `generate.py`: generates synthetic data from a model using [Distilabel](https://github.com/argilla-io/distilabel).
+- `Makefile`: contains easy-to-run commands for each step in the R1 pipeline leveraging the scripts above.
+
+### Plan of attack
+
+We will use the DeepSeek-R1 [tech report](https://github.com/deepseek-ai/DeepSeek-R1) as a guide, which can roughly be broken down into three main steps:
+
+* Step 1: replicate the R1-Distill models by distilling a high-quality corpus from DeepSeek-R1.
+* Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will likely involve curating new, large-scale datasets for math, reasoning, and code.
+* Step 3: show we can go from base model to RL-tuned via multi-stage training.
+
+<center>
+    <img src="assets/plan-of-attack.png" width="500">
+</center>
+
+
+## Installation
+
+> [!CAUTION]
+> Libraries rely on CUDA 12.4. If you see errors related to segmentation faults, double check the version your system is running with `nvcc --version`.
+
+To run the code in this project, first, create a Python virtual environment using e.g. `uv`.
+To install `uv`, follow the [UV Installation Guide](https://docs.astral.sh/uv/getting-started/installation/).
+
+
+```shell
+uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --upgrade pip
+```
+
+> [!TIP]
+> For Hugging Face cluster users, add `export UV_LINK_MODE=copy` to your `.bashrc` to suppress cache warnings from `uv`
+
+Next, install vLLM and FlashAttention:
+
+```shell
+uv pip install vllm==0.7.2
+uv pip install setuptools && uv pip install flash-attn --no-build-isolation
+```
+
+This will also install PyTorch `v2.5.1` and it is **very important** to use this version since the vLLM binaries are compiled for it. You can then install the remaining dependencies for your specific use case via `pip install -e .[LIST OF MODES]`. For most contributors, we recommend:
+
+```shell
+GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]"
+```
+
+Next, log into your Hugging Face and Weights and Biases accounts as follows:
+
+```shell
+huggingface-cli login
+wandb login
+```
+
+Finally, check whether your system has Git LFS installed so that you can load and push models/datasets to the Hugging Face Hub:
+
+```shell
+git-lfs --version
+```
+
+If it isn't installed, run:
+
+```shell
+sudo apt-get install git-lfs
+```
+
+## Training models
+
+We support training models with either DDP or DeepSpeed (ZeRO-2 and ZeRO-3). For example, to run SFT on a dataset distilled from DeepSeek-R1 with reasoning traces such as [open-r1/OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k), run:
+
+```shell
+# Train via command line
+accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
+    --model_name_or_path Qwen/Qwen2.5-1.5B-Instruct \
+    --dataset_name open-r1/OpenR1-Math-220k \
+    --learning_rate 1.0e-5 \
+    --num_train_epochs 1 \
+    --packing \
+    --max_seq_length 16384 \
+    --per_device_train_batch_size 16 \
+    --gradient_checkpointing \
+    --bf16 \
+    --output_dir data/Qwen2.5-1.5B-Open-R1-Distill
+
+# Train via YAML config
+accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
+    --config recipes/Qwen2.5-1.5B-Instruct/sft/config_demo.yaml
+```
+
+Currently, the following tasks are supported:
+
+* Supervised Fine-Tuning `sft`
+* Group Relative Policy Optimization `grpo`
+
+> [!TIP]
+> If you scale up/down the number of GPUs, we recommend also scaling up the per-device batch size or number of gradient accumulation steps to keep the global batch size constant.
+
+By default, these scripts will push each model to your Hugging Face Hub username, i.e. `{username}/{model_name}-{task}`. You can override the parameters in each YAML config by appending them to the command as follows: 
+
+```shell
+# Change batch size, number of epochs etc
+accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
+    --config recipes/Qwen2.5-1.5B-Instruct/sft/config_demo.yaml
+    --per_device_train_batch_size=1 --num_train_epochs=5
+```
+
+If you also wish to override the Weights and Biases default settings, you can do so as follows:
+
+```shell
+accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
+    --config recipes/Qwen2.5-1.5B-Instruct/sft/config_demo.yaml
+    --wandb_entity huggingface --wandb_project open-r1 --run_name Qwen2.5-1.5B-GRPO
+```
+
+> [!NOTE]
+> The training commands below are configured for a node of 8 x H100s (80GB). For different hardware and topologies, you may need to tune the batch size and number of gradient accumulation steps.
+
+### SFT
+
+To run SFT on a dataset distilled from DeepSeek-R1 with reasoning traces such as [open-r1/OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k), run:
+
+```shell
+ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
+    src/open_r1/sft.py \
+    --config recipes/Qwen2.5-1.5B-Instruct/sft/config_demo.yaml
+```
+
+### GRPO
+
+To train via the GRPO trainer, we use one GPU to run vLLM for faster generation and the remaining GPUs for training. For example, one a node with 8 GPUs, set `--num_processes` to override the default value in the `accelerate` configs:
+
+```shell
+ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \
+    --num_processes=7 src/open_r1/grpo.py \
+    --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml
+```
+
+> [!WARNING]
+> The chat template used in the distilled DeepSeek models omits the contents of the reasoning block within the `<think>` and `</think>` tags. It also prefills the assistant response with `<think>` which interferes with the format reward function. To handle that, it is important to override the chat template as done in e.g.  [recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml](./recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml).
+
+
+We provide a minimal reproducible experiment using GRPO for mathematical reasoning, referencing the approach from [SimpleRL-Reason](https://hkust-nlp.notion.site/simplerl-reason) which uses a 7B model trained on 8K examples. Running this on 8 H100 80G GPU takes about 3 hours:
+
+```shell
+ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \
+    --num_processes=7 src/open_r1/grpo.py \
+    --config recipes/Qwen2.5-Math-7B/grpo/config_simple_rl.yaml
+```
+
+Our final [model](https://huggingface.co/Dongwei/Qwen-2.5-7B_Base_Math_smalllr), while using different learning rates, loss functions and reward structures, achieves 69.4% accuracy on MATH-500, demonstrating a 17%+ improvement over the base model.
+
+#### 👨‍💻 Training with a code interpreter
+
+We provide a `code` reward function for executing code generated by the policy during training. Currently, this reward function targets code contests like [Codeforces](https://codeforces.com), where solutions are executed against a set of test cases and the overall success rate is returned as the final reward. To ensure safe execution, we use [E2B](https://e2b.dev) sandboxes, which are fast and cheap to run. To use this reward function, first install the necessary dependencies:
+
+```shell
+uv pip install -e '.[code]'
+```
+
+Then create a `.env` file and place an API token from E2B within it:
+
+```
+E2B_API_KEY="e2b_xxx"
+```
+
+Then make sure your dataset contains a `verification_info` column with the following schema (adopted from PrimeIntellect's excellent [datasets](https://huggingface.co/collections/PrimeIntellect/synthetic-1-67a2c399cfdd6c9f7fae0c37) of verifiable problems):
+
+```python
+{
+    "language": "python",
+    "test_cases": [
+        {
+            "input": "4\n4\n0001\n1000\n0011\n0111\n3\n010\n101\n0\n2\n00000\n00001\n4\n01\n001\n0001\n00001\n",
+            "output": "1\n3 \n-1\n0\n\n2\n1 2 \n",
+            "type": "stdin_stdout",
+        }
+    ],
+}
+```
+
+For example, to train a smol model on Python problems, run:
+
+```shell
+ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \
+    --num_processes=7 src/open_r1/grpo.py \
+    --config recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo_code.yaml
+```
+
+#### Data decontamination
+
+Following [s1: Simple test-time scaling](https://arxiv.org/abs/2501.19393) the data can be decontaminated using the script at: [scripts/decontaminate.py](./scripts/decontaminate.py), which decontaminates a dataset using 8-grams and deduplicate the data. Sample run:
+
+```shell
+python scripts/decontaminate.py \
+    --dataset "open-r1/verifiable-coding-problems-python" \
+    --problem_column problem \
+    --cleanup
+```
+
+It will decontaminate against the benchmark datasets, and remove the contaminated samples afterwards. If no argument `--new_dataset_name` is provided, the same dataset will be reused, adding a `_decontaminated`. It runs against the prompt, which for this dataset is the column `problem`, but a different one can be provided.
+
+Arguments for the script:
+
+```shell
+usage: decontaminate.py [-h] --dataset DATASET [--split SPLIT] [--ngram_size NGRAM_SIZE] [--problem_column PROBLEM_COLUMN] [--cleanup] [--new_dataset_name NEW_DATASET_NAME]
+
+options:
+  -h, --help            show this help message and exit
+  --dataset DATASET     Name of the dataset to check for contamination.
+  --split SPLIT         Split to check for contamination, defaults to `train`.
+  --ngram_size NGRAM_SIZE
+                        Size of n-grams to build, defaults to 8.
+  --problem_column PROBLEM_COLUMN
+                        Name of the column containing the problem (prompt).
+  --cleanup           Whether to remove the contaminated rows before pushing the dataset.
+  --new_dataset_name NEW_DATASET_NAME
+                        New name for the dataset. If not provided, will reuse the name and add a `_decontaminated` to the name.
+```
+
+### Launching jobs on a Slurm cluster
+
+If you have access to a Slurm cluster, we provide a `slurm/train.slurm` script that will automatically queue training jobs for you. Here's how you can use it:
+
+```shell
+sbatch --job-name=open_r1 --nodes=1 slurm/train.slurm {model_name} {task} {config_suffix} {accelerator}
+```
+
+Here `{model_name}` and `{task}` are defined as above, while `{config_suffix}` refers to the specific config and `{accelerator}` refers to the choice of 🤗 Accelerate config in `recipes/accelerate_configs`. If you wish to override the default config parameters, you can provide them by appending a space-separated string like `'--arg1=value1 --arg2=value2'`. Here's a concrete example to run SFT on 1 node of 8 GPUs:
+
+```shell
+# Launch on Slurm and override default hyperparameters
+sbatch --job-name=open_r1 --nodes=1 slurm/train.slurm Qwen2.5-1.5B-Instruct sft demo zero3 '--per_device_train_batch_size=1 --num_train_epochs=5'
+```
+
+You can scale the number of nodes by increasing the `--nodes` flag.
+
+> [!NOTE]
+> The configuration in `slurm/train.slurm` is optimised for the Hugging Face Compute Cluster and may require tweaking to be adapted to your own compute nodes.
+
+## Evaluating models
+
+We use `lighteval` to evaluate models, with custom tasks defined in `src/open_r1/evaluate.py`. For models which fit on a single GPU, run:
+
+```shell
+MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
+MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
+OUTPUT_DIR=data/evals/$MODEL
+
+# AIME 2024
+TASK=aime24
+lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
+    --custom-tasks src/open_r1/evaluate.py \
+    --use-chat-template \
+    --output-dir $OUTPUT_DIR
+
+# MATH-500
+TASK=math_500
+lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
+    --custom-tasks src/open_r1/evaluate.py \
+    --use-chat-template \
+    --output-dir $OUTPUT_DIR
+
+# GPQA Diamond
+TASK=gpqa:diamond
+lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
+    --custom-tasks src/open_r1/evaluate.py \
+    --use-chat-template \
+    --output-dir $OUTPUT_DIR
+
+# LiveCodeBench
+lighteval vllm $MODEL_ARGS "extended|lcb:codegeneration|0|0" \
+    --use-chat-template \
+    --output-dir $OUTPUT_DIR 
+```
+
+> [!IMPORTANT]
+> You must set `max_model_length=32768` in the `vllm` command to align with the `max_new_tokens` we define per eval. Without this, `lighteval` will throw an error.
+
+To increase throughput across multiple GPUs, use _data parallel_ as follows:
+
+```shell
+NUM_GPUS=8
+MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
+MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
+TASK=aime24
+OUTPUT_DIR=data/evals/$MODEL
+
+lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
+    --custom-tasks src/open_r1/evaluate.py \
+    --use-chat-template \
+    --output-dir $OUTPUT_DIR 
+```
+
+For large models which require sharding across GPUs, use _tensor parallel_ and run:
+
+```shell
+NUM_GPUS=8
+MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
+MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,tensor_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
+TASK=aime24
+OUTPUT_DIR=data/evals/$MODEL
+
+export VLLM_WORKER_MULTIPROC_METHOD=spawn
+lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
+    --custom-tasks src/open_r1/evaluate.py \
+    --use-chat-template \
+    --output-dir $OUTPUT_DIR 
+```
+
+You can also launch an evaluation with `make evaluate`, specifying the model, task, and optionally the parallelism technique and number of GPUs.
+
+To evaluate on a single GPU:
+
+```shell
+make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24
+```
+
+To use Data Parallelism:
+
+```shell
+make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=data NUM_GPUS=8
+```
+
+To use Tensor Parallelism:
+
+```shell
+make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=tensor NUM_GPUS=8
+```
+
+## Reproducing Deepseek's evaluation results
+
+> [!NOTE]
+> The DeepSeek-R1 paper uses sampling with 64 responses per query to estimate `pass@1`. Below, we report the results from sampling 1 response per query, which likely explains the small 1-3σ discrepancies between our results and theirs.
+
+### AIME 2024
+
+We are able to reproduce Deepseek's reported results on the AIME 2024 benchmark within ~1-3 standard deviations:
+
+| Model                         | AIME 2024 (🤗 LightEval) | AIME 2024 (DeepSeek Reported) |
+|:------------------------------|:-----------------------:|:----------------------------:|
+| DeepSeek-R1-Distill-Qwen-1.5B |          26.7           |             28.9             |
+| DeepSeek-R1-Distill-Qwen-7B   |          56.6           |             55.5             |
+| DeepSeek-R1-Distill-Qwen-14B  |          60.0           |             69.7             |
+| DeepSeek-R1-Distill-Qwen-32B  |          73.2           |             72.6             |
+| DeepSeek-R1-Distill-Llama-8B  |          43.3           |             50.4             |
+| DeepSeek-R1-Distill-Llama-70B |          73.3           |             70.0             |
+
+To reproduce these results use the following command:
+
+```shell
+NUM_GPUS=1 # Set to 8 for 32B and 70B models
+MODEL=deepseek-ai/{model_name}
+MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
+OUTPUT_DIR=data/evals/$MODEL
+
+lighteval vllm $MODEL_ARGS "custom|aime24|0|0" \
+    --custom-tasks src/open_r1/evaluate.py \
+    --use-chat-template \
+    --output-dir $OUTPUT_DIR
+```
+
+Alternatively, you can launch Slurm jobs as follows:
+
+```shell
+python scripts/run_benchmarks.py --model-id {model_id}  --benchmarks aime24
+```
+
+### MATH-500
+
+We are able to reproduce Deepseek's reported results on the MATH-500 benchmark within ~1-3 standard deviations:
+
+| Model                         | MATH-500 (🤗 LightEval) | MATH-500 (DeepSeek Reported) |
+|:------------------------------|:-----------------------:|:----------------------------:|
+| DeepSeek-R1-Distill-Qwen-1.5B |          84.6           |             83.9             |
+| DeepSeek-R1-Distill-Qwen-7B   |          93.0           |             92.8             |
+| DeepSeek-R1-Distill-Qwen-14B  |          95.0           |             93.9             |
+| DeepSeek-R1-Distill-Qwen-32B  |          96.6           |             94.3             |
+| DeepSeek-R1-Distill-Llama-8B  |          88.6           |             89.1             |
+| DeepSeek-R1-Distill-Llama-70B |          96.4           |             94.5             |
+
+To reproduce these results use the following command:
+
+```shell
+NUM_GPUS=1 # Set to 8 for 32B and 70B models
+MODEL=deepseek-ai/{model_name}
+MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
+OUTPUT_DIR=data/evals/$MODEL
+
+lighteval vllm $MODEL_ARGS "custom|math_500|0|0" \
+    --custom-tasks src/open_r1/evaluate.py \
+    --use-chat-template \
+    --output-dir $OUTPUT_DIR
+```
+
+Alternatively, you can launch Slurm jobs as follows:
+
+```shell
+python scripts/run_benchmarks.py --model-id {model_id}  --benchmarks math_500
+```
+
+### GPQA Diamond
+
+We are able to reproduce Deepseek's reported results on the GPQA Diamond benchmark within ~1-3 standard deviations:
+
+| Model                         | GPQA Diamond (🤗 LightEval) | GPQA Diamond (DeepSeek Reported) |
+|:------------------------------|:---------------------------:|:--------------------------------:|
+| DeepSeek-R1-Distill-Qwen-1.5B |            34.3             |               33.8               |
+| DeepSeek-R1-Distill-Qwen-7B   |            50.5             |               49.1               |
+| DeepSeek-R1-Distill-Qwen-14B  |            59.6             |               59.1               |
+| DeepSeek-R1-Distill-Qwen-32B  |            63.6             |               62.1               |
+| DeepSeek-R1-Distill-Llama-8B  |            52.0             |               49.0               |
+| DeepSeek-R1-Distill-Llama-70B |            67.2             |               65.2               |
+
+To reproduce these results use the following command:
+
+```shell
+NUM_GPUS=1 # Set to 8 for 32B and 70B models
+MODEL=deepseek-ai/{model_name}
+MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
+OUTPUT_DIR=data/evals/$MODEL
+
+lighteval vllm $MODEL_ARGS "custom|gpqa:diamond|0|0" \
+    --custom-tasks src/open_r1/evaluate.py \
+    --use-chat-template \
+    --output-dir $OUTPUT_DIR
+```
+
+```shell
+python scripts/run_benchmarks.py --model-id {model_id}  --benchmarks gpqa
+```
+
+### LiveCodeBench
+
+We are able to reproduce Deepseek's reported results on the LiveCodeBench code generation benchmark within ~1-3 standard deviations:
+
+| Model                         | LiveCodeBench (🤗 LightEval) | GPQA Diamond (DeepSeek Reported) |
+|:------------------------------|:----------------------------:|:--------------------------------:|
+| DeepSeek-R1-Distill-Qwen-1.5B |             16.3             |               16.9               |
+| DeepSeek-R1-Distill-Qwen-7B   |             36.6             |               37.6               |
+| DeepSeek-R1-Distill-Qwen-14B  |             51.5             |               53.1               |
+| DeepSeek-R1-Distill-Qwen-32B  |             56.6             |               57.2               |
+| DeepSeek-R1-Distill-Llama-8B  |             37.0             |               39.6               |
+| DeepSeek-R1-Distill-Llama-70B |             54.5             |               57.5               |
+
+To reproduce these results use the following command:
+
+```shell
+NUM_GPUS=1 # Set to 8 for 32B and 70B models, or data_parallel_size=8 with the smaller models for speed
+MODEL=deepseek-ai/{model_name}
+MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
+OUTPUT_DIR=data/evals/$MODEL
+
+lighteval vllm $MODEL_ARGS "extended|lcb:codegeneration|0|0" \
+    --use-chat-template \
+    --output-dir $OUTPUT_DIR
+```
+
+```shell
+python scripts/run_benchmarks.py --model-id {model_id}  --benchmarks lcb
+```
+
+## Data generation
+
+### Generate data from a smol distilled R1 model
+
+The following example can be run in 1xH100. 
+First install the following dependencies:
+
+```shell
+uv pip install "distilabel[vllm]>=1.5.2"
+```
+
+Now save the following snippet into a file named `pipeline.py` and run it with `python pipeline.py`. It will generate 4 outputs for each of the 10 examples (change the username for the repository to your org/user name):
+
+```python
+from datasets import load_dataset
+from distilabel.models import vLLM
+from distilabel.pipeline import Pipeline
+from distilabel.steps.tasks import TextGeneration
+
+
+prompt_template = """\
+You will be given a problem. Please reason step by step, and put your final answer within \boxed{}:
+{{ instruction }}"""
+
+dataset = load_dataset("AI-MO/NuminaMath-TIR", split="train").select(range(10))
+
+model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"  # Exchange with another smol distilled r1
+
+with Pipeline(
+    name="distill-qwen-7b-r1",
+    description="A pipeline to generate data from a distilled r1 model",
+) as pipeline:
+
+    llm = vLLM(
+        model=model_id,
+        tokenizer=model_id,
+        extra_kwargs={
+            "tensor_parallel_size": 1,
+            "max_model_len": 8192,
+        },
+        generation_kwargs={
+            "temperature": 0.6,
+            "max_new_tokens": 8192,
+        },
+    )
+    prompt_column = "problem"
+    text_generation = TextGeneration(
+        llm=llm, 
+        template=prompt_template,
+        num_generations=4,
+        input_mappings={"instruction": prompt_column} if prompt_column is not None else {}
+    )
+
+
+if __name__ == "__main__":
+    distiset = pipeline.run(dataset=dataset)
+    distiset.push_to_hub(repo_id="username/numina-deepseek-r1-qwen-7b")
+```
+
+Take a look at the sample dataset at [HuggingFaceH4/numina-deepseek-r1-qwen-7b](https://huggingface.co/datasets/HuggingFaceH4/numina-deepseek-r1-qwen-7b).
+
+
+### Generate data from DeepSeek-R1
+
+To run the bigger DeepSeek-R1, we used 2 nodes, each with 8×H100 GPUs using the slurm file present in this repo at `slurm/generate.slurm`. First, install the dependencies:
+
+(for now we need to install the vllm dev wheel that [fixes the R1 cuda graph capture](https://github.com/vllm-project/vllm/commits/221d388cc5a836fa189305785ed7e887cea8b510/csrc/moe/moe_align_sum_kernels.cu))
+```shell
+pip install https://wheels.vllm.ai/221d388cc5a836fa189305785ed7e887cea8b510/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu121
+
+uv pip install "distilabel[vllm,ray,openai]>=1.5.2"
+```
+
+And then run the following command:
+
+```shell
+sbatch slurm/generate.slurm \
+    --hf-dataset AI-MO/NuminaMath-TIR \
+    --temperature 0.6 \
+    --prompt-column problem \
+    --model deepseek-ai/DeepSeek-R1 \
+    --hf-output-dataset username/r1-dataset
+```
+
+> [!NOTE]  
+> While the job is running, you can setup an SSH tunnel through the cluster login node to access the Ray dashboard from your computer running `ssh -L 8265:ray_ip_head_node:8265 <login_node>`, then browsing `http://localhost:8265`
+
+## Contributing
+
+Contributions are welcome. Please refer to https://github.com/huggingface/open-r1/issues/23.
diff --git a/generate_fashion_dataset.py b/generate_fashion_dataset.py
new file mode 100644
index 000000000..0acecef8c
--- /dev/null
+++ b/generate_fashion_dataset.py
@@ -0,0 +1,119 @@
+#!/usr/bin/env python
+"""
+Simplified script to generate a fashion dataset for training DeepSeek-R1-Fashion model.
+"""
+
+import os
+import json
+import argparse
+from tqdm import tqdm
+import numpy as np
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from datasets import Dataset
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="Generate fashion dataset for DeepSeek-R1")
+    parser.add_argument("--output-path", type=str, default="data/fashion-dataset",
+                       help="Path to save the generated dataset")
+    parser.add_argument("--num-samples", type=int, default=1000,
+                       help="Number of samples to generate")
+    parser.add_argument("--model", type=str, default="deepseek-ai/DeepSeek-R1",
+                       help="Model to use for generation")
+    return parser.parse_args()
+
+def main():
+    args = parse_args()
+    
+    # Fashion-related queries
+    fashion_queries = [
+        "What's a good outfit for a summer wedding?",
+        "How do I style a basic white t-shirt?",
+        "What are the key fashion trends for Fall 2025?",
+        "Can you recommend sustainable fashion brands?",
+        "How should I dress for a job interview in tech?",
+        "What accessories go well with a little black dress?",
+        "How do I build a minimalist wardrobe?",
+        "What colors are complementary to olive skin tone?",
+        "How do I style oversized clothing without looking sloppy?",
+        "What's the difference between business casual and smart casual?",
+        # Additional queries for variety
+        "How can I dress professionally while pregnant?",
+        "What are good outfit ideas for a first date?",
+        "How do I choose the right jeans for my body type?",
+        "What should I wear to a music festival?",
+        "How do I transition my wardrobe from winter to spring?",
+        "What are must-have pieces for a capsule wardrobe?",
+        "How can I dress to look taller?",
+        "What's appropriate to wear to a funeral?",
+        "How do I care for silk clothing?",
+        "What are some 90s fashion trends making a comeback?"
+    ]
+    
+    # System prompt for fashion advice
+    system_prompt = """You are a helpful AI assistant specializing in fashion advice.
+    When responding to fashion-related queries, follow these guidelines:
+    1. Consider the occasion, body type, personal style, and practical concerns
+    2. Provide specific recommendations with reasoning
+    3. Include options at different price points when appropriate
+    4. Suggest styling combinations and accessories
+    5. Mention current trends while respecting timeless principles
+    
+    Your advice should be detailed, personalized, and practical."""
+    
+    print("Loading tokenizer and model...")
+    try:
+        tokenizer = AutoTokenizer.from_pretrained(args.model)
+        model = AutoModelForCausalLM.from_pretrained(args.model)
+    except Exception as e:
+        print(f"Error loading model: {e}")
+        print("Using a fallback model instead...")
+        tokenizer = AutoTokenizer.from_pretrained("gpt2")
+        model = AutoModelForCausalLM.from_pretrained("gpt2")
+    
+    # Create directory if it doesn't exist
+    os.makedirs(os.path.dirname(args.output_path), exist_ok=True)
+    
+    # Generate responses
+    print(f"Generating {args.num_samples} fashion conversation samples...")
+    all_data = []
+    for _ in tqdm(range(args.num_samples)):
+        # Select a random query
+        query = np.random.choice(fashion_queries)
+        
+        # Format the prompt
+        prompt = f"{system_prompt}\n\nUser: {query}\nAssistant:"
+        
+        # Generate response
+        inputs = tokenizer(prompt, return_tensors="pt")
+        outputs = model.generate(
+            inputs.input_ids,
+            max_new_tokens=1024,
+            temperature=0.7,
+            top_p=0.9,
+            do_sample=True
+        )
+        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+        
+        # Extract the assistant's response
+        try:
+            assistant_response = response.split("Assistant:")[1].strip()
+        except IndexError:
+            assistant_response = response.replace(prompt, "").strip()
+        
+        # Store the data
+        data = {
+            "text": query,
+            "response": assistant_response
+        }
+        all_data.append(data)
+    
+    # Save the dataset
+    with open(args.output_path, 'w') as f:
+        for item in all_data:
+            f.write(json.dumps(item) + '\n')
+    
+    print(f"Dataset generation complete. Saved to {args.output_path}")
+    print(f"Generated {len(all_data)} samples")
+    
+if __name__ == "__main__":
+    main()
diff --git a/recipes/DeepSeek-R1-Fashion/README.md b/recipes/DeepSeek-R1-Fashion/README.md
new file mode 100644
index 000000000..2ef9a773b
--- /dev/null
+++ b/recipes/DeepSeek-R1-Fashion/README.md
@@ -0,0 +1,109 @@
+# DeepSeek-R1-Fashion
+
+This recipe provides configuration files and instructions for training a fashion-specialized version of DeepSeek-R1. The model is fine-tuned to provide high-quality fashion advice, outfit recommendations, and style guidance.
+
+## Training Process
+
+The training process consists of two main steps:
+
+1. **Supervised Fine-Tuning (SFT)**: Fine-tune the base DeepSeek-R1 model on a fashion dataset
+2. **Group Relative Policy Optimization (GRPO)**: Further refine the model with reinforcement learning
+
+## Data Preparation
+
+Before training, you need to prepare a fashion dataset. You can use the provided script to generate synthetic fashion conversations:
+
+```bash
+python recipes/DeepSeek-R1-Fashion/generate_fashion_dataset.py --output-path data/fashion-dataset --num-samples 10000
+```
+
+For the GRPO phase, you'll need query data:
+
+```bash
+# Create a directory for fashion queries
+mkdir -p data/fashion-queries-dataset
+
+# Example of creating a simple query dataset
+python -c "
+from datasets import Dataset
+import json
+
+queries = [
+    'What should I wear to a summer wedding?',
+    'How do I style a denim jacket?',
+    'What are the current fashion trends?',
+    # Add more fashion queries here
+]
+
+ds = Dataset.from_dict({'query': queries})
+ds.to_json('data/fashion-queries-dataset/fashion_queries.jsonl')
+"
+```
+
+## Training Commands
+
+### 1. Supervised Fine-Tuning (SFT)
+
+```bash
+ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
+    src/open_r1/sft.py \
+    --config recipes/DeepSeek-R1-Fashion/sft/config_fashion.yaml
+```
+
+### 2. GRPO Training
+
+```bash
+ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \
+    --num_processes=7 src/open_r1/grpo.py \
+    --config recipes/DeepSeek-R1-Fashion/grpo/config_fashion.yaml
+```
+
+## Evaluation
+
+After training, evaluate your fashion model using:
+
+```bash
+MODEL=your-username/DeepSeek-R1-Fashion
+MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
+OUTPUT_DIR=data/evals/$MODEL
+
+# Fashion style evaluation
+TASK=fashion_style
+lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
+    --custom-tasks src/open_r1/evaluate.py \
+    --use-chat-template \
+    --output-dir $OUTPUT_DIR
+```
+
+## Configuration Details
+
+### SFT Configuration
+
+The SFT configuration (`config_fashion.yaml`) uses the following key settings:
+
+- Base model: DeepSeek-R1
+- Learning rate: 5e-5
+- Training epochs: 1
+- Max sequence length: 16384
+- Batch size: 16
+
+### GRPO Configuration
+
+The GRPO configuration includes:
+
+- Base model: Your SFT-trained fashion model
+- Learning rate: 1e-6
+- Reward functions:
+  - accuracy: Checks factual correctness
+  - format: Ensures proper output formatting
+  - tag_count: Maintains proper usage of think/answer tags
+  - fashion_relevance: Custom reward for fashion-specific quality
+
+## Customization
+
+You can customize the configurations by:
+
+1. Adjusting training parameters in the config files
+2. Modifying the system prompt to better match your fashion use case
+3. Using different reward weights in the GRPO phase
+4. Adding custom reward functions for fashion-specific evaluation
diff --git a/recipes/DeepSeek-R1-Fashion/accelerate_config.yaml b/recipes/DeepSeek-R1-Fashion/accelerate_config.yaml
new file mode 100644
index 000000000..24f12cff3
--- /dev/null
+++ b/recipes/DeepSeek-R1-Fashion/accelerate_config.yaml
@@ -0,0 +1,16 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: NO
+downcast_bf16: 'no'
+gpu_ids: "0"
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 1
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
diff --git a/recipes/DeepSeek-R1-Fashion/evaluate_fashion.py b/recipes/DeepSeek-R1-Fashion/evaluate_fashion.py
new file mode 100755
index 000000000..02b22b858
--- /dev/null
+++ b/recipes/DeepSeek-R1-Fashion/evaluate_fashion.py
@@ -0,0 +1,329 @@
+#!/usr/bin/env python
+"""
+Evaluation script for DeepSeek-R1-Fashion model.
+
+This script evaluates the performance of the DeepSeek-R1-Fashion model on
+various fashion-related tasks, including style recommendations, outfit compatibility,
+trend analysis, and fashion knowledge.
+"""
+
+import os
+import argparse
+import json
+from tqdm import tqdm
+from collections import defaultdict
+from dataclasses import dataclass, field, asdict
+from typing import Dict, List, Any, Optional
+
+import torch
+from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
+from vllm import LLM, SamplingParams
+
+@dataclass
+class FashionEvalResult:
+    """Result of evaluating a fashion model response."""
+    task: str
+    query: str
+    response: str
+    scores: Dict[str, float] = field(default_factory=dict)
+    metadata: Dict[str, Any] = field(default_factory=dict)
+    
+@dataclass
+class FashionEvalMetrics:
+    """Aggregated metrics for fashion evaluation."""
+    task: str
+    avg_score: float
+    score_breakdown: Dict[str, float]
+    sample_count: int
+
+class FashionEvaluator:
+    """Evaluate fashion model performance."""
+    
+    def __init__(self, model_name_or_path, use_vllm=True, device=None):
+        self.model_name = model_name_or_path
+        
+        # Setup evaluation data
+        self.eval_tasks = {
+            "style_advice": self._get_style_advice_queries(),
+            "outfit_compatibility": self._get_outfit_compatibility_queries(),
+            "trend_analysis": self._get_trend_analysis_queries(),
+            "fashion_knowledge": self._get_fashion_knowledge_queries(),
+        }
+        
+        # Setup model
+        if use_vllm:
+            self.vllm = True
+            self.model = LLM(model=model_name_or_path)
+            self.tokenizer = None  # Not needed with vLLM
+            self.sampling_params = SamplingParams(
+                temperature=0.7,
+                top_p=0.9,
+                max_tokens=1024,
+            )
+        else:
+            self.vllm = False
+            self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
+            self.model = AutoModelForCausalLM.from_pretrained(
+                model_name_or_path, 
+                torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
+                device_map=self.device
+            )
+            self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+            self.pipe = pipeline(
+                "text-generation",
+                model=self.model,
+                tokenizer=self.tokenizer,
+                device=self.device
+            )
+    
+    def _format_prompt(self, query):
+        """Format a prompt for the model."""
+        system_prompt = """You are a helpful AI Assistant specialized in fashion advice. 
+        You provide thoughtful fashion recommendations based on user requests.
+        """
+        
+        return f"{system_prompt}\n\nUser: {query}\n\nAssistant:"
+    
+    def _generate_response(self, prompt):
+        """Generate a response from the model."""
+        if self.vllm:
+            outputs = self.model.generate([prompt], self.sampling_params)
+            return outputs[0].outputs[0].text
+        else:
+            outputs = self.pipe(
+                prompt,
+                max_new_tokens=1024,
+                temperature=0.7,
+                top_p=0.9,
+                do_sample=True,
+            )
+            return outputs[0]["generated_text"][len(prompt):]
+    
+    def _analyze_response(self, task, query, response):
+        """Analyze a model response for a given task."""
+        # Initialize scores dictionary
+        scores = {}
+        
+        # Basic relevance check
+        fashion_terms = [
+            "outfit", "style", "trend", "fashion", "clothes", "clothing",
+            "accessory", "accessories", "color", "pattern", "wear"
+        ]
+        relevance_score = sum(1 for term in fashion_terms if term.lower() in response.lower()) / len(fashion_terms)
+        scores["relevance"] = min(1.0, relevance_score * 2)  # Scale up to max of 1.0
+        
+        # Task-specific scoring
+        if task == "style_advice":
+            # Check for personalization and specific recommendations
+            personalization = any(term in response.lower() for term in 
+                                 ["your body", "your style", "your occasion", "your preference"])
+            specific_items = sum(1 for term in 
+                                ["shirt", "pants", "dress", "skirt", "jeans", "jacket", "coat", "shoes", "blouse", "suit"] 
+                                if term.lower() in response.lower())
+            
+            scores["personalization"] = 1.0 if personalization else 0.0
+            scores["specificity"] = min(1.0, specific_items / 5)
+            
+        elif task == "outfit_compatibility":
+            # Check for explanation of why items work together
+            explanation_terms = ["complement", "match", "pair", "work with", "goes with", "coordinate"]
+            has_explanation = any(term in response.lower() for term in explanation_terms)
+            
+            color_discussion = any(color in response.lower() for color in 
+                                  ["color", "tone", "shade", "hue", "contrast", "complement"])
+            
+            scores["explanation"] = 1.0 if has_explanation else 0.0
+            scores["color_awareness"] = 1.0 if color_discussion else 0.0
+            
+        elif task == "trend_analysis":
+            # Check for temporal awareness and specific trends
+            temporal_terms = ["current", "season", "this year", "recent", "upcoming", "latest"]
+            has_temporal = any(term in response.lower() for term in temporal_terms)
+            
+            trend_count = sum(1 for term in 
+                             ["trending", "popular", "runway", "designer", "collection", "fashion week"] 
+                             if term.lower() in response.lower())
+            
+            scores["temporal_awareness"] = 1.0 if has_temporal else 0.0
+            scores["trend_specificity"] = min(1.0, trend_count / 3)
+            
+        elif task == "fashion_knowledge":
+            # Check for historical references and technical terms
+            historical = any(term in response.lower() for term in 
+                           ["history", "traditional", "classic", "origin", "decade", "century", "era"])
+            
+            technical_terms = sum(1 for term in 
+                                ["silhouette", "cut", "drape", "textile", "fabric", "stitch", "tailoring"] 
+                                if term.lower() in response.lower())
+            
+            scores["historical_context"] = 1.0 if historical else 0.0
+            scores["technical_knowledge"] = min(1.0, technical_terms / 3)
+        
+        # Calculate average score
+        avg_score = sum(scores.values()) / len(scores)
+        scores["average"] = avg_score
+        
+        return scores
+    
+    def evaluate(self, task=None, output_path=None):
+        """
+        Evaluate the model on fashion tasks.
+        
+        Args:
+            task: Specific task to evaluate, or None for all tasks
+            output_path: Path to save evaluation results
+            
+        Returns:
+            Dictionary of evaluation results
+        """
+        tasks = [task] if task else list(self.eval_tasks.keys())
+        all_results = []
+        
+        for task in tasks:
+            print(f"Evaluating task: {task}")
+            queries = self.eval_tasks[task]
+            
+            for query in tqdm(queries, desc=f"Evaluating {task}"):
+                prompt = self._format_prompt(query)
+                response = self._generate_response(prompt)
+                scores = self._analyze_response(task, query, response)
+                
+                result = FashionEvalResult(
+                    task=task,
+                    query=query,
+                    response=response,
+                    scores=scores
+                )
+                all_results.append(result)
+        
+        # Compute aggregated metrics
+        metrics = self._compute_metrics(all_results)
+        
+        # Save results if output path is provided
+        if output_path:
+            os.makedirs(os.path.dirname(output_path), exist_ok=True)
+            with open(output_path, 'w') as f:
+                results_dict = {
+                    "results": [asdict(r) for r in all_results],
+                    "metrics": [asdict(m) for m in metrics]
+                }
+                json.dump(results_dict, f, indent=2)
+        
+        return metrics, all_results
+    
+    def _compute_metrics(self, results):
+        """Compute aggregated metrics from evaluation results."""
+        task_results = defaultdict(list)
+        for result in results:
+            task_results[result.task].append(result)
+        
+        metrics = []
+        for task, task_results_list in task_results.items():
+            # Collect all scores for this task
+            all_scores = defaultdict(list)
+            for result in task_results_list:
+                for score_name, score_value in result.scores.items():
+                    all_scores[score_name].append(score_value)
+            
+            # Compute average scores
+            avg_scores = {
+                score_name: sum(scores) / len(scores) 
+                for score_name, scores in all_scores.items()
+            }
+            
+            metrics.append(FashionEvalMetrics(
+                task=task,
+                avg_score=avg_scores["average"],
+                score_breakdown=avg_scores,
+                sample_count=len(task_results_list)
+            ))
+        
+        return metrics
+    
+    def _get_style_advice_queries(self):
+        """Get queries for style advice task."""
+        return [
+            "What should I wear to a summer wedding?",
+            "How can I dress professionally while staying comfortable?",
+            "I have a pear-shaped body. What styles would flatter my figure?",
+            "What's a good casual outfit for a first date?",
+            "How should I dress for a job interview in a tech company?",
+            "What are some stylish outfits for rainy weather?",
+            "How can I make a basic t-shirt and jeans look more fashionable?",
+            "What should I pack for a weekend beach trip?",
+            "How can I transition my summer wardrobe to fall?",
+            "What are good outfit options for a plus-size figure?"
+        ]
+    
+    def _get_outfit_compatibility_queries(self):
+        """Get queries for outfit compatibility task."""
+        return [
+            "Do black pants go with a navy blue top?",
+            "What colors complement a burgundy dress?",
+            "What type of shoes would work with wide-leg pants?",
+            "How can I mix patterns in an outfit without clashing?",
+            "What accessories would enhance a simple white dress?",
+            "Can I wear gold and silver jewelry together?",
+            "What type of jacket would work with a floral midi skirt?",
+            "How do I coordinate colors in a three-piece outfit?",
+            "What bottom would pair well with an oversized sweater?",
+            "How can I style a statement piece without overwhelming my outfit?"
+        ]
+    
+    def _get_trend_analysis_queries(self):
+        """Get queries for trend analysis task."""
+        return [
+            "What are the biggest fashion trends this season?",
+            "Are skinny jeans still in style?",
+            "What color palettes are trending for summer?",
+            "How are sustainable fashion trends evolving?",
+            "What accessories are popular right now?",
+            "What vintage styles are making a comeback?",
+            "How are gender-fluid fashion trends developing?",
+            "What are the emerging streetwear trends?",
+            "How are workplace fashion trends changing post-pandemic?",
+            "What's the forecast for next season's fashion trends?"
+        ]
+    
+    def _get_fashion_knowledge_queries(self):
+        """Get queries for fashion knowledge task."""
+        return [
+            "What's the difference between haute couture and ready-to-wear?",
+            "Can you explain what a capsule wardrobe is?",
+            "What are the basic types of dress silhouettes?",
+            "What's the history of the little black dress?",
+            "How do different fabrics affect the drape of clothing?",
+            "What are the rules of color theory in fashion?",
+            "What's the difference between fashion and style?",
+            "How has men's formal wear evolved over the last century?",
+            "What makes a garment considered 'luxury'?",
+            "How do seasonal color analyses work?"
+        ]
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Evaluate fashion model")
+    parser.add_argument("--model", type=str, required=True, help="Model name or path")
+    parser.add_argument("--output", type=str, default="fashion_eval_results.json", help="Output file path")
+    parser.add_argument("--task", type=str, default=None, help="Specific task to evaluate")
+    parser.add_argument("--no-vllm", action="store_true", help="Disable vLLM for generation")
+    args = parser.parse_args()
+    
+    evaluator = FashionEvaluator(args.model, use_vllm=not args.no_vllm)
+    metrics, results = evaluator.evaluate(task=args.task, output_path=args.output)
+    
+    # Print summary
+    print("\n=== Fashion Evaluation Results ===")
+    for metric in metrics:
+        print(f"\nTask: {metric.task}")
+        print(f"Average Score: {metric.avg_score:.2f}")
+        print("Score Breakdown:")
+        for name, score in metric.score_breakdown.items():
+            if name != "average":
+                print(f"  - {name}: {score:.2f}")
+    
+    print(f"\nDetailed results saved to {args.output}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/recipes/DeepSeek-R1-Fashion/fashion_reward.py b/recipes/DeepSeek-R1-Fashion/fashion_reward.py
new file mode 100644
index 000000000..2101d85fa
--- /dev/null
+++ b/recipes/DeepSeek-R1-Fashion/fashion_reward.py
@@ -0,0 +1,137 @@
+"""
+Fashion Relevance Reward Function for DeepSeek-R1-Fashion model.
+
+This module implements a custom reward function that evaluates the fashion relevance
+and quality of responses for the GRPO training phase.
+"""
+
+import re
+from typing import Dict, List, Any, Optional, Tuple
+from open_r1.reward.base import RewardFunction
+
+class FashionRelevanceReward(RewardFunction):
+    """
+    Reward function that evaluates fashion-specific qualities in responses:
+    1. Fashion terminology usage
+    2. Personalization
+    3. Practicality of advice
+    4. Multi-option recommendations
+    5. Style explanation
+    """
+    
+    def __init__(self):
+        super().__init__()
+        # Fashion-related terminology to look for
+        self.fashion_terms = [
+            "outfit", "style", "trend", "accessory", "accessories", "color", "pattern",
+            "fabric", "silhouette", "wardrobe", "dressy", "casual", "formal", "fit",
+            "texture", "layering", "seasonal", "classic", "contemporary", "vintage",
+            "sustainable", "tailored", "oversized", "minimalist", "statement", "aesthetic"
+        ]
+        
+    def _calculate_fashion_term_score(self, text: str) -> float:
+        """Calculate score based on fashion terminology usage."""
+        text = text.lower()
+        term_count = sum(1 for term in self.fashion_terms if term in text)
+        # Normalize the score to [0, 1] range with diminishing returns
+        return min(1.0, term_count / 10)
+        
+    def _calculate_personalization_score(self, text: str) -> float:
+        """Calculate score based on personalization indicators."""
+        personalization_patterns = [
+            r"body type", r"skin tone", r"personal style", r"preference",
+            r"comfort", r"occasion", r"your", r"you might", r"you could",
+            r"depending on", r"for your", r"based on your"
+        ]
+        
+        count = sum(1 for pattern in personalization_patterns if re.search(pattern, text, re.IGNORECASE))
+        return min(1.0, count / 5)
+        
+    def _calculate_practicality_score(self, text: str) -> float:
+        """Calculate score based on practical advice indicators."""
+        practicality_patterns = [
+            r"budget", r"affordable", r"investment", r"versatile", r"mix and match",
+            r"capsule", r"staple", r"essential", r"weather", r"season", r"occasion",
+            r"day-to-night", r"transition", r"maintenance", r"wash", r"care"
+        ]
+        
+        count = sum(1 for pattern in practicality_patterns if re.search(pattern, text, re.IGNORECASE))
+        return min(1.0, count / 5)
+        
+    def _calculate_options_score(self, text: str) -> float:
+        """Calculate score based on providing multiple options."""
+        # Look for numbered lists, bullet points, or option indicators
+        option_patterns = [
+            r"\d+\.", r"\*", r"-", r"option", r"alternative", r"another",
+            r"additionally", r"other", r"second", r"third", r"first", r"instead"
+        ]
+        
+        count = sum(1 for pattern in option_patterns if re.search(pattern, text, re.IGNORECASE))
+        return min(1.0, count / 5)
+        
+    def _calculate_style_explanation_score(self, text: str) -> float:
+        """Calculate score based on style explanation."""
+        explanation_patterns = [
+            r"because", r"reason", r"complement", r"enhance", r"flattering",
+            r"highlight", r"balance", r"proportion", r"elongate", r"slimming",
+            r"emphasize", r"contrast", r"coordinate", r"this works", r"this helps"
+        ]
+        
+        count = sum(1 for pattern in explanation_patterns if re.search(pattern, text, re.IGNORECASE))
+        return min(1.0, count / 5)
+    
+    def compute_reward(
+        self,
+        completion: str,
+        prompt: Optional[str] = None,
+        prompt_metadata: Optional[Dict[str, Any]] = None,
+        completion_metadata: Optional[Dict[str, Any]] = None,
+    ) -> Tuple[float, Dict[str, Any]]:
+        """
+        Compute the fashion relevance reward score.
+        
+        Args:
+            completion: The model's generated completion
+            prompt: The input prompt
+            prompt_metadata: Additional metadata about the prompt
+            completion_metadata: Additional metadata about the completion
+            
+        Returns:
+            A tuple of (reward_score, metadata_dict)
+        """
+        # Initialize metadata dictionary
+        metadata = {
+            "fashion_term_score": 0.0,
+            "personalization_score": 0.0,
+            "practicality_score": 0.0,
+            "options_score": 0.0,
+            "style_explanation_score": 0.0,
+        }
+        
+        # Skip if completion is empty
+        if not completion or len(completion.strip()) == 0:
+            return 0.0, metadata
+        
+        # Calculate component scores
+        metadata["fashion_term_score"] = self._calculate_fashion_term_score(completion)
+        metadata["personalization_score"] = self._calculate_personalization_score(completion)
+        metadata["practicality_score"] = self._calculate_practicality_score(completion)
+        metadata["options_score"] = self._calculate_options_score(completion)
+        metadata["style_explanation_score"] = self._calculate_style_explanation_score(completion)
+        
+        # Compute overall score (weighted average)
+        weights = {
+            "fashion_term_score": 1.0,
+            "personalization_score": 1.5,
+            "practicality_score": 1.0,
+            "options_score": 0.8,
+            "style_explanation_score": 1.2,
+        }
+        
+        total_weight = sum(weights.values())
+        overall_score = sum(metadata[k] * weights[k] for k in metadata) / total_weight
+        
+        # Add overall score to metadata
+        metadata["overall_score"] = overall_score
+        
+        return overall_score, metadata
diff --git a/recipes/DeepSeek-R1-Fashion/generate_fashion_dataset.py b/recipes/DeepSeek-R1-Fashion/generate_fashion_dataset.py
new file mode 100644
index 000000000..f585fd24a
--- /dev/null
+++ b/recipes/DeepSeek-R1-Fashion/generate_fashion_dataset.py
@@ -0,0 +1,113 @@
+#!/usr/bin/env python
+"""
+Script to generate a fashion dataset for training DeepSeek-R1-Fashion model.
+
+This script uses Distilabel to generate fashion-related conversations and recommendations
+that can be used for fine-tuning the DeepSeek-R1 model for fashion tasks.
+"""
+
+import os
+import argparse
+from datasets import Dataset
+from distilabel.pipeline import Pipeline
+from distilabel.tasks import ChatGeneration
+from distilabel.steps.llm import HuggingFaceLLM, VLLMStep
+from distilabel.steps.prompt import PromptTemplate
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="Generate fashion dataset for DeepSeek-R1")
+    parser.add_argument("--output-path", type=str, default="data/fashion-dataset",
+                       help="Path to save the generated dataset")
+    parser.add_argument("--num-samples", type=int, default=1000,
+                       help="Number of samples to generate")
+    parser.add_argument("--model", type=str, default="deepseek-ai/DeepSeek-R1",
+                       help="Model to use for generation")
+    return parser.parse_args()
+
+def main():
+    args = parse_args()
+    
+    # Fashion-related queries
+    fashion_queries = [
+        "What's a good outfit for a summer wedding?",
+        "How do I style a basic white t-shirt?",
+        "What are the key fashion trends for Fall 2025?",
+        "Can you recommend sustainable fashion brands?",
+        "How should I dress for a job interview in tech?",
+        "What accessories go well with a little black dress?",
+        "How do I build a minimalist wardrobe?",
+        "What colors are complementary to olive skin tone?",
+        "How do I style oversized clothing without looking sloppy?",
+        "What's the difference between business casual and smart casual?",
+        # Add more fashion queries here
+    ]
+    
+    # Create dataset from queries
+    query_dataset = Dataset.from_dict({"text": fashion_queries})
+    
+    # System prompt for fashion advice
+    system_prompt = """You are a helpful AI assistant specializing in fashion advice.
+    When responding to fashion-related queries, follow these guidelines:
+    1. Consider the occasion, body type, personal style, and practical concerns
+    2. Provide specific recommendations with reasoning
+    3. Include options at different price points when appropriate
+    4. Suggest styling combinations and accessories
+    5. Mention current trends while respecting timeless principles
+    
+    Your advice should be detailed, personalized, and practical."""
+    
+    # Create generation pipeline
+    template = PromptTemplate(
+        template=system_prompt + "\n\nUser: {{text}}\nAssistant:",
+        input_columns=["text"],
+        output_column="response"
+    )
+    
+    # Use VLLM for faster generation if available
+    try:
+        generator = VLLMStep(
+            model=args.model,
+            generation_kwargs={
+                "max_new_tokens": 1024,
+                "temperature": 0.7,
+                "top_p": 0.9,
+            }
+        )
+    except:
+        # Fallback to HuggingFace generation
+        generator = HuggingFaceLLM(
+            model_id=args.model,
+            generation_kwargs={
+                "max_new_tokens": 1024,
+                "temperature": 0.7,
+                "top_p": 0.9,
+            }
+        )
+    
+    # Setup pipeline
+    pipeline = Pipeline(
+        steps={
+            "template": template,
+            "generator": generator,
+        },
+        connections={
+            "template": ["generator"],
+        },
+        input_dataset=query_dataset,
+        input_keys=["text"],
+        output_keys=["response"],
+    )
+    
+    # Run generation
+    print(f"Generating {args.num_samples} fashion conversation samples...")
+    results = pipeline.run(
+        num_samples=args.num_samples,
+        output_path=args.output_path,
+        format="jsonl",
+    )
+    
+    print(f"Dataset generation complete. Saved to {args.output_path}")
+    print(f"Generated {len(results)} samples")
+    
+if __name__ == "__main__":
+    main()
diff --git a/recipes/DeepSeek-R1-Fashion/grpo/config_fashion.yaml b/recipes/DeepSeek-R1-Fashion/grpo/config_fashion.yaml
new file mode 100644
index 000000000..d5ed2a589
--- /dev/null
+++ b/recipes/DeepSeek-R1-Fashion/grpo/config_fashion.yaml
@@ -0,0 +1,61 @@
+# Model arguments
+model_name_or_path: DeepSeek-R1-Fashion
+model_revision: main
+torch_dtype: bfloat16
+attn_implementation: flash_attention_2
+
+# Data training arguments
+chat_template: "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<｜User｜>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<｜tool▁call▁end｜>'}}{%- set ns.is_first = true -%}{%- else %}{{'\\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<｜tool▁call▁end｜>'}}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<｜Assistant｜>'}}{% endif %}"
+dataset_name: open-r1/fashion-queries-dataset
+dataset_configs:
+- default
+system_prompt: "You are a helpful AI Assistant specialized in fashion advice. You first think about the reasoning process as an internal monologue and then provide the user with thoughtful fashion recommendations. Respond in the following format: <think>\n...\n</think>\n<answer>\n...\n</answer>"
+
+# GRPO trainer config
+bf16: true
+use_vllm: true
+vllm_device: auto
+vllm_gpu_memory_utilization: 0.7
+do_eval: false
+gradient_accumulation_steps: 4
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+hub_model_id: DeepSeek-R1-Fashion-GRPO
+hub_strategy: every_save
+learning_rate: 1.0e-06
+log_completions: true
+log_level: info
+logging_first_step: true
+logging_steps: 1
+logging_strategy: steps
+lr_scheduler_type: cosine_with_min_lr
+lr_scheduler_kwargs:
+  min_lr_rate: 0.1
+max_prompt_length: 512
+max_completion_length: 2048
+max_steps: -1
+num_generations: 16
+num_train_epochs: 1
+output_dir: data/DeepSeek-R1-Fashion-GRPO
+overwrite_output_dir: true
+per_device_eval_batch_size: 16
+per_device_train_batch_size: 16
+push_to_hub: true
+report_to:
+- wandb
+reward_funcs:
+- accuracy
+- format
+- tag_count
+- fashion_relevance
+reward_weights:
+- 1.0
+- 1.0
+- 1.0
+- 2.0
+save_strategy: "epoch"
+save_total_limit: 1
+seed: 42
+temperature: 0.7
+warmup_ratio: 0.1
diff --git a/recipes/DeepSeek-R1-Fashion/sft/config_fashion.yaml b/recipes/DeepSeek-R1-Fashion/sft/config_fashion.yaml
new file mode 100644
index 000000000..d08ec5e73
--- /dev/null
+++ b/recipes/DeepSeek-R1-Fashion/sft/config_fashion.yaml
@@ -0,0 +1,46 @@
+# Model arguments
+model_name_or_path: deepseek-ai/DeepSeek-R1
+model_revision: main
+torch_dtype: bfloat16
+attn_implementation: flash_attention_2
+
+# Data training arguments
+dataset_name: open-r1/fashion-dataset
+dataset_configs:
+- default
+dataset_num_proc: 48
+
+# SFT trainer config
+bf16: true
+do_eval: false
+eval_strategy: 'no'
+gradient_accumulation_steps: 1
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+hub_model_id: DeepSeek-R1-Fashion
+hub_strategy: every_save
+learning_rate: 5.0e-05
+log_level: info
+logging_steps: 5
+logging_strategy: steps
+lr_scheduler_type: cosine_with_min_lr
+lr_scheduler_kwargs:
+  min_lr_rate: 0.1
+packing: true
+max_length: 16384
+max_steps: -1
+num_train_epochs: 1
+output_dir: data/DeepSeek-R1-Fashion
+overwrite_output_dir: true
+per_device_eval_batch_size: 16
+per_device_train_batch_size: 16
+push_to_hub: true
+report_to:
+- wandb
+save_strategy: "steps"
+save_steps: 100
+save_total_limit: 1
+seed: 42
+use_liger: true
+warmup_ratio: 0.05
diff --git a/train_fashion_model.py b/train_fashion_model.py
new file mode 100755
index 000000000..549ee7bf7
--- /dev/null
+++ b/train_fashion_model.py
@@ -0,0 +1,111 @@
+#!/usr/bin/env python
+
+import os
+import json
+import subprocess
+import tempfile
+from datasets import Dataset
+
+# Set up environment variables
+MODEL_NAME = "deepseek-ai/deepseek-llm-1.3b-base"
+OUTPUT_DIR = "data/DeepSeek-R1-Fashion-model"
+DATASET_PATH = "data/fashion-dataset/fashion_dataset.json"
+
+# Create output directory if it doesn't exist
+os.makedirs(OUTPUT_DIR, exist_ok=True)
+
+# Load the dataset
+with open(DATASET_PATH, 'r') as f:
+    raw_data = json.load(f)
+
+# Extract the relevant fields from the conversations
+train_data = []
+
+for item in raw_data:
+    # Extract the system prompt
+    system_prompt = item["conversation"]["system"]
+    
+    # Combine the user and assistant messages into a single text string
+    messages = item["conversation"]["messages"]
+    conversation = []
+    
+    for msg in messages:
+        if msg["role"] == "user":
+            conversation.append({"role": "user", "content": msg["content"]})
+        elif msg["role"] == "assistant":
+            conversation.append({"role": "assistant", "content": msg["content"]})
+    
+    train_data.append({
+        "id": item["id"],
+        "conversations": conversation,
+        "system": system_prompt
+    })
+
+# Create or update the JSONL train file
+train_jsonl_path = 'data/fashion-dataset-train.jsonl'
+with open(train_jsonl_path, 'w') as f:
+    for item in train_data:
+        f.write(json.dumps(item) + '\n')
+
+print(f"Dataset converted and saved to '{train_jsonl_path}'")
+
+# Since we're having issues with the command-line arguments, let's create a config file
+config_path = 'data/train_config.json'
+config = {
+    "model_name_or_path": MODEL_NAME,
+    "output_dir": OUTPUT_DIR,
+    "dataset_name": "json",
+    "dataset_kwargs": {"data_files": train_jsonl_path},
+    "dataset_text_field": "conversations",
+    "learning_rate": 2.0e-5,
+    "num_train_epochs": 3,
+    "max_seq_length": 2048,
+    "per_device_train_batch_size": 1,
+    "gradient_accumulation_steps": 4,
+    "gradient_checkpointing": True,
+    "bf16": True,
+    "logging_steps": 5,
+    "eval_strategy": "steps",
+    "eval_steps": 50,
+    "save_strategy": "steps",
+    "save_steps": 100,
+    "report_to": "none"
+}
+
+with open(config_path, 'w') as f:
+    json.dump(config, f, indent=2)
+
+print(f"Training config saved to '{config_path}'")
+
+# Construct the command with simpler arguments
+cmd = [
+    "accelerate", "launch",
+    "--config_file=recipes/DeepSeek-R1-Fashion/accelerate_config.yaml",
+    "src/open_r1/sft.py",
+    "--model_name_or_path", MODEL_NAME,
+    "--dataset_name", "json",
+    "--dataset_kwargs", '{"data_files":"data/fashion-dataset-train.jsonl"}',
+    "--dataset_text_field", "conversations",
+    "--learning_rate", "2.0e-5",
+    "--num_train_epochs", "3",
+    "--max_seq_length", "2048",
+    "--per_device_train_batch_size", "1",
+    "--gradient_accumulation_steps", "4",
+    "--gradient_checkpointing",
+    "--bf16",
+    "--logging_steps", "5",
+    "--eval_strategy", "steps",
+    "--eval_steps", "50",
+    "--save_strategy", "steps",
+    "--save_steps", "100",
+    "--output_dir", OUTPUT_DIR,
+    "--report_to", "none"
+]
+
+# Print the command for debugging
+print("Running command:", " ".join(cmd))
+
+# Run the command
+subprocess.run(cmd)
+
+print(f"Training completed. Model saved to {OUTPUT_DIR}")
diff --git a/train_fashion_model.sh b/train_fashion_model.sh
new file mode 100755
index 000000000..7a65da5a3
--- /dev/null
+++ b/train_fashion_model.sh
@@ -0,0 +1,30 @@
+#!/bin/bash
+
+# Set up environment variables
+export MODEL_NAME="deepseek-ai/deepseek-llm-1.3b-base"
+export OUTPUT_DIR="data/DeepSeek-R1-Fashion-model"
+export DATASET_PATH="data/fashion-dataset/fashion_dataset.json"
+
+# Create output directory if it doesn't exist
+mkdir -p $OUTPUT_DIR
+
+# Run the training
+accelerate launch --config_file=recipes/DeepSeek-R1-Fashion/accelerate_config.yaml src/open_r1/sft.py \
+    --model_name_or_path $MODEL_NAME \
+    --dataset_paths $DATASET_PATH \
+    --learning_rate 2.0e-5 \
+    --num_train_epochs 3 \
+    --max_seq_length 2048 \
+    --per_device_train_batch_size 1 \
+    --gradient_accumulation_steps 4 \
+    --gradient_checkpointing \
+    --bf16 \
+    --logging_steps 5 \
+    --eval_strategy steps \
+    --eval_steps 50 \
+    --save_strategy steps \
+    --save_steps 100 \
+    --output_dir $OUTPUT_DIR \
+    --report_to none
+
+echo "Training completed. Model saved to $OUTPUT_DIR"
diff --git a/train_fashion_model_fixed.sh b/train_fashion_model_fixed.sh
new file mode 100755
index 000000000..7a65da5a3
--- /dev/null
+++ b/train_fashion_model_fixed.sh
@@ -0,0 +1,30 @@
+#!/bin/bash
+
+# Set up environment variables
+export MODEL_NAME="deepseek-ai/deepseek-llm-1.3b-base"
+export OUTPUT_DIR="data/DeepSeek-R1-Fashion-model"
+export DATASET_PATH="data/fashion-dataset/fashion_dataset.json"
+
+# Create output directory if it doesn't exist
+mkdir -p $OUTPUT_DIR
+
+# Run the training
+accelerate launch --config_file=recipes/DeepSeek-R1-Fashion/accelerate_config.yaml src/open_r1/sft.py \
+    --model_name_or_path $MODEL_NAME \
+    --dataset_paths $DATASET_PATH \
+    --learning_rate 2.0e-5 \
+    --num_train_epochs 3 \
+    --max_seq_length 2048 \
+    --per_device_train_batch_size 1 \
+    --gradient_accumulation_steps 4 \
+    --gradient_checkpointing \
+    --bf16 \
+    --logging_steps 5 \
+    --eval_strategy steps \
+    --eval_steps 50 \
+    --save_strategy steps \
+    --save_steps 100 \
+    --output_dir $OUTPUT_DIR \
+    --report_to none
+
+echo "Training completed. Model saved to $OUTPUT_DIR"
diff --git a/train_fashion_model_fixed_path.sh b/train_fashion_model_fixed_path.sh
new file mode 100755
index 000000000..9fd9917d6
--- /dev/null
+++ b/train_fashion_model_fixed_path.sh
@@ -0,0 +1,31 @@
+#!/bin/bash
+
+# Set up environment variables
+export MODEL_NAME="deepseek-ai/deepseek-llm-1.3b-base"
+export OUTPUT_DIR="data/DeepSeek-R1-Fashion-model"
+export DATASET_PATH="data/fashion-dataset/fashion_dataset.json"
+export PYTHONPATH=$(pwd)/src:$PYTHONPATH
+
+# Create output directory if it doesn't exist
+mkdir -p $OUTPUT_DIR
+
+# Run the training
+accelerate launch --config_file=recipes/DeepSeek-R1-Fashion/accelerate_config.yaml src/open_r1/sft.py \
+    --model_name_or_path $MODEL_NAME \
+    --dataset_kwargs "{'data_files': '$DATASET_PATH'}" \
+    --learning_rate 2.0e-5 \
+    --num_train_epochs 3 \
+    --max_seq_length 2048 \
+    --per_device_train_batch_size 1 \
+    --gradient_accumulation_steps 4 \
+    --gradient_checkpointing \
+    --bf16 \
+    --logging_steps 5 \
+    --eval_strategy steps \
+    --eval_steps 50 \
+    --save_strategy steps \
+    --save_steps 100 \
+    --output_dir $OUTPUT_DIR \
+    --report_to none
+
+echo "Training completed. Model saved to $OUTPUT_DIR"