OpenLLM-France · Jeronymous · Jun 21, 2024 · Jun 21, 2024 · Jun 21, 2024 · Jun 21, 2024
diff --git a/ablation/README.md b/ablation/README.md
@@ -0,0 +1,80 @@
+# Ablation study
+
+## Data Proportions study
+
+The idea here is to challenge the model with different proportions of data (data mixtures). Considering previous work one could hope that experiments on small scale model (e.g. 80M parameters) should be a good proxy to estimate a larger scale model (e.g. 7B parameters) optimal data mixture.
+
+### Validate the hypothesis
+
+The idea is to validate the hypothesis that a small scale model can be used to estimate the optimal data mixture for a larger scale model. Here, we propose to train a small scale model (80M parameters) on different data mixtures and then train a larger scale model (410M parameters) on the same data mixtures.
+
+Then, the idea is to compare the perplexity of both models on a test set and see if the rank of the different domains is preserved between the two models when trained on the same data mixture.
+
+For this first experiment, we will train both models on 30B tokens.
+
+#### Prerequisites
+
+For those experiments, one will need to change the Megatron-DeepSpeed branch.
+
+Supposing that you are in the `Lucie-Training` directory, you can do the following:
+
+```bash
+cd Megatron-DeepSpeed
+git checkout perplexity
+cd ..
+```
+
+#### Train the models with different mixtures
+
+To train the models, we will use the following command:
+
+```bash
+sbatch --nodes=2 --time=04:30:00 --job-name=ablation-datamix-lucie80m-config00-30ksteps ablation/ablation_training.slurm --model_config lucie80m --data_config config00
+```
+
+To determine the different options you can follow the following table that will help you to determine the different options depending on the model's size:
+
+| Model size | model_config | nodes | time |
+|------------|--------------|-------------|------|
+| 80M        | lucie80m     | 2           | 04:30:00 |
+| 410M       | lucie410m    | 4           | 16:00:00 |
+
+To set the `data_config` parameter the following options are available:
+
+| data_config | config_path |
+|-------------|--------------|
+| config00    | data_config/config00.yaml |
+| config01    | data_config/config01.yaml |
+
+Finally, when all the above are decided you should set the `--job-name` parameter to: `ablation-datamix-<model_config>-<data_config>-30ksteps`. It will be helpful later to compute the different perplexities.
+
+#### Compute the perplexities
+
+To compute the perplexities, we will use the following command:
+
+```bash
+sbatch --nodes=2 --time=01:30:00 --job-name=perplexity-ablation-datamix-lucie80m-config00-30ksteps ablation/ablation_perplexity.slurm --model_config lucie80m --data_config config00
+```
+
+The same options as for the training should be used as this scripts will search the checkpoints in the same directory as the training scripts.
+
+Finally, the `--job-name` parameter should be set to: `perplexity-ablation-datamix-<model_config>-<data_config>-30ksteps`.
+
+#### Analyze the results
+**WIP**
+
+### Find the optimal data mixture for Lucie80M
+
+The idea here is to find the optimal data mixture for the Lucie80M model. To do so, we will train the model on different data mixtures and then evaluate the perplexity on a test set. We will then compare the different perplexities and determine the optimal data mixture.
+
+#### The data mixtures
+**WIP**
+
+#### Train the models with different mixtures
+**WIP**
+
+#### Compute the perplexities
+**WIP**
+
+#### Analyze the results
+**WIP**
diff --git a/ablation/ablation_perplexity.slurm b/ablation/ablation_perplexity.slurm
@@ -0,0 +1,194 @@
+#!/bin/bash
+#SBATCH --ntasks-per-node=1          # crucial - only 1 task per dist per node!
+#SBATCH --cpus-per-task=64           # number of cores per tasks
+#SBATCH --hint=nomultithread         # we get physical cores not logical
+#SBATCH --gres=gpu:8                 # number of gpus per nodes
+#SBATCH --constraint=a100
+#SBATCH --output=./out/%x-%j.out # STDOUT
+#SBATCH --error=./out/%x-%j.err
+#SBATCH --account=qgz@a100
+#SBATCH --qos=qos_gpu-t3
+#SBATCH --mail-type=BEGIN,END,FAIL
+#SBATCH [email protected]
+
+set -x -e
+
+echo "START TIME: $(date)"
+
+cd $HOME/Lucie-Training
+
+# ----- load env and variables
+source training/set_env.sh
+
+# Variables that are set with command-line arguments
+MODEL_CONFIG="" # can be lucie80m, lucie410m
+DATA_CONFIG=""  # see the files name (without the .yml ext) in ablation/data_config for options
+
+while [[ "$#" -gt 0 ]]; do
+  case $1 in
+    --model_config) MODEL_CONFIG="$2"; shift ;;
+    --data_config) DATA_CONFIG="$2"; shift ;;
+    *) echo "Unknown parameter passed: $1"; exit 1 ;;
+  esac
+  shift
+done
+
+echo "Model Config: $MODEL_CONFIG"
+echo "Data Config: $DATA_CONFIG"
+
+# Path variables
+VARIANT=ablation_${MODEL_CONFIG}_datamix_${DATA_CONFIG}
+LOGS_PATH=$OUTPUT_PATH/lucie-logs/$VARIANT/$SLURM_JOB_NAME
+mkdir -p $LOGS_PATH
+
+PERPLEXITY_RESULTS_PATH=$OUTPUT_PATH/perplexity/$VARIANT
+mkdir -p $PERPLEXITY_RESULTS_PATH
+
+# ----- data
+TOKENIZER_PATH=OpenLLM-France/Lucie-tokenizer-65k
+TOKENS_DIR="/gpfsssd/scratch/rech/qgz/commun/preprocessed_data/Lucie/lucie_tokens_65k_grouped"
+
+# in this case the proportions will be ignored but the format will be correct
+DOMAIN_PROPORTIONS_PATH=Lucie-Training/ablation/data_config/config00.yml
+DATASET_ARGS=" \
+	    --domain_proportions $HOME/$DOMAIN_PROPORTIONS_PATH \
+	    "
+DATASET="$(python `pwd`/training/collect_data_and_weights_ablation.py $TOKENS_DIR $DATASET_ARGS)"
+TEST_DATA_CACHE=$OUTPUT_PATH/test_data/.cache
+
+# so processes know who to talk to
+MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
+MASTER_PORT=6000
+
+GPUS_PER_NODE=8
+NNODES=$SLURM_NNODES
+
+# ----- model
+source ablation/models_config/${MODEL_CONFIG}.sh
+
+# ----- checkpoint path (model should be already trained)
+CHECKPOINT_PATH=$OUTPUT_PATH/checkpoints/$VARIANT
+
+# ----- optimizer
+TRAIN_STEPS=30_000 
+LR=3e-4
+MIN_LR=3e-5
+LR_WARMUP_STEPS=2000
+WEIGHT_DECAY=0.1
+GRAD_CLIP=1
+
+SAVE_INTERVAL=5_000
+
+DS_CONFIG_PATH=$OUTPUT_PATH/ds_configs/$VARIANT/$SLURM_JOB_NAME
+mkdir -p $DS_CONFIG_PATH
+config_json="$DS_CONFIG_PATH/$SLURM_JOBID.json"
+
+cat <<EOT > $config_json
+{
+  "train_batch_size" : $GLOBAL_BATCH_SIZE,
+  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
+  "zero_optimization": {
+    "stage": $ZERO_STAGE
+  },
+  "bf16": {
+    "enabled": true
+  },
+  "gradient_clipping": 1.0,
+  "steps_per_print": 4000,
+  "wall_clock_breakdown": false
+}
+EOT
+
+DEEPSPEED_ARGS=" \
+    --deepspeed \
+    --deepspeed_config ${config_json} \
+    --zero-stage ${ZERO_STAGE} \
+    "
+
+GPT_ARGS=" \
+       --num-layers $NUM_LAYERS \
+       --hidden-size $HIDDEN_SIZE \
+       --ffn-hidden-size $FFN_HIDDEN_SIZE \
+       --num-attention-heads $NUM_HEADS \
+       --seq-length $SEQ_LENGTH \
+       --max-position-embeddings $SEQ_LENGTH \
+       --attention-dropout 0 \
+       --hidden-dropout 0 \
+       --use-rotary-position-embeddings \
+       --untie-embeddings-and-output-weights \
+       --swiglu \
+       --normalization rmsnorm \
+       --disable-bias-linear \
+       --num-key-value-heads $NUM_KV_HEADS \
+       --bf16 \
+       "
+
+OPTIMIZER_ARGS=" \
+       --lr $LR \
+       --lr-decay-style cosine \
+       --min-lr $MIN_LR \
+       --clip-grad $GRAD_CLIP \
+       --lr-warmup-iters $LR_WARMUP_STEPS \
+       --optimizer adam \
+       --adam-beta1 0.9 \
+       --adam-beta2 0.95 \
+       --adam-eps 1e-5 \
+       --weight-decay $WEIGHT_DECAY \
+       "
+
+# do not remove or the training will hang and nodes will be lost w/o this workaround
+export CUDA_LAUNCH_BLOCKING=1
+
+# hide duplicated errors using this hack - will be properly fixed in pt-1.12
+export TORCHELASTIC_ERROR_FILE=/tmp/torch-elastic-error.json
+
+DISTRIBUTED_ARGS=" \
+       --nproc_per_node $GPUS_PER_NODE \
+       --nnodes $NNODES \
+       --node_rank \$SLURM_PROCID \
+       --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
+       --rdzv_backend c10d \
+       --max_restarts 0 \
+       --tee 3 \
+       "
+
+EXIT_OPTS=" \
+    --exit-duration-in-mins 1190 \
+    "
+
+export TOKENIZERS_PARALLELISM=true
+
+# ---- compute perplexity
+export RUN="torchrun $DISTRIBUTED_ARGS \
+      `pwd`/ablation/perplexity.py \
+      --tensor-model-parallel-size $TP \
+      --pipeline-model-parallel-size $PP \
+      --micro-batch-size $MICRO_BATCH_SIZE \
+      --global-batch-size $GLOBAL_BATCH_SIZE \
+      --data-path $DATASET \
+      --data-impl mmap \
+      --tokenizer-type PretrainedFromHF  \
+      --tokenizer-name-or-path $TOKENIZER_PATH \
+      --distributed-backend nccl \
+      --split 0.99,0.005,0.005 \
+      --use-flash-attn-v2 \
+      --no-query-key-layer-scaling \
+      --load $CHECKPOINT_PATH \
+      --inference \
+      --finetune \
+      --seed 42 \
+      --skip-warmup True\
+      --datatest-cache-path $TEST_DATA_CACHE \
+      --perplexity-results-path $PERPLEXITY_RESULTS_PATH \
+      $GPT_ARGS \
+      $OPTIMIZER_ARGS \
+      $DEEPSPEED_ARGS \
+      $EXIT_OPTS \
+      "
+
+clear; srun --jobid $SLURM_JOBID bash -c "$RUN" 2>&1 | tee -a $LOGS_PATH/main_log.txt
+
+mv ./out/$SLURM_JOB_NAME-$SLURM_JOBID.out $LOGS_PATH/$SLURM_JOB_NAME-$SLURM_JOBID.out
+mv ./out/$SLURM_JOB_NAME-$SLURM_JOBID.err $LOGS_PATH/$SLURM_JOB_NAME-$SLURM_JOBID.err
+
+echo "END TIME: $(date)"