Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
e20cc7d
feat: add perplexity scripts
lucashervier Jun 21, 2024
10a39d2
ablation: update data collection script to adapt to the new domains
Agustin-Picard Jun 21, 2024
bd8586a
ablation: introduce the domain proportions in yaml format
Agustin-Picard Jun 21, 2024
2da1c99
ablation: update script to call the right python script for the domai…
Agustin-Picard Jun 21, 2024
2bd1627
ablation: modify eval function to return results dict instead of file
Agustin-Picard Jun 24, 2024
cba8dc9
ablation: turn off DP as it was causing trouble during GPU sync
Agustin-Picard Jun 24, 2024
226e131
- Add json files with raw statistics.
Jeronymous Jun 11, 2024
a693922
Add category columns to dataset
Jeronymous Jun 11, 2024
a8d6a75
Fix Pile categories
Jeronymous Jun 11, 2024
a2eee0c
Update stats (fix CroissantAligned/detailed, update Gallica and Guten…
Jeronymous Jun 17, 2024
2526f63
Discard MathPile validation set for training
Jeronymous Jun 17, 2024
f7233a9
add counts for categories, ocr, lang
jhunter19 Jun 17, 2024
5d85e05
Update token statistics, and statistics about YouTube
Jeronymous Jun 18, 2024
899745a
Breakdown Pile and OtherFr
Jeronymous Jun 18, 2024
da8f62a
Update categories
jhunter19 Jun 18, 2024
7f878af
feat: update for domain and language
lucashervier Jun 28, 2024
9255142
fix: minor mistakes with paths and unused variables
lucashervier Jun 28, 2024
70e50f7
feat: change the domain proportion yaml file to integrate the language
lucashervier Jun 28, 2024
b7c3031
ablation: Fix DP in PPL evaluation
Agustin-Picard Jun 28, 2024
a3be345
feat: agus add distributing arguments fr correct computation
lucashervier Jun 28, 2024
1e0ee64
fix: the domain proportion for programming language was not normalized
lucashervier Jul 1, 2024
e015a85
feat: add a slurm file for launching ablation for the 80M model
lucashervier Jul 1, 2024
06b4483
fixup: forget a reflink to home
lucashervier Jul 1, 2024
00d2c49
fix: a broken link
lucashervier Jul 1, 2024
646247f
fix: remove a print which break the script with CulturaX
lucashervier Jul 1, 2024
e74fe67
feat: automatically set the output path to the user work directory
lucashervier Jul 8, 2024
900f230
feat: add a slurm training job for 80 M parameters model
lucashervier Jul 8, 2024
c7c1f3e
fix: wrong print rank
lucashervier Jul 8, 2024
7a779d3
feat: improve ablation study experiments management
lucashervier Jul 8, 2024
49fde71
chore: ease the launch of multiple experiments
lucashervier Jul 8, 2024
a0c50e8
fix: typo mistake
lucashervier Jul 8, 2024
4bc9cdf
fix: update the splits so there is no empty datasets
lucashervier Jul 8, 2024
9a4fade
fix: write numpy array in the csv instead of torch tensor
lucashervier Jul 8, 2024
2ecd99e
feat: avoid the drop last batch as some test samples are not large en…
lucashervier Jul 9, 2024
ec35b26
feat: change the data loader
lucashervier Jul 9, 2024
1438cd8
fix: wrong wild card to move the outputs files
lucashervier Jul 9, 2024
4c9bba5
feat: add the tokenizers parallelism variable
lucashervier Jul 9, 2024
03af559
ablation: update bash scripts to point to correct folders (for me)
Agustin-Picard Aug 19, 2024
8058464
ablation: update data config file to contain a weight for web datasets
Agustin-Picard Aug 19, 2024
f904967
ablation: initialize a script to analyze results from the experiments…
Agustin-Picard Aug 19, 2024
8b5cec0
chore: clean-up a conflict from the rebase
Agustin-Picard Aug 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions ablation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Ablation study

## Data Proportions study

The idea here is to challenge the model with different proportions of data (data mixtures). Considering previous work one could hope that experiments on small scale model (e.g. 80M parameters) should be a good proxy to estimate a larger scale model (e.g. 7B parameters) optimal data mixture.

### Validate the hypothesis

The idea is to validate the hypothesis that a small scale model can be used to estimate the optimal data mixture for a larger scale model. Here, we propose to train a small scale model (80M parameters) on different data mixtures and then train a larger scale model (410M parameters) on the same data mixtures.

Then, the idea is to compare the perplexity of both models on a test set and see if the rank of the different domains is preserved between the two models when trained on the same data mixture.

For this first experiment, we will train both models on 30B tokens.

#### Prerequisites

For those experiments, one will need to change the Megatron-DeepSpeed branch.

Supposing that you are in the `Lucie-Training` directory, you can do the following:

```bash
cd Megatron-DeepSpeed
git checkout perplexity
cd ..
```

#### Train the models with different mixtures

To train the models, we will use the following command:

```bash
sbatch --nodes=2 --time=04:30:00 --job-name=ablation-datamix-lucie80m-config00-30ksteps ablation/ablation_training.slurm --model_config lucie80m --data_config config00
```

To determine the different options you can follow the following table that will help you to determine the different options depending on the model's size:

| Model size | model_config | nodes | time |
|------------|--------------|-------------|------|
| 80M | lucie80m | 2 | 04:30:00 |
| 410M | lucie410m | 4 | 16:00:00 |

To set the `data_config` parameter the following options are available:

| data_config | config_path |
|-------------|--------------|
| config00 | data_config/config00.yaml |
| config01 | data_config/config01.yaml |

Finally, when all the above are decided you should set the `--job-name` parameter to: `ablation-datamix-<model_config>-<data_config>-30ksteps`. It will be helpful later to compute the different perplexities.

#### Compute the perplexities

To compute the perplexities, we will use the following command:

```bash
sbatch --nodes=2 --time=01:30:00 --job-name=perplexity-ablation-datamix-lucie80m-config00-30ksteps ablation/ablation_perplexity.slurm --model_config lucie80m --data_config config00
```

The same options as for the training should be used as this scripts will search the checkpoints in the same directory as the training scripts.

Finally, the `--job-name` parameter should be set to: `perplexity-ablation-datamix-<model_config>-<data_config>-30ksteps`.

#### Analyze the results
**WIP**

### Find the optimal data mixture for Lucie80M

The idea here is to find the optimal data mixture for the Lucie80M model. To do so, we will train the model on different data mixtures and then evaluate the perplexity on a test set. We will then compare the different perplexities and determine the optimal data mixture.

#### The data mixtures
**WIP**

#### Train the models with different mixtures
**WIP**

#### Compute the perplexities
**WIP**

#### Analyze the results
**WIP**
194 changes: 194 additions & 0 deletions ablation/ablation_perplexity.slurm
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
#!/bin/bash
#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=64 # number of cores per tasks
#SBATCH --hint=nomultithread # we get physical cores not logical
#SBATCH --gres=gpu:8 # number of gpus per nodes
#SBATCH --constraint=a100
#SBATCH --output=./out/%x-%j.out # STDOUT
#SBATCH --error=./out/%x-%j.err
#SBATCH --account=qgz@a100
#SBATCH --qos=qos_gpu-t3
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

set -x -e

echo "START TIME: $(date)"

cd $HOME/Lucie-Training

# ----- load env and variables
source training/set_env.sh

# Variables that are set with command-line arguments
MODEL_CONFIG="" # can be lucie80m, lucie410m
DATA_CONFIG="" # see the files name (without the .yml ext) in ablation/data_config for options

while [[ "$#" -gt 0 ]]; do
case $1 in
--model_config) MODEL_CONFIG="$2"; shift ;;
--data_config) DATA_CONFIG="$2"; shift ;;
*) echo "Unknown parameter passed: $1"; exit 1 ;;
esac
shift
done

echo "Model Config: $MODEL_CONFIG"
echo "Data Config: $DATA_CONFIG"

# Path variables
VARIANT=ablation_${MODEL_CONFIG}_datamix_${DATA_CONFIG}
LOGS_PATH=$OUTPUT_PATH/lucie-logs/$VARIANT/$SLURM_JOB_NAME
mkdir -p $LOGS_PATH

PERPLEXITY_RESULTS_PATH=$OUTPUT_PATH/perplexity/$VARIANT
mkdir -p $PERPLEXITY_RESULTS_PATH

# ----- data
TOKENIZER_PATH=OpenLLM-France/Lucie-tokenizer-65k
TOKENS_DIR="/gpfsssd/scratch/rech/qgz/commun/preprocessed_data/Lucie/lucie_tokens_65k_grouped"

# in this case the proportions will be ignored but the format will be correct
DOMAIN_PROPORTIONS_PATH=Lucie-Training/ablation/data_config/config00.yml
DATASET_ARGS=" \
--domain_proportions $HOME/$DOMAIN_PROPORTIONS_PATH \
"
DATASET="$(python `pwd`/training/collect_data_and_weights_ablation.py $TOKENS_DIR $DATASET_ARGS)"
TEST_DATA_CACHE=$OUTPUT_PATH/test_data/.cache

# so processes know who to talk to
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000

GPUS_PER_NODE=8
NNODES=$SLURM_NNODES

# ----- model
source ablation/models_config/${MODEL_CONFIG}.sh

# ----- checkpoint path (model should be already trained)
CHECKPOINT_PATH=$OUTPUT_PATH/checkpoints/$VARIANT

# ----- optimizer
TRAIN_STEPS=30_000
LR=3e-4
MIN_LR=3e-5
LR_WARMUP_STEPS=2000
WEIGHT_DECAY=0.1
GRAD_CLIP=1

SAVE_INTERVAL=5_000

DS_CONFIG_PATH=$OUTPUT_PATH/ds_configs/$VARIANT/$SLURM_JOB_NAME
mkdir -p $DS_CONFIG_PATH
config_json="$DS_CONFIG_PATH/$SLURM_JOBID.json"

cat <<EOT > $config_json
{
"train_batch_size" : $GLOBAL_BATCH_SIZE,
"train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
"zero_optimization": {
"stage": $ZERO_STAGE
},
"bf16": {
"enabled": true
},
"gradient_clipping": 1.0,
"steps_per_print": 4000,
"wall_clock_breakdown": false
}
EOT

DEEPSPEED_ARGS=" \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${ZERO_STAGE} \
"

GPT_ARGS=" \
--num-layers $NUM_LAYERS \
--hidden-size $HIDDEN_SIZE \
--ffn-hidden-size $FFN_HIDDEN_SIZE \
--num-attention-heads $NUM_HEADS \
--seq-length $SEQ_LENGTH \
--max-position-embeddings $SEQ_LENGTH \
--attention-dropout 0 \
--hidden-dropout 0 \
--use-rotary-position-embeddings \
--untie-embeddings-and-output-weights \
--swiglu \
--normalization rmsnorm \
--disable-bias-linear \
--num-key-value-heads $NUM_KV_HEADS \
--bf16 \
"

OPTIMIZER_ARGS=" \
--lr $LR \
--lr-decay-style cosine \
--min-lr $MIN_LR \
--clip-grad $GRAD_CLIP \
--lr-warmup-iters $LR_WARMUP_STEPS \
--optimizer adam \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--adam-eps 1e-5 \
--weight-decay $WEIGHT_DECAY \
"

# do not remove or the training will hang and nodes will be lost w/o this workaround
export CUDA_LAUNCH_BLOCKING=1

# hide duplicated errors using this hack - will be properly fixed in pt-1.12
export TORCHELASTIC_ERROR_FILE=/tmp/torch-elastic-error.json

DISTRIBUTED_ARGS=" \
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank \$SLURM_PROCID \
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
--rdzv_backend c10d \
--max_restarts 0 \
--tee 3 \
"

EXIT_OPTS=" \
--exit-duration-in-mins 1190 \
"

export TOKENIZERS_PARALLELISM=true

# ---- compute perplexity
export RUN="torchrun $DISTRIBUTED_ARGS \
`pwd`/ablation/perplexity.py \
--tensor-model-parallel-size $TP \
--pipeline-model-parallel-size $PP \
--micro-batch-size $MICRO_BATCH_SIZE \
--global-batch-size $GLOBAL_BATCH_SIZE \
--data-path $DATASET \
--data-impl mmap \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path $TOKENIZER_PATH \
--distributed-backend nccl \
--split 0.99,0.005,0.005 \
--use-flash-attn-v2 \
--no-query-key-layer-scaling \
--load $CHECKPOINT_PATH \
--inference \
--finetune \
--seed 42 \
--skip-warmup True\
--datatest-cache-path $TEST_DATA_CACHE \
--perplexity-results-path $PERPLEXITY_RESULTS_PATH \
$GPT_ARGS \
$OPTIMIZER_ARGS \
$DEEPSPEED_ARGS \
$EXIT_OPTS \
"

clear; srun --jobid $SLURM_JOBID bash -c "$RUN" 2>&1 | tee -a $LOGS_PATH/main_log.txt

mv ./out/$SLURM_JOB_NAME-$SLURM_JOBID.out $LOGS_PATH/$SLURM_JOB_NAME-$SLURM_JOBID.out
mv ./out/$SLURM_JOB_NAME-$SLURM_JOBID.err $LOGS_PATH/$SLURM_JOB_NAME-$SLURM_JOBID.err

echo "END TIME: $(date)"
Loading