- Install PyTorch env:
conda create -n cocktail python=3.10
conda activate cocktail
conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
conda install -c conda-forge cupy nccl cudatoolkit=11.8or managing packages with mamba:
mamba create -n cocktail python=3.10
mamba activate cocktail
mamba install -c "nvidia/label/cuda-11.8.0" cuda-toolkit
mamba install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
mamba install -c conda-forge cupy nccl cudatoolkit=11.8And then install other requirements:
pip install -r requirements.txtAs we use wandb to manage experiments, one should also configure wandb before running the code
wandb loginWe provide pretrained model checkpoints that are sharded by layers:
Please download and unzip the above ckpts to fine-tune them.
The path of unzipped model should be passed to --model-name and --tokenizer-name for fine-tuning.
Please refer to example_scripts/finetune_opt1.3b.sh, which shows an example to fine-tune OPT-1.3B on mmlu-cot data.
The script will launch 8 processes with a data parallel degree of 4 and a pipeline parallel degree of 2.
In case of geo-distributed training, please first make sure the network interface is correctly set and the master (rank 0 worker) IP and port are accesible by all the workers. After that, run the corresponding process on each GPU node.
# set enviroment vars
...
# run on each GPU node
python dist_lm_train.py ... --cuda-id 0 --rank ${GLOBAL_RANK}Enviroment vars that should be set:
export GLOO_SOCKET_IFNAME=lo # the correct interface
export NCCL_SOCKET_IFNAME=lo # the correct interface
export WANDB_NAME=opt-test # wandb run name
export RANDOMP_RATIO=0.1 # CocktailSGD: Random sparsity ratio
export TOPK_RATIO=0.2 # CocktailSGD: TopK sparsity ratio
export QUANT_BITS=4 # CocktailSGD: Quantization bitsThe following arguments should be carefully set:
--model-name: The path of model ckpt sharded by layers.--tokenizer-name: Usually the same to--model-name. You can also use HF's model name.--model-type: Indicate the model type. {opt, flash_opt, gptj, gptneox}. The 'flash_' prefix uses flash attention to accelerate training.--num-layers: Number of Transformer layers for each GPU. E.g. OPT-1.3B has 24 layers, if we use two GPUs to form a pipeline,--num-layersshould be 12.--embedding-dim: The hidden size of the model. OPT-1.3B is 2048, GPT-J-6B is 4096, GPT-NeoX-20B is 6144. This is used to create buffers.--dist-url: URL of rank 0 worker (master). It is the same to all workers. And this URL should be accessible by all workers. For local training (single machine multiple GPUs), this can be like--dist-url tcp://127.0.0.1:7033--world-size: The total number of workers.world-size == pipeline-group-size * data-group-size--pipeline-group-size: Number of GPU workers for each pipeline--data-group-size: Number of data parallel workers. Also the number of pipelines.--net-interface: Network interface. Should be consistent withGLOO_SOCKET_IFNAMEandNCCL_SOCKET_IFNAME.
The following arguments can be tuned / changed:
--optimizer: Optimizer type. {adam, 8bit-adam} (8bit-adam requirespip install bitsandbytes)--load-pretrained-model: Whether to load model weights. Usuallytrue.--task-name: The task name or the path of ajsonlfile. For multi-task training separate task names by,. There is an optional sampling weight after each task name, separated by:(default is 1.0). Sampling weights will be normalized. E.g. it should be like--task-name cot:0.1,/path_task0.jsonl:1.0,/path_task0.jsonl:1.0,/path_task0.jsonl:1.0.--checkpoint-path: Path to save fine-tuned checkpoints.--checkpoint-steps: Save ckpt everycheckpoint-steps.--total-steps: Total number of steps for training. (This counts allgradient-accumulate-steps.)--warmup-steps: LR warmup steps.--lr: learning rate--seq-length: sequence length--batch-size: batch size for each GPU device (of each gradient accumulation step).--micro-batch-size: micro batch size for pipeline parallelism. 1 works fine.--gradient-accumulate-step: Accumulate gradients for several steps before updating parameters. This is another way to achieve large batch sizes when GPU memory is not enough.--dp-backend: {gloo, nccl}--dp-mode: {allreduce, cocktail_sgd}.cocktail_sgdshould always set--dp-backend gloo,allreduceperforms better atnccl.
The following arguments usually do not change:
--fp16: Flag to enable FP16 mixed precision training. Should always adding it for the current impl.--pp-mode: alwaysgpipe--profiling: {no-profiling, tidy_profiling}.tidy_profilingwill generate profile jsons.
pip install bitsandbytes # optional, to use 8bit-adamhttps://github.com/HazyResearch/flash-attention
Install FlashAttention
export CUDA_HOME=/usr/local/cuda-11.8
git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
git checkout tags/v1.0.4
pip install .
cd ..Install other optimized kernels:
cd flash-attention/csrc/rotary
pip install .
cd ../..export CUDA_HOME=/usr/local/cuda-11.8
git clone https://github.com/facebookresearch/xformers.git
cd xformers
git submodule update --init --recursive
pip install .
cd ..export CUDA_HOME=/usr/local/cuda-11.8
git clone https://github.com/NVIDIA/apex.git
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
cd ..