Hi, thank you for your work on this project!
I was training the 4B model with the provided script with 8×H100 GPUs (1 for vLLM server, 7 for training), but I got significantly lower results with DDP than with Zero-1. Zero-1 matches reported results, but DDP is much worse.
Looking at the configs, the main differences seem to be max_grad_norm and learning rate & scheduler type.
What settings should I use for DDP training to match Zero-1 performance?
Hi, thank you for your work on this project!
I was training the 4B model with the provided script with 8×H100 GPUs (1 for vLLM server, 7 for training), but I got significantly lower results with DDP than with Zero-1. Zero-1 matches reported results, but DDP is much worse.
Looking at the configs, the main differences seem to be
max_grad_normandlearning rate&scheduler type.What settings should I use for DDP training to match Zero-1 performance?