Different results in DDP and ZeRO-1 when training 4B model

Hi, thank you for your work on this project!

I was training the 4B model with the provided script with 8×H100 GPUs (1 for vLLM server, 7 for training), but I got significantly lower results with DDP than with Zero-1. Zero-1 matches reported results, but DDP is much worse.

Looking at the configs, the main differences seem to be `max_grad_norm` and `learning rate` & `scheduler type`.

What settings should I use for DDP training to match Zero-1 performance?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different results in DDP and ZeRO-1 when training 4B model #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Different results in DDP and ZeRO-1 when training 4B model #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions