Skip to content

Different results in DDP and ZeRO-1 when training 4B model #3

Description

@AiRyunn

Hi, thank you for your work on this project!

I was training the 4B model with the provided script with 8×H100 GPUs (1 for vLLM server, 7 for training), but I got significantly lower results with DDP than with Zero-1. Zero-1 matches reported results, but DDP is much worse.

Looking at the configs, the main differences seem to be max_grad_norm and learning rate & scheduler type.

What settings should I use for DDP training to match Zero-1 performance?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions