Skip to content

zhengchen1999/DOVE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

52 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution

Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, and Yulun Zhang, "DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution", NeurIPS 2025

[project] [arXiv] [supplementary material] [dataset] [pretrained models]

πŸ”₯πŸ”₯πŸ”₯ News

  • 2025-10-12: Training code and the HQ-VSR dataset have been released. πŸš€πŸš€πŸš€
  • 2025-10-11: The project page is online, containing more visual results. 🌈🌈🌈
  • 2025-9-18: DOVE is accepted at NeurIPS 2025. πŸŽ‰πŸŽ‰πŸŽ‰
  • 2025-6-09: Test datasets, inference scripts, and pretrained models are available. ⭐️⭐️⭐️
  • 2025-5-22: This repo is released.

Abstract: Diffusion models have demonstrated promising performance in real-world video super-resolution (VSR). However, the dozens of sampling steps they require, make inference extremely slow. Sampling acceleration techniques, particularly single-step, provide a potential solution. Nonetheless, achieving one step in VSR remains challenging, due to the high training overhead on video data and stringent fidelity demands. To tackle the above issues, we propose DOVE, an efficient one-step diffusion model for real-world VSR. DOVE is obtained by fine-tuning a pretrained video diffusion model (i.e., CogVideoX). To effectively train DOVE, we introduce the latent–pixel training strategy. The strategy employs a two-stage scheme to gradually adapt the model to the video super-resolution task. Meanwhile, we design a video processing pipeline to construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning on this dataset further enhances the restoration capability of DOVE. Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a 28Γ— speed-up over existing methods such as MGLD-VSR.


VideoLQ-007.mp4
RealVSR-016.mp4

Training Strategy


Video Processing Pipeline

πŸ”– TODO

  • Release testing code.
  • Release pre-trained models.
  • Release training code.
  • Release the video processing pipeline.
  • Release HQ-VSR dataset.
  • Release project page.
  • Provide WebUI.
  • Provide HuggingFace demo.

βš™οΈ Dependencies

  • Python 3.11
  • PyTorch>=2.5.0
  • Diffusers
# Clone the github repo and go to the default directory 'DOVE'.
git clone https://github.com/zhengchen1999/DOVE.git
conda create -n DOVE python=3.11
conda activate DOVE
pip install -r requirements.txt
pip install diffusers["torch"] transformers
pip install pyiqa

πŸ”— Contents

  1. Datasets
  2. Models
  3. Training
  4. Testing
  5. Results
  6. Acknowledgements

πŸ“ Datasets

πŸ—³οΈ Train Datasets

We use two datasets for model training: HQ-VSR and DIV2K-HR. All datasets should be placed in the directory datasets/train/.

Dataset Type # Videos / Images Download
HQ-VSR Video 2,055 Google Drive
DIV2K-HR Image 800 Official Link

All datasets should follow this structure:

datasets/
└── train/
    β”œβ”€β”€ HQ-VSR/
    └── DIV2K_train_HR/

πŸ’‘ HQ-VSR description:

  • Construct using our four-stage video processing pipeline.
  • Extract 2,055 videos from OpenVid-1M, suitable for video super-resolution (VSR) training.
  • Detailed configuration and statistics are provided in the paper.

πŸ—³οΈ Test Datasets

We provide several real-world and synthetic test datasets for evaluation. All datasets follow a consistent directory structure:

Dataset Type # Num Download
UDM10 Synthetic 10 Google Drive
SPMCS Synthetic 30 Google Drive
YouHQ40 Synthetic 40 Google Drive
RealVSR Real-world 50 Google Drive
MVSR4x Real-world 15 Google Drive
VideoLQ Real-world 50 Google Drive

All datasets are hosted here. Make sure the path (datasets/test/) is correct before running inference.

The directory structure is as follows:

datasets/
└── test/
    └── [DatasetName]/
        β”œβ”€β”€ GT/         # Ground Truth: folder of high-quality frames (one per clip)
        β”œβ”€β”€ GT-Video/   # Ground Truth (video version): lossless MKV format
        β”œβ”€β”€ LQ/         # Low-quality Input: folder of degraded frames (one per clip)
        └── LQ-Video/   # Low-Quality Input (video version): lossless MKV format

πŸ“¦ Models

We provide pretrained weights for DOVE and DOVE-2B.

Model Name Description HuggingFace Google Drive Baidu Disk Visual Results
DOVE Base version, built on CogVideoX1.5-5B; TODO Download Download Download
DOVE-2B Smaller version, based on CogVideoX-2B TODO TODO TODO TODO

Place downloaded model files into the pretrained_models/ folder, e.g., pretrained_models/DOVE.

πŸ”§ Training

Note: Training requires 4Γ—A100 GPUs (80 GB each). You can optionally reduce the number of GPUs and use LoRA fine-tuning to reduce GPU memory requirements.

  • Prepare Datasets and Pretrained Models. Download the following resources and place them in the specified directories:

    Type Dataset / Model Path
    Training HQ-VSR, DIV2K-HR datasets/train/
    Testing UDM10 datasets/test/
    Pretrained model CogVideoX1.5-5B pretrained_models/
  • Build Dataset Statistics. Run the following commands to generate training and testing data statistics:

    # πŸ”Ή Train dataset
    python finetune/scripts/prepare_dataset.py --dir /data2/chenzheng/DOVE/datasets/train/HQ-VSR
    python finetune/scripts/prepare_dataset.py --dir /data2/chenzheng/DOVE/datasets/train/DIV2K_train_HR
    # πŸ”Ή Testing dataset
    python finetune/scripts/prepare_dataset.py --dir /data2/chenzheng/DOVE/datasets/test/UDM10/GT-Video
    python finetune/scripts/prepare_dataset.py --dir /data2/chenzheng/DOVE/datasets/test/UDM10/LQ-Video
  • πŸ”Ή Stage-1 (Latent-Space): Adaptation. Enter the finetune/ directory and perform the first-stage training (latent-space) using:

    bash train_ddp_one_s1.sh

    This step fine-tunes the pretrained CogVideoX1.5-5B model to adapt to the VSR task.

  • πŸ”Ή Stage-2 (Pixel-Space): Refinement. After Stage-1 training, convert the checkpoint into a loadable SFT weight:

    python finetune/scripts/prepare_sft_ckpt.py --checkpoint_dir checkpoint/DOVE-s1/checkpoint-10000

    Then, run the second-stage fine-tuning:

    bash train_ddp_one_s2.sh

    This stage further adjusts the model in pixel space to enhance the video restoration.

  • After Stage-2, convert the final checkpoint to a loadable format:

    python finetune/scripts/prepare_sft_ckpt.py --checkpoint_dir checkpoint/DOVE-/checkpoint-500

πŸ”¨ Testing

  • We provide inference commands below. Before running, make sure to download the corresponding pretrained models and test datasets.

  • For more options and usage, please refer to inference_script.py.

  • The full testing commands are provided in the shell script: inference.sh.

πŸ’‘ Prompt Optimization: DOVE uses an empty prompt (""). To accelerate inference, we pre-load the empty prompt embedding from pretrained_models/prompt_embeddings. When the prompt is empty, the pre-loaded embedding is used directly, bypassing text encoding and reducing overhead.

# πŸ”Ή Demo inference
python inference_script.py \
    --input_dir datasets/demo \
    --model_path pretrained_models/DOVE \
    --output_path results/DOVE/demo \
    --is_vae_st \
    --save_format yuv420p

# πŸ”Ή Reproduce paper results
python inference_script.py \
    --input_dir datasets/test/UDM10/LQ-Video \
    --model_path pretrained_models/DOVE \
    --output_path results/DOVE/UDM10 \
    --is_vae_st \

# πŸ”Ή Evaluate quantitative metrics
python eval_metrics.py \
    --gt datasets/test/UDM10/GT \
    --pred results/DOVE/UDM10 \
    --metrics psnr,ssim,lpips,dists,clipiqa

πŸ’‘ If you encounter out-of-memory (OOM) issues, you can enable chunk-based testing by setting the following parameters: tile_size_hw, overlap_hw, chunk_len, and overlap_t.

πŸ’‘ Default save format is yuv444p. If playback fails, try save_format=yuv420p (may slightly affect metrics).

TODO: Add metric computation scripts for FasterVQA, DOVER, and $E^*_{warp}$.

πŸ”Ž Results

We achieve state-of-the-art performance on real-world video super-resolution. Visual results are available at Google Drive.

Quantitative Results (click to expand)
  • Results in Tab. 2 of the main paper

- Complexity Comparison in Tab. 2 of the supplementary material

Qualitative Results (click to expand)
  • Results in Fig. 4 of the main paper

More Qualitative Results
  • More results in Fig. 3 of the supplementary material

  • More results in Fig. 4 of the supplementary material

  • More results in Fig. 5 of the supplementary material

  • More results in Fig. 6 of the supplementary material

  • More results in Fig. 7 of the supplementary material

πŸ“Ž Citation

If you find the code helpful in your research or work, please cite the following paper(s).

@inproceedings{chen2025dove,
  title={DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution},
  author={Chen, Zheng and Zou, Zichen and Zhang, Kewei and Su, Xiongfei and Yuan, Xin and Guo, Yong and Zhang, Yulun},
  booktitle={NeurIPS},
  year={2025}
}

πŸ’‘ Acknowledgements

This project is based on CogVideo and Open-Sora.

About

[NeurIPS'25] DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution

Resources

Stars

Watchers

Forks

Packages

No packages published