Skip to content

[RFC]: DiT models Performance benchmark(T2I/I2I/T2V/TI2V) #344

@david6666666

Description

@david6666666

Motivation.

  • Define two major evaluation blocks: Text to Image and Text to Video
  • For each block, define:
    • Key performance metrics (Latency / Throughput / Memory / Utilization / Stability)
    • Key quality metrics (optional / tiered: fast proxy vs full metrics)
    • Evaluation datasets and prompt sets (including versioning / snapshot strategy)
    • Unified methodology (warmup, timing points, statistical definitions, configuration recording)
    • Standard output format (JSON schema recommended)
  • Supported comparison dimensions (examples):
    • Different attention backends
    • Different parallelization strategies
    • Different cache backends
    • Different precisions
    • Different resolutions / frame counts
    • Different batch sizes / concurrency levels
    • Different schedulers / steps

Step1: Two Python scripts are provided for conducting benchmark tests for t2i/t2v.

Step2: Combine the two scripts with vllm bench serve to make the benchmarking tool universally compatible.

Proposed Change.

1. Terminology and Definitions

  • E2E latency: End-to-end time from input prompt to the final image/video tensor (or encoded file)
  • Throughput: Number of samples completed per unit time (images/s or videos/s); for video, frames/s may also be provided
  • Peak VRAM: Peak GPU memory usage during inference (MiB)

2. Benchmark 1: Text to Image (T2I)

2.1 Workload Definition

Evaluation dimensions should cover at least:

  • Resolution: 512×512, 1024×1024 (or the actual supported set)
  • Batch size: 1 / 2 / 4 etc. (depending on VRAM)
  • Inference steps: 20 / 30 / 50 (select at least one fixed value for normalization)
  • Guidance scale (CFG): fix a value (e.g., 7.5) and record it
  • Scheduler: fix one (or evaluate separately and record)
  • Precision: fp16 / bf16 (optional fp8 if supported)
  • Attention backend / parallel / cache backend implementation: core comparison items (determined by engineering options)
2.2 Key Performance Metrics
  • e2e_latency_ms (P50 / P95 / P99)
  • throughput_images_per_s (steady-state)
  • peak_vram_mib
  • oom_rate / failure_rate
2.3 Dataset and Prompt Sets

Two categories: Reference-based and Prompt-only.

  1. Prompt-only (more suitable for performance regression; recommended as primary)
  • Fixed prompt list covering:
    • Short / long prompts, different styles, different subjects, different complexity levels, different languages (including mixed Chinese/English)
  • Versioning: prompt_set_name + version + sha256
  • Sampling / adaptation allowed from the following sources (subject to availability and license confirmation):
    • DrawBench prompts
    • DiffusionDB
    • Internally built prompt sets

3. Benchmark 2: Image to Image (I2I)


4. Benchmark 3: Text to Video (T2V)

4.1 Workload Definition

Key T2V parameters must be explicitly fixed:

  • Resolution: e.g., 480×640 (480p) (standardize one first, then extend)
  • Number of frames (num_frames): e.g., 16 / 24 / 48 (fix at least one for normalization)
  • FPS (target generation fps): e.g., 8 / 12 / 24 (depends on model definition; record as N/A if not applicable)
  • Inference steps: e.g., 30 / 50
  • Guidance scale: fixed and recorded
  • Output format: tensor / raw frames / encoded mp4 (whether encoding is included in E2E must be specified)
  • Concurrency: primarily single-request; multi-concurrency can be added for throughput and stability testing
4.2 Key Performance Metrics (Required)
  • e2e_latency_ms (per video)
  • throughput_videos_per_s (steady-state)
  • throughput_frames_per_s (strongly recommended for cross–frame-count comparison)
  • peak_vram_mib
  • oom_rate / failure_rate
4.3 Prompt Set (Recommended)
  • Prompt-only (recommended primary for performance regression):
    • Fixed T2V prompt list (covering motion / camera language / multiple subjects / long descriptions)
    • Versioning same as T2I

5. Benchmark 2: Image to Video (I2V)


6. Result Output and Reporting Format (Script-Unified Recommended)
6.1 Summary JSON (Recommended)
  • task: t2i / i2i/ t2v / i2v
  • timestamp_utc
  • git_commit
  • config (model name, precision, steps, resolution, batch, frames, scheduler, attention_backend, etc.)
  • metrics_summary
    • latency percentiles, throughput, peak_vram, failure / oom

Feedback Period.

2025/12/30

CC List.

@david6666666 @yenuo26

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions