[RFC]: DiT models Performance benchmark（T2I/I2I/T2V/TI2V）

### Motivation.

- Define two major evaluation blocks: Text to Image and Text to Video
- For each block, define:
  - Key performance metrics (Latency / Throughput / Memory / Utilization / Stability)
  - Key quality metrics (optional / tiered: fast proxy vs full metrics)
  - Evaluation datasets and prompt sets (including versioning / snapshot strategy)
  - Unified methodology (warmup, timing points, statistical definitions, configuration recording)
  - Standard output format (JSON schema recommended)
- Supported comparison dimensions (examples):
  - Different attention backends
  - Different parallelization strategies
  - Different cache backends
  - Different precisions
  - Different resolutions / frame counts
  - Different batch sizes / concurrency levels
  - Different schedulers / steps

Step1: Two Python scripts are provided for conducting benchmark tests for t2i/t2v.

Step2: Combine the two scripts with vllm bench serve to make the benchmarking tool universally compatible.

### Proposed Change.

#### 1. Terminology and Definitions

- E2E latency: End-to-end time from input prompt to the final image/video tensor (or encoded file)
- Throughput: Number of samples completed per unit time (images/s or videos/s); for video, frames/s may also be provided
- Peak VRAM: Peak GPU memory usage during inference (MiB)

---

#### 2. Benchmark 1: Text to Image (T2I)

##### 2.1 Workload Definition

Evaluation dimensions should cover at least:

- Resolution: 512×512, 1024×1024 (or the actual supported set)
- Batch size: 1 / 2 / 4 etc. (depending on VRAM)
- Inference steps: 20 / 30 / 50 (select at least one fixed value for normalization)
- Guidance scale (CFG): fix a value (e.g., 7.5) and record it
- Scheduler: fix one (or evaluate separately and record)
- Precision: fp16 / bf16 (optional fp8 if supported)
- Attention backend / parallel / cache backend implementation: core comparison items (determined by engineering options)

##### 2.2 Key Performance Metrics

- e2e_latency_ms (P50 / P95 / P99)
- throughput_images_per_s (steady-state)
- peak_vram_mib
- oom_rate / failure_rate

##### 2.3 Dataset and Prompt Sets

Two categories: Reference-based and Prompt-only.

1. Prompt-only (more suitable for performance regression; recommended as primary)

- Fixed prompt list covering:
  - Short / long prompts, different styles, different subjects, different complexity levels, different languages (including mixed Chinese/English)
- Versioning: prompt_set_name + version + sha256
- Sampling / adaptation allowed from the following sources (subject to availability and license confirmation):
  - DrawBench prompts
  - DiffusionDB
  - Internally built prompt sets

#### 3. Benchmark 2: Image to Image (I2I)

---

#### 4. Benchmark 3: Text to Video (T2V)

##### 4.1 Workload Definition

Key T2V parameters must be explicitly fixed:

- Resolution: e.g., 480×640 (480p) (standardize one first, then extend)
- Number of frames (num_frames): e.g., 16 / 24 / 48 (fix at least one for normalization)
- FPS (target generation fps): e.g., 8 / 12 / 24 (depends on model definition; record as N/A if not applicable)
- Inference steps: e.g., 30 / 50
- Guidance scale: fixed and recorded
- Output format: tensor / raw frames / encoded mp4 (whether encoding is included in E2E must be specified)
- Concurrency: primarily single-request; multi-concurrency can be added for throughput and stability testing

##### 4.2 Key Performance Metrics (Required)

- e2e_latency_ms (per video)
- throughput_videos_per_s (steady-state)
- throughput_frames_per_s (strongly recommended for cross–frame-count comparison)
- peak_vram_mib
- oom_rate / failure_rate

##### 4.3 Prompt Set (Recommended)

- Prompt-only (recommended primary for performance regression):
  - Fixed T2V prompt list (covering motion / camera language / multiple subjects / long descriptions)
  - Versioning same as T2I

#### 5. Benchmark 2: Image to Video (I2V)

---

##### 6. Result Output and Reporting Format (Script-Unified Recommended)

##### 6.1 Summary JSON (Recommended)

- task: t2i / i2i/ t2v / i2v
- timestamp_utc
- git_commit
- config (model name, precision, steps, resolution, batch, frames, scheduler, attention_backend, etc.)
- metrics_summary
  - latency percentiles, throughput, peak_vram, failure / oom


### Feedback Period.

2025/12/30

### CC List.

@david6666666 @yenuo26

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://vllm-omni.readthedocs.io), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC]: DiT models Performance benchmark（T2I/I2I/T2V/TI2V） #344

Motivation.

Proposed Change.

1. Terminology and Definitions

2. Benchmark 1: Text to Image (T2I)

2.1 Workload Definition

2.2 Key Performance Metrics

2.3 Dataset and Prompt Sets

3. Benchmark 2: Image to Image (I2I)

4. Benchmark 3: Text to Video (T2V)

4.1 Workload Definition

4.2 Key Performance Metrics (Required)

4.3 Prompt Set (Recommended)

5. Benchmark 2: Image to Video (I2V)

6. Result Output and Reporting Format (Script-Unified Recommended)

6.1 Summary JSON (Recommended)

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC]: DiT models Performance benchmark（T2I/I2I/T2V/TI2V） #344

Description

Motivation.

Proposed Change.

1. Terminology and Definitions

2. Benchmark 1: Text to Image (T2I)

2.1 Workload Definition

2.2 Key Performance Metrics

2.3 Dataset and Prompt Sets

3. Benchmark 2: Image to Image (I2I)

4. Benchmark 3: Text to Video (T2V)

4.1 Workload Definition

4.2 Key Performance Metrics (Required)

4.3 Prompt Set (Recommended)

5. Benchmark 2: Image to Video (I2V)

6. Result Output and Reporting Format (Script-Unified Recommended)

6.1 Summary JSON (Recommended)

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions