Skip to content

Conversation

@princepride
Copy link
Contributor

@princepride princepride commented Dec 15, 2025

Purpose

Resolves #203

This PR introduces support for the Bagel model (BAGEL-7B-MoT) in vllm-omni.
Specifically, it implements the txt2img inference capability using the BagelPipeline.

Subsequently, I will implement Bagel within the Model Executor. I plan to decompose the model into multiple stages: AR and DiT. The AR stage will directly utilize the implementation from the main repository, while the DiT stage will use the Model Executor's implementation. This approach will enable text2text, text2img, img2text, and img2img capabilities.

Test Plan

To verify the correctness of the implementation, a reproduction script was created to initialize the model and perform a simple text-to-image generation.

Test Script:

from vllm_omni.entrypoints.omni_diffusion import OmniDiffusion

def main():
    model_path = "../models/BAGEL-7B-MoT"
    prompt = "A futuristic city skyline at twilight, cyberpunk style"
    pipeline = OmniDiffusion(model=model_path)
    result = pipeline.generate(prompt)
    output_file = "bagel_output.png"
    result.images[0].save(output_file)

if __name__ == "__main__":
    main()

Test Result

image
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

@hsliuustc0106
Copy link
Collaborator

@natureofnature PTAL

@princepride
Copy link
Contributor Author

Sorry, I forgot to install pre-commit on the computer I used over the weekend.😂

f"W={db_cache_config.max_warmup_steps}, "
)

transformer = pipeline.language_model.model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we add self.transformer=self.language_model.model in bagel pipeline init_, can we just reuse the regular dit enabler enable_cache_for_dit?

princepride and others added 3 commits December 18, 2025 05:58
Signed-off-by: princepride <[email protected]>
Signed-off-by: princepride <[email protected]>
@hsliuustc0106
Copy link
Collaborator

Hi, will the model be ready before 1230 release?

@princepride
Copy link
Contributor Author

I believe we can make it!

@hsliuustc0106
Copy link
Collaborator

@natureofnature @princepride currently, can we have an e2e example for AR+DiT?

@princepride
Copy link
Contributor Author

Since Bagel's DiT component does not follow a traditional architecture, I am currently unable to implement the Cache DiT functionality for it. I have provided a more detailed explanation in this issue: vipshop/cache-dit#598

@princepride
Copy link
Contributor Author

@hsliuustc0106 Can you help review it, I only kept the code for the diffusion part.

@hsliuustc0106
Copy link
Collaborator

@hsliuustc0106 Can you help review it, I only kept the code for the diffusion part.

definitely, please fix docs and precommit

od_config.model,
)
od_config.tf_model_config = TransformerConfig.from_dict(tf_config_dict)
# Diffusers-style models expose `model_index.json` with `_class_name`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZJY0516 PTAL

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to support models that don‘t follow the standard diffusers file structure, we have to add specific handling logic here :(

@zhangzef
Copy link

Hi! This is a great PR — I went through Bagel’s adaptation process and the code in detail, but I still have a few questions that I’m unclear about:
1. From what I’ve observed, model_executor uses stage parallelism. However, Bagel’s understanding and generation parameters don’t seem to be easily decoupled into two independent stages for parallel execution. Would it be feasible to treat Bagel as a single stage during adaptation? If so, would that reduce the overall degree of parallelism and therefore increase the inference cost?
2. Is CacheDiT not applicable because Bagel’s special MoT architecture makes it incompatible? Also, does the Bagel diffusion branch that is about to be merged already use CacheDiT?
3. When you say “batch inference” here, do you mean the model directly computes over the entire batch in one forward pass, or is it referring to something else?

@princepride
Copy link
Contributor Author

Hi! This is a great PR — I went through Bagel’s adaptation process and the code in detail, but I still have a few questions that I’m unclear about: 1. From what I’ve observed, model_executor uses stage parallelism. However, Bagel’s understanding and generation parameters don’t seem to be easily decoupled into two independent stages for parallel execution. Would it be feasible to treat Bagel as a single stage during adaptation? If so, would that reduce the overall degree of parallelism and therefore increase the inference cost? 2. Is CacheDiT not applicable because Bagel’s special MoT architecture makes it incompatible? Also, does the Bagel diffusion branch that is about to be merged already use CacheDiT? 3. When you say “batch inference” here, do you mean the model directly computes over the entire batch in one forward pass, or is it referring to something else?

  1. I do not intend to implement stage parallelism within model_executor because such logic does not exist in Bagel. The current code is still quite messy 😂 as I am currently trying to integrate my code with @natureofnature's code. I will remove a lot of unnecessary content later.
  2. That is correct. MoT causes architectural incompatibility. I reached out to the Cache-DiT [Feature] vLLM-Omni Bagel support but don't know how to use cache-dit, I need your help🙏 vipshop/cache-dit#598, and they suggested we implement it using transformers, but we do not support that currently.
  3. Yes, I am referring to computing the entire batch in a single forward pass. I also hope that in the future Dynamic Stage Transitions [Feature]: Dynamic Stage Transitions #504, the entire batch can pass through the AR computation first, then, the I2T and T2T tasks can exit directly from the AR stage, while the remaining batch continues through the DiT stage to complete the I2I and T2I.

One concern I have is that the I2I in Bagel requires computing additional VAE KV Cache during the AR stage. This also needs to be based on stage tags during batch inference. I suspect that relying entirely on the multi-modal support in the vLLM might not be feasible, as I haven't seen any configurations for multiple vision modules in it yet. Please correct me if I'm wrong. @Isotr0py

@princepride
Copy link
Contributor Author

vipshop/cache-dit#598 I have provided a more intuitive implementation of Bagel DiT Attention in this issue. Specifically, Bagel computes a single step as follows:

<vision_token_start> (AR weights computation) <image_token>*4096 (DiT weights computation) <vision_token_end> (AR weights computation)
=>
bidirectional attention.

@zhangzef
Copy link

I noticed that the current PR does not implement MoT, only gen's attention.

@princepride
Copy link
Contributor Author

I noticed that the current PR does not implement MoT, only gen's attention.

Simplify the logic: https://github.com/princepride/vllm-omni/blob/9bf1ef49033de8df9c6edf36f9af2b7a5d67013a/vllm_omni/diffusion/models/bagel/qwen2_navit.py#L356

@zhangzef
Copy link

Hi! This is a great PR — I went through Bagel’s adaptation process and the code in detail, but I still have a few questions that I’m unclear about: 1. From what I’ve observed, model_executor uses stage parallelism. However, Bagel’s understanding and generation parameters don’t seem to be easily decoupled into two independent stages for parallel execution. Would it be feasible to treat Bagel as a single stage during adaptation? If so, would that reduce the overall degree of parallelism and therefore increase the inference cost? 2. Is CacheDiT not applicable because Bagel’s special MoT architecture makes it incompatible? Also, does the Bagel diffusion branch that is about to be merged already use CacheDiT? 3. When you say “batch inference” here, do you mean the model directly computes over the entire batch in one forward pass, or is it referring to something else?

  1. I do not intend to implement stage parallelism within model_executor because such logic does not exist in Bagel. The current code is still quite messy 😂 as I am currently trying to integrate my code with @natureofnature's code. I will remove a lot of unnecessary content later.
  2. That is correct. MoT causes architectural incompatibility. I reached out to the Cache-DiT [Feature] vLLM-Omni Bagel support but don't know how to use cache-dit, I need your help🙏 vipshop/cache-dit#598, and they suggested we implement it using transformers, but we do not support that currently.
  3. Yes, I am referring to computing the entire batch in a single forward pass. I also hope that in the future Dynamic Stage Transitions [Feature]: Dynamic Stage Transitions #504, the entire batch can pass through the AR computation first, then, the I2T and T2T tasks can exit directly from the AR stage, while the remaining batch continues through the DiT stage to complete the I2I and T2I.

One concern I have is that the I2I in Bagel requires computing additional VAE KV Cache during the AR stage. This also needs to be based on stage tags during batch inference. I suspect that relying entirely on the multi-modal support in the vLLM might not be feasible, as I haven't seen any configurations for multiple vision modules in it yet. Please correct me if I'm wrong. @Isotr0py

Thank you for your answer. I am also very interested in this work at present. Could you add me on wechat for further communication? If you agree, you could send your wechat account to my email [email protected]

Copy link
Collaborator

@ZJY0516 ZJY0516 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall, LGTM

@ZJY0516 ZJY0516 requested a review from SamitHuang December 30, 2025 14:36
@ZJY0516
Copy link
Collaborator

ZJY0516 commented Dec 30, 2025

@princepride please fix the doc build error

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Dec 31, 2025
@hsliuustc0106
Copy link
Collaborator

  • docs/readthedocs.org:vllm-omni

it looks like you need to add init under bagel folder

@princepride princepride requested a review from ZJY0516 December 31, 2025 03:15
Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, looking forward to the follow-up PRs

@hsliuustc0106 hsliuustc0106 merged commit 23bf317 into vllm-project:main Dec 31, 2025
7 checks passed
@hsliuustc0106
Copy link
Collaborator

@princepride please also submit a PR to vllm/recipe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: ByteDance-Seed/BAGEL-7B-MoT

6 participants