-
Notifications
You must be signed in to change notification settings - Fork 225
[Profile] Adding profiling hooks for omni&vllm&diffusion pipeline #340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
8ae6e64
1b1f2e5
8fb54c7
fe8bd16
da300a4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,394 @@ | ||
| # Profiling vLLM-Omni (update soon) | ||
| # \# Profiling vLLM-Omni | ||
|
|
||
| # \## profiling hooks for omni\&vllm\&diffusion pipeline | ||
|
|
||
| # | ||
|
|
||
| # \## 1.Usage of Log Statistics for Single-Pipeline Diffusion Scheduling | ||
|
||
|
|
||
| # | ||
|
|
||
| # | ||
|
|
||
| # In this project, tasks such as text-to-image and text-to-video follow a single-pipeline diffusion scheduling paradigm. | ||
|
||
|
|
||
| # Each request triggers the diffusion pipeline as a whole, executing text encoding, denoising iterations, and decoding in a tightly coupled, end-to-end manner. | ||
|
|
||
| # | ||
|
|
||
| # • The entire workflow is launched in one shot via `Omni.generate(...)`. | ||
|
|
||
| # | ||
|
|
||
| # • Execution proceeds sequentially within the diffusion engine. | ||
|
|
||
| # | ||
|
|
||
| # • Performance and behavior can be directly inspected through: | ||
|
|
||
| # | ||
|
|
||
| # • Diffusion-level logs (e.g., denoising steps, post-processing), | ||
|
|
||
| # | ||
|
|
||
| # • vLLM runtime logs (e.g., worker startup, device allocation). | ||
|
|
||
| # | ||
|
|
||
| # > Text-to-Image / Text-to-Video → \*Single diffusion pipeline, single execution path\* | ||
|
|
||
| # \### The log usage method is as follows: | ||
|
|
||
| # \### 1.Print the vllm feature. | ||
|
|
||
| # 1)vllm feature integration | ||
|
|
||
| # ```bash | ||
|
|
||
| # export VLLM\_LOGGING\_LEVEL=DEBUG | ||
|
|
||
| # ``` | ||
|
|
||
| # 2)Run script(Taking image\_to\_image as an example, the usage method for other models is the same.): | ||
|
|
||
| # | ||
|
|
||
| # ```python | ||
|
|
||
| # python image\_edit.py \\ | ||
|
|
||
| # --image input.png \\ | ||
|
|
||
| # --prompt "Let this mascot dance under the moon, surrounded by floating stars and poetic bubbles such as 'Be Kind'" \\ | ||
|
|
||
| # --output output\_image\_edit.png \\ | ||
|
|
||
| # --num\_inference\_steps 50 \\ | ||
|
|
||
| # --cfg\_scale 4.0 | ||
|
|
||
| # ``` | ||
|
|
||
| # We can see the vLLM logs in the console and the diffusion logs in path/omni\_diffusion\_stats/omni\_diffusion\_%Y%m%d\_%H%M%S\_xx\_pidxxxx.jsonl. | ||
|
|
||
| # | ||
|
|
||
| # ```json | ||
|
|
||
| # DEBUG 12-17 09:21:42 \[plugins/\_\_init\_\_.py:28] No plugins for group vllm.platform\_plugins found. | ||
|
|
||
| # DEBUG 12-17 09:21:42 \[platforms/\_\_init\_\_.py:34] Checking if TPU platform is available. | ||
|
|
||
| # DEBUG 12-17 09:21:42 \[platforms/\_\_init\_\_.py:52] TPU platform is not available because: No module named 'libtpu' | ||
|
|
||
| # DEBUG 12-17 09:21:42 \[platforms/\_\_init\_\_.py:58] Checking if CUDA platform is available. | ||
|
|
||
| # DEBUG 12-17 09:21:42 \[platforms/\_\_init\_\_.py:78] Confirmed CUDA platform is available. | ||
|
|
||
| # DEBUG 12-17 09:21:42 \[platforms/\_\_init\_\_.py:106] Checking if ROCm platform is available. | ||
|
|
||
| # DEBUG 12-17 09:21:42 \[platforms/\_\_init\_\_.py:120] ROCm platform is not available because: No module named 'amdsmi' | ||
|
|
||
| # DEBUG 12-17 09:21:42 \[platforms/\_\_init\_\_.py:127] Checking if XPU platform is available. | ||
|
|
||
| # DEBUG 12-17 09:21:42 \[platforms/\_\_init\_\_.py:146] XPU platform is not available because: No module named 'intel\_extension\_for\_pytorch' | ||
|
|
||
| # DEBUG 12-17 09:21:42 \[platforms/\_\_init\_\_.py:153] Checking if CPU platform is available. | ||
|
|
||
| # DEBUG 12-17 09:21:42 \[platforms/\_\_init\_\_.py:58] Checking if CUDA platform is available. | ||
|
|
||
| # DEBUG 12-17 09:21:42 \[platforms/\_\_init\_\_.py:78] Confirmed CUDA platform is available. | ||
|
|
||
| # INFO 12-17 09:21:42 \[platforms/\_\_init\_\_.py:216] Automatically detected platform cuda. | ||
|
|
||
| # DEBUG 12-17 09:21:47 \[compilation/decorators.py:155] Inferred dynamic dimensions for forward method of <class 'vllm.model\_executor.models.qwen3\_moe.Qwen3MoeModel'>: \['input\_ids', 'positions', 'intermediate\_tensors', 'inputs\_embeds'] | ||
|
|
||
| # WARNING 12-17 09:21:47 \[mooncake\_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work | ||
|
|
||
| # DEBUG 12-17 09:21:47 \[factory.py:35] Registered connector: MooncakeConnector | ||
|
|
||
| # DEBUG 12-17 09:21:47 \[factory.py:35] Registered connector: SharedMemoryConnector | ||
|
|
||
| # DEBUG 12-17 09:21:48 \[distributed/device\_communicators/shm\_broadcast.py:313] Connecting to ipc:///tmp/5c30e5fa-26de-43e1-bd35-d551269b0fe2 | ||
|
|
||
| # DEBUG 12-17 09:21:48 \[distributed/device\_communicators/shm\_broadcast.py:243] Binding to ipc:///tmp/7c1c23a5-2d1c-4f83-a6f2-36d8c4c71644 | ||
|
|
||
| # INFO 12-17 09:21:48 \[distributed/device\_communicators/shm\_broadcast.py:289] vLLM message queue communication handle: Handle(local\_reader\_ranks=\[0], buffer\_handle=(1, 10485760, 10, | ||
|
|
||
| # INFO 12-17 09:22:16 \[diffusers\_loader.py:214] Loading weights took 17.82 seconds | ||
|
|
||
| # INFO 12-17 09:22:16 \[gpu\_worker.py:81] Model loading took 53.7462 GiB and 27.811149 seconds | ||
|
|
||
| # INFO 12-17 09:22:16 \[gpu\_worker.py:86] Worker 0: Model loaded successfully. | ||
|
|
||
| # INFO 12-17 09:22:16 \[gpu\_worker.py:237] Worker 0: Scheduler loop started. | ||
|
|
||
| # INFO 12-17 09:22:16 \[gpu\_worker.py:175] Worker 0 ready to receive requests via shared memory | ||
|
|
||
| # DEBUG 12-17 09:22:16 \[diffusion\_engine.py:147] All workers are ready | ||
|
|
||
| # DEBUG 12-17 09:22:16 \[distributed/device\_communicators/shm\_broadcast.py:313] Connecting to ipc:///tmp/7c1c23a5-2d1c-4f83-a6f2-36d8c4c71644 | ||
|
|
||
| # INFO 12-17 09:22:16 \[scheduler.py:45] SyncScheduler initialized result MessageQueue | ||
|
|
||
| # INFO 12-17 09:22:16 \[omni\_diffusion.py:114] OmniDiffusion initialized: model=path/models/Qwen-Image-Edit, class=QwenImageEditPipeline, init\_ms=36702.19 | ||
|
|
||
| # Pipeline loaded | ||
|
|
||
| # ``` | ||
|
|
||
| # --- | ||
|
|
||
| # \## omni\_diffusion\_%Y%m%d\_%H%M%S\_xx\_pidxxxx.jsonl | ||
|
|
||
| # ```json | ||
|
|
||
| # {"model": "path/models/Qwen-Image", "model\_class": "QwenImagePipeline", "init\_ms": 17562.917941002524, "event": "engine\_load", "ts": 1766019712.405319, "pid": 18635, "host": "xxxx"} | ||
|
|
||
| # {"n\_requests": 1, "prompt\_chars": 28, "height": 1024, "width": 1024, "generator": "<torch.\_C.Generator object at 0x7fc71d96e8f0>", "true\_cfg\_scale": 4.0, "num\_inference\_steps": 50, "num\_outputs\_per\_prompt": 1, "event": "request\_scheduled", "ts": 1766019712.405916, "pid": 18635, "host": "xxxx"} | ||
|
|
||
| # {"n\_requests": 1, "total\_ms": 42437.41700099781, "diffusion\_total\_ms": 42437.13191100687, "denoise\_avg\_ms": 848.7426382201375, "input\_tokens": 28, "input\_tokens\_per\_s": 0.6597951048562086, "event": "request\_finished", "ts": 1766019754.8433862, "pid": 18635, "host": "xxxx"} | ||
|
|
||
| # | ||
|
|
||
| # ``` | ||
|
|
||
| # \### 2.The vllm feature is not printed.. | ||
|
|
||
| # Run script: | ||
|
|
||
| # ```python | ||
|
|
||
| # python image\_edit.py \\ | ||
|
|
||
| # --image input.png \\ | ||
|
|
||
| # --prompt "Let this mascot dance under the moon, surrounded by floating stars and poetic bubbles such as 'Be Kind'" \\ | ||
|
|
||
| # --output output\_image\_edit.png \\ | ||
|
|
||
| # --num\_inference\_steps 50 \\ | ||
|
|
||
| # --cfg\_scale 4.0 | ||
|
|
||
| # ``` | ||
|
|
||
| # We can see the vLLM logs in the console .The characteristics of diffusion models will still be statistically analyzed. | ||
|
|
||
| # | ||
|
|
||
| # ```json | ||
|
|
||
| # INFO 12-17 09:28:58 \[\_\_init\_\_.py:216] Automatically detected platform cuda. | ||
|
|
||
| # WARNING 12-17 09:29:03 \[mooncake\_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work | ||
|
|
||
| # Loaded input image from input.png (size: (514, 556)) | ||
|
|
||
| # INFO 12-17 09:29:06 \[shm\_broadcast.py:289] vLLM message queue communication handle: Handle(local\_reader\_ranks=\[0], buffer\_handle=(1, 10485760, 10, 'psm\_0c8120b1'), local\_subscribe\_addr='ipc:///tmp/7f7c25ae-cf87-4c4d-b79d-17cbb4ea00e2', remote\_subscribe\_addr=None, remote\_addr\_ipv6=False) | ||
|
|
||
| # INFO 12-17 09:29:06 \[diffusion\_engine.py:92] Starting server... | ||
|
|
||
| # ....... | ||
|
|
||
| # INFO 12-17 09:29:26 \[diffusion\_engine.py:43] Pre-processing completed in 0.0564 seconds | ||
|
|
||
| # INFO 12-17 09:30:26 \[shm\_broadcast.py:466] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation). | ||
|
|
||
| # INFO 12-17 09:31:17 \[diffusion\_engine.py:48] Generation completed successfully. | ||
|
|
||
| # INFO 12-17 09:31:17 \[diffusion\_engine.py:53] Post-processing completed in 0.0651 seconds | ||
|
|
||
| # INFO 12-17 09:31:17 \[omni\_diffusion.py:177] request\_finished: n\_requests=1, total\_ms=111360.70 | ||
|
|
||
| # INFO 12-17 09:31:17 \[omni\_diffusion.py:184] request\_scheduled: n\_requests=1, kwargs\_keys=\['pil\_image', 'negative\_prompt', 'generator', 'true\_cfg\_scale', 'num\_inference\_steps', 'num\_outputs\_per\_prompt'], kwargs\_detail={'generator': '<torch.\_C.Generator object at 0x7f0c71d328d0>', 'true\_cfg\_scale': 4.0, 'num\_inference\_steps': 50, 'num\_outputs\_per\_prompt': 1} | ||
|
|
||
| # INFO 12-17 09:31:17 \[omni\_diffusion.py:190] OMNI\_DIFFUSION\_METRICS {"prompt\_chars": 103, "input\_tokens": 103, "input\_tokens\_per\_s": 0.9249223665998467, "num\_inference\_steps": 50, "diffusion\_total\_ms": 111360.43418200097, "denoise\_avg\_ms": 2227.2086836400194, "total\_ms": 111360.69763200067} | ||
|
|
||
| # Total generation time: 111.3614 seconds (111361.44 ms) | ||
|
|
||
| # Saved edited image to path/vllm-omni/examples/offline\_inference/image\_to\_image/output\_image\_edit.png | ||
|
|
||
| # INFO 12-17 09:31:17 \[gpu\_worker.py:190] Worker 0: Received shutdown message | ||
|
|
||
| # INFO 12-17 09:31:17 \[gpu\_worker.py:214] event loop terminated. | ||
|
|
||
| # INFO 12-17 09:31:17 \[gpu\_worker.py:114] Worker 0: Destroyed process group | ||
|
|
||
| # INFO 12-17 09:31:17 \[gpu\_worker.py:245] Worker 0: Shutdown complete. | ||
|
|
||
| # | ||
|
|
||
| # ``` | ||
|
|
||
| # --- | ||
|
|
||
| # \## 2.Usage of Log Statistics for Multi-Process, Multi-Stage Scheduling | ||
|
|
||
| # | ||
|
|
||
| # In contrast, Qwen2.5-Omni and Qwen3-Omni adopt a multi-process, multi-stage scheduling model driven by OmniLLM. | ||
|
|
||
| # Rather than executing a single pipeline, the system decomposes the task into multiple stages, each running as an independent process. | ||
|
|
||
| # | ||
|
|
||
| # • The core abstraction is a stage-based pipeline: | ||
|
|
||
| # | ||
|
|
||
| # • Each stage performs a specific function (e.g., reasoning, generation, modality transformation). | ||
|
|
||
| # | ||
|
|
||
| # • Stages are connected via inter-process communication (IPC). | ||
|
|
||
| # | ||
|
|
||
| # • Scheduling is pipeline-oriented: | ||
|
|
||
| # | ||
|
|
||
| # • Downstream stages are activated once upstream stages complete. | ||
|
|
||
| # | ||
|
|
||
| # • Multiple stages can overlap in time, enabling pipeline parallelism. | ||
|
|
||
| # | ||
|
|
||
| # • System behavior can be observed through: | ||
|
|
||
| # | ||
|
|
||
| # • Omni-level logs (stage transitions and orchestration), | ||
|
|
||
| # | ||
|
|
||
| # • Diffusion logs (if diffusion is involved in a stage), | ||
|
|
||
| # | ||
|
|
||
| # • vLLM logs (process lifecycle, execution and resource usage). | ||
|
|
||
| # | ||
|
|
||
| # > Qwen2.5-Omni / Qwen3-Omni → \*Multi-process, multi-stage pipeline with explicit scheduling\* | ||
|
|
||
| # | ||
|
|
||
| # 1\. Setting the log switch.: | ||
|
|
||
| # | ||
|
|
||
| # ```python | ||
|
|
||
| # omni\_llm = Omni( | ||
|
|
||
| # model=model\_name, | ||
|
|
||
| # log\_stats=args.enable\_stats,#Setting enable\_stats=True | ||
|
|
||
| # log\_file=(os.path.join(log\_dir, "omni\_llm\_pipeline.log") if args.enable\_stats else None) | ||
|
|
||
| # ) | ||
|
|
||
| # ``` | ||
|
|
||
| # or | ||
|
|
||
| # ```python | ||
|
|
||
| # omni\_llm = Omni( | ||
|
|
||
| # model=model\_name, | ||
|
|
||
| # log\_stats=True | ||
|
|
||
| # log\_file=os.path.join(log\_dir, "omni\_llm\_pipeline.log") | ||
|
|
||
| # ) | ||
|
|
||
| # | ||
|
|
||
| # ``` | ||
|
|
||
| # 2\. Run script: | ||
|
|
||
| # | ||
|
|
||
| # ```bash | ||
|
|
||
| # sh run\_multiple\_prompts.sh | ||
|
|
||
| # ``` | ||
|
|
||
| # or | ||
|
|
||
| # ```bash | ||
|
|
||
| # run\_single\_prompt.sh | ||
|
|
||
| # ``` | ||
|
|
||
| # 4\. vllm feature integration | ||
|
|
||
| # ```bash | ||
|
|
||
| # export VLLM\_LOGGING\_LEVEL=DEBUG | ||
|
|
||
| # ``` | ||
|
|
||
| # We can see the debug log(vllm+omni+diffusion)in omni\_llm\_pipeline.log: | ||
|
|
||
| # ```json | ||
|
|
||
| # 2025-12-16 01:24:23,021 \[PID:17815] DEBUG: \[Orchestrator] generate() called | ||
|
|
||
| # 2025-12-16 01:24:23,021 \[PID:17815] DEBUG: \[Orchestrator] Seeding 1 requests into stage-0 | ||
|
|
||
| # 2025-12-16 01:24:23,022 \[PID:17815] DEBUG: \[Orchestrator] Enqueued request 0\_b3b2dcb1-4c75-42de-a073-dcef52b9e557 to stage-0 | ||
|
|
||
| # 2025-12-16 01:24:23,023 \[PID:17815] DEBUG: \[Orchestrator] Entering scheduling loop: total\_requests=1, stages=3 | ||
|
|
||
| # 2025-12-16 01:24:26,527 \[PID:17815] INFO: \[StageMetrics] stage=0 req=0\_b3b2dcb1-4c75-42de-a073-dcef52b9e557 metrics={'num\_tokens\_out': 52, 'stage\_gen\_time\_ms': 3490.6439781188965, 'batch\_id': 1, 'rx\_decode\_time\_ms': 0.036716461181640625, 'rx\_transfer\_bytes': 339, 'rx\_in\_flight\_time\_ms': 0.0} | ||
|
|
||
| # 2025-12-16 01:24:26,527 \[PID:17815] DEBUG: \[Orchestrator] Stage-0 completed request 0\_b3b2dcb1-4c75-42de-a073-dcef52b9e557; forwarding or finalizing | ||
|
|
||
| # 2025-12-16 01:24:26,527 \[PID:17815] DEBUG: \[Orchestrator] Request 0\_b3b2dcb1-4c75-42de-a073-dcef52b9e557 finalized at stage-0 | ||
|
|
||
| # 2025-12-16 01:24:26,780 \[PID:17815] DEBUG: \[Orchestrator] Forwarded request 0\_b3b2dcb1-4c75-42de-a073-dcef52b9e557 to stage-1 | ||
|
|
||
| # 2025-12-16 01:24:44,789 \[PID:17815] INFO: \[StageMetrics] stage=1 req=0\_b3b2dcb1-4c75-42de-a073-dcef52b9e557 metrics={'num\_tokens\_out': 170, 'stage\_gen\_time\_ms': 17991.965770721436, 'batch\_id': 1, 'rx\_decode\_time\_ms': 5.737543106079102, 'rx\_transfer\_bytes': 3148794, 'rx\_in\_flight\_time\_ms': 1.1227130889892578} | ||
|
|
||
| # 2025-12-16 01:24:44,789 \[PID:17815] DEBUG: \[Orchestrator] Stage-1 completed request 0\_b3b2dcb1-4c75-42de-a073-dcef52b9e557; forwarding or finalizing | ||
|
|
||
| # 2025-12-16 01:24:44,790 \[PID:17815] DEBUG: \[Orchestrator] Forwarded request 0\_b3b2dcb1-4c75-42de-a073-dcef52b9e557 to stage-2 | ||
|
|
||
| # 2025-12-16 01:24:44,914 \[PID:17815] INFO: \[StageMetrics] stage=2 req=0\_b3b2dcb1-4c75-42de-a073-dcef52b9e557 metrics={'num\_tokens\_out': 0, 'stage\_gen\_time\_ms': 117.71297454833984, 'batch\_id': 1, 'rx\_decode\_time\_ms': 0.43487548828125, 'rx\_transfer\_bytes': 8393, 'rx\_in\_flight\_time\_ms': 0.5235671997070312} | ||
|
|
||
| # 2025-12-16 01:24:44,915 \[PID:17815] DEBUG: \[Orchestrator] Stage-2 completed request 0\_b3b2dcb1-4c75-42de-a073-dcef52b9e557; forwarding or finalizing | ||
|
|
||
| # 2025-12-16 01:24:44,915 \[PID:17815] DEBUG: \[Orchestrator] Request 0\_b3b2dcb1-4c75-42de-a073-dcef52b9e557 finalized at stage-2 | ||
|
|
||
| # 2025-12-16 01:24:44,915 \[PID:17815] DEBUG: \[Orchestrator] Request 0\_b3b2dcb1-4c75-42de-a073-dcef52b9e557 fully completed (1/1) | ||
|
|
||
| # 2025-12-16 01:24:44,915 \[PID:17815] DEBUG: \[Orchestrator] All requests completed | ||
|
|
||
| # 2025-12-16 01:24:44,915 \[PID:17815] INFO: \[Summary] {'e2e\_requests': 1, 'e2e\_total\_time\_ms': 21893.684148788452, 'e2e\_sum\_time\_ms': 21892.935752868652, 'e2e\_total\_tokens': 0, 'e2e\_avg\_time\_per\_request\_ms': 21892.935752868652, 'e2e\_avg\_tokens\_per\_s': 0.0, 'wall\_time\_ms': 21893.684148788452, 'final\_stage\_id': 2, 'stages': \[{'stage\_id': 0, 'requests': 1, 'tokens': 52, 'total\_time\_ms': 3505.100727081299, 'avg\_time\_per\_request\_ms': 3505.100727081299, 'avg\_tokens\_per\_s': 14.835522299897058}, {'stage\_id': 1, 'requests': 1, 'tokens': 170, 'total\_time\_ms': 18008.86106491089, 'avg\_time\_per\_request\_ms': 18008.86106491089, 'avg\_tokens\_per\_s': 9.43979740791238}, {'stage\_id': 2, 'requests': 1, 'tokens': 0, 'total\_time\_ms': 124.7246265411377, 'avg\_time\_per\_request\_ms': 124.7246265411377, 'avg\_tokens\_per\_s': 0.0}], 'transfers': \[{'from\_stage': 0, 'to\_stage': 1, 'samples': 1, 'total\_bytes': 3148794, 'total\_time\_ms': 5.67626953125, 'tx\_mbps': 4437.835776, 'rx\_samples': 1, 'rx\_total\_bytes': 3148794, 'rx\_total\_time\_ms': 5.737543106079102, 'rx\_mbps': 4390.442308539705, 'total\_samples': 1, 'total\_transfer\_time\_ms': 12.53652572631836, 'total\_mbps': 2009.3567029593396}, {'from\_stage': 1, 'to\_stage': 2, 'samples': 1, 'total\_bytes': 8393, 'total\_time\_ms': 0.35572052001953125, 'tx\_mbps': 188.7549247828418, 'rx\_samples': 1, 'rx\_total\_bytes': 8393, 'rx\_total\_time\_ms': 0.43487548828125, 'rx\_mbps': 154.39821698245615, 'total\_samples': 1, 'total\_transfer\_time\_ms': 1.3141632080078125, 'total\_mbps': 51.092588493468796}]} | ||
|
|
||
| # | ||
|
|
||
| # ``` | ||
|
|
||
| # | ||
|
|
||
| # \## If you do not need to print the vLLM features, you can run the script directly, or unset VLLM\_LOGGING\_LEVEL. | ||
|
|
||
| # ```bash | ||
|
|
||
| # unset VLLM\_LOGGING\_LEVEL | ||
|
|
||
| # ``` | ||
|
|
||
| Profiling is only intended for vLLM-Omni developers and maintainers to understand the proportion of time spent in different parts of the codebase. **vLLM-Omni end-users should never turn on profiling** as it will significantly slow down the inference. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some background to tell the users why we cannot directly use the vLLM profiling method and what's the different scenarios.