v0.10.1.1
We are very pleased to announce the official release of vLLM Kunlun v0.10.1.1!
Going forward, if there is demand, we will continue to release patch updates and feature enhancement versions, and will periodically share the latest features and models supported by vLLM Kunlun. Stay tuned.
0.10.1.1 Release
Highlights✨
-
Comprehensive enhancements to multimodal capabilities now support 5+ series multimodal models, with overall inference throughput reaching up to 90% of the Axx platform.
-
A major breakthrough in sampling performance completely eliminates the Top-K sorting bottleneck; when enabled, end-to-end throughput can improve by up to 10× compared to the native implementation.
-
Quantized inference is now fully production-ready, with support for AWQ / GPTQ quantization for dense models, delivering significant gains compared to FP16:
- Significant reduction in GPU memory usage.
- Compute throughput is doubled.
-
Support for multi-LoRA inference.
-
Support for Piecewise CUDA Graph, significantly reducing scheduling and kernel launch overhead.
-
Support for the vLLM V1 inference engine.
Supported models
- Qwen2.5
- Qwen2.5-VL
- Qwen3
- Qwen3-MoE
- GLM4.1v
- GLM4.5
- GLM4.5Air
- GLM4.5v
- InternVL25
- InternVL35
- QiFanVL
Operator updates🚀
- KLX xtorch_ops operator library
- Added Flash-Infer Top-K / Top-P sampling operators. Compared to the original sorting-based logic, sampling-stage performance is improved by tens to hundreds of times.
BUG FIX❤️🩹
- Fixed issues with YaRN positional encoding, resolving garbled outputs in some models when exceeding the native context length.
- Fixed Rotary Positional Encoding (RoPE) precision issues.
- Fixed abnormal errors when
repetition_penalty > 1. - Fixed XPU INT4 data layout issues, significantly improving the performance of AWQ / GPTQ–related operators on XPU.
Known issues⚠️
- Errors may occur when invoking xgrammar in Function Call scenarios.
- Cause: The relevant operators are not yet supported.
- Future: Support will be gradually added in upcoming releases.