Performance
- Improved performance on Intel Data Center GPU Flex 60 and Flex 70 for Qwen3-30B MoE model family.
- Improved multinomial algorithm performance, reducing latency for generation with temperature > 0.
- Improved model loading and pipeline initialization performance for new inference requests.
New models and hardware support
- Restored support for generative models on hosts with CPUs without AVX2 instruction set when using supported discrete GPUs.
- Added support for Xe GPUs for MoE models, including Intel Arc A770.
- Enabled execution of GPT-OSS-20b with INT8 precision and GPT-OSS-120b with INT4 precision on GPU.
- Enabled models and support for MoE for Qwen3.5, Qwen3.6, Qwen3-Coder-Next Demo
- Fixed chat template rendering for Granite models when processing non-ASCII characters.
- Added tool parsers for Gemma 4 and LFM2 models.
Deployment ease
-
Improved default performance tuning to use resource constraints in Docker containers, with default number of REST workers, OpenVINO inference streams, threads, and CPU pinning configurations avoiding quota and ulimit settings on Linux to prevent overallocation and performance degradation in Docker and Kubernetes environments.
-
Enhanced deployment capabilities with local generative model startup options and runtime parameter configuration through CLI, enabling generative model deployment from read-only filesystems with configurable runtime parameters such as target device and cache size for seamless KServe and OpenShift integration. Demo
-
Improved model pulling recovery mechanisms to resume interrupted Hugging Face model downloads from the previous checkpoint in case of failures or interruptions. Link
New or improved endpoints capabilities
-
Added initial support for
/responsesendpoint. Reference -
Fixed server readiness endpoint behavior -
/v2/health/readynow correctly reports success when all models are fully initialized and returns appropriate errors when models are not loaded. -
Added
min_psampling parameter for enhanced generation control. Link -
Added
skip_special_tokenssampling parameter - when set to False, returns raw model responses including special tokens to users. Link -
Fixed default seed parameter to use random values, ensuring non-deterministic responses from LLM models.
-
Added LoRA adapter support for image generation models. Demo
-
Added support of streaming for audio/transcriptions endpoint. Link
-
Introduced
OVMS_AUDIO_MAX_FILE_SIZE_BYTESenvironment variable that controls the upper bound on memory that a single audio request can allocate for decoded data. Link
Limitations
-
Gemma 4 and LFM2 MoE models supported without Continuous Batching.
-
/responsesendpoint doesn't include built-in tools, audio input and multimodal output. There are also no session management capabilities. -
Using prefix caching with new Linear Attention models such as Qwen3.5/Qwen3.6 consumes exceeding amount of memory which will be addressed shortly.
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2026.2- CPU device support with image based on Ubuntu 24.04docker pull openvino/model_server:2026.2-gpu- GPU, NPU and CPU device support with image based on Ubuntu 24.04
or use provided binary packages. Only packages with suffix _python_on have support for python.
There is also additional distribution channel via https://storage.openvinotoolkit.org/repositories/openvino_model_server/packages/2026.2.0/
Check the instructions how to install the binary package. The prebuilt image is available also on RedHat Ecosystem Catalog