Release OpenVINO Model Server 2026.2 · openvinotoolkit/model_server

Performance

Improved performance on Intel Data Center GPU Flex 60 and Flex 70 for Qwen3-30B MoE model family.
Improved multinomial algorithm performance, reducing latency for generation with temperature > 0.
Improved model loading and pipeline initialization performance for new inference requests.

Restored support for generative models on hosts with CPUs without AVX2 instruction set when using supported discrete GPUs.
Added support for Xe GPUs for MoE models, including Intel Arc A770.
Enabled execution of GPT-OSS-20b with INT8 precision and GPT-OSS-120b with INT4 precision on GPU.
Enabled models and support for MoE for Qwen3.5, Qwen3.6, Qwen3-Coder-Next Demo
Fixed chat template rendering for Granite models when processing non-ASCII characters.
Added tool parsers for Gemma 4 and LFM2 models.

Improved default performance tuning to use resource constraints in Docker containers, with default number of REST workers, OpenVINO inference streams, threads, and CPU pinning configurations avoiding quota and ulimit settings on Linux to prevent overallocation and performance degradation in Docker and Kubernetes environments.
Enhanced deployment capabilities with local generative model startup options and runtime parameter configuration through CLI, enabling generative model deployment from read-only filesystems with configurable runtime parameters such as target device and cache size for seamless KServe and OpenShift integration. Demo
Improved model pulling recovery mechanisms to resume interrupted Hugging Face model downloads from the previous checkpoint in case of failures or interruptions. Link

Added initial support for /responses endpoint. Reference
Fixed server readiness endpoint behavior - /v2/health/ready now correctly reports success when all models are fully initialized and returns appropriate errors when models are not loaded.
Added min_p sampling parameter for enhanced generation control. Link
Added skip_special_tokens sampling parameter - when set to False, returns raw model responses including special tokens to users. Link
Fixed default seed parameter to use random values, ensuring non-deterministic responses from LLM models.
Added LoRA adapter support for image generation models. Demo
Added support of streaming for audio/transcriptions endpoint. Link
Introduced OVMS_AUDIO_MAX_FILE_SIZE_BYTES environment variable that controls the upper bound on memory that a single audio request can allocate for decoded data. Link

Gemma 4 and LFM2 MoE models supported without Continuous Batching.
/responses endpoint doesn't include built-in tools, audio input and multimodal output. There are also no session management capabilities.
Using prefix caching with new Linear Attention models such as Qwen3.5/Qwen3.6 consumes exceeding amount of memory which will be addressed shortly.

You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:

docker pull openvino/model_server:2026.2 - CPU device support with image based on Ubuntu 24.04
docker pull openvino/model_server:2026.2-gpu - GPU, NPU and CPU device support with image based on Ubuntu 24.04

or use provided binary packages. Only packages with suffix _python_on have support for python.

Check the instructions how to install the binary package. The prebuilt image is available also on RedHat Ecosystem Catalog