fix: BMI2 crash on AVX-only CPUs (Intel Ivy Bridge/Sandy Bridge) #7864

coffeerunhobby · 2026-01-04T22:39:13Z

Description

Fix: Prevent BMI2 instruction crash on AVX-only CPUs

Problem

The llama-cpp-avx binary incorrectly includes BMI2 instructions despite being built for AVX-only compatibility. This causes crashes on CPUs with AVX but without BMI2 support (e.g., Intel Sandy Bridge, Ivy Bridge from 2011-2013).

Error symptoms:

Process terminates with rpc error: code = Unavailable desc = error reading from server: EOF
Crash occurs during model warmup
Illegal instruction on CPUs with AVX but no BMI2

Root Cause

llama.cpp's CMake automatically enables BMI2 when detecting x86_64 architecture, even when building AVX-only binaries. The llama-cpp-avx target is intended for older CPUs that have AVX but lack newer instruction sets.

Solution

Add -DGGML_BMI2=off (and -DGGML_BMI=off for fallback) to CMake args for:

llama-cpp-avx: Disable BMI2 for AVX-only CPUs
llama-cpp-fallback: Disable both BMI and BMI2 for maximum compatibility
llama-cpp-grpc: Disable both BMI and BMI2 for RPC server compatibility

Notes for Reviewers

Testing

Check the actual CPU inference code (lower addresses, typically 0x400000-0xffffff range)
Verified binaries no longer contain BMI2 instructions:

objdump -d /tmp/llama-backend-128/llama-cpp-avx | grep -E '\b(shlx|mulx|rorx|sarx|shrx|pdep|pext)\b' | awk '$1 < "100000:"' | head -20
# Returns nothing ✓

Added also CUDA_DOCKER_ARCH option to compile llama-cpp for Blackwell Gpus (CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues)
Tested on Ubuntu 24.04, Intel E3-1240 v2, Geforce RTX 5060 Ti 16GB (SM_120), Nvidia driver 570-open, Cuda version 12.8

./LocalAI > DOCKER_BUILDKIT=1 docker build --pull --progress=plain -f backend/Dockerfile.llama-cpp
--build-arg CMAKE_FROM_SOURCE=true
--build-arg CMAKE_VERSION=3.31.10
--build-arg BUILD_TYPE=cublas
--build-arg CUDA_MAJOR_VERSION=12
--build-arg CUDA_MINOR_VERSION=8
--build-arg UBUNTU_VERSION=2204
--build-arg CUDA_DOCKER_ARCH='75;86;89;120'
-t localai-llama-cpp-backend:cuda128 .

Yes, I signed my commits.

Signed-off-by: coffeerunhobby <[email protected]>

netlify · 2026-01-04T22:39:18Z

✅ Deploy Preview for localai ready!

Name	Link
🔨 Latest commit	`90df093`
🔍 Latest deploy log	https://app.netlify.com/projects/localai/deploys/695c3f6ebb947e0008a0892f
😎 Deploy Preview	https://deploy-preview-7864--localai.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

mudler · 2026-01-05T22:35:05Z

backend/cpp/llama-cpp/CMakeLists.txt

@@ -1,6 +1,6 @@
-set(TARGET grpc-server)
+# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues


Is this really needed? If we need to do that, it requires compiling Cmake in the build process. Doable, but adds to compilation time and CI times. If there is no specific reason to do it I would avoid to do so for now.

I had trouble compiling for RTX 5060 (SM_120) and the only version that consistently worked was CMake 3.31.10. I tried multiple 4.0.x versions and lower CMake versions, but none succeeded. I’d prefer we standardize on 3.31.10 for now - it looks like the safest option, and PyTorch also uses it. Also worth noting: 3.31.9 includes a fix related to CUDA 13, which may be connected to what we’re seeing.

mudler · 2026-01-05T22:36:10Z