Skip to content

Conversation

@coffeerunhobby
Copy link
Contributor

Description

Fix: Prevent BMI2 instruction crash on AVX-only CPUs

Problem

The llama-cpp-avx binary incorrectly includes BMI2 instructions despite being built for AVX-only compatibility. This causes crashes on CPUs with AVX but without BMI2 support (e.g., Intel Sandy Bridge, Ivy Bridge from 2011-2013).

Error symptoms:

  • Process terminates with rpc error: code = Unavailable desc = error reading from server: EOF
  • Crash occurs during model warmup
  • Illegal instruction on CPUs with AVX but no BMI2

Root Cause

llama.cpp's CMake automatically enables BMI2 when detecting x86_64 architecture, even when building AVX-only binaries. The llama-cpp-avx target is intended for older CPUs that have AVX but lack newer instruction sets.

Solution

Add -DGGML_BMI2=off (and -DGGML_BMI=off for fallback) to CMake args for:

  • llama-cpp-avx: Disable BMI2 for AVX-only CPUs
  • llama-cpp-fallback: Disable both BMI and BMI2 for maximum compatibility
  • llama-cpp-grpc: Disable both BMI and BMI2 for RPC server compatibility

Notes for Reviewers

Testing

Check the actual CPU inference code (lower addresses, typically 0x400000-0xffffff range)
Verified binaries no longer contain BMI2 instructions:

objdump -d /tmp/llama-backend-128/llama-cpp-avx | grep -E '\b(shlx|mulx|rorx|sarx|shrx|pdep|pext)\b' | awk '$1 < "100000:"' | head -20
# Returns nothing ✓

Added also CUDA_DOCKER_ARCH option to compile llama-cpp for Blackwell Gpus (CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues)
Tested on Ubuntu 24.04, Intel E3-1240 v2, Geforce RTX 5060 Ti 16GB (SM_120), Nvidia driver 570-open, Cuda version 12.8

./LocalAI > DOCKER_BUILDKIT=1 docker build --pull --progress=plain -f backend/Dockerfile.llama-cpp
--build-arg CMAKE_FROM_SOURCE=true
--build-arg CMAKE_VERSION=3.31.10
--build-arg BUILD_TYPE=cublas
--build-arg CUDA_MAJOR_VERSION=12
--build-arg CUDA_MINOR_VERSION=8
--build-arg UBUNTU_VERSION=2204
--build-arg CUDA_DOCKER_ARCH='75;86;89;120'
-t localai-llama-cpp-backend:cuda128 .

  • Yes, I signed my commits.

@netlify
Copy link

netlify bot commented Jan 4, 2026

Deploy Preview for localai ready!

Name Link
🔨 Latest commit 90df093
🔍 Latest deploy log https://app.netlify.com/projects/localai/deploys/695c3f6ebb947e0008a0892f
😎 Deploy Preview https://deploy-preview-7864--localai.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@@ -1,6 +1,6 @@
set(TARGET grpc-server)
# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
Copy link
Owner

@mudler mudler Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed? If we need to do that, it requires compiling Cmake in the build process. Doable, but adds to compilation time and CI times. If there is no specific reason to do it I would avoid to do so for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had trouble compiling for RTX 5060 (SM_120) and the only version that consistently worked was CMake 3.31.10. I tried multiple 4.0.x versions and lower CMake versions, but none succeeded. I’d prefer we standardize on 3.31.10 for now - it looks like the safest option, and PyTorch also uses it. Also worth noting: 3.31.9 includes a fix related to CUDA 13, which may be connected to what we’re seeing.

ifeq ($(OS),Darwin)
CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off" $(MAKE) VARIANT="llama-cpp-avx-build" build-llama-cpp-grpc-server
else ifeq ($(ARCH),$(filter $(ARCH),aarch64 arm64))
else ifneq ($(filter $(ARCH),aarch64 arm64),)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any specific reason? I find ifeq more readable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason here is:

  1. Skip BMI flags on Darwin (macOS)
  2. Skip BMI flags on ARM (aarch64/arm64)
  3. Add BMI flags on x86_64 (the else case)

The ifneq checks if ARCH matches aarch64 or arm64. When the filter finds a match, the result is non-empty, so we skip the flags.

I can change to ifeq if you prefer, but the logic would need to invert:

ifeq ($(OS),Darwin)
    # No BMI flags (Darwin)
else ifeq ($(filter $(ARCH),aarch64 arm64),)
    # This is x86_64 - ADD BMI flags here
else
    # This is ARM - NO BMI flags
endif

CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off" $(MAKE) VARIANT="llama-cpp-avx-build" build-llama-cpp-grpc-server
else
CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DCMAKE_C_FLAGS=-mno-bmi2 -DCMAKE_CXX_FLAGS=-mno-bmi2" $(MAKE) VARIANT="llama-cpp-avx-build" build-llama-cpp-grpc-server
CFLAGS="-mno-bmi2" CXXFLAGS="-mno-bmi2" CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI=off -DGGML_BMI2=off" $(MAKE) VARIANT="llama-cpp-avx-build" build-llama-cpp-grpc-server
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GGML_BMI does not exist: https://github.com/ggml-org/ggml/blob/ebc3a0f4a56be1c9424a89fbec09962ac34fde85/CMakeLists.txt#L155

Do we need also CFLAGS/CXXFLAGS? If don't let's drop it. GGML_BMI2 should be enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right here. -DGGML_BMI2=off alone works (as I tested in the successful build) - the CFLAGS/CXXFLAGS were just me being overly cautious after fighting with compiler flags for too long.

ifeq ($(OS),Darwin)
CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off" $(MAKE) VARIANT="llama-cpp-fallback-build" build-llama-cpp-grpc-server
else ifeq ($(ARCH),$(filter $(ARCH),aarch64 arm64))
else ifneq ($(filter $(ARCH),aarch64 arm64),)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto about logic inversion

CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off" $(MAKE) VARIANT="llama-cpp-fallback-build" build-llama-cpp-grpc-server
else
CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DCMAKE_C_FLAGS='-mno-bmi -mno-bmi2' -DCMAKE_CXX_FLAGS='-mno-bmi -mno-bmi2'" $(MAKE) VARIANT="llama-cpp-fallback-build" build-llama-cpp-grpc-server
CFLAGS="-mno-bmi -mno-bmi2" CXXFLAGS="-mno-bmi -mno-bmi2" CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI=off -DGGML_BMI2=off" $(MAKE) VARIANT="llama-cpp-fallback-build" build-llama-cpp-grpc-server
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto above

ifeq ($(OS),Darwin)
CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off" TARGET="--target grpc-server --target rpc-server" $(MAKE) VARIANT="llama-cpp-grpc-build" build-llama-cpp-grpc-server
else ifeq ($(ARCH),$(filter $(ARCH),aarch64 arm64))
else ifneq ($(filter $(ARCH),aarch64 arm64),)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off" TARGET="--target grpc-server --target rpc-server" $(MAKE) VARIANT="llama-cpp-grpc-build" build-llama-cpp-grpc-server
else
CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DCMAKE_C_FLAGS='-mno-bmi -mno-bmi2' -DCMAKE_CXX_FLAGS='-mno-bmi -mno-bmi2'" TARGET="--target grpc-server --target rpc-server" $(MAKE) VARIANT="llama-cpp-grpc-build" build-llama-cpp-grpc-server
CFLAGS="-mno-bmi -mno-bmi2" CXXFLAGS="-mno-bmi -mno-bmi2" CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI=off -DGGML_BMI2=off" TARGET="--target grpc-server --target rpc-server" $(MAKE) VARIANT="llama-cpp-grpc-build" build-llama-cpp-grpc-server
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -1,4 +1,5 @@
cmake_minimum_required(VERSION 3.12)
# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
cmake_minimum_required(VERSION 3.31.10)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@mudler mudler enabled auto-merge (squash) January 5, 2026 22:47
@mudler mudler merged commit 5add7b4 into mudler:master Jan 6, 2026
38 of 39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants