Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18186

These additional codenames support different features from the current set:

  • ivybridge
  • piledriver
  • cascadelake
  • cooperlake
  • zen4

Resolves: #17966

…L_VARIANTS=On`

- `ivybridge`
- `piledriver`
- `cascadelake`
- `cooperlake`
- `zen4`

Resolves: #17966
@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #620

Overview

PR #620 adds support for five additional x86_64 CPU backend variants (ivybridge, piledriver, cascadelake, cooperlake, zen4) when building with GGML_CPU_ALL_VARIANTS=On. The changes consolidate SIMD header includes by removing duplicate __F16C__ preprocessor checks from ggml-impl.h and simd-mappings.h, while adding __F16C__ explicitly to the conditional in ggml-cpu-impl.h. This refactoring introduces a consistent 6-10 ns degradation in parameter accessor functions across all CPU backend variants.

Key Findings

Impacted Functions in Performance-Critical Areas

The degradation affects parameter accessor functions compiled across multiple CPU backend variants:

Most Impacted Functions (by absolute change):

  • repack.cpp_ggml_set_op_params in libggml-cpu.so: +10 ns throughput (110 ns → 120 ns)
  • binary-ops.cpp_ggml_get_op_params_i32 in libggml-cpu.so: +7 ns throughput (76 ns → 83 ns)
  • ggml-backend-reg.cpp_ggml_get_op_params_i32 in libggml.so: +6 ns throughput (76 ns → 83 ns)
  • amx.cpp_ggml_get_op_params_i32 in libggml-cpu.so: +6 ns throughput (69 ns → 75 ns)
  • ggml-quants.c_ggml_get_op_params_i32 in libggml-base.so: +6 ns throughput (71 ns → 77 ns)

All affected functions are variants of ggml_get_op_params_i32 and ggml_set_op_params located in ggml-impl.h lines 151-154. These are static inline functions that extract operation parameters from tensor structures. The degradation originates from altered compiler optimization decisions when <immintrin.h> is included in different preprocessor contexts due to the addition of __F16C__ to the conditional check.

Impact on Inference Performance (Tokens per Second)

No impact on tokenization or inference throughput. The affected functions (ggml_get_op_params_i32, ggml_set_op_params) are parameter accessors used during graph construction and operation dispatch, not during the core inference loop. The functions responsible for token generation (llama_decode, llama_encode, llama_tokenize) show no performance changes in this PR.

Reference context: For the test model (ollama://smollm:135m on 12th Gen Intel Core i7-1255U), a 2 ms slowdown in llama_decode causes 7% reduction in tokens per second. The 6-10 ns changes observed here are six orders of magnitude smaller and occur in infrastructure functions called during setup, not per-token processing.

Power Consumption Analysis

Power consumption increases are minimal across affected binaries:

  • libggml.so: +0.054% (+2.25 nJ absolute, 4152 nJ → 4154 nJ)
  • libggml-cpu.so: +0.017% (+20 nJ absolute, 119986 nJ → 120006 nJ)
  • libggml-base.so: +0.012% (+7 nJ absolute, 59263 nJ → 59270 nJ)
  • libllama.so: +0.000% (no measurable change)
  • All other binaries: no measurable change

The power consumption changes correlate with the throughput degradations in parameter accessor functions. The increases are negligible in absolute terms and do not impact production energy consumption meaningfully.

Code Change Analysis

The PR implements legitimate functionality: expanding CPU variant support for better hardware optimization. The header refactoring consolidates SIMD intrinsic management by centralizing <immintrin.h> inclusion in ggml-cpu-impl.h. The performance regression is an unintended side effect of this consolidation, where the addition of __F16C__ to the preprocessor conditional altered the compilation context for inline functions across all CPU backend variants. The changes do not modify algorithmic logic or add computational overhead; the degradation stems from different compiler optimization decisions for the same source code when compiled with different preprocessor definitions.

@loci-dev loci-dev force-pushed the main branch 20 times, most recently from f002844 to 25154fc Compare December 21, 2025 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants