Skip to content

Conversation

taronaeo
Copy link
Collaborator

@taronaeo taronaeo commented Sep 5, 2025

Closes #15721
Supersedes #15739

This Pull Request drops support for the NNPA Vector Intrinsics as the maintenance cost required does not justify the performance improvements for FP32 ↔ FP16 conversion.

Tested with both -fa off and -fa on and ensured that the inference is correct on both modes.

For future reference to IBMers that want to bring this acceleration back,

  1. The NNPA Vector Intrinsics implementation for both FP32 → FP16 and FP16 → FP32 conversion is correct.
  2. Enabling Flash Attention (-fa on, turned on by default) somehow causes tensor data to be invalid i.e., -inf and nan. Make sure to check that the data is clean before determining if the conversion implementation is correct. See: ggml-cpu: fixes instability in NNPA Vector Intrinsics #15739 (comment)
  3. The function calling the FP32 ↔ FP16 conversion and providing invalid data is likely coming from ggml_compute_forward_dup_f32.

@github-actions github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning labels Sep 5, 2025
@taronaeo taronaeo merged commit 186415d into ggml-org:master Sep 6, 2025
48 checks passed
walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: ggml-cpu Conversion FP32<->FP16 Using GGML_NNPA Stop Inferencing Correctly After b6324
2 participants