imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

EAddario · 2025-07-26T16:47:29Z

Following up from #9400 and #12718, I've started tinkering with activation-based statistics, in addition to what's currently available via --show-statistics.

At the moment, I'm exploring three options going from from easy to implement and OK approximation, to some assembly required but fairly accurate:

L2 norm of activation difference: where larger values would suggest the tensor has significantly transformed the input with respect to the previous layer.
KL Divergence reduction using a pre-computed logit file: using a similar approach as described by nostalgebraist in logit lens, and based on a pre-computed logit file (e.g. from a previous llama-perplexity --save-all-logits run)
Given that llama-imatrix already generates the actual logits to compute PPL, use Thông T. Nguyễn's logit prism approach to calculate the exact contribution of each layer to the final logit scores

Sharing with the readers, and in particular @compilade and @jukofyork, in case anyone's willing to double check assumptions and/or suggest alternative approaches I haven't considered.

tools/imatrix/imatrix.cpp

jukofyork · 2025-07-29T11:15:01Z

L2 norm of activation difference: where larger values would suggest the tensor has significantly transformed the input with respect to the previous layer.

If we had access to some numerical linear algebra routines then it would likely be possible to get much more interesting stats from this.

If you think about it:

The L2 norm of the activation difference is just measuring the Euclidean distance of the tip of the input vector vs the tip of the output vector.
The mean of these norms probably isn't that interesting (but could be used to test if a quant is systematically biasing or scaling the activations).
The variance of these norms is likely much more interesting and tells you about the "richness" of the transformation (indirectly - see below).

If instead of using the L2 norms of the differences, we construct the cross-covariance matrix of the paired samples, and then take the SVD of this:

The "richness" of the transformation (measured indirectly above) is actually to do with the distribution of the singular values, eg: there are many sets of activation differences with the same L2-norm, but those with a flat(er) distribution of singular values (vs a couple of large singular values) are likely to be much more important and interesting.
If you convert the SVD into a polar decomposition, then the scaling and rotational components will likely lead to other interesting insights, eg:

I suspect that the scaling part of the transformation is quite well handled by the current scaler quants, but the rotational component is likely not.

IIRC, some of the 1-2bit quants use vector quantization, and if so; these will likely handle the rotational components better and/or show quite different properties.

I'm on my phone ATM so can't easily link them, but there have been several papers showing:

Outlier activations in LLMs matter much more than simple rate–distortion theory would suggest/measure. This is likely related to the "flatness" of the singular values, where only rarely do some singular vector directions give a high dot-product with an input activation, but when they do; they add a significant/important contribution to the output.
LLMs are much more rotational than people first realised, eg: there was [IIRC] a Microsoft paper where they constrained everything to be on the surface of a unit ball, and there are several PEFT methods that purely alter the rotational directions via orthogonal transformations.

jukofyork · 2025-07-29T11:28:59Z

If it's any use, then there is code here to analyse the symmetrised cross-covaraince matrix I used for the control vectors:

https://github.com/jukofyork/control-vectors/blob/main/direction_analyzer.py

The symmetrised version deliberately gets rid of the rotational components as there can't be made use of if we are just looking for a single direction... You can actually do the same on the anti-symmetrised version (to look at the rotational components only), but Eigen-decompostion is less useful for this as it will return all complex vectors (hence why SVD makes more sense).

I should also add that from my experiments using SVD on the tensors (ie: ignoring the activations!) of LLMs, it often appears that the early/final tensors (which actually appear to be very important and are bumped in bits in the quant routines here!), actually tend to have a less flat distribution of singular values themselves! So when you ignore the distribution of input activations - they generally appear to be doing something inherently "lower dimensional" than the middle tensors!? It would be interesting to investigate this whilst also looking at the activations...

EAddario · 2025-07-29T23:23:19Z

I'd be lying if I were to claim I understand everything in there 🥴, but I think I got the gist.

Implementing the l2 norm seems straightforward without having to introduce additional 3rd party dependencies, but completely agree that a "light" BLAS lib will be a godsend.

For now, I'll focus on l2 norm, but will add activation variance as well (good shout!)

For a later version, I'd like to try the logit prism approach but that's for another day.

Thanks for the steer @jukofyork! more weekend reading 😁

jukofyork · 2025-07-30T14:26:00Z

I'd be lying if I were to claim I understand everything in there 🥴, but I think I got the gist.

Implementing the l2 norm seems straightforward without having to introduce additional 3rd party dependencies, but completely agree that a "light" BLAS lib will be a godsend.

For now, I'll focus on l2 norm, but will add activation variance as well (good shout!)

For a later version, I'd like to try the logit prism approach but that's for another day.

Thanks for the steer @jukofyork! more weekend reading 😁

If you want to learn more about Linear Algebra then Gilbert Strang's video lectures are amazing:

https://www.youtube.com/playlist?list=PLE7DDD91010BC51F8

(IIRC, the first lecture only is bad resolution, so don't be put off by that!)

or if you like books:

https://www.amazon.co.uk/Practical-Linear-Algebra-Textbooks-Mathematics/dp/0367507846

(or one of the earlier editions of this same book)

gives a really solid foundation in terms of 2D and 3D.

The biggest problem breaking into it is for some reason American Universities decided to make it much more abstract and proof-based than it needs to be (probably to weed out potential math-majors!).

If you look at some much older pre-1980s books, or books not aimed at Westerners, then it's surprising how approachable it is:

https://mirtitles.org/?s=linear+algebra

jukofyork · 2025-07-30T14:36:14Z

but completely agree that a "light" BLAS lib will be a godsend.

I have tried to bring this up before:

#8831 (reply in thread)
#8831 (comment)

I think it would be fairly straightforward to port the non-complex routines and then open up all that GSL has to offer:

https://www.gnu.org/software/gsl/doc/html/linalg.html

instead of trying to rewrite numerical routines that have had 1000's and 1000's of thought and testing put into them! :)

compilade · 2025-08-05T16:52:17Z

Considering the imatrix weights (in_sum2 / counts) are used by their relative amplitude (assumption from the second paragraph of #15060 (comment)), it might be useful to calculate the variance and standard deviation of the mean squared activations.

The closer the variance is to 0, the closer the effect is to neutral weights.

Although maybe it's not quite the right thing since the variance is sensitive to the scale. (normalized variance, perhaps? Does this have another name?)

~~EDIT: nevermind, it might not be much different from the Z-scores.~~

EAddario · 2025-08-05T20:54:05Z

With the latest set of updates, the code now switches between mean activations (activations/counts) and mean squared activations (values/counts) based on the presence/absence of X.weight.in_sum.

All the stats will be computed accordingly so that for example, min, max, mean, std deviation, entropy, normalised entropy (E/Log2(n)), etc. will either use in_sum or in_sum2.

Please note that the ZD metric (Z-score distribution) is not a proper z-score measure, but rather the proportion of weights in a tensor/layer having a z-score greater than 1, as defined by the original authors

EAddario · 2025-08-07T21:42:36Z

Just pushed a few changes to #14891 that I'd be grateful if anyone has a chance to play with.

I have introduced a new metric I'm calling Euclidean-Cosine Score (ECS), defined as $f(a, b) = K \cdot e^{-\alpha a} \cdot |b|^{\gamma}$ where a = l2 norm, b = cosine similarity, and K = 100 (just a scaling factor), with fine-tuning parameters α set to -0.1 and γ to 10 (giving more weight to higher CosSim than to lower L2 Norm). The parameters are somewhat arbitrary, and being adjusted by trial and error as I get to do more tests.

The main idea behind is that tensors/layers with small lengths and pointing in the same direction will have less influence as tokens flow forward. Therefore, in theory, they should be ideal candidates to down-quantise (tensors) or to prune (layers).

It's early days but so far I got encouraging* PPL improvements using activation-based CosSim only, to select tensors to up/down quantise, compared to the previous method (Σ(Act²)).

So far, I've only tested on Llama 3.2 1B but my hunch is that for different architectures, particularly MoEs, a compound metric may have better resolution. Since we now have magnitude (L2 Norm) and direction (CosSim), deriving the dot product between tensors/layers is trivial, leading to creating the ECS.

Caveat emptor: it's just a hunch, but it's fun trying 😋

(*) by encouraging I mean ~0.6 to ~0.8% improvement on 𝜌PPL. It isn't going to make your IQ1_S model perform as Q8_0, but every little helps...

p.s. for this to work, you have to generate the imatrix with this fork first. L2 Norm is only available if activations are saved as well. If not, the stats default to legacy using Σ(Act²) which is OKish but not ideal. Also, please be aware the imatrix file size will double

jukofyork · 2025-08-08T09:06:46Z

I've had an interesting idea related to this:

Unless I've not seen something, the whole LLM is invariant to permutation of the hidden state indices: both the residual stream indices requiring every tensor to be permuted and the MLP intermediate indices requiring just the up/gate/down indices for the specific MLP block permutating.

If this is the case, then considering the legacy-quants use a block size of 32 and the K-quants a block size of 256 (with an internal block size of 32 or 16 IIRC?), then there could be a permutation that groups indices in such a way as to create less quantisation error.

It would be quite a lot of data to store, but not infeasible and the permutations wouldn't be hard to perform.

Without actually having some data it's hard to say what we should look for, but I suspect some greedy algorithm that tries to create certain properties for the 32 (or 256) block-diagonal entries vs the off-block-diagonal entries wouldn't be too expensive.

What are the properties of a "good" block? Correlated magnitudes, uncorrelated magnitudes, something else entirely?

@compilade might be interested in this too.

compilade · 2025-08-08T16:16:06Z

What are the properties of a "good" block? Correlated magnitudes, uncorrelated magnitudes, something else entirely?

@jukofyork
If we're only considering imatrix weights, then I would guess uncorrelated magnitude is better, because correlated magnitude is closer to neutral quantization, and depends more on the relative amplitude of the model weights.

A good block should quantize well. Assuming linear quantization, that means any component should be very close to within a integer fraction multiple of the max value of the block. (e.g. 5/15, 4/7, etc. as long as the denominator is smaller than the max representable integer in the target quant).
If we're directly considering the weights and not imatrix, best permutations wouldn't necessarily be the same for different quantization algorithms (because of the different max denominators), unfortunately.

To get values close to fractional multiples of each other, it could be possible to pre-multiply all values of a row by all distinct possible fractions for a type, and then sort, and then somehow collect groups of the desired block size. Not sure how this would generalize to multiple rows, so it's probably not the correct approach.

Uncorrelated imatrix blocks would be interesting to try.

Unless I've not seen something, the whole LLM is invariant to permutation of the hidden state indices

AFAIK, RoPE is dependant on the order of the hidden state indices (at least the first rope_dims ones).

There's also attention heads which probably limit possible permutations of the Attention tensors.

For FFN tensors, if they're right after RoPE, then the limitations of the rope_dims need to be considered. Parallel vs sequential FFN also limit what kind of permutations can be done.

There seems to be many edge cases.

EAddario · 2025-08-08T19:20:29Z

Reading this exchange (without grasping all the implications) triggered another thought along the lines of create less quantisation error: now that llama-quantize has access to the means AND the squared means, via the new imatrix format, isn't there an opening for a "dynamic" approach that can automatically ("on-the-fly") select the best quant algo per tensor instead of relying on heuristics?

Worth exploring further? or have I pulled a classic Howard Wolowitz move?

EAddario · 2025-08-09T13:40:28Z

The more I think about the above, the more I feel it may have some legs, but not as an added function to llama-quantize or llama-imatrix, but as its own stand-alone tool (llama-quant-select? 🙃).

I realised that having access to μAct (activations / counts) and μAct² (values / counts) is not sufficient to determine the true quantisation error without the actual token that generated those values.

I'll file this under "future projects"

…by default

Use activations to calculate the stats

09bc7c2

EAddario marked this pull request as draft July 26, 2025 16:47

github-actions bot added the examples label Jul 26, 2025

compilade reviewed Jul 26, 2025

View reviewed changes

tools/imatrix/imatrix.cpp Outdated Show resolved Hide resolved

tools/imatrix/imatrix.cpp Show resolved Hide resolved

EAddario added 21 commits July 31, 2025 20:46

Refactor variable names

2097f03

Fix problem up when GGUF does not have in_sum

78ddb47

Determine calculation mode

9744a4a

Compute entropy for activations

cce514a

Compute cosine similarity based on activations

b7fb362

Compute l2 norm

9b841eb

Adjust threshold

ee2509f

Update table display

fc8f925

Remove inactive

4c01f51

Reformat report layout

a32a2ec

Refactor variables

4d1325e

Update table layout

5324558

Refactor lambda into compute_tensor_averages() function

fce05aa

Refactor function names

be60469

Add compute_layer_statistics() function

a6155a8

Update aggregated statistic report layout

2117c4e

Minor cosmetic changes

90cb1be

Fix printing l2 norm when calc_mode = 1

f1c2a4c

Refactor variable name

c39c4e2

Merge branch 'master' into imatrix

adbff66

Do not resize if in_sum is null

5e40cf4

EAddario added 6 commits August 5, 2025 08:54

Compute aggregated (per layer) l2 norm

b373934

Update aggregated sum of squared activations per layer

906548a

Make ZD Score two-tailed

aea9b31

Refactor variable names

49996a1

Update report layout

4c3fea8

Refactor legacy mode

88854c9

EAddario added 7 commits August 5, 2025 21:58

Merge branch 'master' into imatrix

030ed3c

Merge branch 'master' into imatrix

c7959ed

Refactor variable names

3e9d53c

Reverse conditional logic to match convention

e0d6471

Rename report heading

dadd90e

Add --activation-statistics parameter

5bb2def

Add Euclidean–Cosine Score (ECS)

c5ecdaa

EAddario added 3 commits August 9, 2025 01:26

Update README.md

59af503

Merge branch 'master' into imatrix

9467963

Fix typo in ECS formula

6fe51e1

EAddario added 2 commits August 9, 2025 14:49

Add --activation-statistics logic to avoid doubling the imatrix size …

dcac206

…by default

Update README.md

89051cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

EAddario commented Jul 26, 2025

Uh oh!

Uh oh!

Uh oh!

jukofyork commented Jul 29, 2025 •

edited

Loading

Uh oh!

jukofyork commented Jul 29, 2025 •

edited

Loading

Uh oh!

EAddario commented Jul 29, 2025

Uh oh!

jukofyork commented Jul 30, 2025

Uh oh!

jukofyork commented Jul 30, 2025

Uh oh!

compilade commented Aug 5, 2025 •

edited

Loading

Uh oh!

EAddario commented Aug 5, 2025

Uh oh!

EAddario commented Aug 7, 2025

Uh oh!

jukofyork commented Aug 8, 2025

Uh oh!

compilade commented Aug 8, 2025

Uh oh!

EAddario commented Aug 8, 2025

Uh oh!

EAddario commented Aug 9, 2025

Uh oh!

Uh oh!

imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

Are you sure you want to change the base?

imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

Conversation

EAddario commented Jul 26, 2025

Uh oh!

Uh oh!

Uh oh!

jukofyork commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EAddario commented Jul 29, 2025

Uh oh!

jukofyork commented Jul 30, 2025

Uh oh!

jukofyork commented Jul 30, 2025

Uh oh!

compilade commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EAddario commented Aug 5, 2025

Uh oh!

EAddario commented Aug 7, 2025

Uh oh!

jukofyork commented Aug 8, 2025

Uh oh!

compilade commented Aug 8, 2025

Uh oh!

EAddario commented Aug 8, 2025

Uh oh!

EAddario commented Aug 9, 2025

Uh oh!

Uh oh!

jukofyork commented Jul 29, 2025 •

edited

Loading

jukofyork commented Jul 29, 2025 •

edited

Loading

compilade commented Aug 5, 2025 •

edited

Loading