fixed the logit lens implementation inside ActivationCache.accumulated_resid to match the standard definition in literature and the expected and defined behavior as per the documentation in the docstring and in the docs #1077
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix incorrect logit lens implementation in
accumulated_residFixes #[issue_number]
What was wrong
The
accumulated_resid(apply_ln=True)method is documented for logit lens analysis but was implementing it incorrectly:The bug:
ln_final.hook_scale(computed during the original forward pass) to all intermediate statesfold_ln=True, this workaround was not documented and still didn't fix the core problem of using cached statisticsWhy this matters:
The cached normalization scale acts like a temperature parameter in softmax. Since the residual stream norm grows exponentially through the network, early-layer activations divided by the final layer's large cached scale become artificially small. After unembedding and softmax, this creates near-uniform distributions—not because the model is uncertain, but because of mismatched normalization statistics.
The logic of using cached normalization scales comes from Direct Logit Attribution (DLA), where it's correct and reasonable: you want a faithful decomposition showing how each component contributes to the final output through the lens of the final layer's normalization frame. However, for logit lens analysis, this is not what we want. Logit lens asks "what would the model predict if we stopped processing at layer L?" which requires simulating what would happen if that intermediate state were decoded immediately—meaning we must recompute fresh normalization statistics.
Additionally,
accumulated_residdoesn't attribute to specific components (individual heads or MLP layers) but to the complete accumulated residual stream up to layer L. The previous cached-scale approach answered a confusing hybrid question: "what is the contribution of all components up to layer l, viewed through layer l's (or the final layer's) normalization frame?" This is neither standard logit lens (which asks about intermediate predictions) nor standard DLA (which attributes to individual components).The confusion was compounded by the
layerparameter behavior: whenlayer=L < n_layers, it used layer L's input normalization (ln1 or ln2), not the final layer norm. This makes no sense for logit lens, where you always want to decode through the final layer norm. Thereturn_labelsparameter further hints at the function's true intention: to map intermediate states to vocabulary space for analysis across layers—which requires consistent use of the final layer norm.Concretely:
The model could encode identical concepts at intermediate and final layers—just at different scales—but the cached-scale approach would report completely different entropy and probabilities, even though the representations are functionally equivalent (all subsequent layers normalize their inputs anyway).
Example (GPT-2 XL, "The capital of France is"):
What changed
Code (
activation_cache.py)Changed one line in
accumulated_resid():This now:
[num_layers, batch, pos, d_model]stacklayerparameter valueDocumentation (
accumulated_residdocstring)Fixed ambiguity about bias terms:
The docstring now explicitly documents two valid approaches for projecting to vocabulary space:
With bias terms:
normalized_resid @ model.W_U + model.b_Uapplies both W_U and b_Ufold_ln=Trueandfold_ln=Falsefold_ln=False: layer norm bias applied vialn_final, then unembedding bias viab_Ufold_ln=True: layer norm bias is folded intob_U, so addingb_Uapplies both biases togetherWithout bias terms:
normalized_resid @ model.W_Uonlyfold_ln=Truewhen loading modelfold_ln=True: layer norm has no bias parameter (folded intob_U), and you skipb_U, so no bias terms are appliedfold_ln=False: layer norm bias would still be applied vialn_final(usually undesired when intentionally excluding biases)Previously, the docstring only mentioned "multiply by the unembedding matrix" without clarifying the bias term handling or its interaction with
fold_ln. This is important because the layer norm bias and unembedding bias are applied at different points in the computation, andfold_lndetermines whether they're kept separate or combined.Other documentation improvements:
apply_lnrecomputes normalization statistics to transform activations into the format expected by unembeddingfold_ln=True, with a commented alternative showing how to add bias termsTests (
test_activation_cache.py)Updated
test_accumulated_resid_with_apply_lnto verify correct behavior:Before:
After:
The key change: instead of comparing against
apply_ln_to_stack(the buggy cached-scale behavior), we now compare against direct application ofmodel.ln_final()(the correct fresh-normalization behavior). This ensuresapply_ln=Trueproduces the same result as manually applying the final layer norm.Breaking change
This is a breaking change that fixes incorrect behavior.
Users calling
accumulated_resid(apply_ln=True)will get different numerical outputs:The previous behavior was producing measurement artifacts. Users relying on those outputs were getting misleading interpretability results. Code will continue to run without errors, but the numerical results will be different (and correct).
Why this doesn't affect DLA
Direct Logit Attribution (DLA) decomposes outputs into individual component contributions (specific heads, MLP layers). For that, use
decompose_resid()to get individual components, notaccumulated_resid()which returns cumulative sums of all components up to each layer.The cached normalization logic that was used before makes sense in DLA contexts where you want to see component contributions through a consistent normalization frame. But
accumulated_residis designed for logit lens analysis (mapping intermediate states to predictions), not DLA decomposition. This fix only affects the logit lens use case (apply_ln=True), not DLA decomposition logic elsewhere in the codebase.Type of change
Note: While this is technically a breaking change (numerical outputs differ), existing code will continue to run without errors. The change only affects the numerical values returned, which are now correct rather than artifacts. Users won't experience crashes or API changes, just different (and more accurate) results.
Checklist:
test_accumulated_resid_with_apply_ln, none besides that)Notes:
test_accumulated_resid_with_apply_ln) to verify correct behavior instead of buggy behavioraccumulated_residsignature and parameters) remains unchanged, so backward compatibility is maintained at the API level