Abliteration for LFM2.5

Quick implementation of abliteration to uncensor LLMs. Based on the HuggingFace blog post: https://huggingface.co/blog/mlabonne/abliteration

What it does

Abliteration removes censorship by finding the "refusal direction" in the model's activations and removing it through weight orthogonalization. No retraining needed, works in minutes instead of hours.

The basic idea: models refuse requests by activating a specific direction in their residual stream. If we prevent the model from writing to that direction, it can't refuse anymore.

Setup

uv sync

That's it. Uses uv for dependency management.

Usage

Quick start (recommended)

uv run python abliterate_proper.py \
  --model_name LiquidAI/LFM2.5-1.2B-Instruct \
  --output_dir ./lfm2.5-abliterated-proper \
  --max_samples 64 \
  --eval_directions 3

This evaluates each layer first, then only modifies the ones that actually help. Takes longer but works better.

Basic version (faster, less effective)

uv run python abliterate.py \
  --model_name LiquidAI/LFM2.5-1.2B-Instruct \
  --output_dir ./lfm2.5-abliterated \
  --max_samples 128 \
  --batch_size 4

Just modifies all layers without checking if they help. Sometimes works, sometimes makes things worse.

Check if it worked

uv run python verify_abliteration.py --model_path ./lfm2.5-abliterated-proper

Scripts

abliterate_proper.py - The good one. Evaluates layers first, then modifies only the helpful ones
abliterate.py - Basic version, modifies all layers. Use if you're in a hurry
verify_abliteration.py - Tests the model on some prompts to see if it's less censored
example_transformers.py - Just some examples of using the model normally

How it works

Loads harmful vs harmless prompts
Runs the model on both and extracts activations at the last token
Calculates the difference (refusal direction) for each layer
Tests each direction with inference-time intervention
Picks the best layers and permanently modifies their weights

The weight modification prevents the model from writing to the refusal direction. It's basically orthogonalizing the weight matrices.

Results so far

Best result: ~37% refusal rate (down from 50%). Still some censorship but much better. Illegal activities are partially uncensored, NSFW is hit or miss. Normal questions work fine.

See PROGRESS.md for details on what I tried.

Notes

First run downloads the model (~2.4GB), cached after that
Works better with GPU but not required
The LFM2 architecture is a bit different from standard Llama, had to adapt the hooks
TransformerLens doesn't support LFM2, so using manual hooks instead

References

Original paper: Arditi et al. "Refusal in LLMs is mediated by a single direction"
Blog post: https://huggingface.co/blog/mlabonne/abliteration
FailSpy's implementation: https://github.com/FailSpy/abliterator

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
COMPARISON.md		COMPARISON.md
PROGRESS.md		PROGRESS.md
README.md		README.md
abliterate.py		abliterate.py
abliterate_proper.py		abliterate_proper.py
example_transformers.py		example_transformers.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock
verify_abliteration.py		verify_abliteration.py
verify_abliteration.sh		verify_abliteration.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abliteration for LFM2.5

What it does

Setup

Usage

Quick start (recommended)

Basic version (faster, less effective)

Check if it worked

Scripts

How it works

Results so far

Notes

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Abliteration for LFM2.5

What it does

Setup

Usage

Quick start (recommended)

Basic version (faster, less effective)

Check if it worked

Scripts

How it works

Results so far

Notes

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages