Quick implementation of abliteration to uncensor LLMs. Based on the HuggingFace blog post: https://huggingface.co/blog/mlabonne/abliteration
Abliteration removes censorship by finding the "refusal direction" in the model's activations and removing it through weight orthogonalization. No retraining needed, works in minutes instead of hours.
The basic idea: models refuse requests by activating a specific direction in their residual stream. If we prevent the model from writing to that direction, it can't refuse anymore.
uv syncThat's it. Uses uv for dependency management.
uv run python abliterate_proper.py \
--model_name LiquidAI/LFM2.5-1.2B-Instruct \
--output_dir ./lfm2.5-abliterated-proper \
--max_samples 64 \
--eval_directions 3This evaluates each layer first, then only modifies the ones that actually help. Takes longer but works better.
uv run python abliterate.py \
--model_name LiquidAI/LFM2.5-1.2B-Instruct \
--output_dir ./lfm2.5-abliterated \
--max_samples 128 \
--batch_size 4Just modifies all layers without checking if they help. Sometimes works, sometimes makes things worse.
uv run python verify_abliteration.py --model_path ./lfm2.5-abliterated-properabliterate_proper.py- The good one. Evaluates layers first, then modifies only the helpful onesabliterate.py- Basic version, modifies all layers. Use if you're in a hurryverify_abliteration.py- Tests the model on some prompts to see if it's less censoredexample_transformers.py- Just some examples of using the model normally
- Loads harmful vs harmless prompts
- Runs the model on both and extracts activations at the last token
- Calculates the difference (refusal direction) for each layer
- Tests each direction with inference-time intervention
- Picks the best layers and permanently modifies their weights
The weight modification prevents the model from writing to the refusal direction. It's basically orthogonalizing the weight matrices.
Best result: ~37% refusal rate (down from 50%). Still some censorship but much better. Illegal activities are partially uncensored, NSFW is hit or miss. Normal questions work fine.
See PROGRESS.md for details on what I tried.
- First run downloads the model (~2.4GB), cached after that
- Works better with GPU but not required
- The LFM2 architecture is a bit different from standard Llama, had to adapt the hooks
- TransformerLens doesn't support LFM2, so using manual hooks instead
- Original paper: Arditi et al. "Refusal in LLMs is mediated by a single direction"
- Blog post: https://huggingface.co/blog/mlabonne/abliteration
- FailSpy's implementation: https://github.com/FailSpy/abliterator