Skip to content

notquite28/abliteration

Repository files navigation

Abliteration for LFM2.5

Quick implementation of abliteration to uncensor LLMs. Based on the HuggingFace blog post: https://huggingface.co/blog/mlabonne/abliteration

What it does

Abliteration removes censorship by finding the "refusal direction" in the model's activations and removing it through weight orthogonalization. No retraining needed, works in minutes instead of hours.

The basic idea: models refuse requests by activating a specific direction in their residual stream. If we prevent the model from writing to that direction, it can't refuse anymore.

Setup

uv sync

That's it. Uses uv for dependency management.

Usage

Quick start (recommended)

uv run python abliterate_proper.py \
  --model_name LiquidAI/LFM2.5-1.2B-Instruct \
  --output_dir ./lfm2.5-abliterated-proper \
  --max_samples 64 \
  --eval_directions 3

This evaluates each layer first, then only modifies the ones that actually help. Takes longer but works better.

Basic version (faster, less effective)

uv run python abliterate.py \
  --model_name LiquidAI/LFM2.5-1.2B-Instruct \
  --output_dir ./lfm2.5-abliterated \
  --max_samples 128 \
  --batch_size 4

Just modifies all layers without checking if they help. Sometimes works, sometimes makes things worse.

Check if it worked

uv run python verify_abliteration.py --model_path ./lfm2.5-abliterated-proper

Scripts

  • abliterate_proper.py - The good one. Evaluates layers first, then modifies only the helpful ones
  • abliterate.py - Basic version, modifies all layers. Use if you're in a hurry
  • verify_abliteration.py - Tests the model on some prompts to see if it's less censored
  • example_transformers.py - Just some examples of using the model normally

How it works

  1. Loads harmful vs harmless prompts
  2. Runs the model on both and extracts activations at the last token
  3. Calculates the difference (refusal direction) for each layer
  4. Tests each direction with inference-time intervention
  5. Picks the best layers and permanently modifies their weights

The weight modification prevents the model from writing to the refusal direction. It's basically orthogonalizing the weight matrices.

Results so far

Best result: ~37% refusal rate (down from 50%). Still some censorship but much better. Illegal activities are partially uncensored, NSFW is hit or miss. Normal questions work fine.

See PROGRESS.md for details on what I tried.

Notes

  • First run downloads the model (~2.4GB), cached after that
  • Works better with GPU but not required
  • The LFM2 architecture is a bit different from standard Llama, had to adapt the hooks
  • TransformerLens doesn't support LFM2, so using manual hooks instead

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors