Geometry of Alignment — How safety lives and dies in modern LLMs

§ 02 · the hike narrative · beat 1

When Google first released Gemma 4, I was thinking — maybe this could save lives. I used to hike a lot. We hike in remote areas where there is no phone reception, and you can barely find anybody else. Maybe you meet a person once or twice per hour. If something happens, we are truly on our own.

— story.md · author note, paraphrased for the web

§ 03 · the promise narrative · beat 2

With Gemma 4, this becomes a reality in the palm of your hand. An open-weight model — intelligent enough to triage an emergency — running on a phone at decent speed.

offline · 4B params · 8-bit quant · ~7.5 GB

§ 04 · the refusal narrative · beat 3

So I downloaded Gemma 4 E2B IT to my phone, and I asked it the question I'd actually want to ask in the field:

user · offline · 11:42 AM

gemma-4 e2b-it · response

A safety guardrail — exactly the behavior that protects the model in adversarial settings — refuses to engage with a question that might keep someone alive.

§ 05 · the question narrative · beat 4

How does safety work inside a language model — and can we selectively remove it?

Keep refusal where it matters — weapons, abuse, malicious code. Remove it where it harms — medical triage, wilderness survival, home safety. That is the geometric question.

§ 06 · primer refusal · direction · rank-1 abliteration

The refusal
direction.

Run a stream of harmful prompts and a stream of harmless prompts through the model. At every layer, take the mean activation of each. The difference between those means is a single vector in the residual stream — the refusal direction.

Rank-1 abliteration is the surgery: project that direction out of the model's output projections (o_proj, down_proj) by subtracting the rank-1 outer product. One direction, removed everywhere. On most models, refusal collapses.

eqn · refusal direction

$$\mathbf{r}^{(\ell)} = \mu^{(\ell)}_{\mathrm{harm}} - \mu^{(\ell)}_{\mathrm{harmless}}$$

eqn · rank-1 abliteration

$$W' = W - \alpha\, \hat{\mathbf{r}}\hat{\mathbf{r}}^{\top} W$$

$W$ is any output-projection weight. $\hat{\mathbf{r}}$ is the unit refusal direction. $\alpha\in[0,2]$ is the surgical depth.

§ 07 · methodology four stages · benchmark → mech → ablit → weight diff

A four-stage pipeline. Each stage feeds the next. We start with raw refusal behavior, dig down to the residual stream, attempt surgery, and reverse-engineer the published "uncensored" forks.

Benchmark

342 prompts · 8 categories. Measure refusal rates on harm, emergency-medical, wilderness, home safety, mental-health, chemistry. Establish the over-refusal baseline.

Mechanistic

Extract residual-stream activations across 42 layers. Compute Cohen's d, fit UMAP, locate the refusal direction. Peak at L15 · d = 2.87.

M2c · M6

Abliterate

Sweep α∈[0, 2], 9 layer subsets, random control. Then a 6-stage causal cascade isolates the load-bearing ingredient.

Weight diff

Diff base ↔ OBLITERATUS ↔ TrevorJS. SVD each delta. Compare rank, orientation, and overlap with M2 direction.

§ 08 · refusal · mechanistic demo 1 · UMAP · demo 2 · per-layer signal

Two views of the same observation: at one specific layer, refuse-class and comply-class prompts separate into two clean clouds in 2D — and that separation has a signature across depth that peaks sharply at L15.

demo 1 · umap · layer L15 · n=340

refuse comply

demo 2 · per-layer cohen's d click a band → re-scatter

L0L10L20L30L41

HOVER LAYER · L15 · d=2.87 · global-attn

Top-1 PC captures 86.6% of |Δμ|² at the L4–L17 band. Yellow ticks mark global-attention layers; the refusal signal concentrates around them.

§ 09 · the investigation demo 3 · M6 cascade · click any node

Standard rank-1 abliteration left Gemma 4's refusal rate at 100%. We ran a six-stage causal cascade to find out why. Each stage isolates one variable. Click a node.

§ 10 · the punch line

40.5%

On the M6 cascade's n=42 should_refuse subset, with chat-template activations + 99.5% per-layer winsorization + two-pass Gram-Schmidt against the harmless mean, a vanilla rank-1 projection at α=1.0 cuts refusal from 100% to 40.5%. A 60% relative reduction. One ingredient — Gram-Schmidt — is load-bearing.

100 → 40.5%should_refuse · n=42

−60%relative reduction

α = 1.0vanilla projection

rank-1still partially inert

§ 11 · α-sweep demo 4 · drag α from 0 to 2.0

Real sweep_results.json from the M2c study, evaluated on a test subset of n=20 prompts: refusal rate is flat at 30–35% across the entire α sweep. The random-direction control sits at the same baseline. The standard recipe is empirically inert on Gemma 4 E4B-it (8-bit). This is what motivated the M6 cascade above.

demo 4 · α-sweep · real data · test subset · n=20

mean-diff random ctrl

α = 1.00

mean-diff · 30.0% random ctrl · 30.0% test data · n=20 · results/ablation_results/sweep_results.json

layer subset sweep · α=1.0 · test subset · n=20 no subset moves the needle

§ 11b · figures as published · results/figures/

The real plots straight from the repo. The interactive demos above are stylized companions; these are the source of truth.

alpha sweep — α-sweep · standard recipe flat at 30–35% (n=20)

signal vs layer — per-layer Cohen's d · peak at L15

M6 per-prompt — M6 D3 · should_refuse 100 → 40.5% at n=42

M6 cascade gate — M6 cascade · Gram-Schmidt is load-bearing

UMAP L15 — UMAP · L15 · refuse vs comply separate cleanly

direction vs singular vector — activation direction vs weight-diff top singular vector

§ 12 · what this means geometric implications · future work

Refusal on Gemma 4
is not rank-1.

The ~40% residual that survives even our cleanest single-direction surgery concentrates on the most extreme topics — CSAM, ICS/hospital malware, weapons. There is a strong core safety circuit that one direction cannot reach. OBLITERATUS, the publicly successful abliteration, uses median rank-95 of 6 on the same base model.

Selective safety geometry is clean.

Per-category refusal directions — emergency medical, wilderness, home safety, chemistry, mental health — form a tight +0.93 pairwise cluster, and are orthogonal to the global should_refuse direction (mean cos ≈ −0.015).

The geometry permits selective de-alignment. The magnitude side blocks it. The next round is multi-rank descent.

§ 13 · team columbia · eecs 6699 · spring 2026

chenhao yang

role benchmarking + abliteration

email cy2822@columbia.edu

git github.com/chenhaoyang

daitian zhao

role mechanistic + figures

email dz2585@columbia.edu

git github.com/daitianzhao

hanlin wang

role weight-diff + paper

email hw3100@columbia.edu

git github.com/hanlinwang

yuxi luo

role interactive + writeup

email yl6117@columbia.edu

git github.com/yuxiluo

team contact · GeometryofAlignment@nyavana.io