Mechanistic Interpretability: Looking Inside Neural Networks

What is Mechanistic Interpretability?

Mechanistic interpretability aims to reverse-engineer neural networks to understand the algorithms running inside them. It's like analyzing the "compiled code" of AI systems to see how they actually work.

Unlike other approaches that just analyze correlations, mechanistic interpretability seeks to understand the actual mechanisms and algorithms that operate within AI models.

Why It Matters for AI Safety

Detecting deception by seeing if models are trying to mislead users
Finding safety-relevant features like backdoors, security vulnerabilities, or lying
Understanding what algorithms run in models to help predict what they might do
Building transparency and trust in AI systems that are increasingly powerful

The Beauty of Neural Networks

"Neural networks build enormous complexity and beauty inside themselves that people generally don't look at... I think there is an incredibly rich structure to be discovered inside neural networks, a lot of very deep beauty, if we're just willing to take the time to see it and understand it." — Chris Olah

Features and Circuits

Features: Directions in activation space that correspond to meaningful concepts. They can be represented by individual neurons or combinations of neurons.

Circuits: Collections of features connected by weights, implementing specific algorithms. For example, a car detector might connect to wheel detectors and window detectors.

"We found a car detector was built from window detectors and wheel detectors—it looks for windows above, wheels below, and car body in the middle. That's a recipe for a car."

Superposition Hypothesis

Neural networks represent more concepts than they have dimensions by exploiting the sparsity of features—not all concepts are active at once.

Problem: A 1,000-dimension network could only represent 1,000 orthogonal concepts.

Solution: Compressed sensing allows networks to represent many more concepts by exploiting sparsity.

This explains why we observe "polysemantic" neurons that respond to multiple unrelated concepts—they're efficient projections of many features into a lower-dimensional space.

Recent Breakthroughs

Sparse Autoencoders

Technique that extracts interpretable features from polysemantic neurons, turning messy representations into clean ones

Detecting Deception Features

Found features in large models like Claude that activate when the model is being deceptive or lying

Security Vulnerability Features

Discovered features that activate for security vulnerabilities in code and also respond to physical security vulnerabilities in images