0
Skip to Content
Neel Nanda
About
Blog
Top Posts
Mechanistic Interpretability
Neel Nanda
About
Blog
Top Posts
Mechanistic Interpretability
About
Blog
Top Posts
Mechanistic Interpretability
Neel Nanda 7/18/23 Neel Nanda 7/18/23

Tiny Mech Interp Projects: Emergent Positional Embeddings of Words

A rough post exploring the emergent positional embedding hypothesis - rather than representing "this is the token in position 5" models may represent eg "this token is the second name in the sentence"

Read More
Neel Nanda 3/28/23 Neel Nanda 3/28/23

Actually, Othello-GPT Has A Linear Emergent World Representation

A write up of work extending and building on the paper Emergent World Representations

Read More
Paper Replication Walkthrough: Reverse-Engineering Modular Addition
Neel Nanda 3/12/23 Neel Nanda 3/12/23

Paper Replication Walkthrough: Reverse-Engineering Modular Addition

Read More
Attribution Patching: Activation Patching At Industrial Scale
Neel Nanda 2/4/23 Neel Nanda 2/4/23

Attribution Patching: Activation Patching At Industrial Scale

A write-up of an incomplete project I worked on at Anthropic in early 2022, using gradient-based approximation to make activation patching far more scalable

Read More
Neel Nanda 2/4/23 Neel Nanda 2/4/23

Mech Interp Project Advising Call: Memorisation in GPT-2 Small

Read More
Neel Nanda 1/31/23 Neel Nanda 1/31/23

Mechanistic Interpretability Quickstart Guide

An intro guide to a mechanistic interpretability weekend hackathon

Read More
Neel Nanda 12/27/22 Neel Nanda 12/27/22

A Walkthrough of Toy Models of Superposition

Read More
Neel Nanda 12/26/22 Neel Nanda 12/26/22

Analogies between Software Reverse Engineering and Mechanistic Interpretability

Read More
Neel Nanda 12/25/22 Neel Nanda 12/25/22

Concrete Steps to Get Started in Transformer Mechanistic Interpretability

Read More
Neel Nanda 12/21/22 Neel Nanda 12/21/22

A Comprehensive Mechanistic Interpretability Explainer & Glossary

Read More
Neel Nanda 11/22/22 Neel Nanda 11/22/22

A Walkthrough of In-Context Learning and Induction Heads (w/ Charles Frye) Part 1 of 2

Read More
A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)
Neel Nanda 11/7/22 Neel Nanda 11/7/22

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Read More
Neel Nanda 11/1/22 Neel Nanda 11/1/22

Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Read More
Neel Nanda 10/24/22 Neel Nanda 10/24/22

A Barebones Guide to Mechanistic Interpretability Prerequisites

Read More
Neel Nanda 10/18/22 Neel Nanda 10/18/22

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers

A highly opinionated list of what mechanistic interpretability papers to read when getting into the field

Read More
A Walkthrough of A Mathematical Framework for Transformer Circuits
Neel Nanda 10/14/22 Neel Nanda 10/14/22

A Walkthrough of A Mathematical Framework for Transformer Circuits

A stream of conscious video walkthrough of a Mathematical Framework for Transformer Circuits

Read More

Mechanistic Interpretability

Sign up with your email address to receive emails about new posts.

Thank you!

Neel Nanda

Blog About

Subscribe to hear about new posts (RSS)! Give feedback here!