Neel Nanda 7/18/23 Neel Nanda 7/18/23

Tiny Mech Interp Projects: Emergent Positional Embeddings of Words

A rough post exploring the emergent positional embedding hypothesis - rather than representing "this is the token in position 5" models may represent eg "this token is the second name in the sentence"

Read More

Neel Nanda 3/28/23 Neel Nanda 3/28/23

Actually, Othello-GPT Has A Linear Emergent World Representation

A write up of work extending and building on the paper Emergent World Representations

Read More

Paper Replication Walkthrough: Reverse-Engineering Modular Addition

Neel Nanda 3/12/23 Neel Nanda 3/12/23

Paper Replication Walkthrough: Reverse-Engineering Modular Addition

Read More

Attribution Patching: Activation Patching At Industrial Scale

Neel Nanda 2/4/23 Neel Nanda 2/4/23

Attribution Patching: Activation Patching At Industrial Scale

A write-up of an incomplete project I worked on at Anthropic in early 2022, using gradient-based approximation to make activation patching far more scalable

Read More

Neel Nanda 2/4/23 Neel Nanda 2/4/23

Mech Interp Project Advising Call: Memorisation in GPT-2 Small

Read More

Neel Nanda 1/31/23 Neel Nanda 1/31/23

Mechanistic Interpretability Quickstart Guide

An intro guide to a mechanistic interpretability weekend hackathon

Read More

Neel Nanda 12/27/22 Neel Nanda 12/27/22

A Walkthrough of Toy Models of Superposition

Read More

Neel Nanda 12/26/22 Neel Nanda 12/26/22

Analogies between Software Reverse Engineering and Mechanistic Interpretability

Read More

Neel Nanda 12/25/22 Neel Nanda 12/25/22

Concrete Steps to Get Started in Transformer Mechanistic Interpretability

Read More

Neel Nanda 12/21/22 Neel Nanda 12/21/22

A Comprehensive Mechanistic Interpretability Explainer & Glossary

Read More

Neel Nanda 11/22/22 Neel Nanda 11/22/22

A Walkthrough of In-Context Learning and Induction Heads (w/ Charles Frye) Part 1 of 2

Read More

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel Nanda 11/7/22 Neel Nanda 11/7/22

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Read More

Neel Nanda 11/1/22 Neel Nanda 11/1/22

Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Read More

Neel Nanda 10/24/22 Neel Nanda 10/24/22

A Barebones Guide to Mechanistic Interpretability Prerequisites

Read More

Neel Nanda 10/18/22 Neel Nanda 10/18/22

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers

A highly opinionated list of what mechanistic interpretability papers to read when getting into the field

Read More

A Walkthrough of A Mathematical Framework for Transformer Circuits

Neel Nanda 10/14/22 Neel Nanda 10/14/22

A Walkthrough of A Mathematical Framework for Transformer Circuits

A stream of conscious video walkthrough of a Mathematical Framework for Transformer Circuits

Read More