Tiny Mech Interp Projects: Emergent Positional Embeddings of Words
A rough post exploring the emergent positional embedding hypothesis - rather than representing "this is the token in position 5" models may represent eg "this token is the second name in the sentence"
Actually, Othello-GPT Has A Linear Emergent World Representation
A write up of work extending and building on the paper Emergent World Representations
Attribution Patching: Activation Patching At Industrial Scale
A write-up of an incomplete project I worked on at Anthropic in early 2022, using gradient-based approximation to make activation patching far more scalable
Mechanistic Interpretability Quickstart Guide
An intro guide to a mechanistic interpretability weekend hackathon
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers
A highly opinionated list of what mechanistic interpretability papers to read when getting into the field
A Walkthrough of A Mathematical Framework for Transformer Circuits
A stream of conscious video walkthrough of a Mathematical Framework for Transformer Circuits