A Walkthrough of A Mathematical Framework for Transformer Circuits

Oct 14

A Mathematical Framework for Transformer Circuits is, in my opinion, the coolest paper I've ever had the privilege of working on. But it's also very long and dense and at times confusing, and this makes me sad! So I've run an experiment, where I recorded myself reading through the paper and narrated a stream of conscious as I go - which bits are particularly cool but under-appreciated, which bits are a bit of a waste of time, which bits do I think do or do not replicate, attempting to explain the parts I think are particularly confusing, etc. You can watch it here. Sadly, it turns out I have a lot of things to say about Transformer Circuits and this turned into a 3 hour monologue, but I hope it's still useful! This is an experimental format for me for good research communication, and I'd love to hear feedback on how well it works for you! This was much easier to make than writing an entire paper, but could easily be a total waste of time if it's not clear enough to be useful!

The views in this video are entirely my personal takes - the paper was a team effort from everyone at Anthropic, especially Chris Olah, Nelson Elhage and Catherine Olsson, and I am no longer employed by Anthropic. I do not necessarily expect that any of the other authors would agree with any specific thing that I've said, but hope an unfiltered series of takes is useful!

$\setCounter{0}$

Neel Nanda

A Walkthrough of A Mathematical Framework for Transformer Circuits

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers

Neel Nanda