Interlude: A Mechanistic Interpretability Analysis of Grokking

I left my job at Anthropic a few months ago and since then I’ve been taking some time off and poking around at some independent research. And I’ve just published my first set of interesting results! I used mechanistic interpretability tools to investigate what’s up with the ML phenomena of grokking - where models trained on simple mathematical operations like addition mod 113 and given 30% of the data will initially memorise the data, but then if trained for a long time will abruptly generalise - with a training curve like this:

This blog isn’t really the best place to host technical write-ups, but if you’re interested you can check it out here:

Summary Tweet Thread

Technical write-up on the Alignment Forum

Colab notebook with code + technical details

$\setCounter{0}$
Next
Next

Post 49: Things That Make Me Enjoy Giving Career Advice