Interlude: A Mechanistic Interpretability Analysis of Grokking
I left my job at Anthropic a few months ago and since then I’ve been taking some time off and poking around at some independent research. And I’ve just published my first set of interesting results! I used mechanistic interpretability tools to investigate what’s up with the ML phenomena of grokking - where models trained on simple mathematical operations like addition mod 113 and given 30% of the data will initially memorise the data, but then if trained for a long time will abruptly generalise - with a training curve like this:
This blog isn’t really the best place to host technical write-ups, but if you’re interested you can check it out here:
$\setCounter{0}$