Mechanistic Interpretability research @DeepMind. Formerly @AnthropicAI, independent In this to reduce AI X-risk. Neural networks can be understood, let's do it!
Neel Nanda (at ICLR) @NeelNanda5
·
May 10, 2023
Great paper and elegant set up! This is another nice illustration of how it is so, so easy to trick yourself when interpreting LLMs. I would love an interpretability project distinguishing faithful from unfaithful chain of thought! Anyone know what the smallest open source model…
Neel Nanda (at ICLR) @NeelNanda5
·
Apr 15, 2023
I really enjoyed @yonashav's paper on how we might create a world where all large training runs can be monitored - feels like the best AI governance proposal I can recall seeing! Makes me optimistic there are policies simple enough to be realistic, but useful enough to matter