Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Mentions

Neel Nanda (at ICLR) @NeelNanda5 · May 10, 2023

From Twitter

Great paper and elegant set up! This is another nice illustration of how it is so, so easy to trick yourself when interpreting LLMs. I would love an interpretability project distinguishing faithful from unfaithful chain of thought! Anyone know what the smallest open source model…

Paper May 7, 2023

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

by Miles Turpin

Recommended by 1 person

1 mention