Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Paper
May 7, 2023
#Naturallanguageprocessing

Miles Turpin

@milesaturpin

(Author)

Read on arxiv.org

1 Recommender

1 Mention

Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought rea... Show More

Mentions

See All

Neel Nanda (at ICLR) @NeelNanda5 · May 10, 2023

Post
From Twitter

Great paper and elegant set up! This is another nice illustration of how it is so, so easy to trick yourself when interpreting LLMs. I would love an interpretability project distinguishing faithful from unfaithful chain of thought! Anyone know what the smallest open source model…