Thread
The OpenAI Minecraft paper is a great push to getting AI to work in Photoshop, Figma, or any software product — using just the keyboard & mouse, like a person would.

Steps in the paper, explained 🧵

1/
openai.com/blog/vpt/
1. First, hire people to play Minecraft, who are OK at it. Record their screen and keyboard & mouse strokes. This costs $2k for 2k hrs of video in total.

This is your small dataset.

2/
2. Train a model on this small dataset. Let the model to look a little bit in the past and a little bit in the future in the videos. Let it predict the key & mouse strokes the person used, aligned to the video.

This is your small model.

2/
3. After it's trained, use your small model to predict key & mouse strokes on 70k hrs of video, which you scraped from the internet. You didn't hire anyone for these and don't have recorded key & mouse strokes.

This is your large dataset.

3/
4. Train another model on this large dataset. Train it to learn to press the right keys/mouse.

This is your large model.

4/
5. Then, watch your large model play on its own, and it does some pretty nifty things.

All of this is impressive because:
- The different moves you can make are much more open-ended than anything AI has been able to do before.
...

5/
- It's only $2k to get enough data for the smaller dataset to successfully get us here.
- The AI acts like a person playing Minecraft — there isn't some special set of keys we gave it to make this whole thing a little easier.

6/
OK wow this blew up. Let me acknowledge @jeffclune and his team for their amazing work on this! Jeff had also worked on helping wildlife in camera trap photos before. He's a really kind and humble soul, and I'm so v excited to see the work he'll be brewing up next.

7/
Additional great authors, tagged to the best of my Twitter search abilities:
@bobabowen (great username)
Ilge Akkaya
Peter Zhokhov
@Joost_Huizinga
Jie Tang
@AdreinLE
Brandon Houghton
Raul Sampedro

8/
The AI Minecraft never ends. At nearly the same time, NVIDIA dropped some amazing work: Imagine if you could go from writing a sentence or two to getting the Minecraft scene you described before you.

Implication: This could be the future of how we control software platforms.

9/
To get there, NVIDIA has released a gigantic dataset of millions of written comments and 300k hours of narrated game play.

Wow, people play a lot of video games!

My incredibly brilliant friends @DrJimFan & @AnimaAnandkumar are behind this effort.

More:

Mentions
See All