You may have heard of Nvidia’s VOYAGER paper. Jim Fan and team created an “LLM-powered embodied lifelong learning agent”.
In other words, they created a Minecraft bot that could navigate the world and learn new skills. They did it using GPT-4, the model that powers ChatGPT.
Sounds simple, right? Just prompt the GPT to tell the bot what to do. Alas—not that simple.
You see, GPT-4 is really slow.
I’m not talking about 10AM on a Tuesday when software developers across the US are clogging the OpenAI servers with questions about why their git push is failing.
No, GPT-4 is slow even when it’s at its fastest.
Imagine you’re a Minecraft bot, and you detect a zombie noise behind you. You send that data to GPT-4, and the reply begins streaming-in:
“I’m sorry you’re experiencing a distressing situation. The noise you hear may be a mob that wishes you harm. I’d recommend turning around and…”
Too late—you reached the “Game Over” screen after “I’m sorry”. A bot operating in a real-time environment like Minecraft needs to have split-second reaction time.
Before you stop reading because you don’t care about Minecraft, notice that the same problem applies to real-life robots. Your self-driving car will need to dodge that motorcycle flying down the center line. It won’t be able to wait for GPT-4’s six bullet points on how to handle dangerous drivers.
GPT-4 sounds like the wrong tool for the job, right?
Well the smart folks at Nvidia realized that GPT-4 has a trick up its sleeve.
First of all, it has read the internet, so it definitely has the “knowledge” required to control a Minecraft bot. It knows what Minecraft is, and it knows what a bot is.
But GPT-4 also has a superpower that allows it to control a bot at the speed of light:
GPT-4 can code.
It may not be able to instruct the bot to pivot, pull out its sword, and start hacking away at the zombie—but it can certainly write code to do all that. The LLM generates code that leverages the Mineflayer API to spawn and control bots in a Minecraft environment.
Library of Functions
I introduce to you the Skill Library, a collection of javascript functions that GPT-4 writes to direct bots in the real-time simulation environment that is Minecraft. See the diagram provided in the VOYAGER paper:
Example skills include “Mine Wood Log”, “Craft Stone Sword”, and “Combat Zombie”. If you’re wondering how “mining” is something you can do to a wood log…that’s just how it is.
Here’s the cool part: skills can build off upon other skills. In the example code above, the combatZombie() skill leverages the skills craftStoneSword() and craftShield().
What’s another word for a collection of code functions that all build upon one another?
Software. VOYAGER is designed to write software.
You probably already know that ChatGPT can write code snippets. It can give you that tedious Excel macro or provide python code to loop through a multi-dimensional array.
However, if you’ve ever tried to use ChatGPT to create a large, complex codebase, you’ll realize that you immediately run into a problem. Large codebases have a lot of dependencies, so if you want to update a particular function, you may need to provide many files as context in order for the LLM to understand how to correctly update the function.
Or, if you write spaghetti code like me, you might need to give ChatGPT the whole codebase as context. That’s a lot of copy-paste. Worse, ChatGPT might not be able to “pay attention” to a context that large. Can you blame it for losing interest?
Vector Database of Functions
Nvidia came-up with a better way. They take each function in the codebase (the Skill Library), and they run it through GPT-3.5 to generate a description of the function in natural language. Then they take that description and convert it to a vector embedding.
While vector embeddings are very cool, and I’d love to explain in detail what they are, all you need to know is that they are mathematical representations of chunks of text that allow you to compare those chunks of texts to one another. Something like this:
Vector(King) - Vector(Man) + Vector(Woman) ~= Vector(Beyoncé)
But the text can be longer than one word. If you want more info, here’s an article from Pinecone, a platform I’ve used to host vector databases.
Back to VOYAGER.
Once VOYAGER has written a skill function, generated its description, and converted it to an embedding, it then stores the skill in a vector database. Then, when it needs to create a new skill (i.e. new code function), it queries the vector database for the 5 most similar skills. These are the skills that GPT-4 will be able to leverage when creating the new skill function.
For instance, when GPT-4 is writing the function to fight a zombie, it will retrieve skills related to combat, like creating swords and shields, since those will be semantically similar to the task of fighting a zombie. At that moment, the LLM does not need to know about the skill to eat a watermelon.
This process of saving data to a vector database and then pulling-out the most important bits for prompts to an LLM is called Retrieval Augmented Generation (RAG).
RAG is not new, so why do we care about this whole VOYAGER thing?
We care because it happens to work.
GPT-4 Writes Software
VOYAGER uses GPT-4 to create a skill library that achieves better results than the version without the skills library. This means that, at least in the domain of Minecraft bots, GPT-4 is able to develop a non-trivial codebase of skills that build upon one another.
In other words, LLMs can do software development.
Now, most codebases are much more complex than the skills library that GPT-4 builds for VOYAGER. Maybe the model can’t handle additional complexity. However, I don’t see anything in the paper to suggest that they hit the limit of the capabilities of the model.
They could have done more.
For example, one interesting avenue to explore would be other objectives besides “find as many items as possible”. You could ask GPT-4 to generate a curriculum to teach a bot to “survive a night outside without sleeping or hiding”.
I’m interested in objectives that would align with survival pressures, since that is how biological systems learn in the real world. This could force the model to create a more complex and nuanced library of skills.
Why didn’t Nvidia keep going?
The only hint in the paper is that at one point they mention “budgetary considerations” as a reason for reducing GPT-4 usage. I’m guessing they used-up their budget around the time VOYAGER crafted a diamond sword.
This suggests that the barrier to using GPT-4 for large scale software development may not be capabilities. It may just be usage costs.
The VOYAGER paper was written before GPT-4 Turbo, so we now have a better model available at half the price.
If only we had an Nvidia budget.