It's always a treat to watch a Carmack lecture or read anything he writes, and his notes here are no exception. He writes as an engineer, for engineers and documents all his thought processes and misteps in the exact detailed yet concise way you'd want a colleague to who was handing off some work.
One question I would have about the research direction is the emphasis on realtime. If I understand correctly he's doing online learning in realtime. Obviously makes for a cool demo and pulls on his optimisation background, and no doubt some great innovations will be required to make this work. But I guess the bitter lesson and recent history also tell us that some solutions may only emerge at compute levels beyond what is currently possible for realtime inference let alone learning. And the only example we have of entities solving Atari games is the human brain, of which we don't have a clear understanding of the compute capacity. In which case, why wouldn't it be better to focus purely on learning efficiency and relax the realtime requirement for now?
That's a genuine question by the way, definitely not an expert here and I'm sure there's a bunch of value to working within these constraints. I mean, jumping spiders solve reasonably complex problems with 100k neurons, so who knows.
I wish he did this with VR environment instead like they mention at the start of the slides. A VR environment with a JPEG camera filter, physics sim, noise, robot simulation. If anyone could program that well its him.
Using real life robots is going to be a huge bottleneck for training hours no matter what they do.
I was really excited when I heard Carmack was focusing on AI and am really looking forward to watching this when the video is up - but just from looking at the slides it seems like he tried to build a system that can play the Atari? Seems like a fun project, but curious what will come out of it or if there is an associated paper being released.
Why would AGI choose to be embodied? We talk about creating a superior intelligence and having it drive our cars and clean our homes. The scenario in Dan Simmons' Hyperion seems much more plausible: we invent AGI and it disappears into the cloud and largely ignores us.
I still don’t think we have a clear enough idea of what a concept is to be able to think about AGI. And then being able to use concepts from one area to translate into another area, what is the process by which the brain combines and abstracts ideas into something new?
Another thought experiment - if OpenAI AGI was right around the corner, why are they wasting time/money/energy buying a product-less vanity hardware startup run by Ive?
Why not tackle robotics if anything.
Or really just be the best AGI and everyone will be knocking on your door to license it in their hardware/software stacks, you will print infinite money.
> Fundamentally, I believe in the importance of learning from a stream of interactive experience, as humans and animals do, which is quite different from the throw-everything-in-a-blender approach of pretraining an LLM. The blender approach can still be world-changingly valuable, but there are plenty of people advancing the state of the art there.
It's a shame that pretrained approach leads to such good enough result. The learning-from-experience, or what should be the "right" approach, will stagnate. I might be wrong, but it seems that aside from Carmack and a small team, "the world" is just not looking/investing on that side of the AI anymore.
However, I find it funny that Carmack is now researching for such approach. At the end of the day, he was the one who invented Portals, an algorithm to circumvent the need to reproduce the whole 3D world and therefore making 3D games computationally possible.
As a side note, I wonder what models are to come once we see the latest state of the art AI Video training technologies, in synch with the joystick movements from a real player. Maybe the results are so astonishing that even Carmack changes his mind on the subject.
I'm surprised there's as much interest in looking at the structure/behavior of the biological brain, and less interest in considering the behavior of our vision system. Our brains are not CPUs, and our eyes are definitely not a grid of pixels with a fixed framerate.
Bro went his whole career and managed to somehow create a gig for himself where he gets the AI money while playing Atari. It's hard to increase the respect for someone who you already maxed out on but there we go. Carmack is a cool guy.
Quite exciting. Without diminishing the amazing value of LLMs, I don't think that path goes all the way to AGI. No idea if Carmack has the answer, but some good things will come out of that small research group, for sure.
I'm with OpenAI folks on this one: Atari just won't cut it for AGI. My layman intuition is that RL works well when rewards give good signal all the time. Until it does RL is basically random search. That's where massive data diversity like we have in text comes in handy.
In a game there might be a level with a door and a key, and because there's no reward for getting the key closer to the door, bridging this gap requires random search in a massive state space. But in the vast sea of scenarios that you can find in Common Crawl there's probably one, where you are 1 step from the key, and the key is 1 step from the door, so you get the reward signal from it without having to search an enormous state space.
You might say "but you have to search through the giant Common Crawl". Well yes, but while doing so you will get reward signal not just for the key and door problem, but for nearly every problem in the world.
The point is: pretraining teaches models to extract signal that can be used to explore solutions to hard search problems, and if you don't do that you are wasting your time enumerating giant state spaces.
I was really hoping to see something cool. Such a great team of smart people, but this is just John reverting to what he’s comfortable with and going off on some nonsense tangent. Hardware and video games. I doubt this is going to yield anything interesting at all.
I feel top level AI creation is beyond his skill set.
He’s a AAA software engineer but the prerequisites to build out cutting edge AI require deep formal math that is beyond his education and years at this point.
Nothing to stop him playing around with AI models though.
Honestly, having gone through the slides, it's a bit painful to see Carmack "rediscover" stuff I've learned in a reinforcement learning lecture like ten years ago.
But don't get me wrong! Since this is a long-term research endeavor of his, I believe really starting from the basics is good for him and will empower him to bring something new to the table eventually.
I'm surprised though that he "only" came so far as of now. Maybe my slight idolization of Carmack made me kinda of blind to the fact that this kind of research is a mean beast after all and there is a reason that huuuuge research labs dump countless of man-decades into this kind of stuff with no guaranteed breakthroughs.
A lot of the problems John mentioned (camera jpeg, latency, real time decisions) have been worked on by comma.ai for many years. He could have just used their stack and build on it the general learning parts that comma is not focusing on.
My own little insights and ramblings as an uninitiated quack (just spent the night asking Claude to explain machine learning to me):
seems that we are learning in layers, one of the first layers being 2D neural net (images) augmented by other sensory data to create a 3D if not 4D model (neural net).
HRTFs for sound increases the spatial data we get from images.
With depth coming from sound and light and learnt movements(touch) we seem to develop a notion of space and time. (multimodality?)
Seems that we can take low dimensional inputs and correlate them to form higher dimensional structures.
Of course, physically it comes from noticing the dampening of visual data (in focus for example) and memorized audio data (sound frequency and amplitude, early reflections, doppler effect etc).
That should be emergent from training.
Those data sources can be inperfectly correlated. That's why we count during a lightning storm to evaluate distance. It's low dimensional.
In a sense, it's a measure of required effort perhaps (distance to somewhere).
What's funny is that it seems to go the other way from traditional training where we move from higher dimensional tensor spaces to lower ones. At least in a first step.
John Carmack talk at Upper Bound 2025
(twitter.com)399 points by tosh 13 hours ago | 262 comments
Comments
One question I would have about the research direction is the emphasis on realtime. If I understand correctly he's doing online learning in realtime. Obviously makes for a cool demo and pulls on his optimisation background, and no doubt some great innovations will be required to make this work. But I guess the bitter lesson and recent history also tell us that some solutions may only emerge at compute levels beyond what is currently possible for realtime inference let alone learning. And the only example we have of entities solving Atari games is the human brain, of which we don't have a clear understanding of the compute capacity. In which case, why wouldn't it be better to focus purely on learning efficiency and relax the realtime requirement for now?
That's a genuine question by the way, definitely not an expert here and I'm sure there's a bunch of value to working within these constraints. I mean, jumping spiders solve reasonably complex problems with 100k neurons, so who knows.
https://docs.google.com/presentation/d/1GmGe9ref1nxEX_ekDuJX...
https://docs.google.com/document/d/1-Fqc6R6FdngRlxe9gi49PRvU...
Using real life robots is going to be a huge bottleneck for training hours no matter what they do.
Why not tackle robotics if anything. Or really just be the best AGI and everyone will be knocking on your door to license it in their hardware/software stacks, you will print infinite money.
It's a shame that pretrained approach leads to such good enough result. The learning-from-experience, or what should be the "right" approach, will stagnate. I might be wrong, but it seems that aside from Carmack and a small team, "the world" is just not looking/investing on that side of the AI anymore.
However, I find it funny that Carmack is now researching for such approach. At the end of the day, he was the one who invented Portals, an algorithm to circumvent the need to reproduce the whole 3D world and therefore making 3D games computationally possible.
As a side note, I wonder what models are to come once we see the latest state of the art AI Video training technologies, in synch with the joystick movements from a real player. Maybe the results are so astonishing that even Carmack changes his mind on the subject.
EDIT::grammar & typos
TIL JC has elite reflexes
I would argue that if he wants to do AGI through RL, a LLM could be a perfect teacher or oracle.
After all i'm not walking around as a human and not having guidance. It should/could make RL a lot faster leveraging this.
My logical part / RL part does need the 'database'/fact part and my facts are trying to be as logical as possible but its just not.
Quite exciting. Without diminishing the amazing value of LLMs, I don't think that path goes all the way to AGI. No idea if Carmack has the answer, but some good things will come out of that small research group, for sure.
In a game there might be a level with a door and a key, and because there's no reward for getting the key closer to the door, bridging this gap requires random search in a massive state space. But in the vast sea of scenarios that you can find in Common Crawl there's probably one, where you are 1 step from the key, and the key is 1 step from the door, so you get the reward signal from it without having to search an enormous state space.
You might say "but you have to search through the giant Common Crawl". Well yes, but while doing so you will get reward signal not just for the key and door problem, but for nearly every problem in the world.
The point is: pretraining teaches models to extract signal that can be used to explore solutions to hard search problems, and if you don't do that you are wasting your time enumerating giant state spaces.
He’s a AAA software engineer but the prerequisites to build out cutting edge AI require deep formal math that is beyond his education and years at this point.
Nothing to stop him playing around with AI models though.
But don't get me wrong! Since this is a long-term research endeavor of his, I believe really starting from the basics is good for him and will empower him to bring something new to the table eventually.
I'm surprised though that he "only" came so far as of now. Maybe my slight idolization of Carmack made me kinda of blind to the fact that this kind of research is a mean beast after all and there is a reason that huuuuge research labs dump countless of man-decades into this kind of stuff with no guaranteed breakthroughs.
seems that we are learning in layers, one of the first layers being 2D neural net (images) augmented by other sensory data to create a 3D if not 4D model (neural net). HRTFs for sound increases the spatial data we get from images. With depth coming from sound and light and learnt movements(touch) we seem to develop a notion of space and time. (multimodality?)
Seems that we can take low dimensional inputs and correlate them to form higher dimensional structures.
Of course, physically it comes from noticing the dampening of visual data (in focus for example) and memorized audio data (sound frequency and amplitude, early reflections, doppler effect etc). That should be emergent from training.
Those data sources can be inperfectly correlated. That's why we count during a lightning storm to evaluate distance. It's low dimensional.
In a sense, it's a measure of required effort perhaps (distance to somewhere).
What's funny is that it seems to go the other way from traditional training where we move from higher dimensional tensor spaces to lower ones. At least in a first step.