It seems that end to end neural networks for robotics are really taking off. Can someone point me towards where to learn about these, what the state of the art architectures look like, etc? Do they just convert the video into a stream of tokens, run it through a transformer, and output a stream of tokens?
I don't know, there has been so many overhyped and faked demos in humanoid robotics space over the last couple years, it is difficult to believe what is clearly a demo release for shareholders. Would love to see some demonstration in a less controlled environment.
I'm always wondering at the safety measures on these things. How much force is in those motors?
This is basically safety-critical stuff but with LLMs. Hallucinating wrong answers in text is bad, hallucinating that your chest is a drawer to pull open is very bad.
So, there's no way you can have fully actuated control of every finger joint with just 35 degrees of freedom. Which is very reasonable! Humans can't individually control each of our finger joints either. But I'm curious how their hand setups work, which parts are actuated and which are compliant. In the videos I'm not seeing any in-hand manipulation other than just grasping, releasing, and maintaining the orientation of the object relative to the hand and I'm curious how much it can do / they plan to have it be able to do. Do they have any plans to try to mimic OpenAI's one handed rubics cube demo?
Until we get robots with really good hands, something I'd love in the interim is a system that uses _me_ as the hands. When it's time to put groceries away, I don't want to have to think about how to organize everything. Just figure out which grocery items I have, what storage I have available, come up with an optimized organization solution, then tell me where to put things, one at a time. I'm cautiously optimistic this will be doable in the near term with a combination of AR and AI.
The demo is quite interesting but I am mostly intrigued by the claim that it is running totally local to each robot. It seems to use some agentic decision making but the article doesn't touch on that. What possible combo of model types are they stringing together? Or is this something novel?
The article mentions that the system in each robot uses two ai models.
S2 is built on a 7B-parameter open-source, open-weight VLM pretrained on internet-scale data
It feels like although the article is quite openly technical they are leaving out the secret sauce? So they use an open source VLM to identify the objects on the counter. And another model to generate the mechanical motions of the robot.
What part of this system understands 3 dimensional space of that kitchen?
How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?
How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?
Figure robots, each equipped with dual low-power-consumption embedded GPUs
Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?
Are they claiming these robots are also silent? They seem to have "crinkle" sounds handling packaging, which if added in post seems needlessly smoke-and-mirror for what was a very impressive demonstration (of robots impersonating an extreme stoned human.)
Wonder what their vision stack is like. Depth via sensors or purely visual and the distance estimating of objects and inverse kinematics/proprioception, anyway it looks impressive.
Imo, the Terminator movies would have been scarier if they moved like these guys - slow, careful, deliberate and measured but unstoppable. There's something uncanny about this.
Does anyone know how long they have been at this? Is this mainly a reimplementation of the physical intelligence paper + the dual size/freq + the cooperative part?
This whole thread is just people who didn’t read the technical details or immediately doubt the video’s honesty.
I’m actually fairly impressed with this because it’s one neural net which is the goal, and the two system paradigm is really cool. I don’t know much about robotics but this seems like the right direction.
Seriously, what's with all of these perceived "high-end" tech companies not doing static content worth a damn.
Stop hosting your videos as MP4s on your web-server. Either publish to a CDN or use a platform like YouTube. Your bandwidth cannot handle serving high resolution MP4s.
When doing robot control, how do you model in the control of the robot? Do you have tool_use / function calling at the top level model which then gets turned into motion control parameters via inverse kinematic controllers?
What is the interface from the top level to the motors?
I feel it can not just be a neural network all the way down, right?
"The first time you've seen these objects" is a weird thing to say. One presumes that this is already in their training set, and that these models aren't storing a huge amount of data in their context, so what does that even mean?
At this point, this is enough autonomy to have a set of these guys man a howitzer (read as old stockpiles of weapons we already have). Kind of a scary thought. On one hand, I think the idea of moving real people out of danger in war is a good idea, and as an American i'd want Americans to have an edge... and we can't guarantee our enemies won't take it if we skip it, on the other hand I have a visceral reaction to machines killing people.
I think we're at an inflection point now where AI and robotics can be used in warfare, and we need to start having that conversation.
"Pick up anything: Figure robots equipped with Helix can now pick up virtually any small household object, including thousands of items they have never encountered before, simply by following natural language prompts."
If they can do that, why aren't they selling picking systems to Amazon by the tens of thousands?
I get the impression there’s a language model sending high level commands to a control model? I wonder when we can have one multimodal model that controls everything.
The latest models seemed to be fluidly tied in with generating voice; even singing and laughing.
It seems like it would be possible to train a multimodal that can do that with low level actuator commands.
It’s funny… there a lot of comments here asking “why would anyone pay for this, when you could learn to do the thing, or organise your time/plans yourself.”
Helix: A vision-language-action model for generalist humanoid control
(figure.ai)288 points by Philpax 20 February 2025 | 163 comments
Comments
This is basically safety-critical stuff but with LLMs. Hallucinating wrong answers in text is bad, hallucinating that your chest is a drawer to pull open is very bad.
The article mentions that the system in each robot uses two ai models.
and the other It feels like although the article is quite openly technical they are leaving out the secret sauce? So they use an open source VLM to identify the objects on the counter. And another model to generate the mechanical motions of the robot.What part of this system understands 3 dimensional space of that kitchen?
How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?
How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?
Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?I now they claim there’s no special coding but did they practice this task? Special training?
Even if this video is totally legit I’m but burned out by all the hype videos in general.
I’m actually fairly impressed with this because it’s one neural net which is the goal, and the two system paradigm is really cool. I don’t know much about robotics but this seems like the right direction.
Stop hosting your videos as MP4s on your web-server. Either publish to a CDN or use a platform like YouTube. Your bandwidth cannot handle serving high resolution MP4s.
/rant
What is the interface from the top level to the motors?
I feel it can not just be a neural network all the way down, right?
I think we're at an inflection point now where AI and robotics can be used in warfare, and we need to start having that conversation.
If they can do that, why aren't they selling picking systems to Amazon by the tens of thousands?
The latest models seemed to be fluidly tied in with generating voice; even singing and laughing.
It seems like it would be possible to train a multimodal that can do that with low level actuator commands.
That’s how I feel about LLMs and code.
Why make such sinister-looking robots though...?
Does anyone know if this trained model would work on a different robot at all, or would it need retraining?