Helix: A vision-language-action model for generalist humanoid control Hackernews Viewer

Helix: A vision-language-action model for generalist humanoid control

303 points by Philpax 20 February 2025 | 170 comments

Comments

porphyra 20 February 2025

It seems that end to end neural networks for robotics are really taking off. Can someone point me towards where to learn about these, what the state of the art architectures look like, etc? Do they just convert the video into a stream of tokens, run it through a transformer, and output a stream of tokens?

yurimo 20 February 2025

I don't know, there has been so many overhyped and faked demos in humanoid robotics space over the last couple years, it is difficult to believe what is clearly a demo release for shareholders. Would love to see some demonstration in a less controlled environment.

causal 20 February 2025

I'm always wondering at the safety measures on these things. How much force is in those motors?

This is basically safety-critical stuff but with LLMs. Hallucinating wrong answers in text is bad, hallucinating that your chest is a drawer to pull open is very bad.

Symmetry 20 February 2025

So, there's no way you can have fully actuated control of every finger joint with just 35 degrees of freedom. Which is very reasonable! Humans can't individually control each of our finger joints either. But I'm curious how their hand setups work, which parts are actuated and which are compliant. In the videos I'm not seeing any in-hand manipulation other than just grasping, releasing, and maintaining the orientation of the object relative to the hand and I'm curious how much it can do / they plan to have it be able to do. Do they have any plans to try to mimic OpenAI's one handed rubics cube demo?

wwwtyro 20 February 2025

Until we get robots with really good hands, something I'd love in the interim is a system that uses _me_ as the hands. When it's time to put groceries away, I don't want to have to think about how to organize everything. Just figure out which grocery items I have, what storage I have available, come up with an optimized organization solution, then tell me where to put things, one at a time. I'm cautiously optimistic this will be doable in the near term with a combination of AR and AI.

ziofill 20 February 2025

There’s nothing I want more than a robot that does house chores. That’s the real 10x multiplier for humans to do what they do best.

plipt 20 February 2025

The demo is quite interesting but I am mostly intrigued by the claim that it is running totally local to each robot. It seems to use some agentic decision making but the article doesn't touch on that. What possible combo of model types are they stringing together? Or is this something novel?

The article mentions that the system in each robot uses two ai models.

    S2 is built on a 7B-parameter open-source, open-weight VLM pretrained on internet-scale data

and the other

    S1, an 80M parameter cross-attention encoder-decoder transformer, handles low-level [motor?] control.

It feels like although the article is quite openly technical they are leaving out the secret sauce? So they use an open source VLM to identify the objects on the counter. And another model to generate the mechanical motions of the robot.

What part of this system understands 3 dimensional space of that kitchen?

How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?

How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?

    Figure robots, each equipped with dual low-power-consumption embedded GPUs

Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?

verytrivial 20 February 2025

Are they claiming these robots are also silent? They seem to have "crinkle" sounds handling packaging, which if added in post seems needlessly smoke-and-mirror for what was a very impressive demonstration (of robots impersonating an extreme stoned human.)

bilsbie 20 February 2025

This is amazing but it also made me realize I just don’t trust these videos. Is it sped up? How much is preprogrammed?

I now they claim there’s no special coding but did they practice this task? Special training?

Even if this video is totally legit I’m but burned out by all the hype videos in general.

aerodog 20 February 2025

Interesting timing - same day MSFT releases https://microsoft.github.io/Magma/

pr337h4m 20 February 2025

Goal 2 has been achieved, at least as a proof of concept (and not by OpenAI): https://openai.com/index/openai-technical-goals/

sandis 20 February 2025

YouTube link for the video (for whatever reason the video hosted on their site kept buffering for me): https://www.youtube.com/watch?v=Z3yQHYNXPws

ge96 20 February 2025

Wonder what their vision stack is like. Depth via sensors or purely visual and the distance estimating of objects and inverse kinematics/proprioception, anyway it looks impressive.

sottol 20 February 2025

Imo, the Terminator movies would have been scarier if they moved like these guys - slow, careful, deliberate and measured but unstoppable. There's something uncanny about this.

kla-s 20 February 2025

Does anyone know how long they have been at this? Is this mainly a reimplementation of the physical intelligence paper + the dual size/freq + the cooperative part?

bhouston 20 February 2025

When doing robot control, how do you model in the control of the robot? Do you have tool_use / function calling at the top level model which then gets turned into motion control parameters via inverse kinematic controllers?

What is the interface from the top level to the motors?

I feel it can not just be a neural network all the way down, right?

andiareso 20 February 2025

Seriously, what's with all of these perceived "high-end" tech companies not doing static content worth a damn.

Stop hosting your videos as MP4s on your web-server. Either publish to a CDN or use a platform like YouTube. Your bandwidth cannot handle serving high resolution MP4s.

/rant

traverseda 20 February 2025

"The first time you've seen these objects" is a weird thing to say. One presumes that this is already in their training set, and that these models aren't storing a huge amount of data in their context, so what does that even mean?

swalsh 20 February 2025

At this point, this is enough autonomy to have a set of these guys man a howitzer (read as old stockpiles of weapons we already have). Kind of a scary thought. On one hand, I think the idea of moving real people out of danger in war is a good idea, and as an American i'd want Americans to have an edge... and we can't guarantee our enemies won't take it if we skip it, on the other hand I have a visceral reaction to machines killing people.

I think we're at an inflection point now where AI and robotics can be used in warfare, and we need to start having that conversation.

ramenlover 20 February 2025

Why do they make “eye contact” after every hand off? Feels oddly forced.

bbor 20 February 2025

To focus on something other than the obviously-terrifying nature of this and the skepticism that rightfully entails on our part:

  A fast reactive visuomotor policy that translates the latent semantic representations produced by S2 into precise continuous robot actions at 200 Hz

Why 200Hz...? Any experts in here on robotics? Because to this layman that seems really often to update motor controls.

Animats 20 February 2025

"Pick up anything: Figure robots equipped with Helix can now pick up virtually any small household object, including thousands of items they have never encountered before, simply by following natural language prompts."

If they can do that, why aren't they selling picking systems to Amazon by the tens of thousands?

bilsbie 20 February 2025

I get the impression there’s a language model sending high level commands to a control model? I wonder when we can have one multimodal model that controls everything.

The latest models seemed to be fluidly tied in with generating voice; even singing and laughing.

It seems like it would be possible to train a multimodal that can do that with low level actuator commands.

ripped_britches 21 February 2025

This whole thread is just people who didn’t read the technical details or immediately doubt the video’s honesty.

I’m actually fairly impressed with this because it’s one neural net which is the goal, and the two system paradigm is really cool. I don’t know much about robotics but this seems like the right direction.

ianamo 20 February 2025

Are we at a point now where Asimov’s laws are programmed into these fellas somewhere?

exe34 20 February 2025

Is there a paper? I think I get how they did their training, but I'd like to understand it more.

Does anyone know if this trained model would work on a different robot at all, or would it need retraining?

the_other 20 February 2025

It’s funny… there a lot of comments here asking “why would anyone pay for this, when you could learn to do the thing, or organise your time/plans yourself.”

That’s how I feel about LLMs and code.

kingkulk 20 February 2025

Anyone have a link to their paper?

IAmNotACellist 20 February 2025

I don't suppose this is open research and I can read about their model architecture?

ein0p 20 February 2025

There's no way this is 100% real though. No startup demo ever is.

bilsbie 20 February 2025

They should have made them talk. It’s a little dehumanizing otherwise.

butifnot0701 21 February 2025

It's kinda eerie how they look at each other after handover

anentropic 20 February 2025

Very impressive

Why make such sinister-looking robots though...?

kubb 20 February 2025

Wow! This is something new.

dr_dshiv 20 February 2025

Wake me when robots can make a peanut butter sandwich

abraxas 20 February 2025

Is this even reality or CGI? They really should show these things off in less sterile environemtns because this video has a very CGI feel to it.