Discussion about this post

User's avatar
Wanangwa's avatar

Super interesting article, thank you for sharing it. Actually, I’m thinking of doing a PhD in robotics & AI but not too sure where I’d wanna do it in. I’m thinking this VLA field might be it!! I recently started my master’s in robotics and in the first week, we were shown that exact diagram, that shows the flow of information between perception, path planning and control, and I remember thinking to myself, how can I be involved in all 3? I don’t wanna just pick one and specialise in that. This VLA work perfectly brings all of that together👌🏿It’s also new and looks to have lots of problems that are yet to be solved. You know of any research labs I could look into that are doing this work?

Expand full comment
Neural Foundry's avatar

Solid breakdown on VLAs, especially the monolithic vs hierarchical distinction. The chef analogy worked well. What strikes me most is the embedding unification trick, turning vision, language, and actions into a common vocabulary that transformers can process end-to-end. I've seen this sameidea in multimodal LLMs but extending it to robotic control feels like crossing a different threshold entirely. The tokenization of continuous motor commands still seems like the trickiest part to me, wonder if we're gonna hit ceiling with discretization or if diffusion-based approaches will scale better for contact-rich tasks.

Expand full comment
2 more comments...

No posts

Ready for more?