Elucidating the Design Space of Torque-aware Vision-Language-Action Models

CoRL 2025


Zongzheng Zhang*1,2, Chenghao Yue*1, Haobo Xu*1,
Minwen Liao1, Xianglin Qi1, Huan-ang Gao1, Ziwei Wang3, Hao Zhao^1,2

*Equal Contribution ^Corresponding Author
1Institute for AI Industry Research (AIR), Tsinghua Univeristy
2Beijing Academy of Artificial Intelligence, BAAI
3Nanyang Technological University
Pipeline (played at 2x speed)

Abstract

Robotic chemists promise to both liberate human experts from repetitive tasks and accelerate scientific discovery, yet remain in their infancy. Chemical experiments involve long-horizon procedures over hazardous and deformable substances, where success requires not only task completion but also strict compliance with experimental norms. To address these challenges, we propose RoboChemist, a dual-loop framework that integrates Vision-Language Models (VLMs) with Vision-Language-Action (VLA) models. Unlike prior VLM-based systems (e.g., VoxPoser, ReKep) that rely on depth perception and struggle with transparent labware, and existing VLA systems (e.g., RDT, π0) that lack semantic-level feedback for complex tasks, our method leverages a VLM to serve as (1) a planner to decompose tasks into primitive actions, (2) a visual prompt generator to guide VLA models, and (3) a monitor to assess task success and regulatory compliance. Notably, we introduce a VLA interface that accepts image-based visual targets from the VLM, enabling precise, goal-conditioned control. Our system successfully executes both primitive actions and complete multi-step chemistry protocols. Results show 23.57% higher success rate and a 0.298 increase in compliance rate over state-of-the-art VLA baselines, while also demonstrating strong generalization to objects and tasks.

RoboChemist Teaser

(a) Overview of RoboChemist. The VLM in our system acts as the planner, decomposing high-level tasks into subtasks. Based on each subtask, the VLM generates prompted images through visual prompting and provides them, along with other relevant information, to the VLA models. The VLM also functions as the monitor, assessing the completion status of subtasks, thus ensuring a complete feedback loop in the system.

(b) RoboChemist outperforms baselines in both primitive tasks and complete chemical experiment tasks.

(c) Some tasks performed by RoboChemist.


Primitive Tasks

In this section, we present the videos of seven primitive tasks that are used to build the complete tasks and their correspondingvisual prompt examples. The videos are played at 1x speed.


"Grasp the Glass Rod"
Prompted Image
Prompted Image
Video
"Heat Platium Wire"
Prompted Image
Prompted Image
Video
"Insert Into Solution"
Prompted Image
Prompted Image
Video
"Pour Liquid"
Prompted Image
Prompted Image
Video
"Stir Liquid"
Prompted Image
Prompted Image
Video
"Transfer Solid"
Prompted Image
Prompted Image
Video
"Press Button"
Prompted Image
Prompted Image
Video



Complete Tasks

In this section, we present videos of several complete tasks that our RoboChemist can perform. The videos are played at 1x speed.


Mixing NaCl and CuSO\(_4\)Solutions
Thermal Decomposition of Cu(OH)\(_2\)
Flame Test of CuSO\(_4\) Solution
Evaporation of NaCl Solution



Generalization

In this section, we present videos of several generalization tasks that our RoboChemist can perform. The videos are played at 1x speed.

Primitive Task Generalization

Place Glass Rod
Grasp Test Tube
Stir Solid Reagents
Heat Test Tube
Insert a Thermometer
Place Test Tube into Cooling Liquid

Complete Task Generalization

Combination Reaction: CaO+H\(_2\)O
Decomposition Reaction: H\(_2\)O\(_2\)
Displacement Reaction: Fe+CuSO\(_4\)
Displacement Reaction: Zn+HCl
Double Displacement Reaction: NaOH+CuSO\(_4\)
Double Displacement Reaction: NaHCO\(_3\)+HCl


Citation

We kindly request that you cite our work if you utilize the code or reference our findings in your research.

  @inproceedings{zhang2025robochemist,
    title={RoboChemist: Vision-Language-Action Models for Robotic Chemistry},
    author={Zhang, Zongzheng and Yue, Chenghao and Xu, Haobo and Liao, Minwen and Qi, Xianglin and Gao, Huan-ang and Wang, Ziwei and Zhao, Hao},
    booktitle={Conference on Robot Learning (CoRL)},
    year={2025}
  }