Pipeline (played at 2x speed)
Abstract
Robotic chemists promise to both liberate human experts from repetitive tasks and accelerate scientific discovery, yet remain in their infancy. Chemical experiments involve long-horizon procedures over hazardous and deformable substances, where success requires not only task completion but also strict compliance with experimental norms. To address these challenges, we propose RoboChemist, a dual-loop framework that integrates Vision-Language Models (VLMs) with Vision-Language-Action (VLA) models. Unlike prior VLM-based systems (e.g., VoxPoser, ReKep) that rely on depth perception and struggle with transparent labware, and existing VLA systems (e.g., RDT, π0) that lack semantic-level feedback for complex tasks, our method leverages a VLM to serve as (1) a planner to decompose tasks into primitive actions, (2) a visual prompt generator to guide VLA models, and (3) a monitor to assess task success and regulatory compliance. Notably, we introduce a VLA interface that accepts image-based visual targets from the VLM, enabling precise, goal-conditioned control. Our system successfully executes both primitive actions and complete multi-step chemistry protocols. Results show 23.57% higher success rate and a 0.298 increase in compliance rate over state-of-the-art VLA baselines, while also demonstrating strong generalization to objects and tasks.

(a) Overview of RoboChemist. The VLM in our system acts as the planner, decomposing high-level tasks into subtasks. Based on each subtask, the VLM generates prompted images through visual prompting and provides them, along with other relevant information, to the VLA models. The VLM also functions as the monitor, assessing the completion status of subtasks, thus ensuring a complete feedback loop in the system.
(b) RoboChemist outperforms baselines in both primitive tasks and complete chemical experiment tasks.
(c) Some tasks performed by RoboChemist.
Primitive Tasks
In this section, we present the videos of seven primitive tasks that are used to build the complete tasks and their correspondingvisual prompt examples. The videos are played at 1x speed.
"Grasp the Glass Rod"

Prompted Image
Video
"Heat Platium Wire"

Prompted Image
Video
"Insert Into Solution"

Prompted Image
Video
"Pour Liquid"

Prompted Image
Video
"Stir Liquid"

Prompted Image
Video
"Transfer Solid"

Prompted Image
Video
"Press Button"

Prompted Image
Video
Complete Tasks
In this section, we present videos of several complete tasks that our RoboChemist can perform. The videos are played at 1x speed.
Mixing NaCl and CuSO\(_4\)Solutions
Thermal Decomposition of Cu(OH)\(_2\)
Flame Test of CuSO\(_4\) Solution
Evaporation of NaCl Solution
Generalization
In this section, we present videos of several generalization tasks that our RoboChemist can perform. The videos are played at 1x speed.
Primitive Task Generalization
Place Glass Rod
Grasp Test Tube
Stir Solid Reagents
Heat Test Tube
Insert a Thermometer
Place Test Tube into Cooling Liquid
Complete Task Generalization
Combination Reaction: CaO+H\(_2\)O
Decomposition Reaction: H\(_2\)O\(_2\)
Displacement Reaction: Fe+CuSO\(_4\)
Displacement Reaction: Zn+HCl
Double Displacement Reaction: NaOH+CuSO\(_4\)
Double Displacement Reaction: NaHCO\(_3\)+HCl
Citation
We kindly request that you cite our work if you utilize the code or reference our findings in your research.
@inproceedings{zhang2025robochemist, title={RoboChemist: Vision-Language-Action Models for Robotic Chemistry}, author={Zhang, Zongzheng and Yue, Chenghao and Xu, Haobo and Liao, Minwen and Qi, Xianglin and Gao, Huan-ang and Wang, Ziwei and Zhao, Hao}, booktitle={Conference on Robot Learning (CoRL)}, year={2025} }