Many robotic manipulation tasks require sensing and responding to force signals such as torque to assess whether the task has been successfully completed and to enable closed-loop control. However, current Vision-Language-Action (VLA) models lack the ability to integrate such subtle physical feedback. In this work, we explore Torque-aware VLA models, aiming to bridge this gap by systematically studying the design space for incorporating torque signals into existing VLA architectures. We identify and evaluate several strategies, leading to three key findings. First, introducing torque adapters into the decoder consistently outperforms inserting them into the encoder. This is because torque signals align more closely with the decoder’s input, and the decoder is more sensitive to variations in input. Second, torque history proves to be a critical signal. We find that the most effective way to incorporate it is by summarizing the entire history into a single token, as this preserves the original input pattern of the decoder. Third, inspired by joint prediction and planning paradigms in autonomous driving, we propose predicting torque as an auxiliary output, which further improves performance. This strategy encourages the model to build a physically grounded internal representation of interaction dynamics. Extensive quantitative and qualitative experiments across contact-rich manipulation benchmarks validate our findings. Code, models, and datasets will be released.
(a) Torque response of the 7-DoF arm during a charger-insertion task. Shaded gray regions mark periods of no contact, where torques remain nearly flat. The orange-tinted segment shows a failed insertion attempt—contact is made but the plug does not enter the socket, producing only small torque fluctuations. The green-tinted segment highlights a successful insertion, characterized by large, distinctive torque spikes as the plug seats fully. (b) Visualization of the 7-DoF robot arm, highlighting joint torque mappings. (c) Design space of torque-based features explored in this work, spanning current, historical, and future signals.
In this section, we present a video demonstrating torque variations in real-world scenarios. The video showcases a robot arm performing tasks with varying joint torques. From left to right, we present three tasks: Charger Plugging, USB Plugging, and Button Pushing. For each task, the first video provides a top-down view, the second video offers a front view, and the third video visualizes the torque variations. These three videos are time-synchronized and played at 1x speed.
Charger Plugging
USB Plugging
Button Pushing
In this section, we present five videos showcasing the performance of the torque-aware model in the following five contact-rich tasks: Button Pushing, Charger Plugging, USB Plugging, Door Opening, and Drawer Opening. These videos are played at 1x speed.
In this section, we present five videos showcasing the performance of the torque-aware model in the following five regular tasks: Bottle Pick and Place, Liquid Pouring, Stacking Cubes, Push-to-Position, and Opening a Drawer. These videos are played at 1x speed.
In this section, we present a video demonstrating the performance of cross embodiment performance using the ROKAE SR robotic arm. The tasks include inserting a fast-charging connector and a slow-charging connector. The video is played at 1x speed.
We kindly request that you cite our work if you utilize the code or reference our findings in your research.
@article{zhang2025elucidating, title={Elucidating the Design Space of Torque-aware Vision-Language-Action Models}, author={Zhang, Zongzheng and Xu, Haobo and Yang, Zhuo and Yue, Chenghao and Lin, Zehao and Gao, Huan-ang and Wang, Ziwei and Zhao, Hao}, journal={arXiv preprint arXiv:2509.07962}, year={2025} }