Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation

Yuyang Li1,2,3,4⚖️, Yinghan Chen1,2,4,6⚖️, Zihang Zhao1,2,4, Puhao Li3,4,
Tengyu Liu3,4✉️, Siyuan Huang3,4✉️, Yixin Zhu1,2,4,5✉️,
⚖️ Equal contributors    ✉️ Corresponding authors
1Peking University 2Beijing Key Lab of Behavior and Mental Health, Peking University
3Beijing Institute for General Artificial Intelligence 4State Key Lab of General Artificial Intelligence
5PKU-Wuhan Institute for Artificial Intelligence 6University of Cambridge

Robotic manipulation requires both rich multimodal perception and effective learning frameworks to handle complex real-world tasks. See-Through-Skin (STS) sensors, which combine tactile and visual perception, offer promising sensing capabilities, while modern imitation learning provides powerful tools for policy acquisition. However, existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking. Furthermore, integrating these rich multimodal signals into learning-based manipulation pipelines remains an open challenge. We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction, and TacThru-UMI, an imitation learning framework that leverages these multimodal signals for manipulation. Our sensor features a fully transparent elastomer, persistent illumination, novel keyline markers, and efficient tracking, while our learning system integrates these signals through a Transformer-based Diffusion Policy. Experiments on five challenging real-world tasks show that TacThru-UMI achieves an average success rate of 85.5%, significantly outperforming the baselines of alternating tactile-visual (66.3%) and vision-only (55.4%). The system excels in critical scenarios, including contact detection with thin and soft objects and precision manipulation requiring multimodal coordination. This work demonstrates that combining simultaneous multimodal perception with modern learning frameworks enables more precise, adaptable robotic manipulation.

Contributions:
  • TacThru: a novel STS sensor that enables efficient, robust, simultaneous tactile-visual perception.
  • TacThru-UMI: an imitation learning system with a design compatible with UMI for data collection, processing, and policy deployment.
  • A comprehensive experimental validation demonstrating how TacThru's simultaneous multimodal perception enables superior fine-grained and contact-rich manipulation.

The TacThru Sensor

Sensor Sensor

Secret Recipes for
Efficient and Robust Marker Tracking

(i) A fully transparent elastomer that enables clear visual perception.
(ii) Persistent illumination that eliminates mode switching.
(iii) Novel keyline markers that maintain visibility against any background.
(iv) An efficient tracking algorithm processing marker deviations at 6.08 ms per frame.

Fabrication Details of
TacThru / TacThru-UMI

(a) Keyling marker fabrication: The keyline marker elastomer is fabricated by sequentially spraying inner (black) and outer (white) markers on transparent elastomer using laser-cut masks.
(b) Sensor design: The elastomer is installed on a VBTS sensor body, which includes the LEDs for illumination, camera module for image capturing, and the elastomer installed at the top.
(c) Integration on TacThru-UMI: The TacThru-UMI platform includes a robot end-effector (left) and a data collector (right) that share identical body and finger designs.
Hardware

Manipulation with TacThru-UMI

Tasks
ⓘ Use to browse different rollouts for the tasks. All videos are in 4x speed.

Pick Bottle

Goal: Pick up the bottle and put it into a bowl
Validates TacThru-UMI's effectiveness in basic imitation learning and real-world execution.

Pull Tissue

Goal: Grasp a tissue and pull it out
Evaluates TacThru's visual perception capability for handling thin and soft objects where tactile feedback is unreliable.

Sort Bolt

Goal: Grasp a bolt and sort it into the corresponding bowl
Assesses TacThru's capability to distinguish object shape and color through STS perception.

Hang Scissors

Goal: Grasp a pair of scissors and hang it onto the hook
Evaluates whether tactile feedback can reliably distinguish task completion from missed attempts.

Insert Cap

Goal: Pick up the bottle cap and insert it onto a mount
Assesses TacThru's ability to perform visual servoing when visible, and fall back to tactile guidance under occlusion.

Results

Results

TT-M (TacThru with markers) achieves the highest overall success rate (85.5%).

Each task targets a specific sensing capability: PickBottle (basic manipulation), PullTissue (thin-and-soft object manipulation), SortBolt (visual discrimination), HangScissors (tactile discrimination), and InsertCap (multimodal fusion).

Error bars show standard deviation across evaluation, and the rightmost column presents overall performance averages.

Acknowledgement

We thank Lei Yan (LeapZenith AI Research), Shengyu Guo (PKU), Yu Liu (THU), Leiyao Cui (PKU), and Zhen Chen (BIGAI) for their assistance. This work is supported in part by the National Science and Technology Innovation 2030 Major Program (2025ZD0219400), the National Natural Science Foundation of China (62376009), the Beijing Nova Program, the State Key Lab of General AI at Peking University, the PKU-BingJi Joint Laboratory for Artificial Intelligence, and the National Comprehensive Experimental Base for Governance of Intelligent Society, Wuhan East Lake High-Tech Development Zone.