GeoVLA: Empowering 3D Representations in Vision-Language-Action Models

1Tianjin University, 2Dexmal, 3Tsinghua University

Framework

Abstract

Vision-Language-Action (VLA) models have emerged as a promising approach for enabling robots to follow language instructions and predict corresponding actions. However, current VLA models mainly rely on 2D visual inputs, neglecting the rich geometric information in the 3D physical world, which limits their spatial awareness and adaptability.

In this paper, we present GeoVLA, a novel VLA framework that effectively integrates 3D information to advance robotic manipulation. It uses a vision-language model (VLM) to process images and language instructions, extracting fused vision-language embeddings. In parallel, it converts depth maps into point clouds and employs a customized point encoder, called Point Embedding Network, to generate 3D geometric embeddings independently. These produced embeddings are then concatenated and processed by our proposed spatial-aware action expert, called 3D-enhanced Action Expert, which combines information from different sensor modalities to produce precise action sequences.

Through extensive experiments in both simulation and real-world environments, GeoVLA demonstrates superior performance and robustness. It achieves state-of-the-art results in the LIBERO and ManiSkill2 simulation benchmarks and shows remarkable robustness in real-world tasks requiring height, size adaptability and viewpoint invariance.

In-domain Task

We present several in-domain tasks, including stacking blocks, inserting a circle, picking up a hairclip, hanging a cup, covering a Matryoshka doll, picking up a carrot, placing a basketball, and stacking cups.

Height Change

We train the put basketball task in the fifth layer, and evaluate the models from the third to the sixth layers

We observe that the position of the circular ball is difficult to accurately perceive, often leading to empty grasps. When encountering out-of-distribution cases, 2D-based methods fail to localize the basket in 3D space, and instead tend to grasp along the projection line from the camera to true position of the basket.

at the third layers

at the forth layers

at the fifth layers (training)

at the sixth layers

Size Scale

We train the cover Matryoshka task at the base size, and evaluate the models by scaling the doll size

Larger dolls tend to shift the grasping action toward the rear, while smaller dolls increase the difficulty of the task.

much larger than training

slightly larger than training

training

smaller than training

View Shift

We train the stack block task at the base view, and we evaluate the models by shifting the camera view

We incorporate a point cloud centered on the end-effector into our model, which remains invariant under changes in viewpoint.

training view

shifting 15 degrees

shifting 30 degrees

shifting 45 degrees

Background and Lightness Variation

We evaluate the models by changing the background and lightness

varying background

varying lightness