What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

Abstract

Existing robot planning systems rely on appearance-based reasoning, where visual observations are encoded into latent spaces organized around object appearances (e.g., recognizing a "cart" based on how it looks). However, planning requires reasoning about task-relevant functionalities of objects (e.g., whether an object is "movable"), which appearance-based latent spaces do not capture. As a result, existing approaches struggle to generalize to novel robot-object interactions.

We address this limited generalizability through affordance reasoning, enabling planning based on task-relevant object functionalities instead of appearance alone. We introduce A4Dance, which maps visual observations into a shared latent space structured around affordances (e.g., "movable"). By projecting visual observations into this functional latent space and measuring their proximity to affordances, A4Dance infers functionalities relevant to the observed object.

Furthermore, we introduce an affordance discovery mechanism that expands the latent space to handle unseen scenarios where existing affordances are insufficient. A4Dance uses proximity in the functional latent space to quantify uncertainty in affordance inference and selectively triggers affordance discovery. We evaluate A4Dance across several planning tasks involving diverse and unseen affordances. A4Dance achieves 94% inference accuracy on existing affordances, outperforming state-of-the-art approaches by over 15 percentage points, improves new affordance inference accuracy from 60% to 90% with less than 10% of the original training data, and enables 100x faster inference.

Framework Overview

Functional Latent Space for Affordance Reasoning. A4Dance maps visual observations and affordance descriptions into a shared functional latent space. For each affordance, we construct an affordance axis between an affordance and its antonym (e.g., movable ↔ fixed). Visual observations are projected onto these axes to infer task-relevant object functionalities. Projection proximity is calibrated into uncertainty estimates, enabling the system to identify when additional reasoning is required.

A4Dance generation and discovery pipeline

Uncertainty-Aware Affordance Discovery. Given a task and visual observation, A4Dance first generates candidate affordances relevant to the current planning problem. Affordance inference is performed in the functional latent space, while calibrated uncertainty determines whether existing affordances are sufficient. When uncertainty is high or a new functionality is required, a vision-language model proposes and labels new affordances, which are incorporated through offline learning and added to the affordance memory for future deployment.

Demonstrations

A4Dance performs affordance-based decision making across diverse scenarios. In the first example, the planner selects the cart as the most movable object for the task. In the second example, existing affordances are insufficient, triggering uncertainty-guided affordance discovery and introducing a new traversable affordance to successfully complete the task.

A4Dance transfers across robot platforms and domains. The same affordance generation, inference, and discovery framework is deployed on a tabletop robot arm, demonstrating that affordance reasoning is not tied to a specific robot platform. By conditioning generation and labeling on platform-specific capabilities, A4Dance adapts affordance predictions to new robots and tasks without modifying the underlying framework.

What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

Abstract

Framework Overview

Demonstrations

BibTeX