PAT3D teaser image

PAT3D couples text-driven 3D generation with differentiable rigid-body simulation to produce stable, physically plausible, and simulation-ready scenes.

Abstract

We introduce PAT3D, a physics-augmented text-to-3D scene generation framework that integrates vision-language models with physics-based simulation to produce physically plausible and intersection-free 3D scenes.

Given a text prompt, PAT3D generates 3D objects, infers spatial relations, and organizes them into a hierarchical scene tree. This structure is converted into initial simulation conditions, where a differentiable rigid-body simulator drives the scene toward static equilibrium under gravity.

A simulation-in-the-loop optimization step further improves physical stability and semantic consistency, yielding scenes that are directly usable for downstream simulation, editing, and robotic manipulation tasks.

Method Overview

PAT3D method overview

  • Given an input text, a reference image is first generated to capture spatial relations among objects, from which 3D assets are generated using vision foundation models, and a scene tree is extracted using a VLM.
  • Assets are arranged into an initial layout using 3D priors from monocular depth estimation (left), then refined with the scene tree to produce an intersection-free configuration for simulation (right).
  • Forward simulation ensures physical plausibility but may distort semantics (left). We address this with simulation-in-the-loop optimization, enforcing semantic consistency and physical validity (right).

Applications

Baseline Comparison

The examples below follow the asset folder structure from the baseline comparison set and show the text prompt and 3D outputs for each method. Blank GraphDreamer entries indicate out-of-memory failures with no result produced.

Prompt GraphDreamer Blender-MCP MIDI Ours

BibTeX

@inproceedings{
      lin2026patd,
      title={{PAT}3D: Physics-Augmented Text-to-3D Scene Generation},
      author={Guying Lin and Kemeng Huang and Michael Liu and Ruihan Gao and Hanke Chen and Lyuhao Chen and Beijia Lu and Taku Komura and Yuan Liu and Jun-Yan Zhu and Minchen Li},
      booktitle={The Fourteenth International Conference on Learning Representations},
      year={2026},
      url={https://openreview.net/forum?id=iIRxFkeCuY}
      }