PAT3D: Physics-Augmented Text-to-3D Scene Generation

Abstract

We introduce PAT3D, a physics-augmented text-to-3D scene generation framework that integrates vision-language models with physics-based simulation to produce physically plausible and intersection-free 3D scenes.

Given a text prompt, PAT3D generates 3D objects, infers spatial relations, and organizes them into a hierarchical scene tree. This structure is converted into initial simulation conditions, where a differentiable rigid-body simulator drives the scene toward static equilibrium under gravity.

A simulation-in-the-loop optimization step further improves physical stability and semantic consistency, yielding scenes that are directly usable for downstream simulation, editing, and robotic manipulation tasks.

Method Overview

Given an input text, a reference image is first generated to capture spatial relations among objects, from which 3D assets are generated using vision foundation models, and a scene tree is extracted using a VLM.
Assets are arranged into an initial layout using 3D priors from monocular depth estimation (left), then refined with the scene tree to produce an intersection-free configuration for simulation (right).
Forward simulation ensures physical plausibility but may distort semantics (left). We address this with simulation-in-the-loop optimization, enforcing semantic consistency and physical validity (right).

Applications

Baseline Comparison

The examples below follow the asset folder structure from the baseline comparison set and show the text prompt and 3D outputs for each method. Blank GraphDreamer entries indicate out-of-memory failures with no result produced.

Prompt	GraphDreamer	Blender-MCP	MIDI	Ours

BibTeX