Generative Models for Multimodal 3D Scene Generation
Yao Wei is a PhD student in the department of Earth Observation Science. (Co)Promotors are prof.dr.ir. G. Vosselman from the faculty of Geo-Information Science and Earth Observation, University of Twente and prof. M. Yang, University of Bath.
Three-dimensional (3D) scene models are fundamental to how humans and intelligent systems perceive, reason about, and interact with the world. From digital avatars and product design to indoor environments and urban landscapes, 3D data provides semantic and spatial context that 2D data alone cannot fully convey. This dissertation focuses on developing novel approaches for controllable, coherent, and functionally meaningful 3D scene generation from diverse input modalities. We begin at the object level, tackling the challenging task of generating flexible and controllable 3D point clouds from a single image, especially for general and structurally complex objects such as buildings. Extending beyond individual objects, we investigate multi-object 3D scene generation with an emphasis on coherence and functional usability. Specifically, we explore how to incorporate global semantics, layout regularization, and human interaction priors into generative pipelines to ensure that the synthesized scenes are not only visually realistic but also functionally aligned with human-aware functionality.
First, we introduce a hybrid explicit-implicit generative modeling approach for 3D point cloud generation from a single image. This method addresses limitations of prior works, such as fixed-resolution outputs and limited geometric detail. We employ a Normalizing Flow-based generator as an explicit density estimator, allowing for variable-size point cloud generation by modeling the 3D point distribution. An implicit cross-modal discriminator further enhances shape quality via adversarial training. Evaluations on the large-scale synthetic ShapeNet dataset demonstrate the superior performance of our method.
Second, we present BuilDiff, a method designed to generate 3D building models from single images. Unlike existing approaches that are restricted to LoD1 building models derived from nadir-view aerial imagery, BuilDiff handles general-view aerial images using a two-stage conditional diffusion process. We introduce a weighted footprint-based regularization loss to improve geometric fidelity and reduce ambiguity during the denoising process. To support this task, we build a real-world dataset, BuildingNL3D, containing paired aerial images and ALS point clouds. Results show that BuilDiff can generate LoD2-level buildings with roof structures on both synthetic and real-world datasets.
Third, we address the shortcomings of current 3D indoor scene generation methods, which often focus narrowly on local semantics and produce physically implausible layouts. We present Planner3D, a controllable framework that jointly models object shapes and spatial layouts for improved scene coherence. We enhance the scene graph representation using a LLM to extract hierarchical graph features. A layout regularization loss further enforces physical plausibility by encouraging structurally valid object arrangements. Experiments on the SG-FRONT dataset show that our method generates semantically and physically coherent 3D indoor scenes.
Lastly, we propose a human-aware 3D indoor scene generation method that constructs environments from textual descriptions through structured reasoning and optimization. A scene graph serves as a semantic bridge between language and 3D geometry, capturing both object co-occurrence patterns and human interaction priors distilled from LLMs. Initial layouts are generated using graph-based diffusion models, followed by an optimization stage where 3D objects and virtual humans are assembled to refine spatial configurations. This approach addresses challenges in text-to-3D generation by improving semantic completeness and functional realism. Experiments on the 3D-FRONT dataset demonstrate that our method produces visually plausible and practically usable scenes.
In summary, this dissertation explores deep generative models for 3D scene synthesis from multimodal inputs. The proposed methods have broad applicability in areas such as interior design, game development, and augmented/virtual reality. By enabling realistic 3D scene generation from a single RGB image, a structured scene graph, or a simple text prompt, our work empowers non-expert users to create tailored 3D content. As generative modeling and multimodal learning continue to evolve, these contributions offer significant potential for advancing 3D scene understanding applications.