Controllable Structured Object Generation


Amruta Priyadarshan Muthal

Abstract

While text-to-image (T2I) models have demonstrated remarkable improvements in spatial consistency and positioning of large scene components, they struggle with fine-grained structural control, particularly for anatomically complex objects with multiple interconnected parts. This limitation manifests as anatomical inconsistencies, missing or hallucinated components and incorrect part connectivity, critical failures for applications requiring precise structural fidelity. Recent developments in controlled image generation have produced models capable of high-fidelity synthesis. However, these models either rely on sophisticated conditioning inputs such as detailed pose skeletons or segmentation masks that require human expertise.

We introduce PLATO, a novel two-stage framework that bridges this gap by enabling precise, part-controlled object generation from simple, intuitive inputs: an object category and a list of constituent parts. The first stage employs PLayGen, our novel part layout generator built upon a Graph Convolutional Network Variational Autoencoder (GCN-VAE) architecture. PLayGen consists of a GCN Refinement Decoder that iteratively refines part placements through graph-based message passing, capturing inter-part spatial relationships during refinement. To address the challenge of training with non-intersecting bounding boxes and small parts, we introduce the Dynamic Margin IOU (DyI) loss, which dynamically adjusts margins based on box dimensions to ensure meaningful gradients throughout training. Additionally, our Differential Refinement loss scheme enables the model to learn both coarse positioning and fine-grained adjustments simultaneously. The overall loss formulation also has the effect of stabilizing the training process while improving structural coherence.

In the second stage, PLayGen’s synthesized layout guides a custom-tuned ControlNet-based diffusion model through a novel visual conditioning format using color-coded inscribed ellipses combined with part-aware text prompts. This design enforces both spatial constraints and semantic consistency, resulting in anatomically accurate, high-fidelity object generations that faithfully contain precisely the user-specified parts.

 

Year of completion:   December 2025
 Advisor : Dr. Ravi Kiran Sarvadevabhatla

Related Publications


Downloads