EGGen: Image Generation with Multi-entity Prior Learning through Entity Guidance

Zhenhong Sun, Junyan Wang, Zhiyu Tan, Daoyi Dong*, Hailan Ma, Hao Li*, Dong Gong

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Diffusion models have shown remarkable prowess in text-to-image synthesis and editing, yet they often stumble when tasked with interpreting complex prompts that describe multiple entities with specific attributes and interrelations. The generated images often contain inconsistent multi-entity representation (IMR), reflected as inaccurate presentations of the multiple entities and their attributes. Although providing spatial layout guidance improves the multi-entity generation quality in existing works, it is still challenging to handle the leakage attributes and avoid unnatural characteristics. To address the IMR challenge, we first conduct in-depth analyses of the diffusion process and attention operation, revealing that the IMR challenges largely stem from the process of cross-attention mechanisms. According to the analyses, we introduce the entity guidance generation mechanism, which maintains the integrity of the original diffusion model parameters by integrating plug-in networks. Our work advances the stable diffusion model by segmenting comprehensive prompts into distinct entity-specific prompts with bounding boxes, enabling a transition from multi-entity to single-entity generation in cross-attention layers. More importantly, we introduce entity-centric cross-attention layers that focus on individual entities to preserve their uniqueness and accuracy, alongside global entity alignment layers that refine cross-attention maps using multi-entity priors for precise positioning and attribute accuracy. Additionally, a linear attenuation module is integrated to progressively reduce the influence of these layers during inference, preventing oversaturation and preserving generation fidelity. Our comprehensive experiments demonstrate that this entity guidance generation enhances existing text-to-image models in generating detailed, multi-entity images.

Original languageEnglish
Title of host publicationMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
Place of PublicationNew York
PublisherAssociation for Computing Machinery (ACM)
Pages6637-6645
Number of pages9
ISBN (Electronic)9798400706868
DOIs
Publication statusPublished - 28 Oct 2024
Event32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, Australia
Duration: 28 Oct 20241 Nov 2024
Conference number: 32
https://dl.acm.org/doi/proceedings/10.1145/3664647

Conference

Conference32nd ACM International Conference on Multimedia, MM 2024
Abbreviated titleMM '24
Country/TerritoryAustralia
CityMelbourne
Period28/10/241/11/24
OtherWe are delighted to welcome you to Melbourne, Australia for ACM Multimedia 2024, the 32nd ACM International Conference on Multimedia. ACM Multimedia is the premier international conference series in the area of multimedia within the field of computer science. Since 1993, ACM Multimedia has been bringing together worldwide researchers and practitioners from academia and industry to present their innovative research and to discuss recent advancements in multimedia.
Internet address

Fingerprint

Dive into the research topics of 'EGGen: Image Generation with Multi-entity Prior Learning through Entity Guidance'. Together they form a unique fingerprint.

Cite this