SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing,ACM Transactions on Graphics

当前位置： X-MOL 学术 › ACM Trans. Graph. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing
ACM Transactions on Graphics ( IF 7.8 ) Pub Date : 2024-11-19 , DOI: 10.1145/3687957
Zhiyuan Zhang, DongDong Chen, Jing Liao

Scene graphs offer a structured, hierarchical representation of images, with nodes and edges symbolizing objects and the relationships among them. It can serve as a natural interface for image editing, dramatically improving precision and flexibility. Leveraging this benefit, we introduce a new framework that integrates large language model (LLM) with Text2Image generative model for scene graph-based image editing. This integration enables precise modifications at the object level and creative recomposition of scenes without compromising overall image integrity. Our approach involves two primary stages: 1) Utilizing a LLM-driven scene parser, we construct an image's scene graph, capturing key objects and their interrelationships, as well as parsing fine-grained attributes such as object masks and descriptions. These annotations facilitate concept learning with a fine-tuned diffusion model, representing each object with an optimized token and detailed description prompt. 2) During the image editing phase, a LLM editing controller guides the edits towards specific areas. These edits are then implemented by an attention-modulated diffusion editor, utilizing the fine-tuned model to perform object additions, deletions, replacements, and adjustments. Through extensive experiments, we demonstrate that our framework significantly outperforms existing image editing methods in terms of editing precision and scene aesthetics. Our code is available at https://bestzzhang.github.io/SGEdit.

中文翻译：

SGEdit：将 LLM 与 Text2Image 生成模型桥接，用于基于场景图的图像编辑

场景图提供图像的结构化分层表示，节点和边表示对象及其之间的关系。它可以作为图像编辑的自然界面，显著提高精度和灵活性。利用这一优势，我们引入了一个新的框架，该框架将大型语言模型（LLM Text2Image 生成模型集成在一起，用于基于场景图的图像编辑。这种集成支持在对象级别进行精确修改和场景的创造性重组，而不会影响整体图像完整性。我们的方法包括两个主要阶段：1）利用 LLM 驱动的场景解析器，我们构建图像的场景图，捕获关键对象及其相互关系，以及解析细粒度属性，例如对象掩码和描述。这些注释通过微调的扩散模型促进概念学习，使用优化的标记和详细的描述提示来表示每个对象。2）在图像编辑阶段，LLM 编辑控制器将编辑引导到特定区域。然后，这些编辑由注意力调制的扩散编辑器实现，利用微调模型来执行对象的添加、删除、替换和调整。通过广泛的实验，我们证明了我们的框架在编辑精度和场景美感方面明显优于现有的图像编辑方法。我们的代码可在 https://bestzzhang.github.io/SGEdit 获取。

更新日期：2024-11-19

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南