International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-09-19 , DOI: 10.1007/s11263-024-02227-z Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han
Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend identity among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300\(\times \)–2500\(\times \) speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available here (https://github.com/mit-han-lab/fastcomposer).
中文翻译:
FastComposer:具有局部注意力的免调整多主体图像生成
扩散模型擅长文本到图像的生成,尤其是主题驱动的个性化图像生成。然而,由于特定于主题的微调,现有方法效率低下,计算量大,阻碍了有效部署。此外,现有方法难以解决多主体生成问题,因为它们经常混合主体之间的身份。我们推出的 FastComposer 可以实现高效、个性化、多主题的文本到图像生成,无需微调。 FastComposer 使用图像编码器提取的主题嵌入来增强扩散模型中的通用文本调节,从而仅通过前向传递即可基于主题图像和文本指令生成个性化图像。为了解决多主体生成中的身份混合问题,FastComposer 提出了训练期间的交叉注意力定位监督,将参考主体的注意力强制定位到目标图像中的正确区域。对主题嵌入的天真调节会导致主题过度拟合。 FastComposer 提出在去噪步骤中延迟主体调节,以保持主体驱动图像生成中的同一性和可编辑性。 FastComposer 生成多个具有不同风格、动作和背景的看不见的个体的图像。与基于微调的方法相比,它实现了 300 \(\times \) –2500 \(\times \)加速,并且新主题需要零额外存储。 FastComposer 为高效、个性化和高质量的多主题图像创建铺平了道路。代码、模型和数据集可在此处获取 (https://github.com/mit-han-lab/fastcomposer)。