MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask
International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-12-12 , DOI: 10.1007/s11263-024-02294-2
Yupeng Zhou, Daquan Zhou, Yaxing Wang, Jiashi Feng, Qibin Hou

Recent advancements in diffusion models have showcased their impressive capacity to generate visually striking images. However, ensuring a close match between the generated image and the given prompt remains a persistent challenge. In this work, we identify that a crucial factor leading to the erroneous generation of objects and their attributes is the inadequate cross-modality relation learning between the prompt and the generated images. To better align the prompt and image content, we advance the cross-attention with an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features. This mechanism explicitly diminishes the ambiguity in the semantic information embedding of the text encoder, leading to a boost of text-to-image consistency in the synthesized images. Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models. When applied to the latent diffusion models, our MaskDiffusion can largely enhance their capability to correctly generate objects and their attributes, with negligible computation overhead compared to the original diffusion models. Our project page is https://github.com/HVision-NKU/MaskDiffusion.

中文翻译：

MaskDiffusion：使用条件掩码提高文本到图像的一致性

扩散模型的最新进展展示了它们生成视觉冲击图像的令人印象深刻的能力。但是，确保生成的图像与给定的提示之间紧密匹配仍然是一个持续的挑战。在这项工作中，我们确定导致对象及其属性错误生成的一个关键因素是提示和生成图像之间的跨模态关系学习不足。为了更好地对齐提示和图像内容，我们使用自适应掩码来推进交叉注意力，该掩码以注意力图和提示嵌入为条件，以动态调整每个文本标记对图像特征的贡献。这种机制显式减少了文本编码器的语义信息嵌入中的歧义，从而提高了合成图像中文本到图像的一致性。我们的方法称为 MaskDiffusion，无需训练，并且可以热插拔用于流行的预训练扩散模型。当应用于潜在扩散模型时，我们的 MaskDiffusion 可以在很大程度上增强它们正确生成对象及其属性的能力，与原始扩散模型相比，计算开销可以忽略不计。我们的项目页面是 https://github.com/HVision-NKU/MaskDiffusion。

更新日期：2024-12-12

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南