International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-10-25 , DOI: 10.1007/s11263-024-02253-x Zhouxia Wang, Xintao Wang, Liangbin Xie, Zhongang Qi, Ying Shan, Wenping Wang, Ping Luo
This work focuses on generating high-quality images with specific style of reference images and content of provided textual descriptions. Current leading algorithms, i.e., DreamBooth and LoRA, require fine-tuning for each style, leading to time-consuming and computationally expensive processes. In this work, we propose StyleAdapter, a unified stylized image generation model capable of producing a variety of stylized images that match both the content of a given prompt and the style of reference images, without the need for per-style fine-tuning. It introduces a two-path cross-attention (TPCA) module to separately process style information and textual prompt, which cooperate with a semantic suppressing vision model (SSVM) to suppress the semantic content of style images. In this way, it can ensure that the prompt maintains control over the content of the generated images, while also mitigating the negative impact of semantic information in style references. This results in the content of the generated image adhering to the prompt, and its style aligning with the style references. Besides, our StyleAdapter can be integrated with existing controllable synthesis methods, such as T2I-adapter and ControlNet, to attain a more controllable and stable generation process. Extensive experiments demonstrate the superiority of our method over previous works.
中文翻译:
StyleAdapter:统一的风格化图像生成模型
这项工作的重点是生成具有特定样式的参考图像和所提供文本描述内容的高质量图像。当前领先的算法,即 DreamBooth 和 LoRA,需要对每种样式进行微调,从而导致耗时且计算成本高昂的过程。在这项工作中,我们提出了 StyleAdapter,这是一种统一的风格化图像生成模型,能够生成各种风格化图像,这些图像既匹配给定提示的内容,也匹配参考图像的样式,而无需对每个样式进行微调。该模块引入了双路径交叉注意力 (TPCA) 模块,分别处理风格信息和文本提示,与语义抑制视觉模型 (SSVM) 配合抑制风格图像的语义内容。通过这种方式,它可以确保 prompt 保持对生成图像内容的控制,同时还可以减轻样式引用中语义信息的负面影响。这将导致生成的图像的内容符合提示,并且其样式与样式引用保持一致。此外,我们的 StyleAdapter 可以与现有的可控综合方法集成,例如 T2I-adapter 和 ControlNet,以实现更可控和稳定的生成过程。广泛的实验证明了我们的方法优于以前的工作。