StyleCrafter: Taming Artistic Video Diffusion with Reference-Augmented Adapter Learning,ACM Transactions on Graphics

当前位置： X-MOL 学术 › ACM Trans. Graph. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

StyleCrafter: Taming Artistic Video Diffusion with Reference-Augmented Adapter Learning
ACM Transactions on Graphics ( IF 7.8 ) Pub Date : 2024-11-19 , DOI: 10.1145/3687975
Gongye Liu, Menghan Xia, Yong Zhang, Haoxin Chen, Jinbo Xing, Yibo Wang, Xintao Wang, Ying Shan, Yujiu Yang

Text-to-video (T2V) models have shown remarkable capabilities in generating diverse videos. However, they struggle to produce user-desired artistic videos due to (i) text's inherent clumsiness in expressing specific styles and (ii) the generally degraded style fidelity. To address these challenges, we introduce StyleCrafter, a generic method that enhances pretrained T2V models with a style control adapter, allowing video generation in any style by feeding a reference image. Considering the scarcity of artistic video data, we propose to first train a style control adapter using style-rich image datasets, then transfer the learned stylization ability to video generation through a tailor-made finetuning paradigm. To promote content-style disentanglement, we employ carefully designed data augmentation strategies to enhance decoupled learning. Additionally, we propose a scale-adaptive fusion module to balance the influences of text-based content features and image-based style features, which helps generalization across various text and style combinations. StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images. Experiments demonstrate that our approach is more flexible and efficient than existing competitors. Project page: https://gongyeliu.github.io/StyleCrafter.github.io/

中文翻译：

StyleCrafter：通过参考增强适配器学习来驯服艺术视频扩散

文本到视频（T2V）模型在生成各种视频方面表现出卓越的能力。然而，由于（i）文本在表达特定风格时固有的笨拙和（ii）风格保真度普遍下降，它们难以制作用户想要的艺术视频。为了应对这些挑战，我们引入了 StyleCrafter，这是一种通用方法，它使用样式控制适配器增强了预训练的 T2V 模型，允许通过提供参考图像来生成任何风格的视频。考虑到艺术视频数据的稀缺性，我们建议首先使用风格丰富的图像数据集训练一个风格控制适配器，然后通过量身定制的微调范式将学习到的风格化能力转移到视频生成中。为了促进内容式的解缠，我们采用精心设计的数据增强策略来增强解耦学习。此外，我们提出了一个缩放自适应融合模块来平衡基于文本的内容特征和基于图像的样式特征的影响，这有助于跨各种文本和样式组合进行泛化。StyleCrafter 可以有效地生成与文本内容一致并类似于参考图像样式的高质量风格化视频。实验表明，我们的方法比现有竞争对手更灵活、更高效。项目页面： https://gongyeliu.github.io/StyleCrafter.github.io/

更新日期：2024-11-19

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文本刊介绍/投稿指南