当前位置: X-MOL 学术ACM Trans. Graph. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Still-Moving: Customized Video Generation without Customized Video Data
ACM Transactions on Graphics  ( IF 7.8 ) Pub Date : 2024-11-19 , DOI: 10.1145/3687945
Hila Chefer, Shiran Zada, Roni Paiss, Ariel Ephrat, Omer Tov, Michael Rubinstein, Lior Wolf, Tali Dekel, Tomer Michaeli, Inbar Mosseri

Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a T2I model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

中文翻译:


静止移动:无需自定义视频数据的自定义视频生成



自定义文本到图像 (T2I) 模型最近取得了巨大进步,尤其是在个性化、风格化和条件生成等领域。然而,将这一进展扩展到视频生成仍处于起步阶段,主要是由于缺乏定制的视频数据。在这项工作中,我们介绍了 Still-Moving,这是一种新颖的通用框架,用于自定义文本到视频 (T2V) 模型,而无需任何自定义视频数据。该框架适用于突出的 T2V 设计,其中视频模型是在 T2I 模型上构建的(例如,通过通货膨胀)。我们假设可以访问 T2I 模型的自定义版本,该模型仅在静止图像数据上进行训练(例如,使用 DreamBooth)。天真地将自定义 T2I 模型的权重插入 T2V 模型通常会导致严重的伪影或对自定义数据的依从性不足。为了克服这个问题,我们训练轻量级空间适配器来调整注入的 T2I 层产生的特征。重要的是,我们的适配器是在“冻结视频”(即重复图像)上训练的,这些视频由定制的 T2I 模型生成的图像样本构建而成。这种训练由一个新颖的 Motion Adapter 模块提供,它允许我们在此类静态视频上进行训练,同时保留视频模型之前的运动。在测试时,我们删除 Motion Adapter 模块,只保留经过训练的 Spatial Adapter。这将恢复 T2V 模型的运动先验,同时遵循自定义 T2I 模型的空间先验。我们展示了我们的方法在各种任务上的有效性,包括个性化、风格化和有条件的生成。在所有评估的场景中,我们的方法将定制 T2I 模型的空间先验与 T2V 模型提供的运动先验无缝集成。
更新日期:2024-11-19
down
wechat
bug