当前位置: X-MOL 学术Int. J. Comput. Vis. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-10-24 , DOI: 10.1007/s11263-024-02271-9
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, Mike Zheng Shou

Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution, which can also remove potential artifacts and corruptions from low-resolution videos. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15 G vs. 72 G). Furthermore, our Show-1 model can be readily adapted for motion customization and video stylization applications through simple temporal attention layer finetuning. Our model achieves state-of-the-art performance on standard video generation benchmarks. Code of Show-1 is publicly available and more videos can be found here.



中文翻译:


Show-1:将像素和潜在扩散模型相结合,生成文本到视频



在大规模预训练文本到视频扩散模型 (VDM) 领域已经取得了重大进展。然而,以前的方法要么完全依赖于基于像素的 VDM(计算成本高),要么依赖于基于潜在 VDM(通常难以实现精确的文本-视频对齐)。在本文中,我们率先提出了一种称为 Show-1 的混合模型,它将基于像素和基于潜在值的 VDM 相结合,用于生成文本到视频。我们的模型首先使用基于像素的 VDM 来生成具有强文本-视频相关性的低分辨率视频。之后,我们提出了一种新颖的专家翻译方法,该方法采用基于潜在 VDM 的 VDM 将低分辨率视频进一步上采样到高分辨率,这也可以消除低分辨率视频中的潜在伪影和损坏。与潜在 VDM 相比,Show-1 可以生成精确文本视频对齐的高质量视频;与像素 VDM 相比,Show-1 的效率要高得多(推理期间的 GPU 内存使用量为 15 G 对 72 G)。此外,我们的 Show-1 模型可以通过简单的时间注意力层微调轻松适应运动定制和视频风格化应用。我们的模型在标准视频生成基准测试中实现了最先进的性能。Show-1 的代码已公开发布,可在此处找到更多视频。

更新日期:2024-10-25
down
wechat
bug