International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-09-27 , DOI: 10.1007/s11263-024-02231-3 Yaohui Wang, Xin Ma, Xinyuan Chen, Cunjian Chen, Antitza Dantcheva, Bo Dai, Yu Qiao
Spatio-temporal coherency is a major challenge in synthesizing high quality videos, particularly in synthesizing human videos that contain rich global and local deformations. To resolve this challenge, previous approaches have resorted to different features in the generation process aimed at representing appearance and motion. However, in the absence of strict mechanisms to guarantee such disentanglement, a separation of motion from appearance has remained challenging, resulting in spatial distortions and temporal jittering that break the spatio-temporal coherency. Motivated by this, we here propose LEO, a novel framework for human video synthesis, placing emphasis on spatio-temporal coherency. Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance. We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM). The former bridges a space of motion codes with the space of flow maps, and synthesizes video frames in a warp-and-inpaint manner. LMDM learns to capture motion prior in the training data by synthesizing sequences of motion codes. Extensive quantitative and qualitative analysis suggests that LEO significantly improves coherent synthesis of human videos over previous methods on the datasets TaichiHD, FaceForensics and CelebV-HQ. In addition, the effective disentanglement of appearance and motion in LEO allows for two additional tasks, namely infinite-length human video synthesis, as well as content-preserving video editing. Project page: https://wyhsirius.github.io/LEO-project/.
中文翻译:
LEO:用于人类视频合成的生成潜像动画器
时空一致性是合成高质量视频的主要挑战,特别是合成包含丰富的全局和局部变形的人类视频。为了解决这一挑战,以前的方法在生成过程中采用了不同的特征来表示外观和运动。然而,由于缺乏严格的机制来保证这种分离,运动与外观的分离仍然具有挑战性,导致空间扭曲和时间抖动,从而破坏时空一致性。受此启发,我们在这里提出 LEO,一种用于人类视频合成的新颖框架,强调时空一致性。我们的关键思想是将运动表示为生成过程中的一系列流程图,这本质上将运动与外观隔离开来。我们通过基于流的图像动画师和潜在运动扩散模型(LMDM)来实现这个想法。前者将运动代码空间与流图空间连接起来,并以扭曲和修复的方式合成视频帧。 LMDM 通过合成运动代码序列来学习捕获训练数据中的运动先验。广泛的定量和定性分析表明,与以前的方法相比,LEO 在数据集 TaichiHD、FaceForensics 和 CelebV-HQ 上显着改善了人类视频的连贯合成。此外,LEO 中外观和运动的有效分离允许执行两项额外任务,即无限长度的人类视频合成以及保留内容的视频编辑。项目页面:https://wyhsirius.github.io/LEO-project/。