当前位置: X-MOL 学术arXiv.cs.CV › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
ControlAR: Controllable Image Generation with Autoregressive Models
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-10-03 , DOI: arxiv-2410.02705
Zongming Li, Tianheng Cheng, Shoufa Chen, Peize Sun, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang

Autoregressive (AR) models have reformulated image generation as next-token prediction, demonstrating remarkable potential and emerging as strong competitors to diffusion models. However, control-to-image generation, akin to ControlNet, remains largely unexplored within AR models. Although a natural approach, inspired by advancements in Large Language Models, is to tokenize control images into tokens and prefill them into the autoregressive model before decoding image tokens, it still falls short in generation quality compared to ControlNet and suffers from inefficiency. To this end, we introduce ControlAR, an efficient and effective framework for integrating spatial controls into autoregressive image generation models. Firstly, we explore control encoding for AR models and propose a lightweight control encoder to transform spatial inputs (e.g., canny edges or depth maps) into control tokens. Then ControlAR exploits the conditional decoding method to generate the next image token conditioned on the per-token fusion between control and image tokens, similar to positional encodings. Compared to prefilling tokens, using conditional decoding significantly strengthens the control capability of AR models but also maintains the model's efficiency. Furthermore, the proposed ControlAR surprisingly empowers AR models with arbitrary-resolution image generation via conditional decoding and specific controls. Extensive experiments can demonstrate the controllability of the proposed ControlAR for the autoregressive control-to-image generation across diverse inputs, including edges, depths, and segmentation masks. Furthermore, both quantitative and qualitative results indicate that ControlAR surpasses previous state-of-the-art controllable diffusion models, e.g., ControlNet++. Code, models, and demo will soon be available at https://github.com/hustvl/ControlAR.

中文翻译:


ControlAR:使用自回归模型生成可控图像



自回归 (AR) 模型已将图像生成重新表述为次元预测,展示了非凡的潜力,并成为扩散模型的有力竞争者。然而,类似于 ControlNet 的控制到图像生成在 AR 模型中在很大程度上仍未得到探索。尽管受大型语言模型进步的启发,一种自然的方法是将控制图像标记化为标记,并在解码图像标记之前将它们预填充到自回归模型中,但与 ControlNet 相比,它在生成质量上仍然不足,并且效率低下。为此,我们引入了 ControlAR,这是一个将空间控制集成到自回归图像生成模型中的高效框架。首先,我们探索了 AR 模型的控制编码,并提出了一种轻量级控制编码器,用于将空间输入(例如,canny edges 或深度图)转换为控制标记。然后,ControlAR 利用条件解码方法生成下一个图像令牌,该令牌的条件是控制令牌和图像令牌之间的每个令牌融合,类似于位置编码。与预填充 token 相比,使用条件解码显著增强了 AR 模型的控制能力,同时也保持了模型的效率。此外,所提出的 ControlAR 令人惊讶地通过条件解码和特定控制为 AR 模型提供了任意分辨率的图像生成能力。广泛的实验可以证明所提出的 ControlAR 对跨不同输入(包括边缘、深度和分割掩码)的自回归控制到图像生成的可控性。此外,定量和定性结果表明,ControlAR 超越了以前最先进的可控扩散模型,例如 ControlNet++。 代码、模型和演示即将在 https://github.com/hustvl/ControlAR 上提供。
更新日期:2024-10-04
down
wechat
bug