Artificial Intelligence Review ( IF 10.7 ) Pub Date : 2023-04-11 , DOI: 10.1007/s10462-023-10414-6 Ghazala Rafiq , Muhammad Rafiq , Gyu Sang Choi
Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI fields of computer vision and natural language processing in conjunction with real-time and practical applications. Deep learning-based approaches employed for video description have demonstrated enhanced results compared to conventional approaches. The current literature lacks a thorough interpretation of the recently developed and employed sequence to sequence techniques for video description. This paper fills that gap by focusing mainly on deep learning-enabled approaches to automatic caption generation. Sequence to sequence models follow an Encoder–Decoder architecture employing a specific composition of CNN, RNN, or the variants LSTM or GRU as an encoder and decoder block. This standard-architecture can be fused with an attention mechanism to focus on a specific distinctiveness, achieving high quality results. Reinforcement learning employed within the Encoder–Decoder structure can progressively deliver state-of-the-art captions by following exploration and exploitation strategies. The transformer mechanism is a modern and efficient transductive architecture for robust output. Free from recurrence, and solely based on self-attention, it allows parallelization along with training on a massive amount of data. It can fully utilize the available GPUs for most NLP tasks. Recently, with the emergence of several versions of transformers, long term dependency handling is not an issue anymore for researchers engaged in video processing for summarization and description, or for autonomous-vehicle, surveillance, and instructional purposes. They can get auspicious directions from this research.
中文翻译:
视频说明:深度学习方法的全面调查
视频描述是指理解视觉内容并将获得的理解转化为自动的文本叙述。它将计算机视觉和自然语言处理等关键人工智能领域与实时和实际应用相结合。与传统方法相比,用于视频描述的基于深度学习的方法已经证明了增强的结果。目前的文献缺乏对最近开发和使用的用于视频描述的序列到序列技术的透彻解释。本文主要关注支持深度学习的自动字幕生成方法,从而填补了这一空白。序列到序列模型遵循编码器-解码器架构,采用 CNN、RNN 或变体 LSTM 或 GRU 的特定组合作为编码器和解码器块。这种标准架构可以与注意力机制融合,以专注于特定的独特性,从而获得高质量的结果。编码器-解码器结构中采用的强化学习可以通过遵循探索和开发策略逐步提供最先进的字幕。Transformer 机制是一种现代高效的转换架构,可提供稳健的输出。没有重复,并且完全基于自注意力,它允许并行化以及对大量数据的训练。它可以充分利用可用的 GPU 来完成大多数 NLP 任务。最近,随着几个版本的 transformer 的出现,对于从事视频处理以进行总结和描述,或用于自动驾驶汽车、监控、和教学目的。他们可以从这项研究中得到吉祥的方向。