Rethinking the Two-Stage Framework for Grounded Situation Recognition,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Rethinking the Two-Stage Framework for Grounded Situation Recognition
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-12-10 , DOI: arxiv-2112.05375
Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Tat-Seng Chua

Grounded Situation Recognition (GSR), i.e., recognizing the salient activity (or verb) category in an image (e.g., buying) and detecting all corresponding semantic roles (e.g., agent and goods), is an essential step towards "human-like" event understanding. Since each verb is associated with a specific set of semantic roles, all existing GSR methods resort to a two-stage framework: predicting the verb in the first stage and detecting the semantic roles in the second stage. However, there are obvious drawbacks in both stages: 1) The widely-used cross-entropy (XE) loss for object recognition is insufficient in verb classification due to the large intra-class variation and high inter-class similarity among daily activities. 2) All semantic roles are detected in an autoregressive manner, which fails to model the complex semantic relations between different roles. To this end, we propose a novel SituFormer for GSR which consists of a Coarse-to-Fine Verb Model (CFVM) and a Transformer-based Noun Model (TNM). CFVM is a two-step verb prediction model: a coarse-grained model trained with XE loss first proposes a set of verb candidates, and then a fine-grained model trained with triplet loss re-ranks these candidates with enhanced verb features (not only separable but also discriminative). TNM is a transformer-based semantic role detection model, which detects all roles parallelly. Owing to the global relation modeling ability and flexibility of the transformer decoder, TNM can fully explore the statistical dependency of the roles. Extensive validations on the challenging SWiG benchmark show that SituFormer achieves a new state-of-the-art performance with significant gains under various metrics. Code is available at https://github.com/kellyiss/SituFormer.

中文翻译：

重新思考接地情况识别的两阶段框架

扎根情境识别 (GSR)，即识别图像中的显着活动（或动词）类别（例如，购买）并检测所有相应的语义角色（例如，代理和商品），是迈向“类人”的必要步骤事件理解。由于每个动词都与一组特定的语义角色相关联，因此所有现有的 GSR 方法都采用两阶段框架：在第一阶段预测动词并在第二阶段检测语义角色。然而，这两个阶段都存在明显的缺点：1）由于日常活动之间的类内变异大和类间相似性高，广泛用于对象识别的交叉熵（XE）损失在动词分类中不足。2）以自回归方式检测所有语义角色，无法对不同角色之间复杂的语义关系进行建模。为此，我们为 GSR 提出了一种新颖的 SituFormer，它由粗到细动词模型（CFVM）和基于变压器的名词模型（TNM）组成。CFVM 是一个两步的动词预测模型：用 XE 损失训练的粗粒度模型首先提出一组动词候选，然后用三元组损失训练的细粒度模型用增强的动词特征（不仅可分离但也可区分）。TNM是一种基于transformer的语义角色检测模型，并行检测所有角色。由于 Transformer 解码器的全局关系建模能力和灵活性，TNM 可以充分探索角色的统计依赖性。对具有挑战性的 SWiG 基准测试的广泛验证表明，SituFormer 实现了新的最先进的性能，并在各种指标下取得了显着的进步。代码可在 https://github.com/kellyiss/SituFormer 获得。

更新日期：2021-12-13

点击分享查看原文

点击收藏

阅读更多本刊新发论文