当前位置:
X-MOL 学术
›
arXiv.cs.CV
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Rethinking the Two-Stage Framework for Grounded Situation Recognition
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-12-10 , DOI: arxiv-2112.05375 Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Tat-Seng Chua
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-12-10 , DOI: arxiv-2112.05375 Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Tat-Seng Chua
Grounded Situation Recognition (GSR), i.e., recognizing the salient activity
(or verb) category in an image (e.g., buying) and detecting all corresponding
semantic roles (e.g., agent and goods), is an essential step towards
"human-like" event understanding. Since each verb is associated with a specific
set of semantic roles, all existing GSR methods resort to a two-stage
framework: predicting the verb in the first stage and detecting the semantic
roles in the second stage. However, there are obvious drawbacks in both stages:
1) The widely-used cross-entropy (XE) loss for object recognition is
insufficient in verb classification due to the large intra-class variation and
high inter-class similarity among daily activities. 2) All semantic roles are
detected in an autoregressive manner, which fails to model the complex semantic
relations between different roles. To this end, we propose a novel SituFormer
for GSR which consists of a Coarse-to-Fine Verb Model (CFVM) and a
Transformer-based Noun Model (TNM). CFVM is a two-step verb prediction model: a
coarse-grained model trained with XE loss first proposes a set of verb
candidates, and then a fine-grained model trained with triplet loss re-ranks
these candidates with enhanced verb features (not only separable but also
discriminative). TNM is a transformer-based semantic role detection model,
which detects all roles parallelly. Owing to the global relation modeling
ability and flexibility of the transformer decoder, TNM can fully explore the
statistical dependency of the roles. Extensive validations on the challenging
SWiG benchmark show that SituFormer achieves a new state-of-the-art performance
with significant gains under various metrics. Code is available at
https://github.com/kellyiss/SituFormer.
中文翻译:
重新思考接地情况识别的两阶段框架
扎根情境识别 (GSR),即识别图像中的显着活动(或动词)类别(例如,购买)并检测所有相应的语义角色(例如,代理和商品),是迈向“类人”的必要步骤事件理解。由于每个动词都与一组特定的语义角色相关联,因此所有现有的 GSR 方法都采用两阶段框架:在第一阶段预测动词并在第二阶段检测语义角色。然而,这两个阶段都存在明显的缺点:1)由于日常活动之间的类内变异大和类间相似性高,广泛用于对象识别的交叉熵(XE)损失在动词分类中不足。2)以自回归方式检测所有语义角色,无法对不同角色之间复杂的语义关系进行建模。为此,我们为 GSR 提出了一种新颖的 SituFormer,它由粗到细动词模型(CFVM)和基于变压器的名词模型(TNM)组成。CFVM 是一个两步的动词预测模型:用 XE 损失训练的粗粒度模型首先提出一组动词候选,然后用三元组损失训练的细粒度模型用增强的动词特征(不仅可分离但也可区分)。TNM是一种基于transformer的语义角色检测模型,并行检测所有角色。由于 Transformer 解码器的全局关系建模能力和灵活性,TNM 可以充分探索角色的统计依赖性。对具有挑战性的 SWiG 基准测试的广泛验证表明,SituFormer 实现了新的最先进的性能,并在各种指标下取得了显着的进步。代码可在 https://github.com/kellyiss/SituFormer 获得。
更新日期:2021-12-13
中文翻译:
重新思考接地情况识别的两阶段框架
扎根情境识别 (GSR),即识别图像中的显着活动(或动词)类别(例如,购买)并检测所有相应的语义角色(例如,代理和商品),是迈向“类人”的必要步骤事件理解。由于每个动词都与一组特定的语义角色相关联,因此所有现有的 GSR 方法都采用两阶段框架:在第一阶段预测动词并在第二阶段检测语义角色。然而,这两个阶段都存在明显的缺点:1)由于日常活动之间的类内变异大和类间相似性高,广泛用于对象识别的交叉熵(XE)损失在动词分类中不足。2)以自回归方式检测所有语义角色,无法对不同角色之间复杂的语义关系进行建模。为此,我们为 GSR 提出了一种新颖的 SituFormer,它由粗到细动词模型(CFVM)和基于变压器的名词模型(TNM)组成。CFVM 是一个两步的动词预测模型:用 XE 损失训练的粗粒度模型首先提出一组动词候选,然后用三元组损失训练的细粒度模型用增强的动词特征(不仅可分离但也可区分)。TNM是一种基于transformer的语义角色检测模型,并行检测所有角色。由于 Transformer 解码器的全局关系建模能力和灵活性,TNM 可以充分探索角色的统计依赖性。对具有挑战性的 SWiG 基准测试的广泛验证表明,SituFormer 实现了新的最先进的性能,并在各种指标下取得了显着的进步。代码可在 https://github.com/kellyiss/SituFormer 获得。