LoViT: Long Video Transformer for surgical phase recognition,Medical Image Analysis

当前位置： X-MOL 学术 › Med. Image Anal. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

LoViT: Long Video Transformer for surgical phase recognition
Medical Image Analysis ( IF 10.7 ) Pub Date : 2024-10-05 , DOI: 10.1016/j.media.2024.103366
Yang Liu, Maxence Boels, Luis C. Garcia-Peraza-Herrera, Tom Vercauteren, Prokar Dasgupta, Alejandro Granados, Sébastien Ourselin

Online surgical phase recognition plays a significant role towards building contextual tools that could quantify performance and oversee the execution of surgical workflows. Current approaches are limited since they train spatial feature extractors using frame-level supervision that could lead to incorrect predictions due to similar frames appearing at different phases, and poorly fuse local and global features due to computational constraints which can affect the analysis of long videos commonly encountered in surgical interventions. In this paper, we present a two-stage method, called Long Video Transformer (LoViT), emphasizing the development of a temporally-rich spatial feature extractor and a phase transition map. The temporally-rich spatial feature extractor is designed to capture critical temporal information within the surgical video frames. The phase transition map provides essential insights into the dynamic transitions between different surgical phases. LoViT combines these innovations with a multiscale temporal aggregator consisting of two cascaded L-Trans modules based on self-attention, followed by a G-Informer module based on ProbSparse self-attention for processing global temporal information. The multi-scale temporal head then leverages the temporally-rich spatial features and phase transition map to classify surgical phases using phase transition-aware supervision. Our approach outperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently. Compared to Trans-SVNet, LoViT achieves a 2.4 pp (percentage point) improvement in video-level accuracy on Cholec80 and a 3.1 pp improvement on AutoLaparo. Our results demonstrate the effectiveness of our approach in achieving state-of-the-art performance of surgical phase recognition on two datasets of different surgical procedures and temporal sequencing characteristics. The project page is available at https://github.com/MRUIL/LoViT.

中文翻译：

LoViT：用于手术阶段识别的长视频变压器

在线手术阶段识别在构建可以量化性能并监督手术工作流程执行的上下文工具方面发挥着重要作用。目前的方法有限，因为它们使用帧级监督来训练空间特征提取器，这可能会导致由于相似帧出现在不同阶段而导致预测错误，并且由于计算限制而无法很好地融合局部和全局特征，这可能会影响外科手术中常见的长视频分析。在本文中，我们提出了一种称为长视频变压器（LoViT）的两阶段方法，强调开发时间丰富的空间特征提取器和相变图。时间丰富的空间特征提取器旨在捕获手术视频帧中的关键时间信息。相变图为不同手术阶段之间的动态转变提供了重要的见解。LoViT 将这些创新与一个多尺度时间聚合器相结合，该聚合器由两个基于自我注意力的级联 L-Trans 模块组成，然后是一个基于 ProbSparse 自我注意力的 G-Informer 模块，用于处理全局时间信息。然后，多尺度时间头利用时间丰富的空间特征和相变图，使用相变感知监督对手术阶段进行分类。我们的方法在 Cholec80 和 AutoLaparo 数据集上始终优于最先进的方法。与 Trans-SVNet 相比，LoViT 在 Cholec80 上的视频级准确率提高了 2.4 个百分点，在 AutoLaparo 上提高了 3.1 个百分点。我们的结果表明，我们的方法在不同手术程序和时间测序特征的两个数据集上实现最先进的手术阶段识别性能方面的有效性。项目页面位于 https://github.com/MRUIL/LoViT。

更新日期：2024-10-05

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南