npj Digital Medicine ( IF 12.4 ) Pub Date : 2024-11-21 , DOI: 10.1038/s41746-024-01275-6 Adarsh Subbaswamy, Berkman Sahiner, Nicholas Petrick, Vinay Pai, Roy Adams, Matthew C. Diamond, Suchi Saria
A fundamental goal of evaluating the performance of a clinical model is to ensure it performs well across a diverse intended patient population. A primary challenge is that the data used in model development and testing often consist of many overlapping, heterogeneous patient subgroups that may not be explicitly defined or labeled. While a model’s average performance on a dataset may be high, the model can have significantly lower performance for certain subgroups, which may be hard to detect. We describe an algorithmic framework for identifying subgroups with potential performance disparities (AFISP), which produces a set of interpretable phenotypes corresponding to subgroups for which the model’s performance may be relatively lower. This could allow model evaluators, including developers and users, to identify possible failure modes prior to wide-scale deployment. We illustrate the application of AFISP by applying it to a patient deterioration model to detect significant subgroup performance disparities, and show that AFISP is significantly more scalable than existing algorithmic approaches.
中文翻译:
一个数据驱动的框架,用于识别 AI/机器学习模型可能表现不佳的患者亚组
评估临床模型性能的一个基本目标是确保它在不同的目标患者群体中表现良好。一个主要挑战是,模型开发和测试中使用的数据通常由许多重叠的异质患者亚组组成,这些亚组可能没有明确定义或标记。虽然模型在数据集上的平均性能可能很高,但对于某些子组,模型的性能可能会明显降低,这可能很难检测。我们描述了一个算法框架,用于识别具有潜在性能差异 (AFISP) 的亚组,该框架产生一组可解释的表型,对应于模型性能可能相对较低的亚组。这可能使模型评估者(包括开发人员和用户)能够在大规模部署之前识别可能的故障模式。我们通过将 AFISP 应用于患者恶化模型来检测显着的亚组绩效差异,并表明 AFISP 比现有算法方法更具可扩展性。