The use of gene expression datasets in feature selection research: 20 years of inherent bias?,WIREs Data Mining and Knowledge Discovery

当前位置： X-MOL 学术 › WIREs Data Mining Knowl. Discov. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The use of gene expression datasets in feature selection research: 20 years of inherent bias?
WIREs Data Mining and Knowledge Discovery ( IF 6.4 ) Pub Date : 2023-11-16 , DOI: 10.1002/widm.1523
Bruno I. Grisci _{1,

2} , Bruno César Feltes ₃ , Joice de Faria Poloni ₃ , Pedro H. Narloch ₁ , Márcio Dorn _{1,

4,

5}

Affiliation

Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA-seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated, and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA-seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results.

中文翻译：

基因表达数据集在特征选择研究中的使用：20 年的固有偏差？

特征选择算法经常用于预处理应用于生物数据的机器学习管道，以识别相关特征。特征选择在基因表达研究中的应用始于 20 世纪 90 年代末对人类癌症微阵列数据集的分析。此后，基因表达技术日趋完善，人类基因组计划完成，新的微阵列平台不断创建和停产，RNA-seq逐渐取代了微阵列。然而，过去二十年中的大多数特征选择方法都是在微阵列技术起步阶段的相同数据集上设计、评估和验证的。在对 2010 年至 2020 年间发表的 1200 多篇有关特征选择和基因表达的出版物的回顾中，我们发现 57% 的出版物使用了至少一个过时的数据集，23% 的出版物仅使用过时的数据，32% 的出版物没有引用数据源。其他问题包括不再可用的参考数据库、RNA-seq 数据集的缓慢采用以及对人类癌症数据的偏见，即使是针对更广泛范围设计的方法。在最流行的数据集中，有些数据集已有 23 年历史，样本标签错误、实验偏差、分布变化以及缺乏分类挑战都很常见。与生物学出版物相比，这些问题在具有计算机科学背景的出版物中更为突出，并且可能导致不准确和误导性的生物学结果。

更新日期：2023-11-16

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文