当前位置: X-MOL 学术J. Innov. Knowl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
How data heterogeneity affects innovating knowledge and information in gene identification: A statistical learning perspective
Journal of Innovation & Knowledge ( IF 15.6 ) Pub Date : 2024-07-31 , DOI: 10.1016/j.jik.2024.100514
Jun Zhao , Fangyi Lao , Guan'ao Yan , Yi Zhang

Data heterogeneity, particularly noted in fields such as genetics, has been identified as a key feature of big data, posing significant challenges to innovation in knowledge and information. This paper focuses on characterizing and understanding the so-called "curse of heterogeneity" in gene identification for low infant birth weight from a statistical learning perspective. Owing to the computational and analytical advantages of expectile regression in handling heterogeneity, this paper proposes a flexible, regularized, partially linear additive expectile regression model for high-dimensional heterogeneous data. Unlike most existing works that assume Gaussian or sub-Gaussian error distributions, we adopt a more realistic, less stringent assumption that the errors have only finite moments. Additionally, we derive a two-step algorithm to address the reduced optimization problem and demonstrate that our method, with a probability approaching one, achieves optimal estimation accuracy. Furthermore, we demonstrate that the proposed algorithm converges at least linearly, ensuring the practical applicability of our method. Monte Carlo simulations reveal that our method's resulting estimator performs well in terms of estimation accuracy, model selection, and heterogeneity identification. Empirical analysis in gene trait expression further underscores the potential for guiding public health interventions.

中文翻译:


数据异质性如何影响基因识别中的创新知识和信息:统计学习视角



数据异质性,特别是在遗传学等领域,已被认为是大数据的一个关键特征,对知识和信息的创新提出了重大挑战。本文重点从统计学习的角度描述和理解低婴儿出生体重基因识别中所谓的“异质性诅咒”。由于期望回归在处理异质性方面的计算和分析优势,本文针对高维异构数据提出了一种灵活的、正则化的、部分线性的加性期望回归模型。与大多数假设高斯或亚高斯误差分布的现有工作不同,我们采用更现实、不太严格的假设,即误差只有有限矩。此外,我们推导了一种两步算法来解决简化的优化问题,并证明我们的方法以接近 1 的概率实现了最佳估计精度。此外,我们证明了所提出的算法至少线性收敛,确保了我们方法的实际适用性。蒙特卡罗模拟表明,我们的方法得到的估计器在估计精度、模型选择和异质性识别方面表现良好。基因性状表达的实证分析进一步强调了指导公共卫生干预措施的潜力。
更新日期:2024-07-31
down
wechat
bug