当前位置:
X-MOL 学术
›
Annu. Rev. Stat. Appl.
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Convergence Diagnostics for Entity Resolution
Annual Review of Statistics and Its Application ( IF 7.4 ) Pub Date : 2024-04-24 , DOI: 10.1146/annurev-statistics-040522-114848 Serge Aleshin-Guendel 1 , Rebecca C. Steorts 1, 2, 3
Annual Review of Statistics and Its Application ( IF 7.4 ) Pub Date : 2024-04-24 , DOI: 10.1146/annurev-statistics-040522-114848 Serge Aleshin-Guendel 1 , Rebecca C. Steorts 1, 2, 3
Affiliation
Entity resolution is the process of merging and removing duplicate records from multiple data sources, often in the absence of unique identifiers. Bayesian models for entity resolution allow one to include a priori information, quantify uncertainty in important applications, and directly estimate a partition of the records. Markov chain Monte Carlo (MCMC) sampling is the primary computational method for approximate posterior inference in this setting, but due to the high dimensionality of the space of partitions, there are no agreed upon standards for diagnosing nonconvergence of MCMC sampling. In this article, we review Bayesian entity resolution, with a focus on the specific challenges that it poses for the convergence of a Markov chain. We review prior methods for convergence diagnostics, discussing their weaknesses. We provide recommendations for using MCMC sampling for Bayesian entity resolution, focusing on the use of modern diagnostics that are commonplace in applied Bayesian statistics. Using simulated data, we find that a commonly used Gibbs sampler performs poorly compared with two alternatives.
中文翻译:
用于实体解析的收敛诊断
实体解析是从多个数据源合并和删除重复记录的过程,通常在没有唯一标识符的情况下。用于实体解析的贝叶斯模型允许包含先验信息,量化重要应用程序中的不确定性,并直接估计记录的分区。马尔可夫链蒙特卡洛 (MCMC) 采样是在这种情况下近似后验推理的主要计算方法,但由于分区空间的高维性,没有公认的标准来诊断 MCMC 采样的非收敛性。在本文中,我们回顾了贝叶斯实体解析,重点介绍了它对马尔可夫链收敛构成的具体挑战。我们回顾了以前的收敛诊断方法,讨论了它们的弱点。我们提供了使用 MCMC 抽样进行贝叶斯实体解析的建议,重点是使用应用贝叶斯统计中常见的现代诊断方法。使用模拟数据,我们发现与两种替代方案相比,常用的 Gibbs 采样器性能较差。
更新日期:2024-04-24
中文翻译:
用于实体解析的收敛诊断
实体解析是从多个数据源合并和删除重复记录的过程,通常在没有唯一标识符的情况下。用于实体解析的贝叶斯模型允许包含先验信息,量化重要应用程序中的不确定性,并直接估计记录的分区。马尔可夫链蒙特卡洛 (MCMC) 采样是在这种情况下近似后验推理的主要计算方法,但由于分区空间的高维性,没有公认的标准来诊断 MCMC 采样的非收敛性。在本文中,我们回顾了贝叶斯实体解析,重点介绍了它对马尔可夫链收敛构成的具体挑战。我们回顾了以前的收敛诊断方法,讨论了它们的弱点。我们提供了使用 MCMC 抽样进行贝叶斯实体解析的建议,重点是使用应用贝叶斯统计中常见的现代诊断方法。使用模拟数据,我们发现与两种替代方案相比,常用的 Gibbs 采样器性能较差。