Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Introduction to special issues on historical record linking
Historical Methods: A Journal of Quantitative and Interdisciplinary History ( IF 1.6 ) Pub Date : 2020-04-02 , DOI: 10.1080/01615440.2020.1707445
Kenneth M Sylvester 1 , J David Hacker 2
Affiliation  

Historical record linkage has responded to two large opportunities in recent years. The growth of computational power and the emergence of full count historical census data are both revolutionizing the analysis of historical population change. The increased availability of full count census data has expanded the comparative terrain for addressing multigenerational or cross-population change. The exponential increase in the resolution of analysis invites scholars to revisit many assumptions about populations of interest, sample weighting, validation or ground-truthing, and measurement. As Ruggles, Fitch, and Roberts (2018) suggest the systematic effort to link repeated observations for social and economic research reaches back to work in the 1930s. But in the last two decades, the focus has shifted onto a larger geographic stage, moving from intensive studies of local and regional settings, to national and international studies of migration, mobility, and population change. The projects in this special issue, and the one that preceded it in volume 51(4) in 2018, are representative of this combination of computational power and geographic reach. As the authors argue, the richness of full count data allows for comparative and rigorously validated matches between historical individuals. But there is still great uncertainty. There are false matches and there are individuals who are missing over time. Bailey, Cole and Massey argue in “Simple strategies for improving inference with linked data: a case study of the 1850–1930 IPUMS linked representative historical samples” for closer attention to systematic bias introduced by machine linking algorithms in working with longitudinal or intergenerational data. They recommend that researchers adjust for the nonrandom false-matches (Type I errors) and missed matches (Type II errors) by incorporating validation variables in linking inference methods, and employing regression-based weighting procedure to customize research samples. Both approaches are illustrated in relation to the 1850–1930 Integrated Public Use Microdata Series Linked Representative Samples (IPUMS-IRS). Custom weights are developed in relation to a training data set (hand-linked) in order to document the performance of the linking algorithm. Validation variables are used to reduce the level of low quality links in a sample (conditioning on information like the commonness of a last name or disagreement about birthplace over time). This smaller and less biased sample is then evaluated for its representativeness of the reference population. A simple linear regression method and a heteroscedasticity-robust Wald test of joint significance test the null hypothesis of no relationship between the covariates and the likelihood of a linked observation. Abramitzky, Mill and Perez also argue for linking methods that customize large historical data sets to arrive at longitudinal samples that represent the populations of interest as closely as possible. In their paper “Linking individuals across historical sources: A fully automated approach”, they advocate that researchers move toward methods that are highly replicable and even provide STATA based code for reproducing their linking algorithm. Rather than relying on any kind of ground-truthing for validation, as was done in the original IPUMS-IRS (Goeken et al. 2011) and Bailey, Cole, and Massey (2019), Abramitzky et al are arguing for matching based on probability scores derived from the Expectation-Maximization (EM) algorithm. After blocking on subsets of the larger populations and using Jaro-Winkler to measure string distances between names, Abramitzky et al leverage the iterative nature of the EM algorithm to derive a local maximum likelihood function that describes the probability of a match. Once estimates of a match are derived, researchers can assess how the linked records are suited to the research question and the reference populations. In comparing the representativeness of linked IPUM-IRS samples to this automated approach,

中文翻译:

历史记录链接专题介绍

历史记录联动响应了近年来的两大机遇。计算能力的增长和全统计历史人口普查数据的出现都在彻底改变历史人口变化的分析。全面普查数据可用性的增加扩大了解决多代或跨人口变化的比较领域。分析分辨率的指数增长促使学者们重新审视有关感兴趣的人群、样本加权、验证或地面实况以及测量的许多假设。正如 Ruggles、Fitch 和 Roberts(2018 年)所建议的那样,将社会和经济研究的重复观察联系起来的系统性努力可以追溯到 1930 年代。但在过去的二十年里,焦点已经转移到更大的地理舞台上,从对地方和区域环境的深入研究转向关于移民、流动和人口变化的国家和国际研究。本期特刊中的项目,以及 2018 年第 51(4) 卷中之前的项目,都代表了计算能力和地理范围的这种结合。正如作者所言,完整计数数据的丰富性允许在历史个体之间进行比较和严格验证的匹配。但仍有很大的不确定性。存在虚假匹配,并且有些人随着时间的推移而失踪。Bailey、Cole 和 Massey 在“使用关联数据改进推理的简单策略:1850-1930 年 IPUMS 关联代表性历史样本的案例研究”,以更密切地关注机器关联算法在处理纵向或代际数据时引入的系统偏差。他们建议研究人员通过在链接推理方法中加入验证变量,并采用基于回归的加权程序来定制研究样本,来调整非随机错误匹配(I 类错误)和遗漏匹配(II 类错误)。这两种方法都与 1850-1930 年综合公共使用微数据系列关联代表性样本 (IPUMS-IRS) 相关联进行了说明。自定义权重是针对训练数据集(手动链接)开发的,以记录链接算法的性能。验证变量用于降低样本中低质量链接的水平(以姓氏的普遍性或随着时间的推移对出生地的分歧等信息为条件)。然后评估这个较小且偏差较小的样本在参考人群中的代表性。简单的线性回归方法和联合显着性的异方差稳健 Wald 检验检验协变量与关联观察的可能性之间没有关系的零假设。Abramitzky、Mill 和 Perez 还主张将定制大型历史数据集的方法联系起来,以获得尽可能接近代表感兴趣人群的纵向样本。在他们的论文“跨历史来源链接个人:一种完全自动化的方法”中,他们提倡研究人员转向高度可复制的方法,甚至提供基于 STATA 的代码来重现他们的链接算法。Abramitzky 等人不是像最初的 IPUMS-IRS (Goeken et al. 2011) 和 Bailey, Cole, and Massey (2019) 那样依赖任何类型的地面实况进行验证,而是主张基于概率进行匹配从期望最大化(EM)算法得出的分数。在阻止较大群体的子集并使用 Jaro-Winkler 测量名称之间的字符串距离后,Abramitzky 等人利用 EM 算法的迭代性质推导出描述匹配概率的局部最大似然函数。一旦得出匹配的估计值,研究人员可以评估链接记录如何适合研究问题和参考人群。在将链接的 IPUM-IRS 样本的代表性与这种自动化方法进行比较时,
更新日期:2020-04-02
down
wechat
bug