当前位置:
X-MOL 学术
›
Soc. Sci. Comput. Rev.
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Are Large-Scale Data From Private Companies Reliable? An Analysis of Machine-Generated Business Location Data in a Popular Dataset
Social Science Computer Review ( IF 3.0 ) Pub Date : 2024-04-15 , DOI: 10.1177/08944393241245390 Nikolitsa Grigoropoulou 1 , Mario L. Small 1, 2
Social Science Computer Review ( IF 3.0 ) Pub Date : 2024-04-15 , DOI: 10.1177/08944393241245390 Nikolitsa Grigoropoulou 1 , Mario L. Small 1, 2
Affiliation
Large-scale data from private companies offer new opportunities to examine topics of scientific and social significance, such as racial inequality, partisan polarization, and activity-based segregation. However, because such data are often generated through automated processes, their accuracy and reliability for social science research remain unclear. The present study examines how quality issues in large-scale data from private companies can afflict the reporting of even ostensibly uncomplicated values. We assess the reliability with which an often-used device tracking data source, SafeGraph, sorted data it acquired on financial institutions into categories, such as banks and payday lenders, based on a standard classification system. We find major classification problems that vary by type of institution, and remarkably high rates of unidentified closures and duplicate records. We suggest that classification problems can affect research based on large-scale private data in four ways: detection, efficiency, validity, and bias. We discuss the implications of our findings, and list a set of problems researchers should consider when using large-scale data from companies.
中文翻译:
私营公司的大规模数据可靠吗?流行数据集中机器生成的商业位置数据的分析
来自私营公司的大规模数据为研究具有科学和社会意义的主题提供了新的机会,例如种族不平等、党派极化和基于活动的隔离。然而,由于此类数据通常是通过自动化流程生成的,因此其对于社会科学研究的准确性和可靠性仍不清楚。本研究探讨了私营公司大规模数据的质量问题如何影响表面上不复杂的价值的报告。我们评估了常用设备跟踪数据源 SafeGraph 的可靠性,该数据源根据标准分类系统将其从金融机构获取的数据分类为银行和发薪日贷款机构等类别。我们发现主要的分类问题因机构类型而异,并且不明原因关闭和重复记录的比例非常高。我们认为分类问题可以通过四种方式影响基于大规模私人数据的研究:检测、效率、有效性和偏差。我们讨论了我们的研究结果的含义,并列出了研究人员在使用来自公司的大规模数据时应考虑的一系列问题。
更新日期:2024-04-15
中文翻译:
私营公司的大规模数据可靠吗?流行数据集中机器生成的商业位置数据的分析
来自私营公司的大规模数据为研究具有科学和社会意义的主题提供了新的机会,例如种族不平等、党派极化和基于活动的隔离。然而,由于此类数据通常是通过自动化流程生成的,因此其对于社会科学研究的准确性和可靠性仍不清楚。本研究探讨了私营公司大规模数据的质量问题如何影响表面上不复杂的价值的报告。我们评估了常用设备跟踪数据源 SafeGraph 的可靠性,该数据源根据标准分类系统将其从金融机构获取的数据分类为银行和发薪日贷款机构等类别。我们发现主要的分类问题因机构类型而异,并且不明原因关闭和重复记录的比例非常高。我们认为分类问题可以通过四种方式影响基于大规模私人数据的研究:检测、效率、有效性和偏差。我们讨论了我们的研究结果的含义,并列出了研究人员在使用来自公司的大规模数据时应考虑的一系列问题。