The Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data,Organizational Research Methods

当前位置： X-MOL 学术 › Organ. Res. Methods › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data
Organizational Research Methods ( IF 8.9 ) Pub Date : 2024-11-04 , DOI: 10.1177/10944281241284941
Richard F.J. Haans, Marc J. Mertens

Websites represent a crucial avenue for organizations to reach customers, attract talent, and disseminate information to stakeholders. Despite their importance, strikingly little work in the domain of organization and management research has tapped into this source of longitudinal big data. In this paper, we highlight the unique nature and profound potential of longitudinal website data and present novel open-source code- and databases that make these data accessible. Specifically, our codebase offers a general-purpose setup, building on four central steps to scrape historical websites using the Wayback Machine. Our open-access CompuCrawl database was built using this four-step approach. It contains websites of North American firms in the Compustat database between 1996 and 2020—covering 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages. We describe the coverage of our database and illustrate its use by applying word-embedding models to reveal the evolving meaning of the concept of “sustainability” over time. Finally, we outline several avenues for future research enabled by our step-by-step longitudinal web scraping approach and our CompuCrawl database.

中文翻译：

互联网永不忘记：纵向组织网站数据的四步抓取教程、代码库和数据库

网站是组织接触客户、吸引人才和向利益相关者传播信息的重要途径。尽管它们很重要，但在组织和管理研究领域，很少有工作涉及这一纵向大数据来源。在本文中，我们强调了纵向网站数据的独特性质和深远的潜力，并提出了使这些数据可访问的新型开源代码和数据库。具体来说，我们的代码库提供了一个通用的设置，建立在使用 Wayback Machine 抓取历史网站的四个中心步骤之上。我们的开放访问 CompuCrawl 数据库就是使用这种四步法构建的。它包含 Compustat 数据库中 1996 年至 2020 年间北美公司的网站，涵盖 11,277 家公司，每年有 86,303 家公司观察和 1,617,675 个网页。我们描述了数据库的覆盖范围，并通过应用词嵌入模型来说明其用途，以揭示“可持续性”概念随时间演变的含义。最后，我们概述了通过我们的分步纵向 Web 抓取方法和我们的 CompuCrawl 数据库实现的未来研究的几种途径。

更新日期：2024-11-04

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文