当前位置: X-MOL 学术Organ. Res. Methods › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The Effects of the Training Sample Size, Ground Truth Reliability, and NLP Method on Language-Based Automatic Interview Scores’ Psychometric Properties
Organizational Research Methods ( IF 8.9 ) Pub Date : 2024-07-25 , DOI: 10.1177/10944281241264027
Louis Hickman 1 , Josh Liff 2 , Caleb Rottman 2 , Charles Calderwood 1
Affiliation  

While machine learning (ML) can validly score psychological constructs from behavior, several conditions often change across studies, making it difficult to understand why the psychometric properties of ML models differ across studies. We address this gap in the context of automatically scored interviews. Across multiple datasets, for interview- or question-level scoring of self-reported, tested, and interviewer-rated constructs, we manipulate the training sample size and natural language processing (NLP) method while observing differences in ground truth reliability. We examine how these factors influence the ML model scores’ test–retest reliability and convergence, and we develop multilevel models for estimating the convergent-related validity of ML model scores in similar interviews. When the ground truth is interviewer ratings, hundreds of observations are adequate for research purposes, while larger samples are recommended for practitioners to support generalizability across populations and time. However, self-reports and tested constructs require larger training samples. Particularly when the ground truth is interviewer ratings, NLP embedding methods improve upon count-based methods. Given mixed findings regarding ground truth reliability, we discuss future research possibilities on factors that affect supervised ML models’ psychometric properties.

中文翻译:


训练样本量、真实可靠性和 NLP 方法对基于语言的自动面试分数心理测量特性的影响



虽然机器学习 (ML) 可以根据行为对心理结构进行有效评分,但研究中的一些条件经常会发生变化,因此很难理解为什么 ML 模型的心理测量特性在不同研究中存在差异。我们在自动评分面试的背景下解决了这一差距。在多个数据集中,对于自我报告、测试和面试官评分结构的面试或问题级评分,我们操纵训练样本大小和自然语言处理(NLP)方法,同时观察真实可靠性的差异。我们研究这些因素如何影响 ML 模型分数的重测可靠性和收敛性,并开发多层次模型来估计类似访谈中 ML 模型分数的收敛相关有效性。当基本事实是采访者的评分时,数百个观察结果足以满足研究目的,而建议从业者使用更大的样本,以支持跨人群和时间的普遍性。然而,自我报告和测试结构需要更大的训练样本。特别是当基本事实是面试官评分时,NLP 嵌入方法比基于计数的方法有所改进。鉴于有关真实可靠性的混合研究结果,我们讨论了影响监督机器学习模型心理测量特性的因素的未来研究可能性。
更新日期:2024-07-25
down
wechat
bug