当前位置: X-MOL 学术IEEE Trans. Affect. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Can Large Language Models Assess Personality From Asynchronous Video Interviews? A Comprehensive Evaluation of Validity, Reliability, Fairness, and Rating Patterns
IEEE Transactions on Affective Computing ( IF 9.6 ) Pub Date : 2024-03-08 , DOI: 10.1109/taffc.2024.3374875
Tianyi Zhang 1 , Antonis Koutsoumpis 2 , Janneke K. Oostrom 2 , Djurre Holtrop 2 , Sina Ghassemi 3 , Reinout E. de Vries 3
Affiliation  

The advent of Artificial Intelligence (AI) technologies has precipitated the rise of asynchronous video interviews (AVIs) as an alternative to conventional job interviews. These one-way video interviews are conducted online and can be analyzed using AI algorithms to automate and speed up the selection procedure. In particular, the swift advancement of Large Language Models (LLMs) has significantly decreased the cost and technical barrier to developing AI systems for automatic personality and interview performance evaluation. However, the generative and task-unspecific nature of LLMs might pose potential risks and biases when evaluating humans based on their AVI responses. In this study, we conducted a comprehensive evaluation of the validity, reliability, fairness, and rating patterns of two widely-used LLMs, GPT-3.5 and GPT-4, in assessing personality and interview performance from an AVI. We compared the personality and interview performance ratings of the LLMs with the ratings from a task-specific AI model and human annotators using simulated AVI responses of 685 participants. The results show that LLMs can achieve similar or even better zero-shot validity compared with the task-specific AI model when predicting personality traits. The verbal explanations for predicting personality traits generated by LLMs are interpretable by the personality items that are designed according to psychological theories. However, LLMs also suffered from uneven performance across different traits, insufficient test-retest reliability, and the emergence of certain biases. Thus, it is necessary to exercise caution when applying LLMs for human-related application scenarios, especially for significant decisions such as employment.

中文翻译:


大型语言模型可以通过异步视频面试评估性格吗?有效性、可靠性、公平性和评级模式的综合评估



人工智能 (AI) 技术的出现促进了异步视频面试 (AVI) 的兴起,作为传统工作面试的替代方案。这些单向视频面试是在线进行的,可以使用人工智能算法进行分析,以自动化并加快选择过程。特别是,大型语言模型( LLMs )的迅速发展显着降低了开发用于自动个性和面试表现评估的人工智能系统的成本和技术障碍。然而, LLMs的生成性和任务非特异性性质在根据 AVI 反应评估人类时可能会带来潜在的风险和偏见。在本研究中,我们对两种广泛使用的LLMs GPT-3.5 和 GPT-4 在评估 AVI 性格和面试表现时的有效性、可靠性、公平性和评分模式进行了全面评估。我们使用 685 名参与者的模拟 AVI 响应,将LLMs的性格和面试表现评分与特定任务人工智能模型和人类注释者的评分进行了比较。结果表明,与特定任务的 AI 模型相比, LLMs在预测人格特质时可以实现类似甚至更好的零样本有效性。 LLMs生成的预测人格特质的口头解释可以通过根据心理学理论设计的人格项目来解释。然而, LLMs也面临着不同特质表现参差不齐、重试可靠性不足以及出现某些偏见的问题。因此,在将LLMs应用于与人类相关的应用场景时,尤其是就业等重大决策时,需要谨慎行事。
更新日期:2024-03-08
down
wechat
bug