Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Wisdom of the silicon crowd: LLM ensemble prediction capabilities rival human crowd accuracy
Science Advances ( IF 11.7 ) Pub Date : 2024-11-08 , DOI: 10.1126/sciadv.adp1528 Philipp Schoenegger, Indre Tuminauskaite, Peter S. Park, Rafael Valdece Sousa Bastos, Philip E. Tetlock
Science Advances ( IF 11.7 ) Pub Date : 2024-11-08 , DOI: 10.1126/sciadv.adp1528 Philipp Schoenegger, Indre Tuminauskaite, Peter S. Park, Rafael Valdece Sousa Bastos, Philip E. Tetlock
Human forecasting accuracy improves through the “wisdom of the crowd” effect, in which aggregated predictions tend to outperform individual ones. Past research suggests that individual large language models (LLMs) tend to underperform compared to human crowd aggregates. We simulate a wisdom of the crowd effect with LLMs. Specifically, we use an ensemble of 12 LLMs to make probabilistic predictions about 31 binary questions, comparing them with those made by 925 human forecasters in a 3-month tournament. We show that the LLM crowd outperforms a no-information benchmark and is statistically indistinguishable from the human crowd. We also observe human-like biases, such as the acquiescence bias. In another study, we find that LLM predictions (of GPT-4 and Claude 2) improve when exposed to the median human prediction, increasing accuracy by 17 to 28%. However, simply averaging human and machine forecasts yields more accurate results. Our findings suggest that LLM predictions can rival the human crowd’s forecasting accuracy through simple aggregation.
中文翻译:
硅人群的智慧:LLM 集成预测能力可与人类人群准确性相媲美
人类预测的准确性通过“群体智慧”效应来提高,在这种效应中,聚合预测往往优于单个预测。过去的研究表明,与人类人群聚合相比,单个大型语言模型 (LLMs) 往往表现不佳。我们使用 LLMs。具体来说,我们使用 12 个LLMs对 31 个二进制问题进行概率预测,将它们与 925 名人类预测者在 3 个月的比赛中所做的预测进行比较。我们表明,LLM 人群的表现优于无信息基准,并且在统计上与人类人群没有区别。我们还观察到类似人类的偏见,例如默许偏见。在另一项研究中,我们发现 LLM 预测(GPT-4 和 Claude 2)在暴露于中位数人类预测时有所改善,准确性提高了 17% 至 28%。但是,简单地平均人工和机器预测会产生更准确的结果。我们的研究结果表明,LLM 预测可以通过简单的聚合与人类人群的预测准确性相媲美。
更新日期:2024-11-08
中文翻译:
硅人群的智慧:LLM 集成预测能力可与人类人群准确性相媲美
人类预测的准确性通过“群体智慧”效应来提高,在这种效应中,聚合预测往往优于单个预测。过去的研究表明,与人类人群聚合相比,单个大型语言模型 (LLMs) 往往表现不佳。我们使用 LLMs。具体来说,我们使用 12 个LLMs对 31 个二进制问题进行概率预测,将它们与 925 名人类预测者在 3 个月的比赛中所做的预测进行比较。我们表明,LLM 人群的表现优于无信息基准,并且在统计上与人类人群没有区别。我们还观察到类似人类的偏见,例如默许偏见。在另一项研究中,我们发现 LLM 预测(GPT-4 和 Claude 2)在暴露于中位数人类预测时有所改善,准确性提高了 17% 至 28%。但是,简单地平均人工和机器预测会产生更准确的结果。我们的研究结果表明,LLM 预测可以通过简单的聚合与人类人群的预测准确性相媲美。