硅群体的智慧：LLM 集成预测能力与人类群体的准确性相匹配,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

硅群体的智慧：LLM 集成预测能力与人类群体的准确性相匹配
arXiv - CS - Computation and Language Pub Date : 2024-02-29 , DOI: arxiv-2402.19379
Philipp Schoenegger, Indre Tuminauskaite, Peter S. Park, Philip E. Tetlock

实践中人类预测的准确性依赖于“群体智慧”效应，通过将一群个体预测者进行汇总，可以显着提高对未来事件的预测。过去对大型语言模型 (LLM) 预测能力的研究表明，前沿 LLM 作为个人预测者，与人类预测锦标赛总体的黄金标准相比表现不佳。在研究 1 中，我们通过使用由 12 名法学硕士组成的法学硕士整体方法来扩展这项研究。我们将法学硕士对 31 个二元问题的汇总预测与 925 名人类预测者在为期三个月的预测锦标赛中的预测进行了比较。我们的主要分析表明，法学硕士人群的表现优于简单的无信息基准，并且在统计上与人类人群相当。我们还观察到默认效应，尽管正负分辨率几乎均匀分布，但平均模型预测显着高于 50%。此外，在研究 2 中，我们测试是否可以通过利用人类认知输出来改进 LLM 预测（GPT-4 和 Claude 2）。我们发现，这两种模型的预测准确性都受益于将人类预测中值作为信息，将准确性提高了 17% 到 28%：尽管这导致预测的准确度低于简单平均人类和机器预测的准确度。我们的结果表明，法学硕士可以通过简单、实用的预测聚合方法实现与人类预测锦标赛相媲美的预测精度。这复制了法学硕士的“群体智慧”效应，并为全社会的各种应用打开了大门。

"点击查看英文标题和摘要"

Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Match Human Crowd Accuracy

Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human crowd forecasting tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of twelve LLMs. We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of 925 human forecasters from a three-month forecasting tournament. Our main analysis shows that the LLM crowd outperforms a simple no-information benchmark and is statistically equivalent to the human crowd. We also observe an acquiescence effect, with mean model predictions being significantly above 50%, despite an almost even split of positive and negative resolutions. Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%: though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of human crowd forecasting tournaments: via the simple, practically applicable method of forecast aggregation. This replicates the 'wisdom of the crowd' effect for LLMs, and opens up their use for a variety applications throughout society.

更新日期：2024-03-02

点击分享查看原文

点击收藏

阅读更多本刊新发论文