当前位置:
X-MOL 学术
›
Proc. Natl. Acad. Sci. U.S.A.
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Embers of autoregression show how large language models are shaped by the problem they are trained to solve
Proceedings of the National Academy of Sciences of the United States of America ( IF 9.4 ) Pub Date : 2024-10-04 , DOI: 10.1073/pnas.2322420121 R. Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D. Hardy, Thomas L. Griffiths
Proceedings of the National Academy of Sciences of the United States of America ( IF 9.4 ) Pub Date : 2024-10-04 , DOI: 10.1073/pnas.2322420121 R. Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D. Hardy, Thomas L. Griffiths
The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that to develop a holistic understanding of these systems, we must consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts, we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. Using this approach—which we call the teleological approach—we identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. To test our predictions, we evaluate five LLMs (GPT-3.5, GPT-4, Claude 3, Llama 3, and Gemini 1.0) on 11 tasks, and we find robust evidence that LLMs are influenced by probability in the hypothesized ways. Many of the experiments reveal surprising failure modes. For instance, GPT-4’s accuracy at decoding a simple cipher is 51% when the output is a high-probability sentence but only 13% when it is low-probability, even though this task is a deterministic one for which probability should not matter. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system—one that has been shaped by its own particular set of pressures.
中文翻译:
自回归的余烬显示了大型语言模型如何受到训练要解决的问题的影响
大型语言模型 (LLMs使得认识到它们的优势和局限性变得非常重要。我们认为,要全面理解这些系统,我们必须考虑它们被训练解决的问题:互联网文本上的下一个单词预测。通过认识到这项任务带来的压力,我们可以预测 LLMs,从而使我们能够推断它们何时会成功或失败。使用这种方法(我们称之为目的论方法),我们确定了我们假设将影响 LLM:要执行的任务的概率、目标输出的概率以及提供输入的概率。为了检验我们的预测,我们在 11 项任务中评估了 5 个LLMs(GPT-3.5、GPT-4、Claude 3、Llama 3 和 Gemini 1.0),我们发现了强有力的证据表明 LLMs 以假设的方式受到概率的影响。许多实验揭示了令人惊讶的失败模式。例如,当输出是高概率句子时,GPT-4 解码简单密码的准确率为 51%,但在低概率时仅为 13%,即使这项任务是确定性的,概率无关紧要。这些结果表明,AI 从业者在低概率情况下使用 LLMs。更广泛地说,我们得出的结论是,我们不应该像人类一样评估 LLMs,而应该将它们视为一种不同类型的系统——一个由自己特定的压力塑造的系统。
更新日期:2024-10-04
中文翻译:
自回归的余烬显示了大型语言模型如何受到训练要解决的问题的影响
大型语言模型 (LLMs使得认识到它们的优势和局限性变得非常重要。我们认为,要全面理解这些系统,我们必须考虑它们被训练解决的问题:互联网文本上的下一个单词预测。通过认识到这项任务带来的压力,我们可以预测 LLMs,从而使我们能够推断它们何时会成功或失败。使用这种方法(我们称之为目的论方法),我们确定了我们假设将影响 LLM:要执行的任务的概率、目标输出的概率以及提供输入的概率。为了检验我们的预测,我们在 11 项任务中评估了 5 个LLMs(GPT-3.5、GPT-4、Claude 3、Llama 3 和 Gemini 1.0),我们发现了强有力的证据表明 LLMs 以假设的方式受到概率的影响。许多实验揭示了令人惊讶的失败模式。例如,当输出是高概率句子时,GPT-4 解码简单密码的准确率为 51%,但在低概率时仅为 13%,即使这项任务是确定性的,概率无关紧要。这些结果表明,AI 从业者在低概率情况下使用 LLMs。更广泛地说,我们得出的结论是,我们不应该像人类一样评估 LLMs,而应该将它们视为一种不同类型的系统——一个由自己特定的压力塑造的系统。