当前位置: X-MOL 学术Nat. Astron. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Are LLMs ready to do astronomy?
Nature Astronomy ( IF 12.9 ) Pub Date : 2024-12-16 , DOI: 10.1038/s41550-024-02458-7
Lindsay Oldham

As modern astronomy matures, individual researchers are becoming increasingly specialized, often at the expense of detailed knowledge in other fields; meanwhile, the dissemination of the vast datasets being collected by today’s telescopes is limited by the work hours available in the research community. The recent and rapid development of large language models (LLMs) may present a solution to both these problems, but only if their grasp and manipulation of the astronomy literature can be trusted. Yuan-Sen Ting and colleagues investigate this question through a comprehensive comparison of existing proprietary and open-weights (that is, modifiable and free to access) LLMs, and touch on the bigger issue of how artificial intelligence has the potential to transform the practice of science.

The authors compile a benchmarking set of ~4,500 multiple-choice questions, covering a range of astronomy topics from over 85 published review articles, and feed these into a selection of ~20 LLMs to evaluate the accuracy, cost-efficiency, rate of improvement and confidence level of each. They find that though proprietary models such as Anthropic’s Claude series generally outperform open-weights — with a maximum accuracy of 85% but significant scatter — their current costs may be prohibitive, whilst open-weights tend to be less accurate but more affordable to run, and are undergoing more rapid improvement. Interestingly, they also identify human-language-based and subject-specific performance differences between models, which they attribute to the varying amounts of training data available in each case.



中文翻译:


LLMs 准备好做天文学了吗?



随着现代天文学的成熟,个体研究人员变得越来越专业化,往往以牺牲其他领域的详细知识为代价;与此同时,当今望远镜收集的大量数据集的传播受到研究界工作时间的限制。最近大型语言模型 (LLMs) 的快速发展可能会为这两个问题提供解决方案,但前提是它们对天文学的掌握和操作是可信的。Yuan-Sen Ting 及其同事通过对现有的专有和开放权重(即可修改和免费访问)LLMs,并触及人工智能如何有可能改变科学实践的更大问题。


作者汇编了一套包含 ~4,500 道多项选择题的基准,涵盖了超过 85 篇已发表的评论文章中的一系列天文学主题,并将这些内容输入到 ~20 LLMs 中,以评估每个问题的准确性、成本效益、改进率和置信度。他们发现,尽管 Anthropic 的 Claude 系列等专有模型通常优于开放重量——最大准确率为 85%,但分散性很大——但它们目前的成本可能高得令人望而却步,而开放重量往往不太准确,但运行成本更高,并且正在经历更快的改进。有趣的是,他们还确定了模型之间基于人类语言和特定主题的性能差异,并将其归因于每种情况下可用的训练数据量不同。

更新日期:2024-12-17
down
wechat
bug