Philosophical Studies ( IF 1.1 ) Pub Date : 2024-11-09 , DOI: 10.1007/s11098-024-02249-w Tan Zhi-Xuan, Micah Carroll, Matija Franklin, Hal Ashton
The dominant practice of AI alignment assumes (1) that preferences are an adequate representation of human values, (2) that human rationality can be understood in terms of maximizing the satisfaction of preferences, and (3) that AI systems should be aligned with the preferences of one or more humans to ensure that they behave safely and in accordance with our values. Whether implicitly followed or explicitly endorsed, these commitments constitute what we term a preferentist approach to AI alignment. In this paper, we characterize and challenge the preferentist approach, describing conceptual and technical alternatives that are ripe for further research. We first survey the limits of rational choice theory as a descriptive model, explaining how preferences fail to capture the thick semantic content of human values, and how utility representations neglect the possible incommensurability of those values. We then critique the normativity of expected utility theory (EUT) for humans and AI, drawing upon arguments showing how rational agents need not comply with EUT, while highlighting how EUT is silent on which preferences are normatively acceptable. Finally, we argue that these limitations motivate a reframing of the targets of AI alignment: Instead of alignment with the preferences of a human user, developer, or humanity-writ-large, AI systems should be aligned with normative standards appropriate to their social roles, such as the role of a general-purpose assistant. Furthermore, these standards should be negotiated and agreed upon by all relevant stakeholders. On this alternative conception of alignment, a multiplicity of AI systems will be able to serve diverse ends, aligned with normative standards that promote mutual benefit and limit harm despite our plural and divergent values.
中文翻译:
超越 AI Alignment 中的首选项
AI 对齐的主导实践假设 (1) 偏好是人类价值观的充分代表,(2) 人类理性可以从最大化偏好的满足的角度来理解,以及 (3) AI 系统应该与一个或多个人类的偏好保持一致,以确保它们的行为安全并符合我们的价值观。无论是默许还是明确认可,这些承诺都构成了我们所说的 AI 对齐的优先方法。在本文中,我们描述并挑战了优先主义方法,描述了已经成熟以供进一步研究的概念和技术替代方案。我们首先考察了理性选择理论作为描述模型的局限性,解释了偏好如何无法捕捉人类价值观的厚重语义内容,以及效用表征如何忽视了这些价值观可能存在的不可比性。然后,我们批评了人类和 AI 的预期效用理论 (EUT) 的规范性,借鉴了表明理性主体如何不需要遵守 EUT 的论点,同时强调了 EUT 如何对哪些偏好在规范上是可接受的保持沉默。最后,我们认为这些限制促使重新构建 AI 对齐的目标:AI 系统应该与适合其社会角色的规范标准保持一致,例如通用助手的角色,而不是与人类用户、开发人员或人类的偏好保持一致。此外,这些标准应由所有相关利益相关者协商和商定。 在这种替代的对齐概念上,多种 AI 系统将能够服务于不同的目标,与促进互惠互利和限制伤害的规范标准保持一致,尽管我们的价值观是多元和不同的。