World Psychiatry ( IF 60.5 ) Pub Date : 2024-09-16 , DOI: 10.1002/wps.21248 Ashleigh Golden, Elias Aboujaoude
Even within the ever-evolving landscape of digital mental health interventions, the advent of generative artificial intelligence (GAI), large language models (LLMs), and generative pre-trained transformers (GPTs) represents a paradigm shift. These technologies bring the promise of scalable and personalized diagnostics, psychoeducation and treatment that may help close a stubborn access-to-care gap1. At the same time, the risk to patients’ health from unmonitored AI-powered care, and to users’ data from insecure platforms, presents unprecedented challenges. The enthusiasm and fear that AI mental health offerings simultaneously generate make a comprehensive tool for their systematic assessment a timely necessity.
To our knowledge, no comprehensive scale exists for systematically evaluating AI interventions. Abbasian et al2 suggested helpful metrics for assessing AI health care conversations, without explicitly tailoring them to mental health. AI scholar L. Eliot3 advocated rating mental health chatbots by their autonomy or degree of independence from human oversight. Pfohl et al4 put the focus squarely on evaluating equity and bias. These efforts highlight the need for a comprehensive toolbox for evaluating AI interventions in mental health – one that encompasses autonomy and equity, but also efficacy, user experience, safety and ethical integrity, among other crucial dimensions5.
Evaluative digital mental health tools that predate the rise of AI provide valuable lessons. The now discontinued nonprofit One Mind PsyberGuide6 offered reviews of digital mental health apps with a focus on three dimensions: credibility, user experience, and transparency. This framework seemed to fulfill an important role across several constituencies: Psihogios et al7 praised it in their paper on pediatric mobile health apps; Nesamoney8 endorsed it for helping app developers and designers; and Garland et al9 described it as more comprehensive and user-friendly than other app review platforms, including that by the American Psychological Association.
In creating an assessment framework for AI-powered mental health tools, PsyberGuide is a reasonable starting point. Besides short app reviews by users and lengthier expert reviews, it offered scoring guidelines for its dimensions. Given the importance of AI tools “learning” from ongoing feedback and reviews, and of a scoring system that facilitates comparisons across AI offerings, it forms a helpful basis.
Here we introduce the Framework for AI Tool Assessment in Mental Health (FAITA - Mental Health), a structured scale developed by updating PsyberGuide's “credibility”, “user experience” and “transparency” dimensions for the AI “age”, and incorporating three crucial new dimensions: “user agency”, “diversity and inclusivity” and “crisis management” (see supplementary information for the full structured FAITA - Mental Health form).
Our framework reflects awareness of both the potential and challenges of AI tools, and emphasizes evidence base, user-centric design, safety, personalization, cultural sensitivity, and the ethical use of technology. Ultimately, the framework aims to promote “best practices” and to guide industry development of AI technologies that benefit users while respecting their rights. Additionally, the framework seeks to be sufficiently flexible to accommodate continued evolution in the field and, with some minor modifications, adaptation to other medical disciplines impacted by AI (e.g., “FAITA - Genetics”).
The framework's first dimension, “credibility”, evaluates AI-powered mental health tools according to their scientific underpinnings and user goal achievement capabilities. Integrating the three subdimensions of “proposed goal”, “evidence-based content” and “retention”, this dimension advocates for interventions that have clear and measurable goals, are grounded in validated research and practices, and can keep users meaningfully engaged over time. Each subdimension is awarded up to 2 points, for a maximum dimension score of 6 for the most “credible” tool.
The second dimension for assessing AI mental health tools, “user experience”, addresses more complex interactions than those encountered in static mental health apps. As such, PsyberGuide's “user experience” dimension – with its focus on engagement, functionality and esthetics – was found to be insufficient, and three new subdimensions were incorporated: “personalized adaptability”, to evaluate the AI's ability to improve from user feedback over time; “quality of interactions”, to evaluate the naturalness of exchanges; and “mechanisms for feedback”, to underscore the importance of users’ ability to report issues, suggest improvements, and seek assistance. Each subdimension on the “user experience” dimension is awarded up to 2 points, for a maximum dimension score of 6.
The third dimension, “user agency”, is new and underlines the importance of empowering users to manage their personal data and treatment choices. It is divided into two subdimensions. The first, “user autonomy, data protection, and privacy”, focuses on control over personal health data, clearly worded and user-friendly consent processes, robust data protection protocols, secure storage, and users' ability to actively manage their data. The second, “user empowerment”, focuses on users’ self-efficacy and capacity for self-management, gauging AI interventions’ inclusion of tools that support users' independence, as well as encouraging the application of skills learned using the tool to real-life contexts in ways that prevent dependency on the tool. Each subdimension is awarded up to 2 points, for a maximum “user agency” dimension score of 4.
The fourth dimension, “equity and inclusivity”, is also new and consists of two subdimensions: “cultural sensitivity and inclusivity”, which assesses a tool's capability to engage with users from diverse cultural backgrounds and emphasizes the need for content recognizing cultural and other identity differences; and “bias and fairness”, which addresses the tool's commitment to diversify its training material and remove biases that might impact fairness and equity. Each subdimension is awarded up to 2 points, for a maximum “equity and inclusivity” dimension score of 4.
The fifth dimension, “transparency”, remains from PsyberGuide, but now extends beyond data management to include the AI's ownership, funding, business model, development processes, and primary stakeholders. It highlights the importance of providing clear and comprehensive information about operational and business practices, so that users are better equipped to make informed decisions on using such technologies. It also aims to help developers adhere to best practices by disclosing information regarding their tools’ intention and governance. The “transparency” dimension carries a maximum score of 2.
Finally, the new sixth dimension of “crisis management” evaluates the safeguarding of user well-being and whether the mental health AI tool provides immediate, effective support in emergencies. It emphasizes comprehensive safety protocols and crisis management features that not only steer users to relevant local resources during crises, but also facilitate follow-through with these resources. The “crisis management” dimension carries a maximum score of 2.
Integrating GAI, LLMs and GPTs into mental health care heralds a promising but complicated new era. The promise of these technologies for delivering personalized, accessible and scalable mental health support is immense. So, unfortunately, are the challenges. We developed the FAITA - Mental Health to equip users, clinicians, researchers, and industry and public health stakeholders with a scale for comprehensively evaluating the quality, safety, integrity and user-centricity of AI-powered mental health tools.
With an overall score ranging from 0 to 24, this scale attempts to capture the complexities of AI-driven mental health care, while accommodating ongoing evolution in the field and possible adaptations to other medical disciplines. Formal research is required to empirically test its strengths, weaknesses, and most pertinent components.
中文翻译:
心理健康人工智能工具评估框架(FAITA - 心理健康):评估人工智能驱动的心理健康工具的量表
即使在不断发展的数字心理健康干预领域,生成人工智能 (GAI)、大型语言模型 ( LLMs ) 和生成预训练变压器 (GPT) 的出现也代表了范式转变。这些技术带来了可扩展和个性化的诊断、心理教育和治疗的前景,可能有助于缩小顽固的医疗获取差距1 。与此同时,不受监控的人工智能护理对患者健康造成的风险,以及来自不安全平台的用户数据的风险,都带来了前所未有的挑战。人工智能心理健康产品同时产生的热情和恐惧使得及时需要一个用于系统评估的综合工具。
据我们所知,目前还没有系统评估人工智能干预措施的综合量表。 Abbasian 等人2提出了评估人工智能医疗保健对话的有用指标,但没有明确针对心理健康进行调整。人工智能学者 L. Eliot 3主张根据心理健康聊天机器人的自主性或独立于人类监督的程度来对其进行评级。 Pfohl 等人4将重点放在评估公平性和偏见上。这些努力凸显了需要一个全面的工具箱来评估人工智能对心理健康的干预措施——该工具箱不仅包括自主性和公平性,还包括功效、用户体验、安全性和道德诚信等关键维度5 。
人工智能兴起之前的评估数字心理健康工具提供了宝贵的经验教训。现已停产的非营利组织 One Mind PsyberGuide 6提供了对数字心理健康应用程序的评论,重点关注三个维度:可信度、用户体验和透明度。该框架似乎在多个领域发挥了重要作用:Psihogios 等人7在他们关于儿科移动健康应用程序的论文中赞扬了它; Nesamoney 8认可它,以帮助应用程序开发人员和设计人员; Garland 等人9称其比其他应用程序评论平台(包括美国心理学会的平台)更全面、更用户友好。
在为人工智能驱动的心理健康工具创建评估框架时,PsyberGuide 是一个合理的起点。除了用户对应用程序的简短评论和较长的专家评论之外,它还提供了其维度的评分指南。鉴于人工智能工具从持续的反馈和评论中“学习”的重要性,以及促进人工智能产品之间比较的评分系统的重要性,它构成了一个有用的基础。
在这里,我们介绍心理健康人工智能工具评估框架(FAITA - 心理健康),这是一个通过更新 PsyberGuide 的人工智能“时代”的“可信度”、“用户体验”和“透明度”维度而开发的结构化量表,并纳入了三个关键因素新维度:“用户机构”、“多样性和包容性”和“危机管理”(请参阅完整结构化 FAITA - 心理健康表格的补充信息)。
我们的框架反映了对人工智能工具的潜力和挑战的认识,并强调证据基础、以用户为中心的设计、安全性、个性化、文化敏感性和技术的道德使用。最终,该框架旨在推广“最佳实践”,并指导人工智能技术的行业发展,使用户受益,同时尊重他们的权利。此外,该框架力求足够灵活,以适应该领域的持续发展,并通过一些细微的修改,适应受人工智能影响的其他医学学科(例如“FAITA - 遗传学”)。
该框架的第一个维度“可信度”,根据人工智能驱动的心理健康工具的科学基础和用户目标实现能力来评估这些工具。该维度整合了“拟议目标”、“基于证据的内容”和“保留”三个子维度,主张采取具有明确和可衡量目标的干预措施,以经过验证的研究和实践为基础,并可以随着时间的推移让用户有意义地参与。每个子维度最多可获得 2 分,最“可信”工具的维度得分最高为 6 分。
评估人工智能心理健康工具的第二个维度是“用户体验”,它解决了比静态心理健康应用程序中遇到的更复杂的交互。因此,PsyberGuide 的“用户体验”维度(重点关注参与度、功能和美观)被认为是不够的,因此纳入了三个新的子维度:“个性化适应性”,以评估 AI 随着时间的推移根据用户反馈进行改进的能力; “互动质量”,评估交流的自然程度;和“反馈机制”,强调用户报告问题、提出改进建议和寻求帮助的能力的重要性。 “用户体验”维度的每个子维度最多获得 2 分,维度得分最高为 6 分。
第三个维度“用户代理”是新的,它强调了授权用户管理其个人数据和治疗选择的重要性。它分为两个子维度。第一个是“用户自主权、数据保护和隐私”,重点关注对个人健康数据的控制、措辞清晰且用户友好的同意流程、强大的数据保护协议、安全存储以及用户主动管理数据的能力。第二个是“用户赋权”,重点关注用户的自我效能和自我管理能力,衡量人工智能干预是否包含支持用户独立性的工具,并鼓励将使用该工具学到的技能应用到现实中。生活环境,以防止对工具的依赖。每个子维度最多可获得 2 分,“用户机构”维度得分最高为 4 分。
第四个维度“公平和包容性”也是新的,由两个子维度组成:“文化敏感性和包容性”,它评估工具与来自不同文化背景的用户互动的能力,并强调内容识别文化和其他身份的必要性差异;以及“偏见和公平”,它解决了该工具致力于使其培训材料多样化并消除可能影响公平和公正的偏见的承诺。每个子维度最多可获得 2 分,“公平和包容性”维度得分最高为 4 分。
第五个维度“透明度”仍然来自 PsyberGuide,但现在已超出数据管理范围,包括人工智能的所有权、资金、商业模式、开发流程和主要利益相关者。它强调了提供有关运营和业务实践的清晰而全面的信息的重要性,以便用户能够更好地就使用此类技术做出明智的决策。它还旨在通过披露有关其工具的意图和治理的信息来帮助开发人员遵守最佳实践。 “透明度”维度的最高得分为 2。
最后,新的第六维度“危机管理”评估了用户福祉的保障情况以及心理健康人工智能工具是否在紧急情况下提供即时、有效的支持。它强调全面的安全协议和危机管理功能,不仅在危机期间引导用户使用相关的当地资源,而且还有助于对这些资源进行后续处理。 “危机管理”维度的最高分是2分。
将 GAI、 LLMs和 GPT 纳入精神卫生保健预示着一个充满希望但复杂的新时代。这些技术在提供个性化、可访问和可扩展的心理健康支持方面的前景是巨大的。不幸的是,挑战也是如此。我们开发了 FAITA - 心理健康,为用户、临床医生、研究人员以及行业和公共卫生利益相关者提供一个量表,用于全面评估人工智能驱动的心理健康工具的质量、安全性、完整性和以用户为中心。
该量表的总分范围为 0 到 24,试图捕捉人工智能驱动的心理健康护理的复杂性,同时适应该领域的持续发展以及对其他医学学科的可能适应。需要进行正式研究来凭经验测试其优点、缺点和最相关的组成部分。