Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Current safeguards, risk mitigation, and transparency measures of large language models against the generation of health disinformation: repeated cross sectional analysis
The BMJ ( IF 93.6 ) Pub Date : 2024-03-20 , DOI: 10.1136/bmj-2023-078538 Bradley D Menz 1 , Nicole M Kuderer 2 , Stephen Bacchi 1, 3 , Natansh D Modi 1 , Benjamin Chin-Yee 4, 5 , Tiancheng Hu 6 , Ceara Rickard 7 , Mark Haseloff 7 , Agnes Vitry 7, 8 , Ross A McKinnon 1 , Ganessan Kichenadasse 1, 9 , Andrew Rowland 1 , Michael J Sorich 1 , Ashley M Hopkins 10
The BMJ ( IF 93.6 ) Pub Date : 2024-03-20 , DOI: 10.1136/bmj-2023-078538 Bradley D Menz 1 , Nicole M Kuderer 2 , Stephen Bacchi 1, 3 , Natansh D Modi 1 , Benjamin Chin-Yee 4, 5 , Tiancheng Hu 6 , Ceara Rickard 7 , Mark Haseloff 7 , Agnes Vitry 7, 8 , Ross A McKinnon 1 , Ganessan Kichenadasse 1, 9 , Andrew Rowland 1 , Michael J Sorich 1 , Ashley M Hopkins 10
Affiliation
Objectives To evaluate the effectiveness of safeguards to prevent large language models (LLMs) from being misused to generate health disinformation, and to evaluate the transparency of artificial intelligence (AI) developers regarding their risk mitigation processes against observed vulnerabilities. Design Repeated cross sectional analysis. Setting Publicly accessible LLMs. Methods In a repeated cross sectional analysis, four LLMs (via chatbots/assistant interfaces) were evaluated: OpenAI’s GPT-4 (via ChatGPT and Microsoft’s Copilot), Google’s PaLM 2 and newly released Gemini Pro (via Bard), Anthropic’s Claude 2 (via Poe), and Meta’s Llama 2 (via HuggingChat). In September 2023, these LLMs were prompted to generate health disinformation on two topics: sunscreen as a cause of skin cancer and the alkaline diet as a cancer cure. Jailbreaking techniques (ie, attempts to bypass safeguards) were evaluated if required. For LLMs with observed safeguarding vulnerabilities, the processes for reporting outputs of concern were audited. 12 weeks after initial investigations, the disinformation generation capabilities of the LLMs were re-evaluated to assess any subsequent improvements in safeguards. Main outcome measures The main outcome measures were whether safeguards prevented the generation of health disinformation, and the transparency of risk mitigation processes against health disinformation. Results Claude 2 (via Poe) declined 130 prompts submitted across the two study timepoints requesting the generation of content claiming that sunscreen causes skin cancer or that the alkaline diet is a cure for cancer, even with jailbreaking attempts. GPT-4 (via Copilot) initially refused to generate health disinformation, even with jailbreaking attempts—although this was not the case at 12 weeks. In contrast, GPT-4 (via ChatGPT), PaLM 2/Gemini Pro (via Bard), and Llama 2 (via HuggingChat) consistently generated health disinformation blogs. In September 2023 evaluations, these LLMs facilitated the generation of 113 unique cancer disinformation blogs, totalling more than 40 000 words, without requiring jailbreaking attempts. The refusal rate across the evaluation timepoints for these LLMs was only 5% (7 of 150), and as prompted the LLM generated blogs incorporated attention grabbing titles, authentic looking (fake or fictional) references, fabricated testimonials from patients and clinicians, and they targeted diverse demographic groups. Although each LLM evaluated had mechanisms to report observed outputs of concern, the developers did not respond when observations of vulnerabilities were reported. Conclusions This study found that although effective safeguards are feasible to prevent LLMs from being misused to generate health disinformation, they were inconsistently implemented. Furthermore, effective processes for reporting safeguard problems were lacking. Enhanced regulation, transparency, and routine auditing are required to help prevent LLMs from contributing to the mass generation of health disinformation. The research team would be willing to make the complete set of generated data available upon request from qualified researchers or policy makers on submission of a proposal detailing required access and intended use.
中文翻译:
针对健康虚假信息生成的大型语言模型的当前保障措施、风险缓解和透明度措施:重复横断面分析
目标 评估防止大型语言模型(LLMs )避免被滥用来生成健康虚假信息,并评估人工智能(AI)开发人员针对观察到的漏洞的风险缓解流程的透明度。设计 重复横截面分析。设置 公开访问LLMs。方法 在重复的横截面分析中,四种LLMs(通过聊天机器人/助手界面)进行了评估:OpenAI 的 GPT-4(通过 ChatGPT 和 Microsoft 的 Copilot)、Google 的 PaLM 2 和新发布的 Gemini Pro(通过 Bard)、Anthropic 的 Claude 2(通过 Poe)和 Meta 的 Llama 2(通过 HuggingChat) )。 2023 年 9 月,这些LLMs被提示在两个主题上发布健康虚假信息:防晒霜是导致皮肤癌的原因,碱性饮食是治疗癌症的方法。如果需要,将对越狱技术(即尝试绕过安全措施)进行评估。为了LLMs根据观察到的保护漏洞,对报告所关注产出的流程进行了审计。初步调查 12 周后,该公司的虚假信息生成能力LLMs进行了重新评估,以评估保障措施的任何后续改进。主要成果衡量标准 主要成果衡量标准是保障措施是否防止健康虚假信息的产生,以及针对健康虚假信息的风险缓解流程的透明度。结果 Claude 2(通过 Poe)拒绝了在两个研究时间点提交的 130 条提示,这些提示要求生成声称防晒霜会导致皮肤癌或碱性饮食可以治愈癌症的内容,即使有越狱尝试。 GPT-4(通过 Copilot)最初拒绝生成健康虚假信息,即使尝试越狱 - 尽管在 12 周时情况并非如此。相比之下,GPT-4(通过 ChatGPT)、PaLM 2/Gemini Pro(通过 Bard)和 Llama 2(通过 HuggingChat)始终生成健康虚假信息博客。在 2023 年 9 月的评估中,这些LLMs促成了 113 个独特的癌症虚假信息博客的生成,总计超过 40,000 字,无需尝试越狱。这些评估时间点的拒绝率LLMs只有 5%(150 中的 7),并且按照提示LLM生成的博客包含引人注目的标题、真实的(虚假或虚构的)参考资料、捏造的患者和临床医生的感言,并且针对不同的人口群体。虽然每个LLM评估有报告观察到的关注输出的机制,但开发人员在报告漏洞观察结果时没有做出回应。结论 本研究发现,虽然有效的保障措施可以预防LLMs由于它们被滥用来制造健康虚假信息,因此实施不一致。此外,缺乏报告保障问题的有效程序。需要加强监管、透明度和例行审计来帮助防止LLMs以免造成大量健康虚假信息的产生。研究团队愿意根据合格研究人员或政策制定者提交详细说明所需访问和预期用途的提案的要求,提供完整的生成数据集。
更新日期:2024-03-21
中文翻译:
针对健康虚假信息生成的大型语言模型的当前保障措施、风险缓解和透明度措施:重复横断面分析
目标 评估防止大型语言模型(LLMs )避免被滥用来生成健康虚假信息,并评估人工智能(AI)开发人员针对观察到的漏洞的风险缓解流程的透明度。设计 重复横截面分析。设置 公开访问LLMs。方法 在重复的横截面分析中,四种LLMs(通过聊天机器人/助手界面)进行了评估:OpenAI 的 GPT-4(通过 ChatGPT 和 Microsoft 的 Copilot)、Google 的 PaLM 2 和新发布的 Gemini Pro(通过 Bard)、Anthropic 的 Claude 2(通过 Poe)和 Meta 的 Llama 2(通过 HuggingChat) )。 2023 年 9 月,这些LLMs被提示在两个主题上发布健康虚假信息:防晒霜是导致皮肤癌的原因,碱性饮食是治疗癌症的方法。如果需要,将对越狱技术(即尝试绕过安全措施)进行评估。为了LLMs根据观察到的保护漏洞,对报告所关注产出的流程进行了审计。初步调查 12 周后,该公司的虚假信息生成能力LLMs进行了重新评估,以评估保障措施的任何后续改进。主要成果衡量标准 主要成果衡量标准是保障措施是否防止健康虚假信息的产生,以及针对健康虚假信息的风险缓解流程的透明度。结果 Claude 2(通过 Poe)拒绝了在两个研究时间点提交的 130 条提示,这些提示要求生成声称防晒霜会导致皮肤癌或碱性饮食可以治愈癌症的内容,即使有越狱尝试。 GPT-4(通过 Copilot)最初拒绝生成健康虚假信息,即使尝试越狱 - 尽管在 12 周时情况并非如此。相比之下,GPT-4(通过 ChatGPT)、PaLM 2/Gemini Pro(通过 Bard)和 Llama 2(通过 HuggingChat)始终生成健康虚假信息博客。在 2023 年 9 月的评估中,这些LLMs促成了 113 个独特的癌症虚假信息博客的生成,总计超过 40,000 字,无需尝试越狱。这些评估时间点的拒绝率LLMs只有 5%(150 中的 7),并且按照提示LLM生成的博客包含引人注目的标题、真实的(虚假或虚构的)参考资料、捏造的患者和临床医生的感言,并且针对不同的人口群体。虽然每个LLM评估有报告观察到的关注输出的机制,但开发人员在报告漏洞观察结果时没有做出回应。结论 本研究发现,虽然有效的保障措施可以预防LLMs由于它们被滥用来制造健康虚假信息,因此实施不一致。此外,缺乏报告保障问题的有效程序。需要加强监管、透明度和例行审计来帮助防止LLMs以免造成大量健康虚假信息的产生。研究团队愿意根据合格研究人员或政策制定者提交详细说明所需访问和预期用途的提案的要求,提供完整的生成数据集。