当前位置:
X-MOL 学术
›
J. Cheminfom.
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
PubChem synonym filtering process using crowdsourcing
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2024-06-16 , DOI: 10.1186/s13321-024-00868-3
Sunghwan Kim 1 , Bo Yu 1 , Qingliang Li 1 , Evan E Bolton 1
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2024-06-16 , DOI: 10.1186/s13321-024-00868-3
Sunghwan Kim 1 , Bo Yu 1 , Qingliang Li 1 , Evan E Bolton 1
Affiliation
PubChem ( https://pubchem.ncbi.nlm.nih.gov ) is a public chemical information resource containing more than 100 million unique chemical structures. One of the most requested tasks in PubChem and other chemical databases is to search chemicals by name (also commonly called a “chemical synonym”). PubChem performs this task by looking up chemical synonym-structure associations provided by individual depositors to PubChem. In addition, these synonyms are used for many purposes, including creating links between chemicals and PubMed articles (using Medical Subject Headings (MeSH) terms). However, these depositor-provided name-structure associations are subject to substantial discrepancies within and between depositors, making it difficult to unambiguously map a chemical name to a specific chemical structure. The present paper describes PubChem’s crowdsourcing-based synonym filtering strategy, which resolves inter- and intra-depositor discrepancies in synonym-structure associations as well as in the chemical-MeSH associations. The PubChem synonym filtering process was developed based on the analysis of four crowd-voting strategies, which differ in the consistency threshold value employed (60% vs 70%) and how to resolve intra-depositor discrepancies (a single vote vs. multiple votes per depositor) prior to inter-depositor crowd-voting. The agreement of voting was determined at six levels of chemical equivalency, which considers varying isotopic composition, stereochemistry, and connectivity of chemical structures and their primary components. While all four strategies showed comparable results, Strategy I (one vote per depositor with a 60% consistency threshold) resulted in the most synonyms assigned to a single chemical structure as well as the most synonym-structure associations disambiguated at the six chemical equivalency contexts. Based on the results of this study, Strategy I was implemented in PubChem’s filtering process that cleans up synonym-structure associations as well as chemical-MeSH associations. This consistency-based filtering process is designed to look for a consensus in name-structure associations but cannot attest to their correctness. As a result, it can fail to recognize correct name-structure associations (or incorrect ones), for example, when a synonym is provided by only one depositor or when many contributors are incorrect. However, this filtering process is an important starting point for quality control in name-structure associations in large chemical databases like PubChem.
中文翻译:
使用众包的 PubChem 同义词过滤流程
PubChem ( https://pubchem.ncbi.nlm.nih.gov ) 是一个公共化学信息资源,包含超过 1 亿个独特的化学结构。 PubChem 和其他化学数据库中最需要的任务之一是按名称(通常也称为“化学同义词”)搜索化学品。 PubChem 通过查找个人储户向 PubChem 提供的化学同义词结构关联来执行此任务。此外,这些同义词有多种用途,包括在化学品和 PubMed 文章之间创建链接(使用医学主题词 (MeSH) 术语)。然而,这些寄存者提供的名称-结构关联在寄存者内部和之间存在显着差异,使得难以明确地将化学名称映射到特定化学结构。本文描述了 PubChem 基于众包的同义词过滤策略,该策略解决了存储者之间和存储者内部在同义词结构关联以及化学 MeSH 关联中的差异。 PubChem 同义词过滤流程是基于对四种人群投票策略的分析而开发的,这些策略的不同之处在于所采用的一致性阈值(60% 与 70%)以及如何解决存款人内部差异(单次投票与每个人多次投票)储户)在储户间群众投票之前。投票的一致性是在化学当量的六个级别上确定的,其中考虑了不同的同位素组成、立体化学以及化学结构及其主要成分的连通性。 虽然所有四种策略都显示出可比较的结果,但策略 I(每个存款人一票,一致性阈值为 60%)导致分配给单个化学结构的同义词最多,以及在六个化学等效上下文中消除歧义的同义词结构关联最多。根据这项研究的结果,策略 I 在 PubChem 的过滤过程中实施,该过程清理同义词-结构关联以及化学-MeSH 关联。这种基于一致性的过滤过程旨在寻找名称结构关联的共识,但无法证明其正确性。因此,它可能无法识别正确的名称结构关联(或不正确的名称结构关联),例如,当同义词仅由一名存款人提供或许多贡献者不正确时。然而,这种过滤过程是 PubChem 等大型化学数据库中名称-结构关联的质量控制的重要起点。
更新日期:2024-06-17
中文翻译:
使用众包的 PubChem 同义词过滤流程
PubChem ( https://pubchem.ncbi.nlm.nih.gov ) 是一个公共化学信息资源,包含超过 1 亿个独特的化学结构。 PubChem 和其他化学数据库中最需要的任务之一是按名称(通常也称为“化学同义词”)搜索化学品。 PubChem 通过查找个人储户向 PubChem 提供的化学同义词结构关联来执行此任务。此外,这些同义词有多种用途,包括在化学品和 PubMed 文章之间创建链接(使用医学主题词 (MeSH) 术语)。然而,这些寄存者提供的名称-结构关联在寄存者内部和之间存在显着差异,使得难以明确地将化学名称映射到特定化学结构。本文描述了 PubChem 基于众包的同义词过滤策略,该策略解决了存储者之间和存储者内部在同义词结构关联以及化学 MeSH 关联中的差异。 PubChem 同义词过滤流程是基于对四种人群投票策略的分析而开发的,这些策略的不同之处在于所采用的一致性阈值(60% 与 70%)以及如何解决存款人内部差异(单次投票与每个人多次投票)储户)在储户间群众投票之前。投票的一致性是在化学当量的六个级别上确定的,其中考虑了不同的同位素组成、立体化学以及化学结构及其主要成分的连通性。 虽然所有四种策略都显示出可比较的结果,但策略 I(每个存款人一票,一致性阈值为 60%)导致分配给单个化学结构的同义词最多,以及在六个化学等效上下文中消除歧义的同义词结构关联最多。根据这项研究的结果,策略 I 在 PubChem 的过滤过程中实施,该过程清理同义词-结构关联以及化学-MeSH 关联。这种基于一致性的过滤过程旨在寻找名称结构关联的共识,但无法证明其正确性。因此,它可能无法识别正确的名称结构关联(或不正确的名称结构关联),例如,当同义词仅由一名存款人提供或许多贡献者不正确时。然而,这种过滤过程是 PubChem 等大型化学数据库中名称-结构关联的质量控制的重要起点。