Leveraging Researcher Domain Expertise to Annotate Concepts Within Imbalanced Data,Communication Methods and Measures

当前位置： X-MOL 学术 › Communication Methods and Measures › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Leveraging Researcher Domain Expertise to Annotate Concepts Within Imbalanced Data
Communication Methods and Measures ( IF 11.4 ) Pub Date : 2023-02-22 , DOI: 10.1080/19312458.2023.2182278
Dror K. Markus ₁ , Guy Mor-Lan ₁ , Tamir Sheafer _{1,

2} , Shaul R. Shenhav ₁

Affiliation

ABSTRACT

As more computational communication researchers turn to supervised machine learning methods for text classification, we note the challenge in implementing such techniques within an imbalanced dataset. Such issues are critical in our domain, where, in many cases, researchers attempt to identify and study theoretically interesting categories that can be rare in a target corpus. Specifically, imbalanced distributions, with a skewed distribution of texts among the categories, can lead to a lengthy and expensive annotation stage, forcing practitioners to sample and label large numbers of texts to train a classification model. In this paper, we provide an overview of the issue, and describe existing strategies for mitigating such challenges. Noting the pitfalls of previous solutions, we then provide a semi-supervised method – Expert Initiated Latent Space Sampling – that complements researcher domain expertise with a systematic, unsupervised exploration of the latent semantic space to overcome such limitations. Utilizing simulations to systematically evaluate our method and compare it to existing approaches, we show that our procedure offers significant advantages in terms of efficiency and accuracy in many classification tasks.

中文翻译：

利用研究人员的领域专业知识来注释不平衡数据中的概念

摘要

随着越来越多的计算通信研究人员转向用于文本分类的监督机器学习方法，我们注意到在不平衡数据集中实施此类技术的挑战。这些问题在我们的领域至关重要，在许多情况下，研究人员试图识别和研究理论上有趣的类别，而这些类别在目标语料库中可能很少见。具体来说，不平衡的分布，即文本在类别之间的倾斜分布，可能会导致漫长而昂贵的注释阶段，迫使从业者对大量文本进行采样和标记来训练分类模型。在本文中，我们概述了该问题，并描述了缓解此类挑战的现有策略。注意到以前解决方案的缺陷，然后，我们提供了一种半监督方法——专家启动的潜在空间采样——通过对潜在语义空间进行系统的、无监督的探索来补充研究人员的领域专业知识，以克服此类限制。利用模拟系统地评估我们的方法并将其与现有方法进行比较，我们表明我们的程序在许多分类任务的效率和准确性方面具有显着的优势。

更新日期：2023-02-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>