Explainable detection of adverse drug reaction with imbalanced data distribution

Jin Wang, Liang-ChihYu , Xuejie Zhang

Abstract

Analysis of health-related texts can be used to detect adverse drug reactions (ADR). The greatest challenge for ADR detection lies in imbalanced data distributions where words related to ADR symptoms are often minority classes. As a result, trained models tend to converge to a point that strongly biases towards the majority class and then ignores the minority class. Since the most used cross-entropy criteria is an approximation to accuracy, the model focuses more readily on the majority class to achieve high accuracy. To address this issue, existing methods apply either oversampling or down-sampling strategies to balance the data distribution and exploit the most difficult samples of the minority class. However, increasing or reducing the number of individual tokens alone in sequence labeling tasks will result in the loss of the syntactic relations of the sentence. This paper proposes a weighted variant of conditional random field (CRF) for data-imbalanced sequence labeling tasks. Such a weighting strategy can alleviate data distribution imbalances between majority and minority classes. Instead of using softmax in the output layer, the CRF can capture the relationship of labels between tokens. The locally interpretable model-agnostic explanations (LIME) algorithm was applied to investigate performance differences between models with and without the weighted loss function. Experimental results on two different ADR tasks show that the proposed model outperforms previously proposed sequence labeling methods.

Introduction

An adverse drug reaction refers to any injury caused by taking medication, and the incidence of such injuries is quite high especially in cases of large doses or long duration of medication use. However, such reactions are unpredictable, and pre-market clinical trials of new drugs are usually only conducted on samples of 500-3000 people, for a single type of disease, and often exclude special populations (e.g., the elderly, pregnant women, and children). Therefore, such trials often fail to identify relatively rare adverse reactions, late-onset reactions, or adverse reactions that occur in special populations, which only become apparent after following after large-scale use [1]. This raises an urgent need for post-marketing drug safety surveillance after drug approval [2–5].

Methods

Fig 4 shows the overall framework of the proposed weighted BERT-CRF model for ADR, which consists of three parts. The first part is a pre-trained BERT model, the second part is a bi-directional LSTM and the third part is a weighted CRF output layer. The details of each part are described as follow.

Results

This section presents the experiments conducted on several corpora to evaluate the performance of the proposed weighted BERT-CRF model against different neural networks for the ADR detection task.

Discussion

This section used an explainable algorithm to present some explanation or speculation on the empirical results.

The effect of different loss function

The performance of the proposed weighted CRF mainly depends on the weight assignment strategies. That is, the loss of the minority classes should be assigned a heavier weight than that of the minority classes. For comparison, we introduce two other strategies. The first one uses the inverse value of the sample numbers (Strategy-1), and the other uses the inverse ratio of the sample numbers (Strategy-2). As indicated in Fig 2A, the proposed weights assignment strategy in Eq 12) (Weighted Loss) outperformed both of the aforementioned strategies, since the proposed method considers both the number of samples and the ratio of each class.

Conclusions

In this study, a weighted conditional random field is proposed for imbalanced-data ADR detection tasks. It applies a pre-trained language model as a context encoder which is both robust to strong class imbalanced datasets and can integrate dependencies and syntactical information between tokens. To address the data imbalance issue, a weighted variant of the CRF loss function is implemented to assign more weight to the minority class, forcing the model to pay more attention to such classes to ensure effective detection. Furthermore, we introduce an explainable algorithm which provides a qualitative understanding between the input features and the corresponding prediction to compare the behaviors of the models with and without the weighted loss function.

Experimental results show that the proposed weighted variant of CRF provides a significant performance boost without changing the model architecture in imbalanced-data tasks. Future work will attempt to adjust the weights of training samples based on target metrics or to build a separate network for weight prediction.

Citation: Wang J, Yu L-C, Zhang X (2022) Explainable detection of adverse drug reaction with imbalanced data distribution. PLoSComputBiol 18(6): e1010144. https://doi.org/10.1371/journal.pcbi.1010144

Editor: AndreyRzhetsky, University of Chicago, UNITED STATES

Received: December 11, 2021; Accepted: April 26, 2022; Published: June 15, 2022

Copyright: © 2022 Wang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The input data files and codes for model analysis and evaluation is available at: https://github.com/wangjin0818/adverse_drug_reaction/.

Funding: This work was supported by the National Natural Science Foundation of China (NSFC) under Grants Nos. 61702443 (to JW), 61966038 (to JW) and 61762091 (to XJZ), and in part by the Ministry of Science and Technology, Taiwan, ROC, under Grant No. MOST110-2628-E-155-002 (to LCY). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010144#sec022

Thermo Fisher Scientific - mRNA Services