The Effect of Resampling on Data‐Imbalanced Conditions for Prediction towards Nuclear Receptor Profiling Using Deep Learning

2020
In toxicity evaluation based on the nuclear receptor signalling pathway, in silico prediction tools are used for the detection of the early stages of long-term toxicities, the prioritization of newly synthesized chemicals and the acquisition of the selectivity and sensitivity. Computational prediction model is one of the promising tools for the toxicity screening of the chemical-protein interaction as deep learning has been improved the prediction accuracies. However, the challenge is that data-imbalanced conditions, where the volume of toxic chemical compound dataset is much smaller than the nontoxic dataset, result in low prediction accuracy of the toxic dataset providing valid information to toxicity hazard. In this paper, we have examined the effect of data imbalance in the toxicity assessment data of AR (LBD), ER (LBD), AhR, and PPAR as nuclear receptors, and identified the severe imbalance between the prediction of the toxic and nontoxic datasets. As the acquisition of the balanced selectivity and sensitivity is required for the assessment of toxicity hazards, data resampling methods have been investigated in order to improve the bias problem in binary classification for toxicity hazard profiling of nuclear receptor. The experimental results achieved a sensitivity of 0.714 and a specificity of 0.787, with an overall accuracy of 0.829 and a ROC-AUC of 0.822 by the simple resampling methods.
    • Correction
    • Source
    • Cite
    • Save
    22
    References
    4
    Citations
    NaN
    KQI
    []
    Baidu
    map