Incorporate Lexicon into Self-training: A Distantly Supervised Chinese Medical NER.

2021
Medical named entity recognition (NER) tasks usually lack sufficient annotation data. Distant supervision is often used to alleviate this problem, which can quickly and automatically generate annotated training datasets through dictionaries. However, the current distantly supervised method suffers from noisy labeling due to limited coverage of the dictionary, which will cause a large number of unlabeled entities. We call this phenomenon an incomplete annotation problem. To tackle the incomplete annotation problem, we propose a novel distantly supervised method for Chinese medical NER. Specifically, we propose a high recall self-training mechanism to recall potential unlabeled entities in the distant supervision dataset. To reduce error in the high recall self-training, we propose a fine-grained lexicon enhanced scoring and ranking mechanism. Our method improves 3.2% and 5.03% compared to the baseline models on the dataset we proposed and a benchmark dataset for Chinese medical NER.
    • Correction
    • Source
    • Cite
    • Save
    17
    References
    0
    Citations
    NaN
    KQI
    []
    Baidu
    map