Automated assessment of biological database assertions using the scientific literature

2019
The large biological databasessuch as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature. Manual curation is used to establish whether the literature and the records are indeed consistent. We explore in this paper an automated method for assessing the consistency of biological assertions, to assist biocurators, which we call BARC, Biocurationtool for Assessment of Relation Consistency. In this method a biological assertionis represented as a relation between two objects (for example, a gene and a disease); we then use our novel set-based relevance algorithm SaBRA to retrieve pertinent literature, and apply a classifier to estimate the likelihood that this relation ( assertion) is correct. Our experiments on assessing gene–disease relations and protein–protein interactionsusing the PubMed Central collection show that BARC can be effective at assisting curators to perform data cleansing. Specifically, the results obtained showed that BARC substantially outperforms the best baselines, with an improvement of F-measure of 3.5% and 13%, respectively, on gene-disease relations and protein-protein interactions. We have additionally carried out a feature analysis that showed that all feature types are informative, as are all fields of the documents. BARC provides a clear benefit for the biocurationcommunity, as there are no prior automated tools for identifying inconsistent assertionsin large-scale biological databases.
    • Correction
    • Source
    • Cite
    • Save
    110
    References
    1
    Citations
    NaN
    KQI
    []
    Baidu
    map