Using multiple reference genomes to identify and resolve annotation inconsistencies

2019
Background: Advances in sequencing technologies have led to the release of reference genomesand annotationsfor multiple individuals within more well-studied systems. While each of these new genome assemblies shares significant portions of syntenybetween each other, the annotatedstructure of genemodels within these regions can differ. Of particular concern are split- genemisannotations, in which a single geneis incorrectly annotatedas two distinct genesor two genesare incorrectly annotatedas a single gene. These misannotations can have major impacts on functional prediction, estimates of expression, and many downstream analyses. Results: We developed a high-throughput method based on pairwise comparisonsof annotationsthat detect potential split- genemisannotations and quantifies support for whether the genesshould be merged into a single genemodel. We demonstrate the utility of our method using gene annotationsof three reference genomesfrom maize (B73, PH207, and W22), a difficult system from an annotationperspective due to the size and complexity of the genome. On average, we find several hundred of these potential split- genemisannotations in each pairwise comparison, corresponding to 3-5% of genemodels across annotations. To determine which state (i.e. one geneor multiple genes) is biologically supported, we utilize RNAseq data from 10 tissues throughout development along with a novel metric and simulation framework. The methods we have developed require minimal human interaction and can be applied to future assemblies to aid in annotationefforts. Conclusions: Split- genemisannotations occur at appreciable frequency in maize annotations. We have developed a method to easily identify and correct these misannotations. Importantly, this method is generic in that it can utilize any type of short-read expression data. Failure to account for split- genemisannotations has serious consequences for biological inference, particularly for expression-based analyses.
    • Correction
    • Source
    • Cite
    • Save
    31
    References
    4
    Citations
    NaN
    KQI
    []
    Baidu
    map