Using multiple reference genomes to identify and resolve annotation inconsistencies
2019
Background: Advances in sequencing technologies have led to the release of
reference genomesand
annotationsfor multiple individuals within more well-studied systems. While each of these new genome assemblies shares significant portions of
syntenybetween each other, the
annotatedstructure of
genemodels within these regions can differ. Of particular concern are split-
genemisannotations, in which a single
geneis incorrectly
annotatedas two distinct
genesor two
genesare incorrectly
annotatedas a single
gene. These misannotations can have major impacts on functional prediction, estimates of expression, and many downstream analyses. Results: We developed a high-throughput method based on
pairwise comparisonsof
annotationsthat detect potential split-
genemisannotations and quantifies support for whether the
genesshould be merged into a single
genemodel. We demonstrate the utility of our method using
gene
annotationsof three
reference genomesfrom maize (B73, PH207, and W22), a difficult system from an
annotationperspective due to the size and complexity of the genome. On average, we find several hundred of these potential split-
genemisannotations in each
pairwise comparison, corresponding to 3-5% of
genemodels across
annotations. To determine which state (i.e. one
geneor multiple
genes) is biologically supported, we utilize RNAseq data from 10 tissues throughout development along with a novel metric and simulation framework. The methods we have developed require minimal human interaction and can be applied to future assemblies to aid in
annotationefforts. Conclusions: Split-
genemisannotations occur at appreciable frequency in maize
annotations. We have developed a method to easily identify and correct these misannotations. Importantly, this method is generic in that it can utilize any type of short-read expression data. Failure to account for split-
genemisannotations has serious consequences for biological inference, particularly for expression-based analyses.
Keywords:
-
Correction
-
Source
-
Cite
-
Save
31
References
4
Citations
NaN
KQI