Bacterial genome reduction as a result of short read sequence assembly
2016
High-throughput comparative
genomicshas changed our view of bacterial evolution and relatedness. Many
genomiccomparisons, especially those regarding the accessory
genomethat is variably conserved across strains in a species, are performed using
assembled
genomes. For completed
genomes, an assumption is made that the entire
genomewas incorporated into the
genome
assembly, while for draft
assemblies, often constructed from short sequence reads, an assumption is made that
genome
assemblyis an approximation of the entire
genome. To understand the potential effects of short read
assemblieson the estimation of the complete
genome, we downloaded all completed bacterial
genomesfrom GenBank, simulated short reads,
assembledthe simulated short reads and compared the resulting
assemblyto the completed
assembly. Although most simulated
assembliesdemonstrated little reduction, others were reduced by as much as 25%, which was correlated with the repeat structure of the
genome. A comparative analysis of lost
coding regionsequences demonstrated that up to 48 CDSs or up to ~112,000 bases of
coding regionsequence, were missing from some draft
assembliescompared to their finished counterparts. Although this effect was observed to some extent in 32% of
genomes, only minimal effects were observed on
pan-genomestatistics when using simulated draft
genome
assemblies. The benefits and limitations of using draft
genome
assembliesshould be fully realized before interpreting data from
assembly-based comparative analyses.
Keywords:
-
Correction
-
Source
-
Cite
-
Save
43
References
3
Citations
NaN
KQI