Bacterial genome reduction as a result of short read sequence assembly

2016
High-throughput comparative genomicshas changed our view of bacterial evolution and relatedness. Many genomiccomparisons, especially those regarding the accessory genomethat is variably conserved across strains in a species, are performed using assembled genomes. For completed genomes, an assumption is made that the entire genomewas incorporated into the genome assembly, while for draft assemblies, often constructed from short sequence reads, an assumption is made that genome assemblyis an approximation of the entire genome. To understand the potential effects of short read assemblieson the estimation of the complete genome, we downloaded all completed bacterial genomesfrom GenBank, simulated short reads, assembledthe simulated short reads and compared the resulting assemblyto the completed assembly. Although most simulated assembliesdemonstrated little reduction, others were reduced by as much as 25%, which was correlated with the repeat structure of the genome. A comparative analysis of lost coding regionsequences demonstrated that up to 48 CDSs or up to ~112,000 bases of coding regionsequence, were missing from some draft assembliescompared to their finished counterparts. Although this effect was observed to some extent in 32% of genomes, only minimal effects were observed on pan-genomestatistics when using simulated draft genome assemblies. The benefits and limitations of using draft genome assembliesshould be fully realized before interpreting data from assembly-based comparative analyses.
    • Correction
    • Source
    • Cite
    • Save
    43
    References
    3
    Citations
    NaN
    KQI
    []
    Baidu
    map