Validation of predicted anonymous proteins simply using Fisher's exact test

2021 
Motivation: Given its increasing efficiency, accuracy, and decreasing cost, genomes sequencing has become the primary (and often the sole) experimental method to characterize newly discovered organisms, in particular from the microbial world (bacteria, archaea, viruses). This generates an ever increasing number of predicted proteins the existence of which is unwarranted, in particular when they do not share significant similarity with proteins of model organisms. As a last resort, the computation of the selection pressure from pairwise alignments of the corresponding Open Reading Frames (ORFs) can validate their existences. However, this approach is error-prone, as not usually associated with a significance test. Results: We introduce the use of the straightforward Fisher's exact test as a post processing of the results provided by the popular CODEML pairwise sequence comparison software. The respective rates of nucleotide changes at the non-synonymous vs. synonymous position (as determined by CODEML), are turned into entries into a 2x2 contingency table, the probability of which is computed under the Null hypothesis that they should not behave differently (i.e. the ORFs do not encode real proteins). I show that strong negative selection pressures do not always provide a significant argument in favor of the existence of proteins.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    33
    References
    0
    Citations
    NaN
    KQI
    []
    Baidu
    map