Enhanced Permutation Tests via Multiple Pruning

2020 
Big multi-omics data in bioinformatics consists of a huge number of features and relatively small number of samples. In addition, features from multi-omics data have their own specific characteristics, depending on whether they are from genomics, proteomics, metabolomics, etc. Due to these distinct characteristics, standard statistical analyses, using parametric-based assumptions, may sometimes fail to provide exact asymptotic results. To resolve this issue, permutation tests can be a way to exactly analyze of multi-omics data, because it is distribution-free and flexible to use. In permutation tests, p-values are evaluated by estimating the locations of test statistics, in an empirical null distribution, which is generated by random shuffling. However, the permutation approach can be infeasible when the number of features becomes larger, because more stringent control of type I error, for multiple hypothesis testing, is needed, and consequently, much larger numbers of permutations is required to reach significance. To address the problem, we propose a well-organized strategy, “ENhanced Permutation tests via multiple Pruning (ENPP).” ENPP prunes the features, in every permutation round, if they are determined to be non-significant. In other words, if a feature has more times that statistics from permuted data sets exceeds the original statistics, using a certain number of pre-determined cutoffs, it is determined non-significant. If so, ENPP removes the feature and iterates the process without the feature, in the next permutation round. Our simulation study showed that the ENPP method could remove about 50% of the features, at the first permutation round, and by the 100th permutation round, 98% of the features were removed, and only 7.4% of computation time was required, compared to the original unpruned permutation approach. In addition, we applied this approach to a real data set (Korea Association REsource: KARE) of 327,872 SNPs, to find association with a non-normal distributed phenotype (fasting plasma glucose), interpreted the results, and discussed the feasibility and advantages of the approach.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    25
    References
    1
    Citations
    NaN
    KQI
    []
    Baidu
    map