A Spark-Based Approach for High-Efficiency Embedded Feature Selection

2019 
Embedded feature selection is an important branch in the field of feature engineering. However, due to the excessive computing time brought by the iterative mechanism, variational inference-based embedded feature selection cannot satisfy the real-time demand for practical application. In such a case, a Spark-based embedded feature selection approach is proposed in this study. Automatic relevance determination kernel-based variational relevance vector machine is selected as the basic model so that each of the input features can come with an independent scaling parameter. And combining with the characteristics of Spark of being suitable for dealing with iterations, a parallel strategy is designed which stores training samples in RDD (Resilient Distributed Datasets, RDD) and achieves parallel acceleration by designing the operators such as filter, map, flatMap, etc. Meanwhile, by caching the reusable RDD in the calculation process to reduce the repetitive computing, along with broadcasting parameters required in parallel computing to decrease the computing time caused by data shuffle, so that the optimization of algorithmic efficiency can be consequently improved. In addition, the approach adopts the singular value decomposition to implement calculation of the inversion and determinant of the high-dimensional matrix based on the positive definiteness, so as to further optimize the efficiency of the algorithm. The experiments are conducted on solving regression problem with sufficient parametric studies. The results demonstrate that the proposed approach exhibits outstanding performance on both acceleration and prediction accuracy.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    16
    References
    0
    Citations
    NaN
    KQI
    []
    Baidu
    map