Group Sparse Representation With WaveNet Vocoder Adaptation for Spectrum and Prosody Conversion

2019
The statistical approach to voice conversion typically consists of a feature conversion module followed by a vocoder. So far, the feature conversion studies are mainly focused on the conversion of spectrum. However, speaker identity is also characterized by prosodic features, such as fundamental frequency (F0) and energy contour among others. In this paper, we study the transformation of speaker characteristics both in terms of spectrum and prosody. We propose two novel techniques that effectively use a limited amount of source-target training data and leverage a large general speech corpus to improve the voice conversion quality. First, we study the phonetic sparse representation under the group sparsity mathematical formulation. We use phonetic posteriorgrams (PPGs) together with spectral and prosody features to form tandem feature in the phonetic dictionary. The tandem feature allow us to estimate an activation matrix that is less dependent on source speakers, thus providing a better voice conversion quality. Second, we study the use of WaveNet vocoder that can be trained on general speech corpus from multiple speakers and adapted on target speaker data to improve the vocoding quality. We benefit from the large general speech databases that are used to train the PPG generator, and the WaveNet vocoder. The experiments show that the proposed conversion framework outperforms the traditional spectrum and prosody conversion techniques in both objective and subjective evaluations.
    • Correction
    • Source
    • Cite
    • Save
    69
    References
    0
    Citations
    NaN
    KQI
    []
    Baidu
    map